Skip to content

Presidio Analyzer API Reference

Presidio analyzer package.

AnalysisExplanation

Hold tracing information to explain why PII entities were identified as such.

Parameters:

Name Type Description Default
recognizer str

name of recognizer that made the decision

required
original_score float

recognizer's confidence in result

required
pattern_name str

name of pattern (if decision was made by a PatternRecognizer)

None
pattern str

regex pattern that was applied (if PatternRecognizer)

None
validation_result float

result of a validation (e.g. checksum)

None
textual_explanation str

Free text for describing a decision of a logic or model

None
Source code in presidio_analyzer/analysis_explanation.py
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
class AnalysisExplanation:
    """
    Hold tracing information to explain why PII entities were identified as such.

    :param recognizer: name of recognizer that made the decision
    :param original_score: recognizer's confidence in result
    :param pattern_name: name of pattern
            (if decision was made by a PatternRecognizer)
    :param pattern: regex pattern that was applied (if PatternRecognizer)
    :param validation_result: result of a validation (e.g. checksum)
    :param textual_explanation: Free text for describing
            a decision of a logic or model
    """

    def __init__(
        self,
        recognizer: str,
        original_score: float,
        pattern_name: str = None,
        pattern: str = None,
        validation_result: float = None,
        textual_explanation: str = None,
        regex_flags: int = None,
    ):

        self.recognizer = recognizer
        self.pattern_name = pattern_name
        self.pattern = pattern
        self.original_score = original_score
        self.score = original_score
        self.textual_explanation = textual_explanation
        self.score_context_improvement = 0
        self.supportive_context_word = ""
        self.validation_result = validation_result
        self.regex_flags = regex_flags

    def __repr__(self):
        """Create string representation of the object."""
        return str(self.__dict__)

    def set_improved_score(self, score: float) -> None:
        """Update the score and calculate the difference from the original score."""
        self.score = score
        self.score_context_improvement = self.score - self.original_score

    def set_supportive_context_word(self, word: str) -> None:
        """Set the context word which helped increase the score."""
        self.supportive_context_word = word

    def append_textual_explanation_line(self, text: str) -> None:
        """Append a new line to textual_explanation field."""
        if self.textual_explanation is None:
            self.textual_explanation = text
        else:
            self.textual_explanation = "{}\n{}".format(self.textual_explanation, text)

    def to_dict(self) -> Dict:
        """
        Serialize self to dictionary.

        :return: a dictionary
        """
        return self.__dict__

__repr__()

Create string representation of the object.

Source code in presidio_analyzer/analysis_explanation.py
40
41
42
def __repr__(self):
    """Create string representation of the object."""
    return str(self.__dict__)

append_textual_explanation_line(text)

Append a new line to textual_explanation field.

Source code in presidio_analyzer/analysis_explanation.py
53
54
55
56
57
58
def append_textual_explanation_line(self, text: str) -> None:
    """Append a new line to textual_explanation field."""
    if self.textual_explanation is None:
        self.textual_explanation = text
    else:
        self.textual_explanation = "{}\n{}".format(self.textual_explanation, text)

set_improved_score(score)

Update the score and calculate the difference from the original score.

Source code in presidio_analyzer/analysis_explanation.py
44
45
46
47
def set_improved_score(self, score: float) -> None:
    """Update the score and calculate the difference from the original score."""
    self.score = score
    self.score_context_improvement = self.score - self.original_score

set_supportive_context_word(word)

Set the context word which helped increase the score.

Source code in presidio_analyzer/analysis_explanation.py
49
50
51
def set_supportive_context_word(self, word: str) -> None:
    """Set the context word which helped increase the score."""
    self.supportive_context_word = word

to_dict()

Serialize self to dictionary.

Returns:

Type Description
Dict

a dictionary

Source code in presidio_analyzer/analysis_explanation.py
60
61
62
63
64
65
66
def to_dict(self) -> Dict:
    """
    Serialize self to dictionary.

    :return: a dictionary
    """
    return self.__dict__

AnalyzerEngine

Entry point for Presidio Analyzer.

Orchestrating the detection of PII entities and all related logic.

Parameters:

Name Type Description Default
registry RecognizerRegistry

instance of type RecognizerRegistry

None
nlp_engine NlpEngine

instance of type NlpEngine (for example SpacyNlpEngine)

None
app_tracer AppTracer

instance of type AppTracer, used to trace the logic used during each request for interpretability reasons.

None
log_decision_process bool

bool, defines whether the decision process within the analyzer should be logged or not.

False
default_score_threshold float

Minimum confidence value for detected entities to be returned

0
supported_languages List[str]

List of possible languages this engine could be run on. Used for loading the right NLP models and recognizers for these languages.

None
context_aware_enhancer Optional[ContextAwareEnhancer]

instance of type ContextAwareEnhancer for enhancing confidence score based on context words, (LemmaContextAwareEnhancer will be created by default if None passed)

None
Source code in presidio_analyzer/analyzer_engine.py
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
class AnalyzerEngine:
    """
    Entry point for Presidio Analyzer.

    Orchestrating the detection of PII entities and all related logic.

    :param registry: instance of type RecognizerRegistry
    :param nlp_engine: instance of type NlpEngine
    (for example SpacyNlpEngine)
    :param app_tracer: instance of type AppTracer, used to trace the logic
    used during each request for interpretability reasons.
    :param log_decision_process: bool,
    defines whether the decision process within the analyzer should be logged or not.
    :param default_score_threshold: Minimum confidence value
    for detected entities to be returned
    :param supported_languages: List of possible languages this engine could be run on.
    Used for loading the right NLP models and recognizers for these languages.
    :param context_aware_enhancer: instance of type ContextAwareEnhancer for enhancing
    confidence score based on context words, (LemmaContextAwareEnhancer will be created
    by default if None passed)
    """

    def __init__(
        self,
        registry: RecognizerRegistry = None,
        nlp_engine: NlpEngine = None,
        app_tracer: AppTracer = None,
        log_decision_process: bool = False,
        default_score_threshold: float = 0,
        supported_languages: List[str] = None,
        context_aware_enhancer: Optional[ContextAwareEnhancer] = None,
    ):
        if not supported_languages:
            supported_languages = ["en"]

        if not nlp_engine:
            logger.info("nlp_engine not provided, creating default.")
            provider = NlpEngineProvider()
            nlp_engine = provider.create_engine()

        if not registry:
            logger.info("registry not provided, creating default.")
            registry = RecognizerRegistry()
        if not app_tracer:
            app_tracer = AppTracer()
        self.app_tracer = app_tracer

        self.supported_languages = supported_languages

        self.nlp_engine = nlp_engine
        if not self.nlp_engine.is_loaded():
            self.nlp_engine.load()

        self.registry = registry

        # load all recognizers
        if not registry.recognizers:
            registry.load_predefined_recognizers(
                nlp_engine=self.nlp_engine, languages=self.supported_languages
            )

        self.log_decision_process = log_decision_process
        self.default_score_threshold = default_score_threshold

        if not context_aware_enhancer:
            logger.debug(
                "context aware enhancer not provided, creating default"
                + " lemma based enhancer."
            )
            context_aware_enhancer = LemmaContextAwareEnhancer()

        self.context_aware_enhancer = context_aware_enhancer

    def get_recognizers(self, language: Optional[str] = None) -> List[EntityRecognizer]:
        """
        Return a list of PII recognizers currently loaded.

        :param language: Return the recognizers supporting a given language.
        :return: List of [Recognizer] as a RecognizersAllResponse
        """
        if not language:
            languages = self.supported_languages
        else:
            languages = [language]

        recognizers = []
        for language in languages:
            logger.info(f"Fetching all recognizers for language {language}")
            recognizers.extend(
                self.registry.get_recognizers(language=language, all_fields=True)
            )

        return list(set(recognizers))

    def get_supported_entities(self, language: Optional[str] = None) -> List[str]:
        """
        Return a list of the entities that can be detected.

        :param language: Return only entities supported in a specific language.
        :return: List of entity names
        """
        recognizers = self.get_recognizers(language=language)
        supported_entities = []
        for recognizer in recognizers:
            supported_entities.extend(recognizer.get_supported_entities())

        return list(set(supported_entities))

    def analyze(
        self,
        text: str,
        language: str,
        entities: Optional[List[str]] = None,
        correlation_id: Optional[str] = None,
        score_threshold: Optional[float] = None,
        return_decision_process: Optional[bool] = False,
        ad_hoc_recognizers: Optional[List[EntityRecognizer]] = None,
        context: Optional[List[str]] = None,
        allow_list: Optional[List[str]] = None,
        nlp_artifacts: Optional[NlpArtifacts] = None,
    ) -> List[RecognizerResult]:
        """
        Find PII entities in text using different PII recognizers for a given language.

        :param text: the text to analyze
        :param language: the language of the text
        :param entities: List of PII entities that should be looked for in the text.
        If entities=None then all entities are looked for.
        :param correlation_id: cross call ID for this request
        :param score_threshold: A minimum value for which
        to return an identified entity
        :param return_decision_process: Whether the analysis decision process steps
        returned in the response.
        :param ad_hoc_recognizers: List of recognizers which will be used only
        for this specific request.
        :param context: List of context words to enhance confidence score if matched
        with the recognized entity's recognizer context
        :param allow_list: List of words that the user defines as being allowed to keep
        in the text
        :param nlp_artifacts: precomputed NlpArtifacts
        :return: an array of the found entities in the text

        :example:

        >>> from presidio_analyzer import AnalyzerEngine

        >>> # Set up the engine, loads the NLP module (spaCy model by default)
        >>> # and other PII recognizers
        >>> analyzer = AnalyzerEngine()

        >>> # Call analyzer to get results
        >>> results = analyzer.analyze(text='My phone number is 212-555-5555', entities=['PHONE_NUMBER'], language='en') # noqa D501
        >>> print(results)
        [type: PHONE_NUMBER, start: 19, end: 31, score: 0.85]
        """
        all_fields = not entities

        recognizers = self.registry.get_recognizers(
            language=language,
            entities=entities,
            all_fields=all_fields,
            ad_hoc_recognizers=ad_hoc_recognizers,
        )

        if all_fields:
            # Since all_fields=True, list all entities by iterating
            # over all recognizers
            entities = self.get_supported_entities(language=language)

        # run the nlp pipeline over the given text, store the results in
        # a NlpArtifacts instance
        if not nlp_artifacts:
            nlp_artifacts = self.nlp_engine.process_text(text, language)

        if self.log_decision_process:
            self.app_tracer.trace(
                correlation_id, "nlp artifacts:" + nlp_artifacts.to_json()
            )

        results = []
        for recognizer in recognizers:
            # Lazy loading of the relevant recognizers
            if not recognizer.is_loaded:
                recognizer.load()
                recognizer.is_loaded = True

            # analyze using the current recognizer and append the results
            current_results = recognizer.analyze(
                text=text, entities=entities, nlp_artifacts=nlp_artifacts
            )
            if current_results:
                # add recognizer name to recognition metadata inside results
                # if not exists
                self.__add_recognizer_id_if_not_exists(current_results, recognizer)
                results.extend(current_results)

        results = self._enhance_using_context(
            text, results, nlp_artifacts, recognizers, context
        )

        if self.log_decision_process:
            self.app_tracer.trace(
                correlation_id,
                json.dumps([str(result.to_dict()) for result in results]),
            )

        # Remove duplicates or low score results
        results = EntityRecognizer.remove_duplicates(results)
        results = self.__remove_low_scores(results, score_threshold)

        if allow_list:
            results = self._remove_allow_list(results, allow_list, text)

        if not return_decision_process:
            results = self.__remove_decision_process(results)

        return results

    def _enhance_using_context(
        self,
        text: str,
        raw_results: List[RecognizerResult],
        nlp_artifacts: NlpArtifacts,
        recognizers: List[EntityRecognizer],
        context: Optional[List[str]] = None,
    ) -> List[RecognizerResult]:
        """
        Enhance confidence score using context words.

        :param text: The actual text that was analyzed
        :param raw_results: Recognizer results which didn't take
                            context into consideration
        :param nlp_artifacts: The nlp artifacts contains elements
                              such as lemmatized tokens for better
                              accuracy of the context enhancement process
        :param recognizers: the list of recognizers
        :param context: list of context words
        """
        results = []

        for recognizer in recognizers:
            recognizer_results = [
                r
                for r in raw_results
                if r.recognition_metadata[RecognizerResult.RECOGNIZER_IDENTIFIER_KEY]
                == recognizer.id
            ]
            other_recognizer_results = [
                r
                for r in raw_results
                if r.recognition_metadata[RecognizerResult.RECOGNIZER_IDENTIFIER_KEY]
                != recognizer.id
            ]

            # enhance score using context in recognizer level if implemented
            recognizer_results = recognizer.enhance_using_context(
                text=text,
                # each recognizer will get access to all recognizer results
                # to allow related entities contex enhancement
                raw_recognizer_results=recognizer_results,
                other_raw_recognizer_results=other_recognizer_results,
                nlp_artifacts=nlp_artifacts,
                context=context,
            )

            results.extend(recognizer_results)

        # Update results in case surrounding words or external context are relevant to
        # the context words.
        results = self.context_aware_enhancer.enhance_using_context(
            text=text,
            raw_results=results,
            nlp_artifacts=nlp_artifacts,
            recognizers=recognizers,
            context=context,
        )

        return results

    def __remove_low_scores(
        self, results: List[RecognizerResult], score_threshold: float = None
    ) -> List[RecognizerResult]:
        """
        Remove results for which the confidence is lower than the threshold.

        :param results: List of RecognizerResult
        :param score_threshold: float value for minimum possible confidence
        :return: List[RecognizerResult]
        """
        if score_threshold is None:
            score_threshold = self.default_score_threshold

        new_results = [result for result in results if result.score >= score_threshold]
        return new_results

    @staticmethod
    def _remove_allow_list(
        results: List[RecognizerResult], allow_list: List[str], text: str
    ) -> List[RecognizerResult]:
        """
        Remove results which are part of the allow list.

        :param results: List of RecognizerResult
        :param allow_list: list of allowed terms
        :param text: the text to analyze
        :return: List[RecognizerResult]
        """
        new_results = []
        for result in results:
            word = text[result.start : result.end]
            # if the word is not specified to be allowed, keep in the PII entities
            if word not in allow_list:
                new_results.append(result)

        return new_results

    @staticmethod
    def __add_recognizer_id_if_not_exists(
        results: List[RecognizerResult], recognizer: EntityRecognizer
    ) -> None:
        """Ensure recognition metadata with recognizer id existence.

        Ensure recognizer result list contains recognizer id inside recognition
        metadata dictionary, and if not create it. recognizer_id is needed
        for context aware enhancement.

        :param results: List of RecognizerResult
        :param recognizer: Entity recognizer
        """
        for result in results:
            if not result.recognition_metadata:
                result.recognition_metadata = dict()
            if (
                RecognizerResult.RECOGNIZER_IDENTIFIER_KEY
                not in result.recognition_metadata
            ):
                result.recognition_metadata[
                    RecognizerResult.RECOGNIZER_IDENTIFIER_KEY
                ] = recognizer.id
            if RecognizerResult.RECOGNIZER_NAME_KEY not in result.recognition_metadata:
                result.recognition_metadata[
                    RecognizerResult.RECOGNIZER_NAME_KEY
                ] = recognizer.name

    @staticmethod
    def __remove_decision_process(
        results: List[RecognizerResult],
    ) -> List[RecognizerResult]:
        """Remove decision process / analysis explanation from response."""

        for result in results:
            result.analysis_explanation = None

        return results

__add_recognizer_id_if_not_exists(results, recognizer) staticmethod

Ensure recognition metadata with recognizer id existence.

Ensure recognizer result list contains recognizer id inside recognition metadata dictionary, and if not create it. recognizer_id is needed for context aware enhancement.

Parameters:

Name Type Description Default
results List[RecognizerResult]

List of RecognizerResult

required
recognizer EntityRecognizer

Entity recognizer

required
Source code in presidio_analyzer/analyzer_engine.py
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
@staticmethod
def __add_recognizer_id_if_not_exists(
    results: List[RecognizerResult], recognizer: EntityRecognizer
) -> None:
    """Ensure recognition metadata with recognizer id existence.

    Ensure recognizer result list contains recognizer id inside recognition
    metadata dictionary, and if not create it. recognizer_id is needed
    for context aware enhancement.

    :param results: List of RecognizerResult
    :param recognizer: Entity recognizer
    """
    for result in results:
        if not result.recognition_metadata:
            result.recognition_metadata = dict()
        if (
            RecognizerResult.RECOGNIZER_IDENTIFIER_KEY
            not in result.recognition_metadata
        ):
            result.recognition_metadata[
                RecognizerResult.RECOGNIZER_IDENTIFIER_KEY
            ] = recognizer.id
        if RecognizerResult.RECOGNIZER_NAME_KEY not in result.recognition_metadata:
            result.recognition_metadata[
                RecognizerResult.RECOGNIZER_NAME_KEY
            ] = recognizer.name

__remove_decision_process(results) staticmethod

Remove decision process / analysis explanation from response.

Source code in presidio_analyzer/analyzer_engine.py
364
365
366
367
368
369
370
371
372
373
@staticmethod
def __remove_decision_process(
    results: List[RecognizerResult],
) -> List[RecognizerResult]:
    """Remove decision process / analysis explanation from response."""

    for result in results:
        result.analysis_explanation = None

    return results

__remove_low_scores(results, score_threshold=None)

Remove results for which the confidence is lower than the threshold.

Parameters:

Name Type Description Default
results List[RecognizerResult]

List of RecognizerResult

required
score_threshold float

float value for minimum possible confidence

None

Returns:

Type Description
List[RecognizerResult]

List[RecognizerResult]

Source code in presidio_analyzer/analyzer_engine.py
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
def __remove_low_scores(
    self, results: List[RecognizerResult], score_threshold: float = None
) -> List[RecognizerResult]:
    """
    Remove results for which the confidence is lower than the threshold.

    :param results: List of RecognizerResult
    :param score_threshold: float value for minimum possible confidence
    :return: List[RecognizerResult]
    """
    if score_threshold is None:
        score_threshold = self.default_score_threshold

    new_results = [result for result in results if result.score >= score_threshold]
    return new_results

analyze(text, language, entities=None, correlation_id=None, score_threshold=None, return_decision_process=False, ad_hoc_recognizers=None, context=None, allow_list=None, nlp_artifacts=None)

Find PII entities in text using different PII recognizers for a given language.

:example:

from presidio_analyzer import AnalyzerEngine

Set up the engine, loads the NLP module (spaCy model by default)

and other PII recognizers

analyzer = AnalyzerEngine()

Call analyzer to get results

results = analyzer.analyze(text='My phone number is 212-555-5555', entities=['PHONE_NUMBER'], language='en') # noqa D501 print(results) [type: PHONE_NUMBER, start: 19, end: 31, score: 0.85]

Parameters:

Name Type Description Default
text str

the text to analyze

required
language str

the language of the text

required
entities Optional[List[str]]

List of PII entities that should be looked for in the text. If entities=None then all entities are looked for.

None
correlation_id Optional[str]

cross call ID for this request

None
score_threshold Optional[float]

A minimum value for which to return an identified entity

None
return_decision_process Optional[bool]

Whether the analysis decision process steps returned in the response.

False
ad_hoc_recognizers Optional[List[EntityRecognizer]]

List of recognizers which will be used only for this specific request.

None
context Optional[List[str]]

List of context words to enhance confidence score if matched with the recognized entity's recognizer context

None
allow_list Optional[List[str]]

List of words that the user defines as being allowed to keep in the text

None
nlp_artifacts Optional[NlpArtifacts]

precomputed NlpArtifacts

None

Returns:

Type Description
List[RecognizerResult]

an array of the found entities in the text

Source code in presidio_analyzer/analyzer_engine.py
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
def analyze(
    self,
    text: str,
    language: str,
    entities: Optional[List[str]] = None,
    correlation_id: Optional[str] = None,
    score_threshold: Optional[float] = None,
    return_decision_process: Optional[bool] = False,
    ad_hoc_recognizers: Optional[List[EntityRecognizer]] = None,
    context: Optional[List[str]] = None,
    allow_list: Optional[List[str]] = None,
    nlp_artifacts: Optional[NlpArtifacts] = None,
) -> List[RecognizerResult]:
    """
    Find PII entities in text using different PII recognizers for a given language.

    :param text: the text to analyze
    :param language: the language of the text
    :param entities: List of PII entities that should be looked for in the text.
    If entities=None then all entities are looked for.
    :param correlation_id: cross call ID for this request
    :param score_threshold: A minimum value for which
    to return an identified entity
    :param return_decision_process: Whether the analysis decision process steps
    returned in the response.
    :param ad_hoc_recognizers: List of recognizers which will be used only
    for this specific request.
    :param context: List of context words to enhance confidence score if matched
    with the recognized entity's recognizer context
    :param allow_list: List of words that the user defines as being allowed to keep
    in the text
    :param nlp_artifacts: precomputed NlpArtifacts
    :return: an array of the found entities in the text

    :example:

    >>> from presidio_analyzer import AnalyzerEngine

    >>> # Set up the engine, loads the NLP module (spaCy model by default)
    >>> # and other PII recognizers
    >>> analyzer = AnalyzerEngine()

    >>> # Call analyzer to get results
    >>> results = analyzer.analyze(text='My phone number is 212-555-5555', entities=['PHONE_NUMBER'], language='en') # noqa D501
    >>> print(results)
    [type: PHONE_NUMBER, start: 19, end: 31, score: 0.85]
    """
    all_fields = not entities

    recognizers = self.registry.get_recognizers(
        language=language,
        entities=entities,
        all_fields=all_fields,
        ad_hoc_recognizers=ad_hoc_recognizers,
    )

    if all_fields:
        # Since all_fields=True, list all entities by iterating
        # over all recognizers
        entities = self.get_supported_entities(language=language)

    # run the nlp pipeline over the given text, store the results in
    # a NlpArtifacts instance
    if not nlp_artifacts:
        nlp_artifacts = self.nlp_engine.process_text(text, language)

    if self.log_decision_process:
        self.app_tracer.trace(
            correlation_id, "nlp artifacts:" + nlp_artifacts.to_json()
        )

    results = []
    for recognizer in recognizers:
        # Lazy loading of the relevant recognizers
        if not recognizer.is_loaded:
            recognizer.load()
            recognizer.is_loaded = True

        # analyze using the current recognizer and append the results
        current_results = recognizer.analyze(
            text=text, entities=entities, nlp_artifacts=nlp_artifacts
        )
        if current_results:
            # add recognizer name to recognition metadata inside results
            # if not exists
            self.__add_recognizer_id_if_not_exists(current_results, recognizer)
            results.extend(current_results)

    results = self._enhance_using_context(
        text, results, nlp_artifacts, recognizers, context
    )

    if self.log_decision_process:
        self.app_tracer.trace(
            correlation_id,
            json.dumps([str(result.to_dict()) for result in results]),
        )

    # Remove duplicates or low score results
    results = EntityRecognizer.remove_duplicates(results)
    results = self.__remove_low_scores(results, score_threshold)

    if allow_list:
        results = self._remove_allow_list(results, allow_list, text)

    if not return_decision_process:
        results = self.__remove_decision_process(results)

    return results

get_recognizers(language=None)

Return a list of PII recognizers currently loaded.

Parameters:

Name Type Description Default
language Optional[str]

Return the recognizers supporting a given language.

None

Returns:

Type Description
List[EntityRecognizer]

List of [Recognizer] as a RecognizersAllResponse

Source code in presidio_analyzer/analyzer_engine.py
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
def get_recognizers(self, language: Optional[str] = None) -> List[EntityRecognizer]:
    """
    Return a list of PII recognizers currently loaded.

    :param language: Return the recognizers supporting a given language.
    :return: List of [Recognizer] as a RecognizersAllResponse
    """
    if not language:
        languages = self.supported_languages
    else:
        languages = [language]

    recognizers = []
    for language in languages:
        logger.info(f"Fetching all recognizers for language {language}")
        recognizers.extend(
            self.registry.get_recognizers(language=language, all_fields=True)
        )

    return list(set(recognizers))

get_supported_entities(language=None)

Return a list of the entities that can be detected.

Parameters:

Name Type Description Default
language Optional[str]

Return only entities supported in a specific language.

None

Returns:

Type Description
List[str]

List of entity names

Source code in presidio_analyzer/analyzer_engine.py
114
115
116
117
118
119
120
121
122
123
124
125
126
def get_supported_entities(self, language: Optional[str] = None) -> List[str]:
    """
    Return a list of the entities that can be detected.

    :param language: Return only entities supported in a specific language.
    :return: List of entity names
    """
    recognizers = self.get_recognizers(language=language)
    supported_entities = []
    for recognizer in recognizers:
        supported_entities.extend(recognizer.get_supported_entities())

    return list(set(supported_entities))

AnalyzerRequest

Analyzer request data.

Parameters:

Name Type Description Default
req_data Dict

A request dictionary with the following fields: text: the text to analyze language: the language of the text entities: List of PII entities that should be looked for in the text. If entities=None then all entities are looked for. correlation_id: cross call ID for this request score_threshold: A minimum value for which to return an identified entity log_decision_process: Should the decision points within the analysis be logged return_decision_process: Should the decision points within the analysis returned as part of the response

required
Source code in presidio_analyzer/analyzer_request.py
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
class AnalyzerRequest:
    """
    Analyzer request data.

    :param req_data: A request dictionary with the following fields:
        text: the text to analyze
        language: the language of the text
        entities: List of PII entities that should be looked for in the text.
        If entities=None then all entities are looked for.
        correlation_id: cross call ID for this request
        score_threshold: A minimum value for which to return an identified entity
        log_decision_process: Should the decision points within the analysis
        be logged
        return_decision_process: Should the decision points within the analysis
        returned as part of the response
    """

    def __init__(self, req_data: Dict):
        self.text = req_data.get("text")
        self.language = req_data.get("language")
        self.entities = req_data.get("entities")
        self.correlation_id = req_data.get("correlation_id")
        self.score_threshold = req_data.get("score_threshold")
        self.return_decision_process = req_data.get("return_decision_process")
        ad_hoc_recognizers = req_data.get("ad_hoc_recognizers")
        self.ad_hoc_recognizers = []
        if ad_hoc_recognizers:
            self.ad_hoc_recognizers = [
                PatternRecognizer.from_dict(rec) for rec in ad_hoc_recognizers
            ]
        self.context = req_data.get("context")

BatchAnalyzerEngine

Batch analysis of documents (tables, lists, dicts).

Wrapper class to run Presidio Analyzer Engine on multiple values, either lists/iterators of strings, or dictionaries.

Source code in presidio_analyzer/batch_analyzer_engine.py
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
class BatchAnalyzerEngine:
    """
    Batch analysis of documents (tables, lists, dicts).

    Wrapper class to run Presidio Analyzer Engine on multiple values,
    either lists/iterators of strings, or dictionaries.

    :param: analyzer_engine: AnalyzerEngine instance to use
    for handling the values in those collections.
    """

    def __init__(self, analyzer_engine: Optional[AnalyzerEngine] = None):

        self.analyzer_engine = analyzer_engine
        if not analyzer_engine:
            self.analyzer_engine = AnalyzerEngine()

    def analyze_iterator(
        self,
        texts: Iterable[Union[str, bool, float, int]],
        language: str,
        **kwargs,
    ) -> List[List[RecognizerResult]]:
        """
        Analyze an iterable of strings.

        :param texts: An list containing strings to be analyzed.
        :param language: Input language
        :param kwargs: Additional parameters for the `AnalyzerEngine.analyze` method.
        """

        # validate types
        texts = self._validate_types(texts)

        # Process the texts as batch for improved performance
        nlp_artifacts_batch: Iterator[
            Tuple[str, NlpArtifacts]
        ] = self.analyzer_engine.nlp_engine.process_batch(
            texts=texts, language=language
        )

        list_results = []
        for text, nlp_artifacts in nlp_artifacts_batch:
            results = self.analyzer_engine.analyze(
                text=str(text), nlp_artifacts=nlp_artifacts, language=language, **kwargs
            )

            list_results.append(results)

        return list_results

    def analyze_dict(
        self,
        input_dict: Dict[str, Union[Any, Iterable[Any]]],
        language: str,
        keys_to_skip: Optional[List[str]] = None,
        **kwargs,
    ) -> Iterator[DictAnalyzerResult]:
        """
        Analyze a dictionary of keys (strings) and values/iterable of values.

        Non-string values are returned as is.

        :param input_dict: The input dictionary for analysis
        :param language: Input language
        :param keys_to_skip: Keys to ignore during analysis
        :param kwargs: Additional keyword arguments
        for the `AnalyzerEngine.analyze` method.
        Use this to pass arguments to the analyze method,
        such as `ad_hoc_recognizers`, `context`, `return_decision_process`.
        See `AnalyzerEngine.analyze` for the full list.
        """

        context = []
        if "context" in kwargs:
            context = kwargs["context"]
            del kwargs["context"]

        if not keys_to_skip:
            keys_to_skip = []

        for key, value in input_dict.items():
            if not value or key in keys_to_skip:
                yield DictAnalyzerResult(key=key, value=value, recognizer_results=[])
                continue  # skip this key as requested

            # Add the key as an additional context
            specific_context = context[:]
            specific_context.append(key)

            if type(value) in (str, int, bool, float):
                results: List[RecognizerResult] = self.analyzer_engine.analyze(
                    text=str(value), language=language, context=[key], **kwargs
                )
            elif isinstance(value, dict):
                new_keys_to_skip = self._get_nested_keys_to_skip(key, keys_to_skip)
                results = self.analyze_dict(
                    input_dict=value,
                    language=language,
                    context=specific_context,
                    keys_to_skip=new_keys_to_skip,
                    **kwargs,
                )
            elif isinstance(value, Iterable):
                # Recursively iterate nested dicts

                results: List[List[RecognizerResult]] = self.analyze_iterator(
                    texts=value,
                    language=language,
                    context=specific_context,
                    **kwargs,
                )
            else:
                raise ValueError(f"type {type(value)} is unsupported.")

            yield DictAnalyzerResult(key=key, value=value, recognizer_results=results)

    @staticmethod
    def _validate_types(value_iterator: Iterable[Any]) -> Iterator[Any]:
        for val in value_iterator:
            if val and not type(val) in (int, float, bool, str):
                err_msg = (
                    "Analyzer.analyze_iterator only works "
                    "on primitive types (int, float, bool, str). "
                    "Lists of objects are not yet supported."
                )
                logger.error(err_msg)
                raise ValueError(err_msg)
            yield val

    @staticmethod
    def _get_nested_keys_to_skip(key, keys_to_skip):
        new_keys_to_skip = [
            k.replace(f"{key}.", "") for k in keys_to_skip if k.startswith(key)
        ]
        return new_keys_to_skip

analyze_dict(input_dict, language, keys_to_skip=None, **kwargs)

Analyze a dictionary of keys (strings) and values/iterable of values.

Non-string values are returned as is.

Parameters:

Name Type Description Default
input_dict Dict[str, Union[Any, Iterable[Any]]]

The input dictionary for analysis

required
language str

Input language

required
keys_to_skip Optional[List[str]]

Keys to ignore during analysis

None
kwargs

Additional keyword arguments for the AnalyzerEngine.analyze method. Use this to pass arguments to the analyze method, such as ad_hoc_recognizers, context, return_decision_process. See AnalyzerEngine.analyze for the full list.

{}
Source code in presidio_analyzer/batch_analyzer_engine.py
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
def analyze_dict(
    self,
    input_dict: Dict[str, Union[Any, Iterable[Any]]],
    language: str,
    keys_to_skip: Optional[List[str]] = None,
    **kwargs,
) -> Iterator[DictAnalyzerResult]:
    """
    Analyze a dictionary of keys (strings) and values/iterable of values.

    Non-string values are returned as is.

    :param input_dict: The input dictionary for analysis
    :param language: Input language
    :param keys_to_skip: Keys to ignore during analysis
    :param kwargs: Additional keyword arguments
    for the `AnalyzerEngine.analyze` method.
    Use this to pass arguments to the analyze method,
    such as `ad_hoc_recognizers`, `context`, `return_decision_process`.
    See `AnalyzerEngine.analyze` for the full list.
    """

    context = []
    if "context" in kwargs:
        context = kwargs["context"]
        del kwargs["context"]

    if not keys_to_skip:
        keys_to_skip = []

    for key, value in input_dict.items():
        if not value or key in keys_to_skip:
            yield DictAnalyzerResult(key=key, value=value, recognizer_results=[])
            continue  # skip this key as requested

        # Add the key as an additional context
        specific_context = context[:]
        specific_context.append(key)

        if type(value) in (str, int, bool, float):
            results: List[RecognizerResult] = self.analyzer_engine.analyze(
                text=str(value), language=language, context=[key], **kwargs
            )
        elif isinstance(value, dict):
            new_keys_to_skip = self._get_nested_keys_to_skip(key, keys_to_skip)
            results = self.analyze_dict(
                input_dict=value,
                language=language,
                context=specific_context,
                keys_to_skip=new_keys_to_skip,
                **kwargs,
            )
        elif isinstance(value, Iterable):
            # Recursively iterate nested dicts

            results: List[List[RecognizerResult]] = self.analyze_iterator(
                texts=value,
                language=language,
                context=specific_context,
                **kwargs,
            )
        else:
            raise ValueError(f"type {type(value)} is unsupported.")

        yield DictAnalyzerResult(key=key, value=value, recognizer_results=results)

analyze_iterator(texts, language, **kwargs)

Analyze an iterable of strings.

Parameters:

Name Type Description Default
texts Iterable[Union[str, bool, float, int]]

An list containing strings to be analyzed.

required
language str

Input language

required
kwargs

Additional parameters for the AnalyzerEngine.analyze method.

{}
Source code in presidio_analyzer/batch_analyzer_engine.py
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
def analyze_iterator(
    self,
    texts: Iterable[Union[str, bool, float, int]],
    language: str,
    **kwargs,
) -> List[List[RecognizerResult]]:
    """
    Analyze an iterable of strings.

    :param texts: An list containing strings to be analyzed.
    :param language: Input language
    :param kwargs: Additional parameters for the `AnalyzerEngine.analyze` method.
    """

    # validate types
    texts = self._validate_types(texts)

    # Process the texts as batch for improved performance
    nlp_artifacts_batch: Iterator[
        Tuple[str, NlpArtifacts]
    ] = self.analyzer_engine.nlp_engine.process_batch(
        texts=texts, language=language
    )

    list_results = []
    for text, nlp_artifacts in nlp_artifacts_batch:
        results = self.analyzer_engine.analyze(
            text=str(text), nlp_artifacts=nlp_artifacts, language=language, **kwargs
        )

        list_results.append(results)

    return list_results

ContextAwareEnhancer

A class representing an abstract context aware enhancer.

Context words might enhance confidence score of a recognized entity, ContextAwareEnhancer is an abstract class to be inherited by a context aware enhancer logic.

Parameters:

Name Type Description Default
context_similarity_factor float

How much to enhance confidence of match entity

required
min_score_with_context_similarity float

Minimum confidence score

required
context_prefix_count int

how many words before the entity to match context

required
context_suffix_count int

how many words after the entity to match context

required
Source code in presidio_analyzer/context_aware_enhancers/context_aware_enhancer.py
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
class ContextAwareEnhancer:
    """
    A class representing an abstract context aware enhancer.

    Context words might enhance confidence score of a recognized entity,
    ContextAwareEnhancer is an abstract class to be inherited by a context aware
    enhancer logic.

    :param context_similarity_factor: How much to enhance confidence of match entity
    :param min_score_with_context_similarity: Minimum confidence score
    :param context_prefix_count: how many words before the entity to match context
    :param context_suffix_count: how many words after the entity to match context
    """

    MIN_SCORE = 0
    MAX_SCORE = 1.0

    def __init__(
        self,
        context_similarity_factor: float,
        min_score_with_context_similarity: float,
        context_prefix_count: int,
        context_suffix_count: int,
    ):

        self.context_similarity_factor = context_similarity_factor
        self.min_score_with_context_similarity = min_score_with_context_similarity
        self.context_prefix_count = context_prefix_count
        self.context_suffix_count = context_suffix_count

    @abstractmethod
    def enhance_using_context(
        self,
        text: str,
        raw_results: List[RecognizerResult],
        nlp_artifacts: NlpArtifacts,
        recognizers: List[EntityRecognizer],
        context: Optional[List[str]] = None,
    ) -> List[RecognizerResult]:
        """
        Update results in case surrounding words are relevant to the context words.

        Using the surrounding words of the actual word matches, look
        for specific strings that if found contribute to the score
        of the result, improving the confidence that the match is
        indeed of that PII entity type

        :param text: The actual text that was analyzed
        :param raw_results: Recognizer results which didn't take
                            context into consideration
        :param nlp_artifacts: The nlp artifacts contains elements
                              such as lemmatized tokens for better
                              accuracy of the context enhancement process
        :param recognizers: the list of recognizers
        :param context: list of context words
        """
        return raw_results

enhance_using_context(text, raw_results, nlp_artifacts, recognizers, context=None) abstractmethod

Update results in case surrounding words are relevant to the context words.

Using the surrounding words of the actual word matches, look for specific strings that if found contribute to the score of the result, improving the confidence that the match is indeed of that PII entity type

Parameters:

Name Type Description Default
text str

The actual text that was analyzed

required
raw_results List[RecognizerResult]

Recognizer results which didn't take context into consideration

required
nlp_artifacts NlpArtifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

required
recognizers List[EntityRecognizer]

the list of recognizers

required
context Optional[List[str]]

list of context words

None
Source code in presidio_analyzer/context_aware_enhancers/context_aware_enhancer.py
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
@abstractmethod
def enhance_using_context(
    self,
    text: str,
    raw_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    recognizers: List[EntityRecognizer],
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """
    Update results in case surrounding words are relevant to the context words.

    Using the surrounding words of the actual word matches, look
    for specific strings that if found contribute to the score
    of the result, improving the confidence that the match is
    indeed of that PII entity type

    :param text: The actual text that was analyzed
    :param raw_results: Recognizer results which didn't take
                        context into consideration
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param recognizers: the list of recognizers
    :param context: list of context words
    """
    return raw_results

DictAnalyzerResult dataclass

Data class for holding the output of the Presidio Analyzer on dictionaries.

Parameters:

Name Type Description Default
key str

key in dictionary

required
value Union[str, List[str], dict]

value to run analysis on (either string or list of strings)

required
recognizer_results Union[List[RecognizerResult], List[List[RecognizerResult]], Iterator[DictAnalyzerResult]]

Analyzer output for one value. Could be either: - A list of recognizer results if the input is one string - A list of lists of recognizer results, if the input is a list of strings. - An iterator of a DictAnalyzerResult, if the input is a dictionary. In this case the recognizer_results would be the iterator of the DictAnalyzerResults next level in the dictionary.

required
Source code in presidio_analyzer/dict_analyzer_result.py
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
@dataclass
class DictAnalyzerResult:
    """
    Data class for holding the output of the Presidio Analyzer on dictionaries.

    :param key: key in dictionary
    :param value: value to run analysis on (either string or list of strings)
    :param recognizer_results: Analyzer output for one value.
    Could be either:
     - A list of recognizer results if the input is one string
     - A list of lists of recognizer results, if the input is a list of strings.
     - An iterator of a DictAnalyzerResult, if the input is a dictionary.
     In this case the recognizer_results would be the iterator
     of the DictAnalyzerResults next level in the dictionary.
    """

    key: str
    value: Union[str, List[str], dict]
    recognizer_results: Union[
        List[RecognizerResult],
        List[List[RecognizerResult]],
        Iterator["DictAnalyzerResult"],
    ]

EntityRecognizer

A class representing an abstract PII entity recognizer.

EntityRecognizer is an abstract class to be inherited by Recognizers which hold the logic for recognizing specific PII entities.

EntityRecognizer exposes a method called enhance_using_context which can be overridden in case a custom context aware enhancement is needed in derived class of a recognizer.

Parameters:

Name Type Description Default
supported_entities List[str]

the entities supported by this recognizer (for example, phone number, address, etc.)

required
supported_language str

the language supported by this recognizer. The supported langauge code is iso6391Name

'en'
name str

the name of this recognizer (optional)

None
version str

the recognizer current version

'0.0.1'
context Optional[List[str]]

a list of words which can help boost confidence score when they appear in context of the matched entity

None
Source code in presidio_analyzer/entity_recognizer.py
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
class EntityRecognizer:
    """
    A class representing an abstract PII entity recognizer.

    EntityRecognizer is an abstract class to be inherited by
    Recognizers which hold the logic for recognizing specific PII entities.

    EntityRecognizer exposes a method called enhance_using_context which
    can be overridden in case a custom context aware enhancement is needed
    in derived class of a recognizer.

    :param supported_entities: the entities supported by this recognizer
    (for example, phone number, address, etc.)
    :param supported_language: the language supported by this recognizer.
    The supported langauge code is iso6391Name
    :param name: the name of this recognizer (optional)
    :param version: the recognizer current version
    :param context: a list of words which can help boost confidence score
    when they appear in context of the matched entity
    """

    MIN_SCORE = 0
    MAX_SCORE = 1.0

    def __init__(
        self,
        supported_entities: List[str],
        name: str = None,
        supported_language: str = "en",
        version: str = "0.0.1",
        context: Optional[List[str]] = None,
    ):

        self.supported_entities = supported_entities

        if name is None:
            self.name = self.__class__.__name__  # assign class name as name
        else:
            self.name = name

        self._id = f"{self.name}_{id(self)}"

        self.supported_language = supported_language
        self.version = version
        self.is_loaded = False
        self.context = context if context else []

        self.load()
        logger.info("Loaded recognizer: %s", self.name)
        self.is_loaded = True

    @property
    def id(self):
        """Return a unique identifier of this recognizer."""

        return self._id

    @abstractmethod
    def load(self) -> None:
        """
        Initialize the recognizer assets if needed.

        (e.g. machine learning models)
        """

    @abstractmethod
    def analyze(
        self, text: str, entities: List[str], nlp_artifacts: NlpArtifacts
    ) -> List[RecognizerResult]:
        """
        Analyze text to identify entities.

        :param text: The text to be analyzed
        :param entities: The list of entities this recognizer is able to detect
        :param nlp_artifacts: A group of attributes which are the result of
        an NLP process over the input text.
        :return: List of results detected by this recognizer.
        """
        return None

    def enhance_using_context(
        self,
        text: str,
        raw_recognizer_results: List[RecognizerResult],
        other_raw_recognizer_results: List[RecognizerResult],
        nlp_artifacts: NlpArtifacts,
        context: Optional[List[str]] = None,
    ) -> List[RecognizerResult]:
        """Enhance confidence score using context of the entity.

        Override this method in derived class in case a custom logic
        is needed, otherwise return value will be equal to
        raw_results.

        in case a result score is boosted, derived class need to update
        result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

        :param text: The actual text that was analyzed
        :param raw_recognizer_results: This recognizer's results, to be updated
        based on recognizer specific context.
        :param other_raw_recognizer_results: Other recognizer results matched in
        the given text to allow related entity context enhancement
        :param nlp_artifacts: The nlp artifacts contains elements
                              such as lemmatized tokens for better
                              accuracy of the context enhancement process
        :param context: list of context words
        """
        return raw_recognizer_results

    def get_supported_entities(self) -> List[str]:
        """
        Return the list of entities this recognizer can identify.

        :return: A list of the supported entities by this recognizer
        """
        return self.supported_entities

    def get_supported_language(self) -> str:
        """
        Return the language this recognizer can support.

        :return: A list of the supported language by this recognizer
        """
        return self.supported_language

    def get_version(self) -> str:
        """
        Return the version of this recognizer.

        :return: The current version of this recognizer
        """
        return self.version

    def to_dict(self) -> Dict:
        """
        Serialize self to dictionary.

        :return: a dictionary
        """
        return_dict = {
            "supported_entities": self.supported_entities,
            "supported_language": self.supported_language,
            "name": self.name,
            "version": self.version,
        }
        return return_dict

    @classmethod
    def from_dict(cls, entity_recognizer_dict: Dict) -> "EntityRecognizer":
        """
        Create EntityRecognizer from a dict input.

        :param entity_recognizer_dict: Dict containing keys and values for instantiation
        """
        return cls(**entity_recognizer_dict)

    @staticmethod
    def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
        """
        Remove duplicate results.

        Remove duplicates in case the two results
        have identical start and ends and types.
        :param results: List[RecognizerResult]
        :return: List[RecognizerResult]
        """
        results = list(set(results))
        results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
        filtered_results = []

        for result in results:
            if result.score == 0:
                continue

            to_keep = result not in filtered_results  # equals based comparison
            if to_keep:
                for filtered in filtered_results:
                    # If result is contained in one of the other results
                    if (
                        result.contained_in(filtered)
                        and result.entity_type == filtered.entity_type
                    ):
                        to_keep = False
                        break

            if to_keep:
                filtered_results.append(result)

        return filtered_results

id property

Return a unique identifier of this recognizer.

analyze(text, entities, nlp_artifacts) abstractmethod

Analyze text to identify entities.

Parameters:

Name Type Description Default
text str

The text to be analyzed

required
entities List[str]

The list of entities this recognizer is able to detect

required
nlp_artifacts NlpArtifacts

A group of attributes which are the result of an NLP process over the input text.

required

Returns:

Type Description
List[RecognizerResult]

List of results detected by this recognizer.

Source code in presidio_analyzer/entity_recognizer.py
76
77
78
79
80
81
82
83
84
85
86
87
88
89
@abstractmethod
def analyze(
    self, text: str, entities: List[str], nlp_artifacts: NlpArtifacts
) -> List[RecognizerResult]:
    """
    Analyze text to identify entities.

    :param text: The text to be analyzed
    :param entities: The list of entities this recognizer is able to detect
    :param nlp_artifacts: A group of attributes which are the result of
    an NLP process over the input text.
    :return: List of results detected by this recognizer.
    """
    return None

enhance_using_context(text, raw_recognizer_results, other_raw_recognizer_results, nlp_artifacts, context=None)

Enhance confidence score using context of the entity.

Override this method in derived class in case a custom logic is needed, otherwise return value will be equal to raw_results.

in case a result score is boosted, derived class need to update result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

Parameters:

Name Type Description Default
text str

The actual text that was analyzed

required
raw_recognizer_results List[RecognizerResult]

This recognizer's results, to be updated based on recognizer specific context.

required
other_raw_recognizer_results List[RecognizerResult]

Other recognizer results matched in the given text to allow related entity context enhancement

required
nlp_artifacts NlpArtifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

required
context Optional[List[str]]

list of context words

None
Source code in presidio_analyzer/entity_recognizer.py
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
def enhance_using_context(
    self,
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """Enhance confidence score using context of the entity.

    Override this method in derived class in case a custom logic
    is needed, otherwise return value will be equal to
    raw_results.

    in case a result score is boosted, derived class need to update
    result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

    :param text: The actual text that was analyzed
    :param raw_recognizer_results: This recognizer's results, to be updated
    based on recognizer specific context.
    :param other_raw_recognizer_results: Other recognizer results matched in
    the given text to allow related entity context enhancement
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param context: list of context words
    """
    return raw_recognizer_results

from_dict(entity_recognizer_dict) classmethod

Create EntityRecognizer from a dict input.

Parameters:

Name Type Description Default
entity_recognizer_dict Dict

Dict containing keys and values for instantiation

required
Source code in presidio_analyzer/entity_recognizer.py
158
159
160
161
162
163
164
165
@classmethod
def from_dict(cls, entity_recognizer_dict: Dict) -> "EntityRecognizer":
    """
    Create EntityRecognizer from a dict input.

    :param entity_recognizer_dict: Dict containing keys and values for instantiation
    """
    return cls(**entity_recognizer_dict)

get_supported_entities()

Return the list of entities this recognizer can identify.

Returns:

Type Description
List[str]

A list of the supported entities by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
120
121
122
123
124
125
126
def get_supported_entities(self) -> List[str]:
    """
    Return the list of entities this recognizer can identify.

    :return: A list of the supported entities by this recognizer
    """
    return self.supported_entities

get_supported_language()

Return the language this recognizer can support.

Returns:

Type Description
str

A list of the supported language by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
128
129
130
131
132
133
134
def get_supported_language(self) -> str:
    """
    Return the language this recognizer can support.

    :return: A list of the supported language by this recognizer
    """
    return self.supported_language

get_version()

Return the version of this recognizer.

Returns:

Type Description
str

The current version of this recognizer

Source code in presidio_analyzer/entity_recognizer.py
136
137
138
139
140
141
142
def get_version(self) -> str:
    """
    Return the version of this recognizer.

    :return: The current version of this recognizer
    """
    return self.version

load() abstractmethod

Initialize the recognizer assets if needed.

(e.g. machine learning models)

Source code in presidio_analyzer/entity_recognizer.py
68
69
70
71
72
73
74
@abstractmethod
def load(self) -> None:
    """
    Initialize the recognizer assets if needed.

    (e.g. machine learning models)
    """

remove_duplicates(results) staticmethod

Remove duplicate results.

Remove duplicates in case the two results have identical start and ends and types.

Parameters:

Name Type Description Default
results List[RecognizerResult]

List[RecognizerResult]

required

Returns:

Type Description
List[RecognizerResult]

List[RecognizerResult]

Source code in presidio_analyzer/entity_recognizer.py
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
@staticmethod
def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
    """
    Remove duplicate results.

    Remove duplicates in case the two results
    have identical start and ends and types.
    :param results: List[RecognizerResult]
    :return: List[RecognizerResult]
    """
    results = list(set(results))
    results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
    filtered_results = []

    for result in results:
        if result.score == 0:
            continue

        to_keep = result not in filtered_results  # equals based comparison
        if to_keep:
            for filtered in filtered_results:
                # If result is contained in one of the other results
                if (
                    result.contained_in(filtered)
                    and result.entity_type == filtered.entity_type
                ):
                    to_keep = False
                    break

        if to_keep:
            filtered_results.append(result)

    return filtered_results

to_dict()

Serialize self to dictionary.

Returns:

Type Description
Dict

a dictionary

Source code in presidio_analyzer/entity_recognizer.py
144
145
146
147
148
149
150
151
152
153
154
155
156
def to_dict(self) -> Dict:
    """
    Serialize self to dictionary.

    :return: a dictionary
    """
    return_dict = {
        "supported_entities": self.supported_entities,
        "supported_language": self.supported_language,
        "name": self.name,
        "version": self.version,
    }
    return return_dict

LemmaContextAwareEnhancer

Bases: ContextAwareEnhancer

A class representing a lemma based context aware enhancer logic.

Context words might enhance confidence score of a recognized entity, LemmaContextAwareEnhancer is an implementation of Lemma based context aware logic, it compares spacy lemmas of each word in context of the matched entity to given context and the recognizer context words, if matched it enhance the recognized entity confidence score by a given factor.

Parameters:

Name Type Description Default
context_similarity_factor float

How much to enhance confidence of match entity

0.35
min_score_with_context_similarity float

Minimum confidence score

0.4
context_prefix_count int

how many words before the entity to match context

5
context_suffix_count int

how many words after the entity to match context

0
Source code in presidio_analyzer/context_aware_enhancers/lemma_context_aware_enhancer.py
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
class LemmaContextAwareEnhancer(ContextAwareEnhancer):
    """
    A class representing a lemma based context aware enhancer logic.

    Context words might enhance confidence score of a recognized entity,
    LemmaContextAwareEnhancer is an implementation of Lemma based context aware logic,
    it compares spacy lemmas of each word in context of the matched entity to given
    context and the recognizer context words,
    if matched it enhance the recognized entity confidence score by a given factor.

    :param context_similarity_factor: How much to enhance confidence of match entity
    :param min_score_with_context_similarity: Minimum confidence score
    :param context_prefix_count: how many words before the entity to match context
    :param context_suffix_count: how many words after the entity to match context
    """

    def __init__(
        self,
        context_similarity_factor: float = 0.35,
        min_score_with_context_similarity: float = 0.4,
        context_prefix_count: int = 5,
        context_suffix_count: int = 0,
    ):
        super().__init__(
            context_similarity_factor=context_similarity_factor,
            min_score_with_context_similarity=min_score_with_context_similarity,
            context_prefix_count=context_prefix_count,
            context_suffix_count=context_suffix_count,
        )

    def enhance_using_context(
        self,
        text: str,
        raw_results: List[RecognizerResult],
        nlp_artifacts: NlpArtifacts,
        recognizers: List[EntityRecognizer],
        context: Optional[List[str]] = None,
    ) -> List[RecognizerResult]:
        """
        Update results in case the lemmas of surrounding words or input context
        words are identical to the context words.

        Using the surrounding words of the actual word matches, look
        for specific strings that if found contribute to the score
        of the result, improving the confidence that the match is
        indeed of that PII entity type

        :param text: The actual text that was analyzed
        :param raw_results: Recognizer results which didn't take
                            context into consideration
        :param nlp_artifacts: The nlp artifacts contains elements
                              such as lemmatized tokens for better
                              accuracy of the context enhancement process
        :param recognizers: the list of recognizers
        :param context: list of context words
        """  # noqa D205 D400

        # create a deep copy of the results object, so we can manipulate it
        results = copy.deepcopy(raw_results)

        # create recognizer context dictionary
        recognizers_dict = {recognizer.id: recognizer for recognizer in recognizers}

        # Create empty list in None or lowercase all context words in the list
        if not context:
            context = []
        else:
            context = [word.lower() for word in context]

        # Sanity
        if nlp_artifacts is None:
            logger.warning("NLP artifacts were not provided")
            return results

        for result in results:
            recognizer = None
            # get recognizer matching the result, if found.
            if (
                result.recognition_metadata
                and RecognizerResult.RECOGNIZER_IDENTIFIER_KEY
                in result.recognition_metadata.keys()
            ):
                recognizer = recognizers_dict.get(
                    result.recognition_metadata[
                        RecognizerResult.RECOGNIZER_IDENTIFIER_KEY
                    ]
                )

            if not recognizer:
                logger.debug(
                    "Recognizer name not found as part of the "
                    "recognition_metadata dict in the RecognizerResult. "
                )
                continue

            # skip recognizer result if the recognizer doesn't support
            # context enhancement
            if not recognizer.context:
                logger.debug(
                    "recognizer '%s' does not support context enhancement",
                    recognizer.name,
                )
                continue

            # skip context enhancement if already boosted by recognizer level
            if result.recognition_metadata.get(
                RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY
            ):
                logger.debug("result score already boosted, skipping")
                continue

            # extract lemmatized context from the surrounding of the match
            word = text[result.start : result.end]

            surrounding_words = self._extract_surrounding_words(
                nlp_artifacts=nlp_artifacts, word=word, start=result.start
            )

            # combine other sources of context with surrounding words
            surrounding_words.extend(context)

            supportive_context_word = self._find_supportive_word_in_context(
                surrounding_words, recognizer.context
            )
            if supportive_context_word != "":
                result.score += self.context_similarity_factor
                result.score = max(result.score, self.min_score_with_context_similarity)
                result.score = min(result.score, ContextAwareEnhancer.MAX_SCORE)

                # Update the explainability object with context information
                # helped to improve the score
                result.analysis_explanation.set_supportive_context_word(
                    supportive_context_word
                )
                result.analysis_explanation.set_improved_score(result.score)
        return results

    @staticmethod
    def _find_supportive_word_in_context(
        context_list: List[str], recognizer_context_list: List[str]
    ) -> str:
        """
        Find words in the text which are relevant for context evaluation.

        A word is considered a supportive context word if there's exact match
        between a keyword in context_text and any keyword in context_list.

        :param context_list words before and after the matched entity within
               a specified window size
        :param recognizer_context_list a list of words considered as
                context keywords manually specified by the recognizer's author
        """
        word = ""
        # If the context list is empty, no need to continue
        if context_list is None or recognizer_context_list is None:
            return word

        for predefined_context_word in recognizer_context_list:
            # result == true only if any of the predefined context words
            # is found exactly or as a substring in any of the collected
            # context words
            result = next(
                (
                    True
                    for keyword in context_list
                    if predefined_context_word in keyword
                ),
                False,
            )
            if result:
                logger.debug("Found context keyword '%s'", predefined_context_word)
                word = predefined_context_word
                break

        return word

    def _extract_surrounding_words(
        self, nlp_artifacts: NlpArtifacts, word: str, start: int
    ) -> List[str]:
        """Extract words surrounding another given word.

        The text from which the context is extracted is given in the nlp
        doc.

        :param nlp_artifacts: An abstraction layer which holds different
                              items which are the result of a NLP pipeline
                              execution on a given text
        :param word: The word to look for context around
        :param start: The start index of the word in the original text
        """
        if not nlp_artifacts.tokens:
            logger.info("Skipping context extraction due to lack of NLP artifacts")
            # if there are no nlp artifacts, this is ok, we can
            # extract context and we return a valid, yet empty
            # context
            return [""]

        # Get the already prepared words in the given text, in their
        # LEMMATIZED version
        lemmatized_keywords = nlp_artifacts.keywords

        # since the list of tokens is not necessarily aligned
        # with the actual index of the match, we look for the
        # token index which corresponds to the match
        token_index = self._find_index_of_match_token(
            word, start, nlp_artifacts.tokens, nlp_artifacts.tokens_indices
        )

        # index i belongs to the PII entity, take the preceding n words
        # and the successing m words into a context list

        backward_context = self._add_n_words_backward(
            token_index,
            self.context_prefix_count,
            nlp_artifacts.lemmas,
            lemmatized_keywords,
        )
        forward_context = self._add_n_words_forward(
            token_index,
            self.context_suffix_count,
            nlp_artifacts.lemmas,
            lemmatized_keywords,
        )

        context_list = []
        context_list.extend(backward_context)
        context_list.extend(forward_context)
        context_list = list(set(context_list))
        logger.debug("Context list is: %s", " ".join(context_list))
        return context_list

    @staticmethod
    def _find_index_of_match_token(
        word: str, start: int, tokens, tokens_indices: List[int]  # noqa ANN001
    ) -> int:
        found = False
        # we use the known start index of the original word to find the actual
        # token at that index, we are not checking for equivilance since the
        # token might be just a substring of that word (e.g. for phone number
        # 555-124564 the first token might be just '555' or for a match like '
        # rocket' the actual token will just be 'rocket' hence the misalignment
        # of indices)
        # Note: we are iterating over the original tokens (not the lemmatized)
        i = -1
        for i, token in enumerate(tokens, 0):
            # Either we found a token with the exact location, or
            # we take a token which its characters indices covers
            # the index we are looking for.
            if (tokens_indices[i] == start) or (start < tokens_indices[i] + len(token)):
                # found the interesting token, the one that around it
                # we take n words, we save the matching lemma
                found = True
                break

        if not found:
            raise ValueError(
                "Did not find word '" + word + "' "
                "in the list of tokens although it "
                "is expected to be found"
            )
        return i

    @staticmethod
    def _add_n_words(
        index: int,
        n_words: int,
        lemmas: List[str],
        lemmatized_filtered_keywords: List[str],
        is_backward: bool,
    ) -> List[str]:
        """
        Prepare a string of context words.

        Return a list of words which surrounds a lemma at a given index.
        The words will be collected only if exist in the filtered array

        :param index: index of the lemma that its surrounding words we want
        :param n_words: number of words to take
        :param lemmas: array of lemmas
        :param lemmatized_filtered_keywords: the array of filtered
               lemmas from the original sentence,
        :param is_backward: if true take the preceeding words, if false,
                            take the successing words
        """
        i = index
        context_words = []
        # The entity itself is no interest to us...however we want to
        # consider it anyway for cases were it is attached with no spaces
        # to an interesting context word, so we allow it and add 1 to
        # the number of collected words

        # collect at most n words (in lower case)
        remaining = n_words + 1
        while 0 <= i < len(lemmas) and remaining > 0:
            lower_lemma = lemmas[i].lower()
            if lower_lemma in lemmatized_filtered_keywords:
                context_words.append(lower_lemma)
                remaining -= 1
            i = i - 1 if is_backward else i + 1
        return context_words

    def _add_n_words_forward(
        self,
        index: int,
        n_words: int,
        lemmas: List[str],
        lemmatized_filtered_keywords: List[str],
    ) -> List[str]:
        return self._add_n_words(
            index, n_words, lemmas, lemmatized_filtered_keywords, False
        )

    def _add_n_words_backward(
        self,
        index: int,
        n_words: int,
        lemmas: List[str],
        lemmatized_filtered_keywords: List[str],
    ) -> List[str]:
        return self._add_n_words(
            index, n_words, lemmas, lemmatized_filtered_keywords, True
        )

enhance_using_context(text, raw_results, nlp_artifacts, recognizers, context=None)

Update results in case the lemmas of surrounding words or input context words are identical to the context words.

Using the surrounding words of the actual word matches, look for specific strings that if found contribute to the score of the result, improving the confidence that the match is indeed of that PII entity type

Parameters:

Name Type Description Default
text str

The actual text that was analyzed

required
raw_results List[RecognizerResult]

Recognizer results which didn't take context into consideration

required
nlp_artifacts NlpArtifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

required
recognizers List[EntityRecognizer]

the list of recognizers

required
context Optional[List[str]]

list of context words

None
Source code in presidio_analyzer/context_aware_enhancers/lemma_context_aware_enhancer.py
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
def enhance_using_context(
    self,
    text: str,
    raw_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    recognizers: List[EntityRecognizer],
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """
    Update results in case the lemmas of surrounding words or input context
    words are identical to the context words.

    Using the surrounding words of the actual word matches, look
    for specific strings that if found contribute to the score
    of the result, improving the confidence that the match is
    indeed of that PII entity type

    :param text: The actual text that was analyzed
    :param raw_results: Recognizer results which didn't take
                        context into consideration
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param recognizers: the list of recognizers
    :param context: list of context words
    """  # noqa D205 D400

    # create a deep copy of the results object, so we can manipulate it
    results = copy.deepcopy(raw_results)

    # create recognizer context dictionary
    recognizers_dict = {recognizer.id: recognizer for recognizer in recognizers}

    # Create empty list in None or lowercase all context words in the list
    if not context:
        context = []
    else:
        context = [word.lower() for word in context]

    # Sanity
    if nlp_artifacts is None:
        logger.warning("NLP artifacts were not provided")
        return results

    for result in results:
        recognizer = None
        # get recognizer matching the result, if found.
        if (
            result.recognition_metadata
            and RecognizerResult.RECOGNIZER_IDENTIFIER_KEY
            in result.recognition_metadata.keys()
        ):
            recognizer = recognizers_dict.get(
                result.recognition_metadata[
                    RecognizerResult.RECOGNIZER_IDENTIFIER_KEY
                ]
            )

        if not recognizer:
            logger.debug(
                "Recognizer name not found as part of the "
                "recognition_metadata dict in the RecognizerResult. "
            )
            continue

        # skip recognizer result if the recognizer doesn't support
        # context enhancement
        if not recognizer.context:
            logger.debug(
                "recognizer '%s' does not support context enhancement",
                recognizer.name,
            )
            continue

        # skip context enhancement if already boosted by recognizer level
        if result.recognition_metadata.get(
            RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY
        ):
            logger.debug("result score already boosted, skipping")
            continue

        # extract lemmatized context from the surrounding of the match
        word = text[result.start : result.end]

        surrounding_words = self._extract_surrounding_words(
            nlp_artifacts=nlp_artifacts, word=word, start=result.start
        )

        # combine other sources of context with surrounding words
        surrounding_words.extend(context)

        supportive_context_word = self._find_supportive_word_in_context(
            surrounding_words, recognizer.context
        )
        if supportive_context_word != "":
            result.score += self.context_similarity_factor
            result.score = max(result.score, self.min_score_with_context_similarity)
            result.score = min(result.score, ContextAwareEnhancer.MAX_SCORE)

            # Update the explainability object with context information
            # helped to improve the score
            result.analysis_explanation.set_supportive_context_word(
                supportive_context_word
            )
            result.analysis_explanation.set_improved_score(result.score)
    return results

LocalRecognizer

Bases: ABC, EntityRecognizer

PII entity recognizer which runs on the same process as the AnalyzerEngine.

Source code in presidio_analyzer/local_recognizer.py
6
7
class LocalRecognizer(ABC, EntityRecognizer):
    """PII entity recognizer which runs on the same process as the AnalyzerEngine."""

Pattern

A class that represents a regex pattern.

Parameters:

Name Type Description Default
name str

the name of the pattern

required
regex str

the regex pattern to detect

required
score float

the pattern's strength (values varies 0-1)

required
Source code in presidio_analyzer/pattern.py
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
class Pattern:
    """
    A class that represents a regex pattern.

    :param name: the name of the pattern
    :param regex: the regex pattern to detect
    :param score: the pattern's strength (values varies 0-1)
    """

    def __init__(self, name: str, regex: str, score: float):

        self.name = name
        self.regex = regex
        self.score = score

    def to_dict(self) -> Dict:
        """
        Turn this instance into a dictionary.

        :return: a dictionary
        """
        return_dict = {"name": self.name, "score": self.score, "regex": self.regex}
        return return_dict

    @classmethod
    def from_dict(cls, pattern_dict: Dict) -> "Pattern":
        """
        Load an instance from a dictionary.

        :param pattern_dict: a dictionary holding the pattern's parameters
        :return: a Pattern instance
        """
        return cls(**pattern_dict)

    def __repr__(self):
        """Return string representation of instance."""
        return json.dumps(self.to_dict())

    def __str__(self):
        """Return string representation of instance."""
        return json.dumps(self.to_dict())

__repr__()

Return string representation of instance.

Source code in presidio_analyzer/pattern.py
39
40
41
def __repr__(self):
    """Return string representation of instance."""
    return json.dumps(self.to_dict())

__str__()

Return string representation of instance.

Source code in presidio_analyzer/pattern.py
43
44
45
def __str__(self):
    """Return string representation of instance."""
    return json.dumps(self.to_dict())

from_dict(pattern_dict) classmethod

Load an instance from a dictionary.

Parameters:

Name Type Description Default
pattern_dict Dict

a dictionary holding the pattern's parameters

required

Returns:

Type Description
Pattern

a Pattern instance

Source code in presidio_analyzer/pattern.py
29
30
31
32
33
34
35
36
37
@classmethod
def from_dict(cls, pattern_dict: Dict) -> "Pattern":
    """
    Load an instance from a dictionary.

    :param pattern_dict: a dictionary holding the pattern's parameters
    :return: a Pattern instance
    """
    return cls(**pattern_dict)

to_dict()

Turn this instance into a dictionary.

Returns:

Type Description
Dict

a dictionary

Source code in presidio_analyzer/pattern.py
20
21
22
23
24
25
26
27
def to_dict(self) -> Dict:
    """
    Turn this instance into a dictionary.

    :return: a dictionary
    """
    return_dict = {"name": self.name, "score": self.score, "regex": self.regex}
    return return_dict

PatternRecognizer

Bases: LocalRecognizer

PII entity recognizer using regular expressions or deny-lists.

Parameters:

Name Type Description Default
patterns List[Pattern]

A list of patterns to detect

None
deny_list List[str]

A list of words to detect, in case our recognizer uses a predefined list of words (deny list)

None
context List[str]

list of context words

None
deny_list_score float

confidence score for a term identified using a deny-list

1.0
global_regex_flags Optional[int]

regex flags to be used in regex matching, including deny-lists.

DOTALL | MULTILINE | IGNORECASE
Source code in presidio_analyzer/pattern_recognizer.py
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
class PatternRecognizer(LocalRecognizer):
    """
    PII entity recognizer using regular expressions or deny-lists.

    :param patterns: A list of patterns to detect
    :param deny_list: A list of words to detect,
    in case our recognizer uses a predefined list of words (deny list)
    :param context: list of context words
    :param deny_list_score: confidence score for a term
    identified using a deny-list
    :param global_regex_flags: regex flags to be used in regex matching,
    including deny-lists.
    """

    def __init__(
        self,
        supported_entity: str,
        name: str = None,
        supported_language: str = "en",
        patterns: List[Pattern] = None,
        deny_list: List[str] = None,
        context: List[str] = None,
        deny_list_score: float = 1.0,
        global_regex_flags: Optional[int] = re.DOTALL | re.MULTILINE | re.IGNORECASE,
        version: str = "0.0.1",
    ):
        if not supported_entity:
            raise ValueError("Pattern recognizer should be initialized with entity")

        if not patterns and not deny_list:
            raise ValueError(
                "Pattern recognizer should be initialized with patterns"
                " or with deny list"
            )

        super().__init__(
            supported_entities=[supported_entity],
            supported_language=supported_language,
            name=name,
            version=version,
        )
        if patterns is None:
            self.patterns = []
        else:
            self.patterns = patterns
        self.context = context
        self.deny_list_score = deny_list_score
        self.global_regex_flags = global_regex_flags

        if deny_list:
            deny_list_pattern = self._deny_list_to_regex(deny_list)
            self.patterns.append(deny_list_pattern)
            self.deny_list = deny_list
        else:
            self.deny_list = []

    def load(self):  # noqa D102
        pass

    def analyze(
        self,
        text: str,
        entities: List[str],
        nlp_artifacts: Optional[NlpArtifacts] = None,
        regex_flags: Optional[int] = None,
    ) -> List[RecognizerResult]:
        """
        Analyzes text to detect PII using regular expressions or deny-lists.

        :param text: Text to be analyzed
        :param entities: Entities this recognizer can detect
        :param nlp_artifacts: Output values from the NLP engine
        :param regex_flags: regex flags to be used in regex matching
        :return:
        """
        results = []

        if self.patterns:
            pattern_result = self.__analyze_patterns(text, regex_flags)
            results.extend(pattern_result)

        return results

    def _deny_list_to_regex(self, deny_list: List[str]) -> Pattern:
        """
        Convert a list of words to a matching regex.

        To be analyzed by the analyze method as any other regex patterns.

        :param deny_list: the list of words to detect
        :return:the regex of the words for detection
        """

        # Escape deny list elements as preparation for regex
        escaped_deny_list = [re.escape(element) for element in deny_list]
        regex = r"(?:^|(?<=\W))(" + "|".join(escaped_deny_list) + r")(?:(?=\W)|$)"
        return Pattern(name="deny_list", regex=regex, score=self.deny_list_score)

    def validate_result(self, pattern_text: str) -> Optional[bool]:
        """
        Validate the pattern logic e.g., by running checksum on a detected pattern.

        :param pattern_text: the text to validated.
        Only the part in text that was detected by the regex engine
        :return: A bool indicating whether the validation was successful.
        """
        return None

    def invalidate_result(self, pattern_text: str) -> Optional[bool]:
        """
        Logic to check for result invalidation by running pruning logic.

        For example, each SSN number group should not consist of all the same digits.

        :param pattern_text: the text to validated.
        Only the part in text that was detected by the regex engine
        :return: A bool indicating whether the result is invalidated
        """
        return None

    @staticmethod
    def build_regex_explanation(
        recognizer_name: str,
        pattern_name: str,
        pattern: str,
        original_score: float,
        validation_result: bool,
        regex_flags: int,
    ) -> AnalysisExplanation:
        """
        Construct an explanation for why this entity was detected.

        :param recognizer_name: Name of recognizer detecting the entity
        :param pattern_name: Regex pattern name which detected the entity
        :param pattern: Regex pattern logic
        :param original_score: Score given by the recognizer
        :param validation_result: Whether validation was used and its result
        :param regex_flags: Regex flags used in the regex matching
        :return: Analysis explanation
        """
        explanation = AnalysisExplanation(
            recognizer=recognizer_name,
            original_score=original_score,
            pattern_name=pattern_name,
            pattern=pattern,
            validation_result=validation_result,
            regex_flags=regex_flags,
        )
        return explanation

    def __analyze_patterns(
        self, text: str, flags: int = None
    ) -> List[RecognizerResult]:
        """
        Evaluate all patterns in the provided text.

        Including words in the provided deny-list

        :param text: text to analyze
        :param flags: regex flags
        :return: A list of RecognizerResult
        """
        flags = flags if flags else self.global_regex_flags
        results = []
        for pattern in self.patterns:
            match_start_time = datetime.datetime.now()
            matches = re.finditer(pattern.regex, text, flags=flags)
            match_time = datetime.datetime.now() - match_start_time
            logger.debug(
                "--- match_time[%s]: %s.%s seconds",
                pattern.name,
                match_time.seconds,
                match_time.microseconds,
            )

            for match in matches:
                start, end = match.span()
                current_match = text[start:end]

                # Skip empty results
                if current_match == "":
                    continue

                score = pattern.score

                validation_result = self.validate_result(current_match)
                description = self.build_regex_explanation(
                    self.name,
                    pattern.name,
                    pattern.regex,
                    score,
                    validation_result,
                    flags,
                )
                pattern_result = RecognizerResult(
                    entity_type=self.supported_entities[0],
                    start=start,
                    end=end,
                    score=score,
                    analysis_explanation=description,
                    recognition_metadata={
                        RecognizerResult.RECOGNIZER_NAME_KEY: self.name,
                        RecognizerResult.RECOGNIZER_IDENTIFIER_KEY: self.id,
                    },
                )

                if validation_result is not None:
                    if validation_result:
                        pattern_result.score = EntityRecognizer.MAX_SCORE
                    else:
                        pattern_result.score = EntityRecognizer.MIN_SCORE

                invalidation_result = self.invalidate_result(current_match)
                if invalidation_result is not None and invalidation_result:
                    pattern_result.score = EntityRecognizer.MIN_SCORE

                if pattern_result.score > EntityRecognizer.MIN_SCORE:
                    results.append(pattern_result)

                # Update analysis explanation score following validation or invalidation
                description.score = pattern_result.score

        results = EntityRecognizer.remove_duplicates(results)
        return results

    def to_dict(self) -> Dict:
        """Serialize instance into a dictionary."""
        return_dict = super().to_dict()

        return_dict["patterns"] = [pat.to_dict() for pat in self.patterns]
        return_dict["deny_list"] = self.deny_list
        return_dict["context"] = self.context
        return_dict["supported_entity"] = return_dict["supported_entities"][0]
        del return_dict["supported_entities"]

        return return_dict

    @classmethod
    def from_dict(cls, entity_recognizer_dict: Dict) -> "PatternRecognizer":
        """Create instance from a serialized dict."""
        patterns = entity_recognizer_dict.get("patterns")
        if patterns:
            patterns_list = [Pattern.from_dict(pat) for pat in patterns]
            entity_recognizer_dict["patterns"] = patterns_list

        return cls(**entity_recognizer_dict)

__analyze_patterns(text, flags=None)

Evaluate all patterns in the provided text.

Including words in the provided deny-list

Parameters:

Name Type Description Default
text str

text to analyze

required
flags int

regex flags

None

Returns:

Type Description
List[RecognizerResult]

A list of RecognizerResult

Source code in presidio_analyzer/pattern_recognizer.py
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
def __analyze_patterns(
    self, text: str, flags: int = None
) -> List[RecognizerResult]:
    """
    Evaluate all patterns in the provided text.

    Including words in the provided deny-list

    :param text: text to analyze
    :param flags: regex flags
    :return: A list of RecognizerResult
    """
    flags = flags if flags else self.global_regex_flags
    results = []
    for pattern in self.patterns:
        match_start_time = datetime.datetime.now()
        matches = re.finditer(pattern.regex, text, flags=flags)
        match_time = datetime.datetime.now() - match_start_time
        logger.debug(
            "--- match_time[%s]: %s.%s seconds",
            pattern.name,
            match_time.seconds,
            match_time.microseconds,
        )

        for match in matches:
            start, end = match.span()
            current_match = text[start:end]

            # Skip empty results
            if current_match == "":
                continue

            score = pattern.score

            validation_result = self.validate_result(current_match)
            description = self.build_regex_explanation(
                self.name,
                pattern.name,
                pattern.regex,
                score,
                validation_result,
                flags,
            )
            pattern_result = RecognizerResult(
                entity_type=self.supported_entities[0],
                start=start,
                end=end,
                score=score,
                analysis_explanation=description,
                recognition_metadata={
                    RecognizerResult.RECOGNIZER_NAME_KEY: self.name,
                    RecognizerResult.RECOGNIZER_IDENTIFIER_KEY: self.id,
                },
            )

            if validation_result is not None:
                if validation_result:
                    pattern_result.score = EntityRecognizer.MAX_SCORE
                else:
                    pattern_result.score = EntityRecognizer.MIN_SCORE

            invalidation_result = self.invalidate_result(current_match)
            if invalidation_result is not None and invalidation_result:
                pattern_result.score = EntityRecognizer.MIN_SCORE

            if pattern_result.score > EntityRecognizer.MIN_SCORE:
                results.append(pattern_result)

            # Update analysis explanation score following validation or invalidation
            description.score = pattern_result.score

    results = EntityRecognizer.remove_duplicates(results)
    return results

analyze(text, entities, nlp_artifacts=None, regex_flags=None)

Analyzes text to detect PII using regular expressions or deny-lists.

Parameters:

Name Type Description Default
text str

Text to be analyzed

required
entities List[str]

Entities this recognizer can detect

required
nlp_artifacts Optional[NlpArtifacts]

Output values from the NLP engine

None
regex_flags Optional[int]

regex flags to be used in regex matching

None

Returns:

Type Description
List[RecognizerResult]
Source code in presidio_analyzer/pattern_recognizer.py
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def analyze(
    self,
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]:
    """
    Analyzes text to detect PII using regular expressions or deny-lists.

    :param text: Text to be analyzed
    :param entities: Entities this recognizer can detect
    :param nlp_artifacts: Output values from the NLP engine
    :param regex_flags: regex flags to be used in regex matching
    :return:
    """
    results = []

    if self.patterns:
        pattern_result = self.__analyze_patterns(text, regex_flags)
        results.extend(pattern_result)

    return results

build_regex_explanation(recognizer_name, pattern_name, pattern, original_score, validation_result, regex_flags) staticmethod

Construct an explanation for why this entity was detected.

Parameters:

Name Type Description Default
recognizer_name str

Name of recognizer detecting the entity

required
pattern_name str

Regex pattern name which detected the entity

required
pattern str

Regex pattern logic

required
original_score float

Score given by the recognizer

required
validation_result bool

Whether validation was used and its result

required
regex_flags int

Regex flags used in the regex matching

required

Returns:

Type Description
AnalysisExplanation

Analysis explanation

Source code in presidio_analyzer/pattern_recognizer.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
@staticmethod
def build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation:
    """
    Construct an explanation for why this entity was detected.

    :param recognizer_name: Name of recognizer detecting the entity
    :param pattern_name: Regex pattern name which detected the entity
    :param pattern: Regex pattern logic
    :param original_score: Score given by the recognizer
    :param validation_result: Whether validation was used and its result
    :param regex_flags: Regex flags used in the regex matching
    :return: Analysis explanation
    """
    explanation = AnalysisExplanation(
        recognizer=recognizer_name,
        original_score=original_score,
        pattern_name=pattern_name,
        pattern=pattern,
        validation_result=validation_result,
        regex_flags=regex_flags,
    )
    return explanation

from_dict(entity_recognizer_dict) classmethod

Create instance from a serialized dict.

Source code in presidio_analyzer/pattern_recognizer.py
256
257
258
259
260
261
262
263
264
@classmethod
def from_dict(cls, entity_recognizer_dict: Dict) -> "PatternRecognizer":
    """Create instance from a serialized dict."""
    patterns = entity_recognizer_dict.get("patterns")
    if patterns:
        patterns_list = [Pattern.from_dict(pat) for pat in patterns]
        entity_recognizer_dict["patterns"] = patterns_list

    return cls(**entity_recognizer_dict)

invalidate_result(pattern_text)

Logic to check for result invalidation by running pruning logic.

For example, each SSN number group should not consist of all the same digits.

Parameters:

Name Type Description Default
pattern_text str

the text to validated. Only the part in text that was detected by the regex engine

required

Returns:

Type Description
Optional[bool]

A bool indicating whether the result is invalidated

Source code in presidio_analyzer/pattern_recognizer.py
127
128
129
130
131
132
133
134
135
136
137
def invalidate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Logic to check for result invalidation by running pruning logic.

    For example, each SSN number group should not consist of all the same digits.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the result is invalidated
    """
    return None

to_dict()

Serialize instance into a dictionary.

Source code in presidio_analyzer/pattern_recognizer.py
244
245
246
247
248
249
250
251
252
253
254
def to_dict(self) -> Dict:
    """Serialize instance into a dictionary."""
    return_dict = super().to_dict()

    return_dict["patterns"] = [pat.to_dict() for pat in self.patterns]
    return_dict["deny_list"] = self.deny_list
    return_dict["context"] = self.context
    return_dict["supported_entity"] = return_dict["supported_entities"][0]
    del return_dict["supported_entities"]

    return return_dict

validate_result(pattern_text)

Validate the pattern logic e.g., by running checksum on a detected pattern.

Parameters:

Name Type Description Default
pattern_text str

the text to validated. Only the part in text that was detected by the regex engine

required

Returns:

Type Description
Optional[bool]

A bool indicating whether the validation was successful.

Source code in presidio_analyzer/pattern_recognizer.py
117
118
119
120
121
122
123
124
125
def validate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Validate the pattern logic e.g., by running checksum on a detected pattern.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the validation was successful.
    """
    return None

PresidioAnalyzerUtils

Utility functions for Presidio Analyzer.

The class provides a bundle of utility functions that help centralizing the logic for reusability and maintainability

Source code in presidio_analyzer/analyzer_utils.py
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
class PresidioAnalyzerUtils:
    """
    Utility functions for Presidio Analyzer.

    The class provides a bundle of utility functions that help centralizing the logic
    for reusability and maintainability
    """

    @staticmethod
    def is_palindrome(text: str, case_insensitive: bool = False):
        """
        Validate if input text is a true palindrome.

        :param text: input text string to check for palindrome
        :param case_insensitive: optional flag to check palindrome with no case
        :return: True / False
        """
        palindrome_text = text
        if case_insensitive:
            palindrome_text = palindrome_text.replace(" ", "").lower()
        return palindrome_text == palindrome_text[::-1]

    @staticmethod
    def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
        """
        Cleanse the input string of the replacement pairs specified as argument.

        :param text: input string
        :param replacement_pairs: pairs of what has to be replaced with which value
        :return: cleansed string
        """
        for search_string, replacement_string in replacement_pairs:
            text = text.replace(search_string, replacement_string)
        return text

    @staticmethod
    def is_verhoeff_number(input_number: int):
        """
        Check if the input number is a true verhoeff number.

        :param input_number:
        :return:
        """
        __d__ = [
            [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
            [1, 2, 3, 4, 0, 6, 7, 8, 9, 5],
            [2, 3, 4, 0, 1, 7, 8, 9, 5, 6],
            [3, 4, 0, 1, 2, 8, 9, 5, 6, 7],
            [4, 0, 1, 2, 3, 9, 5, 6, 7, 8],
            [5, 9, 8, 7, 6, 0, 4, 3, 2, 1],
            [6, 5, 9, 8, 7, 1, 0, 4, 3, 2],
            [7, 6, 5, 9, 8, 2, 1, 0, 4, 3],
            [8, 7, 6, 5, 9, 3, 2, 1, 0, 4],
            [9, 8, 7, 6, 5, 4, 3, 2, 1, 0],
        ]
        __p__ = [
            [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
            [1, 5, 7, 6, 2, 8, 3, 0, 9, 4],
            [5, 8, 0, 3, 7, 9, 6, 1, 4, 2],
            [8, 9, 1, 6, 0, 4, 3, 5, 2, 7],
            [9, 4, 5, 3, 1, 2, 6, 8, 7, 0],
            [4, 2, 8, 6, 5, 7, 3, 9, 0, 1],
            [2, 7, 9, 3, 8, 0, 6, 4, 1, 5],
            [7, 0, 4, 6, 9, 1, 3, 2, 5, 8],
        ]
        __inv__ = [0, 4, 3, 2, 1, 5, 6, 7, 8, 9]

        c = 0
        inverted_number = list(map(int, reversed(str(input_number))))
        for i in range(len(inverted_number)):
            c = __d__[c][__p__[i % 8][inverted_number[i]]]
        return __inv__[c] == 0

is_palindrome(text, case_insensitive=False) staticmethod

Validate if input text is a true palindrome.

Parameters:

Name Type Description Default
text str

input text string to check for palindrome

required
case_insensitive bool

optional flag to check palindrome with no case

False

Returns:

Type Description

True / False

Source code in presidio_analyzer/analyzer_utils.py
12
13
14
15
16
17
18
19
20
21
22
23
24
@staticmethod
def is_palindrome(text: str, case_insensitive: bool = False):
    """
    Validate if input text is a true palindrome.

    :param text: input text string to check for palindrome
    :param case_insensitive: optional flag to check palindrome with no case
    :return: True / False
    """
    palindrome_text = text
    if case_insensitive:
        palindrome_text = palindrome_text.replace(" ", "").lower()
    return palindrome_text == palindrome_text[::-1]

is_verhoeff_number(input_number) staticmethod

Check if the input number is a true verhoeff number.

Parameters:

Name Type Description Default
input_number int
required

Returns:

Type Description
Source code in presidio_analyzer/analyzer_utils.py
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
@staticmethod
def is_verhoeff_number(input_number: int):
    """
    Check if the input number is a true verhoeff number.

    :param input_number:
    :return:
    """
    __d__ = [
        [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
        [1, 2, 3, 4, 0, 6, 7, 8, 9, 5],
        [2, 3, 4, 0, 1, 7, 8, 9, 5, 6],
        [3, 4, 0, 1, 2, 8, 9, 5, 6, 7],
        [4, 0, 1, 2, 3, 9, 5, 6, 7, 8],
        [5, 9, 8, 7, 6, 0, 4, 3, 2, 1],
        [6, 5, 9, 8, 7, 1, 0, 4, 3, 2],
        [7, 6, 5, 9, 8, 2, 1, 0, 4, 3],
        [8, 7, 6, 5, 9, 3, 2, 1, 0, 4],
        [9, 8, 7, 6, 5, 4, 3, 2, 1, 0],
    ]
    __p__ = [
        [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
        [1, 5, 7, 6, 2, 8, 3, 0, 9, 4],
        [5, 8, 0, 3, 7, 9, 6, 1, 4, 2],
        [8, 9, 1, 6, 0, 4, 3, 5, 2, 7],
        [9, 4, 5, 3, 1, 2, 6, 8, 7, 0],
        [4, 2, 8, 6, 5, 7, 3, 9, 0, 1],
        [2, 7, 9, 3, 8, 0, 6, 4, 1, 5],
        [7, 0, 4, 6, 9, 1, 3, 2, 5, 8],
    ]
    __inv__ = [0, 4, 3, 2, 1, 5, 6, 7, 8, 9]

    c = 0
    inverted_number = list(map(int, reversed(str(input_number))))
    for i in range(len(inverted_number)):
        c = __d__[c][__p__[i % 8][inverted_number[i]]]
    return __inv__[c] == 0

sanitize_value(text, replacement_pairs) staticmethod

Cleanse the input string of the replacement pairs specified as argument.

Parameters:

Name Type Description Default
text str

input string

required
replacement_pairs List[Tuple[str, str]]

pairs of what has to be replaced with which value

required

Returns:

Type Description
str

cleansed string

Source code in presidio_analyzer/analyzer_utils.py
26
27
28
29
30
31
32
33
34
35
36
37
@staticmethod
def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
    """
    Cleanse the input string of the replacement pairs specified as argument.

    :param text: input string
    :param replacement_pairs: pairs of what has to be replaced with which value
    :return: cleansed string
    """
    for search_string, replacement_string in replacement_pairs:
        text = text.replace(search_string, replacement_string)
    return text

RecognizerRegistry

Detect, register and hold all recognizers to be used by the analyzer.

Parameters:

Name Type Description Default
recognizers Optional[Iterable[EntityRecognizer]]

An optional list of recognizers, that will be available instead of the predefined recognizers

None
global_regex_flags

regex flags to be used in regex matching, including deny-lists

required
Source code in presidio_analyzer/recognizer_registry.py
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
class RecognizerRegistry:
    """
    Detect, register and hold all recognizers to be used by the analyzer.

    :param recognizers: An optional list of recognizers,
    that will be available instead of the predefined recognizers
    :param global_regex_flags : regex flags to be used in regex matching,
    including deny-lists

    """

    def __init__(
        self,
        recognizers: Optional[Iterable[EntityRecognizer]] = None,
        global_regex_flags: Optional[int] = re.DOTALL | re.MULTILINE | re.IGNORECASE,
    ):
        if recognizers:
            self.recognizers = recognizers
        else:
            self.recognizers = []
        self.global_regex_flags = global_regex_flags

    def load_predefined_recognizers(
        self, languages: Optional[List[str]] = None, nlp_engine: NlpEngine = None
    ) -> None:
        """
        Load the existing recognizers into memory.

        :param languages: List of languages for which to load recognizers
        :param nlp_engine: The NLP engine to use.
        :return: None
        """
        if not languages:
            languages = ["en"]

        nlp_recognizer = self._get_nlp_recognizer(nlp_engine)

        recognizers_map = {
            "en": [
                UsBankRecognizer,
                UsLicenseRecognizer,
                UsItinRecognizer,
                UsPassportRecognizer,
                UsSsnRecognizer,
                NhsRecognizer,
                SgFinRecognizer,
                AuAbnRecognizer,
                AuAcnRecognizer,
                AuTfnRecognizer,
                AuMedicareRecognizer,
                InPanRecognizer,
                InAadhaarRecognizer,
            ],
            "es": [EsNifRecognizer],
            "it": [
                ItDriverLicenseRecognizer,
                ItFiscalCodeRecognizer,
                ItVatCodeRecognizer,
                ItIdentityCardRecognizer,
                ItPassportRecognizer,
            ],
            "pl": [PlPeselRecognizer],
            "ALL": [
                CreditCardRecognizer,
                CryptoRecognizer,
                DateRecognizer,
                EmailRecognizer,
                IbanRecognizer,
                IpRecognizer,
                MedicalLicenseRecognizer,
                PhoneRecognizer,
                UrlRecognizer,
            ],
        }
        for lang in languages:
            lang_recognizers = [
                self.__instantiate_recognizer(
                    recognizer_class=rc, supported_language=lang
                )
                for rc in recognizers_map.get(lang, [])
            ]
            self.recognizers.extend(lang_recognizers)
            all_recognizers = [
                self.__instantiate_recognizer(
                    recognizer_class=rc, supported_language=lang
                )
                for rc in recognizers_map.get("ALL", [])
            ]
            self.recognizers.extend(all_recognizers)
            if nlp_engine:
                nlp_recognizer_inst = nlp_recognizer(
                    supported_language=lang,
                    supported_entities=nlp_engine.get_supported_entities(),
                )
            else:
                nlp_recognizer_inst = nlp_recognizer(supported_language=lang)
            self.recognizers.append(nlp_recognizer_inst)

    @staticmethod
    def _get_nlp_recognizer(
        nlp_engine: NlpEngine,
    ) -> Type[SpacyRecognizer]:
        """Return the recognizer leveraging the selected NLP Engine."""

        if isinstance(nlp_engine, StanzaNlpEngine):
            return StanzaRecognizer
        if isinstance(nlp_engine, TransformersNlpEngine):
            return TransformersRecognizer
        if not nlp_engine or isinstance(nlp_engine, SpacyNlpEngine):
            return SpacyRecognizer
        else:
            logger.warning(
                "nlp engine should be either SpacyNlpEngine,"
                "StanzaNlpEngine or TransformersNlpEngine"
            )
            # Returning default
            return SpacyRecognizer

    def get_recognizers(
        self,
        language: str,
        entities: Optional[List[str]] = None,
        all_fields: bool = False,
        ad_hoc_recognizers: Optional[List[EntityRecognizer]] = None,
    ) -> List[EntityRecognizer]:
        """
        Return a list of recognizers which supports the specified name and language.

        :param entities: the requested entities
        :param language: the requested language
        :param all_fields: a flag to return all fields of a requested language.
        :param ad_hoc_recognizers: Additional recognizers provided by the user
        as part of the request
        :return: A list of the recognizers which supports the supplied entities
        and language
        """
        if language is None:
            raise ValueError("No language provided")

        if entities is None and all_fields is False:
            raise ValueError("No entities provided")

        all_possible_recognizers = copy.copy(self.recognizers)
        if ad_hoc_recognizers:
            all_possible_recognizers.extend(ad_hoc_recognizers)

        # filter out unwanted recognizers
        to_return = set()
        if all_fields:
            to_return = [
                rec
                for rec in all_possible_recognizers
                if language == rec.supported_language
            ]
        else:
            for entity in entities:
                subset = [
                    rec
                    for rec in all_possible_recognizers
                    if entity in rec.supported_entities
                    and language == rec.supported_language
                ]

                if not subset:
                    logger.warning(
                        "Entity %s doesn't have the corresponding"
                        " recognizer in language : %s",
                        entity,
                        language,
                    )
                else:
                    to_return.update(set(subset))

        logger.debug(
            "Returning a total of %s recognizers",
            str(len(to_return)),
        )

        if not to_return:
            raise ValueError("No matching recognizers were found to serve the request.")

        return list(to_return)

    def add_recognizer(self, recognizer: EntityRecognizer) -> None:
        """
        Add a new recognizer to the list of recognizers.

        :param recognizer: Recognizer to add
        """
        if not isinstance(recognizer, EntityRecognizer):
            raise ValueError("Input is not of type EntityRecognizer")

        self.recognizers.append(recognizer)

    def remove_recognizer(self, recognizer_name: str) -> None:
        """
        Remove a recognizer based on its name.

        :param recognizer_name: Name of recognizer to remove
        """
        new_recognizers = [
            rec for rec in self.recognizers if rec.name != recognizer_name
        ]
        logger.info(
            "Removed %s recognizers which had the name %s",
            str(len(self.recognizers) - len(new_recognizers)),
            recognizer_name,
        )
        self.recognizers = new_recognizers

    def add_pattern_recognizer_from_dict(self, recognizer_dict: Dict) -> None:
        """
        Load a pattern recognizer from a Dict into the recognizer registry.

        :param recognizer_dict: Dict holding a serialization of an PatternRecognizer

        :example:
        >>> registry = RecognizerRegistry()
        >>> recognizer = { "name": "Titles Recognizer", "supported_language": "de","supported_entity": "TITLE", "deny_list": ["Mr.","Mrs."]} # noqa: E501
        >>> registry.add_pattern_recognizer_from_dict(recognizer)
        """

        recognizer = PatternRecognizer.from_dict(recognizer_dict)
        self.add_recognizer(recognizer)

    def add_recognizers_from_yaml(self, yml_path: Union[str, Path]) -> None:
        r"""
        Read YAML file and load recognizers into the recognizer registry.

        See example yaml file here:
        https://github.com/microsoft/presidio/blob/main/presidio-analyzer/conf/example_recognizers.yaml

        :example:
        >>> yaml_file = "recognizers.yaml"
        >>> registry = RecognizerRegistry()
        >>> registry.add_recognizers_from_yaml(yaml_file)

        """

        try:
            with open(yml_path, "r") as stream:
                yaml_recognizers = yaml.safe_load(stream)

            for yaml_recognizer in yaml_recognizers["recognizers"]:
                self.add_pattern_recognizer_from_dict(yaml_recognizer)
        except IOError as io_error:
            print(f"Error reading file {yml_path}")
            raise io_error
        except yaml.YAMLError as yaml_error:
            print(f"Failed to parse file {yml_path}")
            raise yaml_error
        except TypeError as yaml_error:
            print(f"Failed to parse file {yml_path}")
            raise yaml_error

    def __instantiate_recognizer(
        self, recognizer_class: Type[EntityRecognizer], supported_language: str
    ):
        """
        Instantiate a recognizer class given type and input.

        :param recognizer_class: Class object of the recognizer
        :param supported_language: Language this recognizer should support
        """

        inst = recognizer_class(supported_language=supported_language)
        if isinstance(inst, PatternRecognizer):
            inst.global_regex_flags = self.global_regex_flags
        return inst

    def _get_supported_languages(self) -> List[str]:
        languages = []
        for rec in self.recognizers:
            languages.append(rec.supported_language)

        return list(set(languages))

    def get_supported_entities(
        self, languages: Optional[List[str]] = None
    ) -> List[str]:
        """
        Return the supported entities by the set of recognizers loaded.

        :param languages: The languages to get the supported entities for.
        If languages=None, returns all entities for all languages.
        """
        if not languages:
            languages = self._get_supported_languages()

        supported_entities = []
        for language in languages:
            recognizers = self.get_recognizers(language=language, all_fields=True)

            for recognizer in recognizers:
                supported_entities.extend(recognizer.get_supported_entities())

        return list(set(supported_entities))

__instantiate_recognizer(recognizer_class, supported_language)

Instantiate a recognizer class given type and input.

Parameters:

Name Type Description Default
recognizer_class Type[EntityRecognizer]

Class object of the recognizer

required
supported_language str

Language this recognizer should support

required
Source code in presidio_analyzer/recognizer_registry.py
309
310
311
312
313
314
315
316
317
318
319
320
321
322
def __instantiate_recognizer(
    self, recognizer_class: Type[EntityRecognizer], supported_language: str
):
    """
    Instantiate a recognizer class given type and input.

    :param recognizer_class: Class object of the recognizer
    :param supported_language: Language this recognizer should support
    """

    inst = recognizer_class(supported_language=supported_language)
    if isinstance(inst, PatternRecognizer):
        inst.global_regex_flags = self.global_regex_flags
    return inst

add_pattern_recognizer_from_dict(recognizer_dict)

Load a pattern recognizer from a Dict into the recognizer registry.

:example:

registry = RecognizerRegistry() recognizer = { "name": "Titles Recognizer", "supported_language": "de","supported_entity": "TITLE", "deny_list": ["Mr.","Mrs."]} # noqa: E501 registry.add_pattern_recognizer_from_dict(recognizer)

Parameters:

Name Type Description Default
recognizer_dict Dict

Dict holding a serialization of an PatternRecognizer

required
Source code in presidio_analyzer/recognizer_registry.py
264
265
266
267
268
269
270
271
272
273
274
275
276
277
def add_pattern_recognizer_from_dict(self, recognizer_dict: Dict) -> None:
    """
    Load a pattern recognizer from a Dict into the recognizer registry.

    :param recognizer_dict: Dict holding a serialization of an PatternRecognizer

    :example:
    >>> registry = RecognizerRegistry()
    >>> recognizer = { "name": "Titles Recognizer", "supported_language": "de","supported_entity": "TITLE", "deny_list": ["Mr.","Mrs."]} # noqa: E501
    >>> registry.add_pattern_recognizer_from_dict(recognizer)
    """

    recognizer = PatternRecognizer.from_dict(recognizer_dict)
    self.add_recognizer(recognizer)

add_recognizer(recognizer)

Add a new recognizer to the list of recognizers.

Parameters:

Name Type Description Default
recognizer EntityRecognizer

Recognizer to add

required
Source code in presidio_analyzer/recognizer_registry.py
237
238
239
240
241
242
243
244
245
246
def add_recognizer(self, recognizer: EntityRecognizer) -> None:
    """
    Add a new recognizer to the list of recognizers.

    :param recognizer: Recognizer to add
    """
    if not isinstance(recognizer, EntityRecognizer):
        raise ValueError("Input is not of type EntityRecognizer")

    self.recognizers.append(recognizer)

add_recognizers_from_yaml(yml_path)

Read YAML file and load recognizers into the recognizer registry.

See example yaml file here: https://github.com/microsoft/presidio/blob/main/presidio-analyzer/conf/example_recognizers.yaml

:example:

yaml_file = "recognizers.yaml" registry = RecognizerRegistry() registry.add_recognizers_from_yaml(yaml_file)

Source code in presidio_analyzer/recognizer_registry.py
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
def add_recognizers_from_yaml(self, yml_path: Union[str, Path]) -> None:
    r"""
    Read YAML file and load recognizers into the recognizer registry.

    See example yaml file here:
    https://github.com/microsoft/presidio/blob/main/presidio-analyzer/conf/example_recognizers.yaml

    :example:
    >>> yaml_file = "recognizers.yaml"
    >>> registry = RecognizerRegistry()
    >>> registry.add_recognizers_from_yaml(yaml_file)

    """

    try:
        with open(yml_path, "r") as stream:
            yaml_recognizers = yaml.safe_load(stream)

        for yaml_recognizer in yaml_recognizers["recognizers"]:
            self.add_pattern_recognizer_from_dict(yaml_recognizer)
    except IOError as io_error:
        print(f"Error reading file {yml_path}")
        raise io_error
    except yaml.YAMLError as yaml_error:
        print(f"Failed to parse file {yml_path}")
        raise yaml_error
    except TypeError as yaml_error:
        print(f"Failed to parse file {yml_path}")
        raise yaml_error

get_recognizers(language, entities=None, all_fields=False, ad_hoc_recognizers=None)

Return a list of recognizers which supports the specified name and language.

Parameters:

Name Type Description Default
entities Optional[List[str]]

the requested entities

None
language str

the requested language

required
all_fields bool

a flag to return all fields of a requested language.

False
ad_hoc_recognizers Optional[List[EntityRecognizer]]

Additional recognizers provided by the user as part of the request

None

Returns:

Type Description
List[EntityRecognizer]

A list of the recognizers which supports the supplied entities and language

Source code in presidio_analyzer/recognizer_registry.py
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
def get_recognizers(
    self,
    language: str,
    entities: Optional[List[str]] = None,
    all_fields: bool = False,
    ad_hoc_recognizers: Optional[List[EntityRecognizer]] = None,
) -> List[EntityRecognizer]:
    """
    Return a list of recognizers which supports the specified name and language.

    :param entities: the requested entities
    :param language: the requested language
    :param all_fields: a flag to return all fields of a requested language.
    :param ad_hoc_recognizers: Additional recognizers provided by the user
    as part of the request
    :return: A list of the recognizers which supports the supplied entities
    and language
    """
    if language is None:
        raise ValueError("No language provided")

    if entities is None and all_fields is False:
        raise ValueError("No entities provided")

    all_possible_recognizers = copy.copy(self.recognizers)
    if ad_hoc_recognizers:
        all_possible_recognizers.extend(ad_hoc_recognizers)

    # filter out unwanted recognizers
    to_return = set()
    if all_fields:
        to_return = [
            rec
            for rec in all_possible_recognizers
            if language == rec.supported_language
        ]
    else:
        for entity in entities:
            subset = [
                rec
                for rec in all_possible_recognizers
                if entity in rec.supported_entities
                and language == rec.supported_language
            ]

            if not subset:
                logger.warning(
                    "Entity %s doesn't have the corresponding"
                    " recognizer in language : %s",
                    entity,
                    language,
                )
            else:
                to_return.update(set(subset))

    logger.debug(
        "Returning a total of %s recognizers",
        str(len(to_return)),
    )

    if not to_return:
        raise ValueError("No matching recognizers were found to serve the request.")

    return list(to_return)

get_supported_entities(languages=None)

Return the supported entities by the set of recognizers loaded.

Parameters:

Name Type Description Default
languages Optional[List[str]]

The languages to get the supported entities for. If languages=None, returns all entities for all languages.

None
Source code in presidio_analyzer/recognizer_registry.py
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
def get_supported_entities(
    self, languages: Optional[List[str]] = None
) -> List[str]:
    """
    Return the supported entities by the set of recognizers loaded.

    :param languages: The languages to get the supported entities for.
    If languages=None, returns all entities for all languages.
    """
    if not languages:
        languages = self._get_supported_languages()

    supported_entities = []
    for language in languages:
        recognizers = self.get_recognizers(language=language, all_fields=True)

        for recognizer in recognizers:
            supported_entities.extend(recognizer.get_supported_entities())

    return list(set(supported_entities))

load_predefined_recognizers(languages=None, nlp_engine=None)

Load the existing recognizers into memory.

Parameters:

Name Type Description Default
languages Optional[List[str]]

List of languages for which to load recognizers

None
nlp_engine NlpEngine

The NLP engine to use.

None

Returns:

Type Description
None

None

Source code in presidio_analyzer/recognizer_registry.py
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
def load_predefined_recognizers(
    self, languages: Optional[List[str]] = None, nlp_engine: NlpEngine = None
) -> None:
    """
    Load the existing recognizers into memory.

    :param languages: List of languages for which to load recognizers
    :param nlp_engine: The NLP engine to use.
    :return: None
    """
    if not languages:
        languages = ["en"]

    nlp_recognizer = self._get_nlp_recognizer(nlp_engine)

    recognizers_map = {
        "en": [
            UsBankRecognizer,
            UsLicenseRecognizer,
            UsItinRecognizer,
            UsPassportRecognizer,
            UsSsnRecognizer,
            NhsRecognizer,
            SgFinRecognizer,
            AuAbnRecognizer,
            AuAcnRecognizer,
            AuTfnRecognizer,
            AuMedicareRecognizer,
            InPanRecognizer,
            InAadhaarRecognizer,
        ],
        "es": [EsNifRecognizer],
        "it": [
            ItDriverLicenseRecognizer,
            ItFiscalCodeRecognizer,
            ItVatCodeRecognizer,
            ItIdentityCardRecognizer,
            ItPassportRecognizer,
        ],
        "pl": [PlPeselRecognizer],
        "ALL": [
            CreditCardRecognizer,
            CryptoRecognizer,
            DateRecognizer,
            EmailRecognizer,
            IbanRecognizer,
            IpRecognizer,
            MedicalLicenseRecognizer,
            PhoneRecognizer,
            UrlRecognizer,
        ],
    }
    for lang in languages:
        lang_recognizers = [
            self.__instantiate_recognizer(
                recognizer_class=rc, supported_language=lang
            )
            for rc in recognizers_map.get(lang, [])
        ]
        self.recognizers.extend(lang_recognizers)
        all_recognizers = [
            self.__instantiate_recognizer(
                recognizer_class=rc, supported_language=lang
            )
            for rc in recognizers_map.get("ALL", [])
        ]
        self.recognizers.extend(all_recognizers)
        if nlp_engine:
            nlp_recognizer_inst = nlp_recognizer(
                supported_language=lang,
                supported_entities=nlp_engine.get_supported_entities(),
            )
        else:
            nlp_recognizer_inst = nlp_recognizer(supported_language=lang)
        self.recognizers.append(nlp_recognizer_inst)

remove_recognizer(recognizer_name)

Remove a recognizer based on its name.

Parameters:

Name Type Description Default
recognizer_name str

Name of recognizer to remove

required
Source code in presidio_analyzer/recognizer_registry.py
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
def remove_recognizer(self, recognizer_name: str) -> None:
    """
    Remove a recognizer based on its name.

    :param recognizer_name: Name of recognizer to remove
    """
    new_recognizers = [
        rec for rec in self.recognizers if rec.name != recognizer_name
    ]
    logger.info(
        "Removed %s recognizers which had the name %s",
        str(len(self.recognizers) - len(new_recognizers)),
        recognizer_name,
    )
    self.recognizers = new_recognizers

RecognizerResult

Recognizer Result represents the findings of the detected entity.

Result of a recognizer analyzing the text.

Parameters:

Name Type Description Default
entity_type str

the type of the entity

required
start int

the start location of the detected entity

required
end int

the end location of the detected entity

required
score float

the score of the detection

required
analysis_explanation AnalysisExplanation

contains the explanation of why this entity was identified

None
recognition_metadata Dict

a dictionary of metadata to be used in recognizer specific cases, for example specific recognized context words and recognizer name

None
Source code in presidio_analyzer/recognizer_result.py
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
class RecognizerResult:
    """
    Recognizer Result represents the findings of the detected entity.

    Result of a recognizer analyzing the text.

    :param entity_type: the type of the entity
    :param start: the start location of the detected entity
    :param end: the end location of the detected entity
    :param score: the score of the detection
    :param analysis_explanation: contains the explanation of why this
                                 entity was identified
    :param recognition_metadata: a dictionary of metadata to be used in
    recognizer specific cases, for example specific recognized context words
    and recognizer name
    """

    # Keys for recognizer metadata
    RECOGNIZER_NAME_KEY = "recognizer_name"
    RECOGNIZER_IDENTIFIER_KEY = "recognizer_identifier"

    # Key of a flag inside recognition_metadata dictionary
    # which is set to true if the result enhanced by context
    IS_SCORE_ENHANCED_BY_CONTEXT_KEY = "is_score_enhanced_by_context"

    logger = logging.getLogger("presidio-analyzer")

    def __init__(
        self,
        entity_type: str,
        start: int,
        end: int,
        score: float,
        analysis_explanation: AnalysisExplanation = None,
        recognition_metadata: Dict = None,
    ):

        self.entity_type = entity_type
        self.start = start
        self.end = end
        self.score = score
        self.analysis_explanation = analysis_explanation

        if not recognition_metadata:
            self.logger.debug(
                "recognition_metadata should be passed, "
                "containing a recognizer_name value"
            )

        self.recognition_metadata = recognition_metadata

    def append_analysis_explanation_text(self, text: str) -> None:
        """Add text to the analysis explanation."""
        if self.analysis_explanation:
            self.analysis_explanation.append_textual_explanation_line(text)

    def to_dict(self) -> Dict:
        """
        Serialize self to dictionary.

        :return: a dictionary
        """
        return self.__dict__

    @classmethod
    def from_json(cls, data: Dict) -> "RecognizerResult":
        """
        Create RecognizerResult from json.

        :param data: e.g. {
            "start": 24,
            "end": 32,
            "score": 0.8,
            "entity_type": "NAME"
        }
        :return: RecognizerResult
        """
        score = data.get("score")
        entity_type = data.get("entity_type")
        start = data.get("start")
        end = data.get("end")
        return cls(entity_type, start, end, score)

    def __repr__(self) -> str:
        """Return a string representation of the instance."""
        return self.__str__()

    def intersects(self, other: "RecognizerResult") -> int:
        """
        Check if self intersects with a different RecognizerResult.

        :return: If intersecting, returns the number of
        intersecting characters.
        If not, returns 0
        """
        # if they do not overlap the intersection is 0
        if self.end < other.start or other.end < self.start:
            return 0

        # otherwise the intersection is min(end) - max(start)
        return min(self.end, other.end) - max(self.start, other.start)

    def contained_in(self, other: "RecognizerResult") -> bool:
        """
        Check if self is contained in a different RecognizerResult.

        :return: true if contained
        """
        return self.start >= other.start and self.end <= other.end

    def contains(self, other: "RecognizerResult") -> bool:
        """
        Check if one result is contained or equal to another result.

        :param other: another RecognizerResult
        :return: bool
        """
        return self.start <= other.start and self.end >= other.end

    def equal_indices(self, other: "RecognizerResult") -> bool:
        """
        Check if the indices are equal between two results.

        :param other: another RecognizerResult
        :return:
        """
        return self.start == other.start and self.end == other.end

    def __gt__(self, other: "RecognizerResult") -> bool:
        """
        Check if one result is greater by using the results indices in the text.

        :param other: another RecognizerResult
        :return: bool
        """
        if self.start == other.start:
            return self.end > other.end
        return self.start > other.start

    def __eq__(self, other: "RecognizerResult") -> bool:
        """
        Check two results are equal by using all class fields.

        :param other: another RecognizerResult
        :return: bool
        """
        equal_type = self.entity_type == other.entity_type
        equal_score = self.score == other.score
        return self.equal_indices(other) and equal_type and equal_score

    def __hash__(self):
        """
        Hash the result data by using all class fields.

        :return: int
        """
        return hash(
            f"{str(self.start)} {str(self.end)} {str(self.score)} {self.entity_type}"
        )

    def __str__(self) -> str:
        """Return a string representation of the instance."""
        return (
            f"type: {self.entity_type}, "
            f"start: {self.start}, "
            f"end: {self.end}, "
            f"score: {self.score}"
        )

    def has_conflict(self, other: "RecognizerResult") -> bool:
        """
        Check if two recognizer results are conflicted or not.

        I have a conflict if:
        1. My indices are the same as the other and my score is lower.
        2. If my indices are contained in another.

        :param other: RecognizerResult
        :return:
        """
        if self.equal_indices(other):
            return self.score <= other.score
        return other.contains(self)

__eq__(other)

Check two results are equal by using all class fields.

Parameters:

Name Type Description Default
other RecognizerResult

another RecognizerResult

required

Returns:

Type Description
bool

bool

Source code in presidio_analyzer/recognizer_result.py
146
147
148
149
150
151
152
153
154
155
def __eq__(self, other: "RecognizerResult") -> bool:
    """
    Check two results are equal by using all class fields.

    :param other: another RecognizerResult
    :return: bool
    """
    equal_type = self.entity_type == other.entity_type
    equal_score = self.score == other.score
    return self.equal_indices(other) and equal_type and equal_score

__gt__(other)

Check if one result is greater by using the results indices in the text.

Parameters:

Name Type Description Default
other RecognizerResult

another RecognizerResult

required

Returns:

Type Description
bool

bool

Source code in presidio_analyzer/recognizer_result.py
135
136
137
138
139
140
141
142
143
144
def __gt__(self, other: "RecognizerResult") -> bool:
    """
    Check if one result is greater by using the results indices in the text.

    :param other: another RecognizerResult
    :return: bool
    """
    if self.start == other.start:
        return self.end > other.end
    return self.start > other.start

__hash__()

Hash the result data by using all class fields.

Returns:

Type Description

int

Source code in presidio_analyzer/recognizer_result.py
157
158
159
160
161
162
163
164
165
def __hash__(self):
    """
    Hash the result data by using all class fields.

    :return: int
    """
    return hash(
        f"{str(self.start)} {str(self.end)} {str(self.score)} {self.entity_type}"
    )

__repr__()

Return a string representation of the instance.

Source code in presidio_analyzer/recognizer_result.py
90
91
92
def __repr__(self) -> str:
    """Return a string representation of the instance."""
    return self.__str__()

__str__()

Return a string representation of the instance.

Source code in presidio_analyzer/recognizer_result.py
167
168
169
170
171
172
173
174
def __str__(self) -> str:
    """Return a string representation of the instance."""
    return (
        f"type: {self.entity_type}, "
        f"start: {self.start}, "
        f"end: {self.end}, "
        f"score: {self.score}"
    )

append_analysis_explanation_text(text)

Add text to the analysis explanation.

Source code in presidio_analyzer/recognizer_result.py
58
59
60
61
def append_analysis_explanation_text(self, text: str) -> None:
    """Add text to the analysis explanation."""
    if self.analysis_explanation:
        self.analysis_explanation.append_textual_explanation_line(text)

contained_in(other)

Check if self is contained in a different RecognizerResult.

Returns:

Type Description
bool

true if contained

Source code in presidio_analyzer/recognizer_result.py
109
110
111
112
113
114
115
def contained_in(self, other: "RecognizerResult") -> bool:
    """
    Check if self is contained in a different RecognizerResult.

    :return: true if contained
    """
    return self.start >= other.start and self.end <= other.end

contains(other)

Check if one result is contained or equal to another result.

Parameters:

Name Type Description Default
other RecognizerResult

another RecognizerResult

required

Returns:

Type Description
bool

bool

Source code in presidio_analyzer/recognizer_result.py
117
118
119
120
121
122
123
124
def contains(self, other: "RecognizerResult") -> bool:
    """
    Check if one result is contained or equal to another result.

    :param other: another RecognizerResult
    :return: bool
    """
    return self.start <= other.start and self.end >= other.end

equal_indices(other)

Check if the indices are equal between two results.

Parameters:

Name Type Description Default
other RecognizerResult

another RecognizerResult

required

Returns:

Type Description
bool
Source code in presidio_analyzer/recognizer_result.py
126
127
128
129
130
131
132
133
def equal_indices(self, other: "RecognizerResult") -> bool:
    """
    Check if the indices are equal between two results.

    :param other: another RecognizerResult
    :return:
    """
    return self.start == other.start and self.end == other.end

from_json(data) classmethod

Create RecognizerResult from json.

Parameters:

Name Type Description Default
data Dict

e.g. { "start": 24, "end": 32, "score": 0.8, "entity_type": "NAME" }

required

Returns:

Type Description
RecognizerResult

RecognizerResult

Source code in presidio_analyzer/recognizer_result.py
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
@classmethod
def from_json(cls, data: Dict) -> "RecognizerResult":
    """
    Create RecognizerResult from json.

    :param data: e.g. {
        "start": 24,
        "end": 32,
        "score": 0.8,
        "entity_type": "NAME"
    }
    :return: RecognizerResult
    """
    score = data.get("score")
    entity_type = data.get("entity_type")
    start = data.get("start")
    end = data.get("end")
    return cls(entity_type, start, end, score)

has_conflict(other)

Check if two recognizer results are conflicted or not.

I have a conflict if: 1. My indices are the same as the other and my score is lower. 2. If my indices are contained in another.

Parameters:

Name Type Description Default
other RecognizerResult

RecognizerResult

required

Returns:

Type Description
bool
Source code in presidio_analyzer/recognizer_result.py
176
177
178
179
180
181
182
183
184
185
186
187
188
189
def has_conflict(self, other: "RecognizerResult") -> bool:
    """
    Check if two recognizer results are conflicted or not.

    I have a conflict if:
    1. My indices are the same as the other and my score is lower.
    2. If my indices are contained in another.

    :param other: RecognizerResult
    :return:
    """
    if self.equal_indices(other):
        return self.score <= other.score
    return other.contains(self)

intersects(other)

Check if self intersects with a different RecognizerResult.

Returns:

Type Description
int

If intersecting, returns the number of intersecting characters. If not, returns 0

Source code in presidio_analyzer/recognizer_result.py
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
def intersects(self, other: "RecognizerResult") -> int:
    """
    Check if self intersects with a different RecognizerResult.

    :return: If intersecting, returns the number of
    intersecting characters.
    If not, returns 0
    """
    # if they do not overlap the intersection is 0
    if self.end < other.start or other.end < self.start:
        return 0

    # otherwise the intersection is min(end) - max(start)
    return min(self.end, other.end) - max(self.start, other.start)

to_dict()

Serialize self to dictionary.

Returns:

Type Description
Dict

a dictionary

Source code in presidio_analyzer/recognizer_result.py
63
64
65
66
67
68
69
def to_dict(self) -> Dict:
    """
    Serialize self to dictionary.

    :return: a dictionary
    """
    return self.__dict__

RemoteRecognizer

Bases: ABC, EntityRecognizer

A configuration for a recognizer that runs on a different process / remote machine.

Parameters:

Name Type Description Default
supported_entities List[str]

A list of entities this recognizer can identify

required
name Optional[str]

name of recognizer

required
supported_language str

The language this recognizer can detect entities in

required
version str

Version of this recognizer

required
Source code in presidio_analyzer/remote_recognizer.py
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
class RemoteRecognizer(ABC, EntityRecognizer):
    """
    A configuration for a recognizer that runs on a different process / remote machine.

    :param supported_entities: A list of entities this recognizer can identify
    :param name: name of recognizer
    :param supported_language: The language this recognizer can detect entities in
    :param version: Version of this recognizer
    """

    def __init__(
        self,
        supported_entities: List[str],
        name: Optional[str],
        supported_language: str,
        version: str,
        context: Optional[List[str]] = None,
    ):
        super().__init__(
            supported_entities=supported_entities,
            name=name,
            supported_language=supported_language,
            version=version,
            context=context,
        )

    def load(self):  # noqa D102
        pass

    @abstractmethod
    def analyze(
        self, text: str, entities: List[str], nlp_artifacts: NlpArtifacts
    ):  # noqa ANN201
        """
        Call an external service for PII detection.

        :param text: text to be analyzed
        :param entities: Entities that should be looked for
        :param nlp_artifacts: Additional metadata from the NLP engine
        :return: List of identified PII entities
        """

        # 1. Call the external service.
        # 2. Translate results into List[RecognizerResult]
        pass

    @abstractmethod
    def get_supported_entities(self) -> List[str]:  # noqa D102
        pass

analyze(text, entities, nlp_artifacts) abstractmethod

Call an external service for PII detection.

Parameters:

Name Type Description Default
text str

text to be analyzed

required
entities List[str]

Entities that should be looked for

required
nlp_artifacts NlpArtifacts

Additional metadata from the NLP engine

required

Returns:

Type Description

List of identified PII entities

Source code in presidio_analyzer/remote_recognizer.py
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
@abstractmethod
def analyze(
    self, text: str, entities: List[str], nlp_artifacts: NlpArtifacts
):  # noqa ANN201
    """
    Call an external service for PII detection.

    :param text: text to be analyzed
    :param entities: Entities that should be looked for
    :param nlp_artifacts: Additional metadata from the NLP engine
    :return: List of identified PII entities
    """

    # 1. Call the external service.
    # 2. Translate results into List[RecognizerResult]
    pass