Skip to content

Presidio Analyzer API Reference

Objects at the top of the presidio-analyzer package

presidio_analyzer.AnalyzerEngine

Entry point for Presidio Analyzer.

Orchestrating the detection of PII entities and all related logic.

PARAMETER DESCRIPTION
registry

instance of type RecognizerRegistry

TYPE: RecognizerRegistry DEFAULT: None

nlp_engine

instance of type NlpEngine (for example SpacyNlpEngine)

TYPE: NlpEngine DEFAULT: None

app_tracer

instance of type AppTracer, used to trace the logic used during each request for interpretability reasons.

TYPE: AppTracer DEFAULT: None

log_decision_process

bool, defines whether the decision process within the analyzer should be logged or not.

TYPE: bool DEFAULT: False

default_score_threshold

Minimum confidence value for detected entities to be returned

TYPE: float DEFAULT: 0

supported_languages

List of possible languages this engine could be run on. Used for loading the right NLP models and recognizers for these languages.

TYPE: List[str] DEFAULT: None

context_aware_enhancer

instance of type ContextAwareEnhancer for enhancing confidence score based on context words, (LemmaContextAwareEnhancer will be created by default if None passed)

TYPE: Optional[ContextAwareEnhancer] DEFAULT: None

METHOD DESCRIPTION
get_recognizers

Return a list of PII recognizers currently loaded.

get_supported_entities

Return a list of the entities that can be detected.

analyze

Find PII entities in text using different PII recognizers for a given language.

Source code in presidio_analyzer/analyzer_engine.py
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
class AnalyzerEngine:
    """
    Entry point for Presidio Analyzer.

    Orchestrating the detection of PII entities and all related logic.

    :param registry: instance of type RecognizerRegistry
    :param nlp_engine: instance of type NlpEngine
    (for example SpacyNlpEngine)
    :param app_tracer: instance of type AppTracer, used to trace the logic
    used during each request for interpretability reasons.
    :param log_decision_process: bool,
    defines whether the decision process within the analyzer should be logged or not.
    :param default_score_threshold: Minimum confidence value
    for detected entities to be returned
    :param supported_languages: List of possible languages this engine could be run on.
    Used for loading the right NLP models and recognizers for these languages.
    :param context_aware_enhancer: instance of type ContextAwareEnhancer for enhancing
    confidence score based on context words, (LemmaContextAwareEnhancer will be created
    by default if None passed)
    """

    def __init__(
        self,
        registry: RecognizerRegistry = None,
        nlp_engine: NlpEngine = None,
        app_tracer: AppTracer = None,
        log_decision_process: bool = False,
        default_score_threshold: float = 0,
        supported_languages: List[str] = None,
        context_aware_enhancer: Optional[ContextAwareEnhancer] = None,
    ):
        if not supported_languages:
            supported_languages = ["en"]

        if not nlp_engine:
            logger.info("nlp_engine not provided, creating default.")
            provider = NlpEngineProvider()
            nlp_engine = provider.create_engine()

        if not app_tracer:
            app_tracer = AppTracer()
        self.app_tracer = app_tracer

        self.supported_languages = supported_languages

        self.nlp_engine = nlp_engine
        if not self.nlp_engine.is_loaded():
            self.nlp_engine.load()

        if not registry:
            logger.info("registry not provided, creating default.")
            provider = RecognizerRegistryProvider(
                registry_configuration={"supported_languages": self.supported_languages}
            )
            registry = provider.create_recognizer_registry()
            registry.add_nlp_recognizer(nlp_engine=self.nlp_engine)
        else:
            if Counter(registry.supported_languages) != Counter(
                self.supported_languages
            ):
                raise ValueError(
                    f"Misconfigured engine, supported languages have to be consistent"
                    f"registry.supported_languages: {registry.supported_languages}, "
                    f"analyzer_engine.supported_languages: {self.supported_languages}"
                )

        # added to support the previous interface
        if not registry.recognizers:
            registry.load_predefined_recognizers(
                nlp_engine=self.nlp_engine, languages=self.supported_languages
            )

        self.registry = registry

        self.log_decision_process = log_decision_process
        self.default_score_threshold = default_score_threshold

        if not context_aware_enhancer:
            logger.debug(
                "context aware enhancer not provided, creating default"
                + " lemma based enhancer."
            )
            context_aware_enhancer = LemmaContextAwareEnhancer()

        self.context_aware_enhancer = context_aware_enhancer

    def get_recognizers(self, language: Optional[str] = None) -> List[EntityRecognizer]:
        """
        Return a list of PII recognizers currently loaded.

        :param language: Return the recognizers supporting a given language.
        :return: List of [Recognizer] as a RecognizersAllResponse
        """
        if not language:
            languages = self.supported_languages
        else:
            languages = [language]

        recognizers = []
        for language in languages:
            logger.info(f"Fetching all recognizers for language {language}")
            recognizers.extend(
                self.registry.get_recognizers(language=language, all_fields=True)
            )

        return list(set(recognizers))

    def get_supported_entities(self, language: Optional[str] = None) -> List[str]:
        """
        Return a list of the entities that can be detected.

        :param language: Return only entities supported in a specific language.
        :return: List of entity names
        """
        recognizers = self.get_recognizers(language=language)
        supported_entities = []
        for recognizer in recognizers:
            supported_entities.extend(recognizer.get_supported_entities())

        return list(set(supported_entities))

    def analyze(
        self,
        text: str,
        language: str,
        entities: Optional[List[str]] = None,
        correlation_id: Optional[str] = None,
        score_threshold: Optional[float] = None,
        return_decision_process: Optional[bool] = False,
        ad_hoc_recognizers: Optional[List[EntityRecognizer]] = None,
        context: Optional[List[str]] = None,
        allow_list: Optional[List[str]] = None,
        allow_list_match: Optional[str] = "exact",
        regex_flags: Optional[int] = re.DOTALL | re.MULTILINE | re.IGNORECASE,
        nlp_artifacts: Optional[NlpArtifacts] = None,
    ) -> List[RecognizerResult]:
        """
        Find PII entities in text using different PII recognizers for a given language.

        :param text: the text to analyze
        :param language: the language of the text
        :param entities: List of PII entities that should be looked for in the text.
        If entities=None then all entities are looked for.
        :param correlation_id: cross call ID for this request
        :param score_threshold: A minimum value for which
        to return an identified entity
        :param return_decision_process: Whether the analysis decision process steps
        returned in the response.
        :param ad_hoc_recognizers: List of recognizers which will be used only
        for this specific request.
        :param context: List of context words to enhance confidence score if matched
        with the recognized entity's recognizer context
        :param allow_list: List of words that the user defines as being allowed to keep
        in the text
        :param allow_list_match: How the allow_list should be interpreted; either as "exact" or as "regex".
        - If `regex`, results which match with any regex condition in the allow_list would be allowed and not be returned as potential PII.
        - if `exact`, results which exactly match any value in the allow_list would be allowed and not be returned as potential PII.
        :param regex_flags: regex flags to be used for when allow_list_match is "regex"
        :param nlp_artifacts: precomputed NlpArtifacts
        :return: an array of the found entities in the text

        :Example:

        ```python
        from presidio_analyzer import AnalyzerEngine

        # Set up the engine, loads the NLP module (spaCy model by default)
        # and other PII recognizers
        analyzer = AnalyzerEngine()

        # Call analyzer to get results
        results = analyzer.analyze(text='My phone number is 212-555-5555', entities=['PHONE_NUMBER'], language='en')
        print(results)
        ```

        """  # noqa: E501

        all_fields = not entities

        recognizers = self.registry.get_recognizers(
            language=language,
            entities=entities,
            all_fields=all_fields,
            ad_hoc_recognizers=ad_hoc_recognizers,
        )

        if all_fields:
            # Since all_fields=True, list all entities by iterating
            # over all recognizers
            entities = self.get_supported_entities(language=language)

        # run the nlp pipeline over the given text, store the results in
        # a NlpArtifacts instance
        if not nlp_artifacts:
            nlp_artifacts = self.nlp_engine.process_text(text, language)

        if self.log_decision_process:
            self.app_tracer.trace(
                correlation_id, "nlp artifacts:" + nlp_artifacts.to_json()
            )

        results = []
        for recognizer in recognizers:
            # Lazy loading of the relevant recognizers
            if not recognizer.is_loaded:
                recognizer.load()
                recognizer.is_loaded = True

            # analyze using the current recognizer and append the results
            current_results = recognizer.analyze(
                text=text, entities=entities, nlp_artifacts=nlp_artifacts
            )
            if current_results:
                # add recognizer name to recognition metadata inside results
                # if not exists
                self.__add_recognizer_id_if_not_exists(current_results, recognizer)
                results.extend(current_results)

        results = self._enhance_using_context(
            text, results, nlp_artifacts, recognizers, context
        )

        if self.log_decision_process:
            self.app_tracer.trace(
                correlation_id,
                json.dumps([str(result.to_dict()) for result in results]),
            )

        # Remove duplicates or low score results
        results = EntityRecognizer.remove_duplicates(results)
        results = self.__remove_low_scores(results, score_threshold)

        if allow_list:
            results = self._remove_allow_list(
                results, allow_list, text, regex_flags, allow_list_match
            )

        if not return_decision_process:
            results = self.__remove_decision_process(results)

        return results

    def _enhance_using_context(
        self,
        text: str,
        raw_results: List[RecognizerResult],
        nlp_artifacts: NlpArtifacts,
        recognizers: List[EntityRecognizer],
        context: Optional[List[str]] = None,
    ) -> List[RecognizerResult]:
        """
        Enhance confidence score using context words.

        :param text: The actual text that was analyzed
        :param raw_results: Recognizer results which didn't take
                            context into consideration
        :param nlp_artifacts: The nlp artifacts contains elements
                              such as lemmatized tokens for better
                              accuracy of the context enhancement process
        :param recognizers: the list of recognizers
        :param context: list of context words
        """
        results = []

        for recognizer in recognizers:
            recognizer_results = [
                r
                for r in raw_results
                if r.recognition_metadata[RecognizerResult.RECOGNIZER_IDENTIFIER_KEY]
                == recognizer.id
            ]
            other_recognizer_results = [
                r
                for r in raw_results
                if r.recognition_metadata[RecognizerResult.RECOGNIZER_IDENTIFIER_KEY]
                != recognizer.id
            ]

            # enhance score using context in recognizer level if implemented
            recognizer_results = recognizer.enhance_using_context(
                text=text,
                # each recognizer will get access to all recognizer results
                # to allow related entities contex enhancement
                raw_recognizer_results=recognizer_results,
                other_raw_recognizer_results=other_recognizer_results,
                nlp_artifacts=nlp_artifacts,
                context=context,
            )

            results.extend(recognizer_results)

        # Update results in case surrounding words or external context are relevant to
        # the context words.
        results = self.context_aware_enhancer.enhance_using_context(
            text=text,
            raw_results=results,
            nlp_artifacts=nlp_artifacts,
            recognizers=recognizers,
            context=context,
        )

        return results

    def __remove_low_scores(
        self, results: List[RecognizerResult], score_threshold: float = None
    ) -> List[RecognizerResult]:
        """
        Remove results for which the confidence is lower than the threshold.

        :param results: List of RecognizerResult
        :param score_threshold: float value for minimum possible confidence
        :return: List[RecognizerResult]
        """
        if score_threshold is None:
            score_threshold = self.default_score_threshold

        new_results = [result for result in results if result.score >= score_threshold]
        return new_results

    @staticmethod
    def _remove_allow_list(
        results: List[RecognizerResult],
        allow_list: List[str],
        text: str,
        regex_flags: Optional[int],
        allow_list_match: str,
    ) -> List[RecognizerResult]:
        """
        Remove results which are part of the allow list.

        :param results: List of RecognizerResult
        :param allow_list: list of allowed terms
        :param text: the text to analyze
        :param regex_flags: regex flags to be used for when allow_list_match is "regex"
        :param allow_list_match: How the allow_list
        should be interpreted; either as "exact" or as "regex"
        :return: List[RecognizerResult]
        """
        new_results = []
        if allow_list_match == "regex":
            pattern = "|".join(allow_list)
            re_compiled = re.compile(pattern, flags=regex_flags)

            for result in results:
                word = text[result.start : result.end]

                # if the word is not specified to be allowed, keep in the PII entities
                if not re_compiled.search(word):
                    new_results.append(result)
        elif allow_list_match == "exact":
            for result in results:
                word = text[result.start : result.end]

                # if the word is not specified to be allowed, keep in the PII entities
                if word not in allow_list:
                    new_results.append(result)
        else:
            raise ValueError(
                "allow_list_match must either be set to 'exact' or 'regex'."
            )

        return new_results

    @staticmethod
    def __add_recognizer_id_if_not_exists(
        results: List[RecognizerResult], recognizer: EntityRecognizer
    ) -> None:
        """Ensure recognition metadata with recognizer id existence.

        Ensure recognizer result list contains recognizer id inside recognition
        metadata dictionary, and if not create it. recognizer_id is needed
        for context aware enhancement.

        :param results: List of RecognizerResult
        :param recognizer: Entity recognizer
        """
        for result in results:
            if not result.recognition_metadata:
                result.recognition_metadata = dict()
            if (
                RecognizerResult.RECOGNIZER_IDENTIFIER_KEY
                not in result.recognition_metadata
            ):
                result.recognition_metadata[
                    RecognizerResult.RECOGNIZER_IDENTIFIER_KEY
                ] = recognizer.id
            if RecognizerResult.RECOGNIZER_NAME_KEY not in result.recognition_metadata:
                result.recognition_metadata[RecognizerResult.RECOGNIZER_NAME_KEY] = (
                    recognizer.name
                )

    @staticmethod
    def __remove_decision_process(
        results: List[RecognizerResult],
    ) -> List[RecognizerResult]:
        """Remove decision process / analysis explanation from response."""

        for result in results:
            result.analysis_explanation = None

        return results

get_recognizers

get_recognizers(language: Optional[str] = None) -> List[EntityRecognizer]

Return a list of PII recognizers currently loaded.

PARAMETER DESCRIPTION
language

Return the recognizers supporting a given language.

TYPE: Optional[str] DEFAULT: None

RETURNS DESCRIPTION
List[EntityRecognizer]

List of [Recognizer] as a RecognizersAllResponse

Source code in presidio_analyzer/analyzer_engine.py
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
def get_recognizers(self, language: Optional[str] = None) -> List[EntityRecognizer]:
    """
    Return a list of PII recognizers currently loaded.

    :param language: Return the recognizers supporting a given language.
    :return: List of [Recognizer] as a RecognizersAllResponse
    """
    if not language:
        languages = self.supported_languages
    else:
        languages = [language]

    recognizers = []
    for language in languages:
        logger.info(f"Fetching all recognizers for language {language}")
        recognizers.extend(
            self.registry.get_recognizers(language=language, all_fields=True)
        )

    return list(set(recognizers))

get_supported_entities

get_supported_entities(language: Optional[str] = None) -> List[str]

Return a list of the entities that can be detected.

PARAMETER DESCRIPTION
language

Return only entities supported in a specific language.

TYPE: Optional[str] DEFAULT: None

RETURNS DESCRIPTION
List[str]

List of entity names

Source code in presidio_analyzer/analyzer_engine.py
134
135
136
137
138
139
140
141
142
143
144
145
146
def get_supported_entities(self, language: Optional[str] = None) -> List[str]:
    """
    Return a list of the entities that can be detected.

    :param language: Return only entities supported in a specific language.
    :return: List of entity names
    """
    recognizers = self.get_recognizers(language=language)
    supported_entities = []
    for recognizer in recognizers:
        supported_entities.extend(recognizer.get_supported_entities())

    return list(set(supported_entities))

analyze

analyze(
    text: str,
    language: str,
    entities: Optional[List[str]] = None,
    correlation_id: Optional[str] = None,
    score_threshold: Optional[float] = None,
    return_decision_process: Optional[bool] = False,
    ad_hoc_recognizers: Optional[List[EntityRecognizer]] = None,
    context: Optional[List[str]] = None,
    allow_list: Optional[List[str]] = None,
    allow_list_match: Optional[str] = "exact",
    regex_flags: Optional[int] = re.DOTALL | re.MULTILINE | re.IGNORECASE,
    nlp_artifacts: Optional[NlpArtifacts] = None,
) -> List[RecognizerResult]

Find PII entities in text using different PII recognizers for a given language.

:Example:

from presidio_analyzer import AnalyzerEngine

# Set up the engine, loads the NLP module (spaCy model by default)
# and other PII recognizers
analyzer = AnalyzerEngine()

# Call analyzer to get results
results = analyzer.analyze(text='My phone number is 212-555-5555', entities=['PHONE_NUMBER'], language='en')
print(results)
PARAMETER DESCRIPTION
text

the text to analyze

TYPE: str

language

the language of the text

TYPE: str

entities

List of PII entities that should be looked for in the text. If entities=None then all entities are looked for.

TYPE: Optional[List[str]] DEFAULT: None

correlation_id

cross call ID for this request

TYPE: Optional[str] DEFAULT: None

score_threshold

A minimum value for which to return an identified entity

TYPE: Optional[float] DEFAULT: None

return_decision_process

Whether the analysis decision process steps returned in the response.

TYPE: Optional[bool] DEFAULT: False

ad_hoc_recognizers

List of recognizers which will be used only for this specific request.

TYPE: Optional[List[EntityRecognizer]] DEFAULT: None

context

List of context words to enhance confidence score if matched with the recognized entity's recognizer context

TYPE: Optional[List[str]] DEFAULT: None

allow_list

List of words that the user defines as being allowed to keep in the text

TYPE: Optional[List[str]] DEFAULT: None

allow_list_match

How the allow_list should be interpreted; either as "exact" or as "regex". - If regex, results which match with any regex condition in the allow_list would be allowed and not be returned as potential PII. - if exact, results which exactly match any value in the allow_list would be allowed and not be returned as potential PII.

TYPE: Optional[str] DEFAULT: 'exact'

regex_flags

regex flags to be used for when allow_list_match is "regex"

TYPE: Optional[int] DEFAULT: DOTALL | MULTILINE | IGNORECASE

nlp_artifacts

precomputed NlpArtifacts

TYPE: Optional[NlpArtifacts] DEFAULT: None

RETURNS DESCRIPTION
List[RecognizerResult]

an array of the found entities in the text

Source code in presidio_analyzer/analyzer_engine.py
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
def analyze(
    self,
    text: str,
    language: str,
    entities: Optional[List[str]] = None,
    correlation_id: Optional[str] = None,
    score_threshold: Optional[float] = None,
    return_decision_process: Optional[bool] = False,
    ad_hoc_recognizers: Optional[List[EntityRecognizer]] = None,
    context: Optional[List[str]] = None,
    allow_list: Optional[List[str]] = None,
    allow_list_match: Optional[str] = "exact",
    regex_flags: Optional[int] = re.DOTALL | re.MULTILINE | re.IGNORECASE,
    nlp_artifacts: Optional[NlpArtifacts] = None,
) -> List[RecognizerResult]:
    """
    Find PII entities in text using different PII recognizers for a given language.

    :param text: the text to analyze
    :param language: the language of the text
    :param entities: List of PII entities that should be looked for in the text.
    If entities=None then all entities are looked for.
    :param correlation_id: cross call ID for this request
    :param score_threshold: A minimum value for which
    to return an identified entity
    :param return_decision_process: Whether the analysis decision process steps
    returned in the response.
    :param ad_hoc_recognizers: List of recognizers which will be used only
    for this specific request.
    :param context: List of context words to enhance confidence score if matched
    with the recognized entity's recognizer context
    :param allow_list: List of words that the user defines as being allowed to keep
    in the text
    :param allow_list_match: How the allow_list should be interpreted; either as "exact" or as "regex".
    - If `regex`, results which match with any regex condition in the allow_list would be allowed and not be returned as potential PII.
    - if `exact`, results which exactly match any value in the allow_list would be allowed and not be returned as potential PII.
    :param regex_flags: regex flags to be used for when allow_list_match is "regex"
    :param nlp_artifacts: precomputed NlpArtifacts
    :return: an array of the found entities in the text

    :Example:

    ```python
    from presidio_analyzer import AnalyzerEngine

    # Set up the engine, loads the NLP module (spaCy model by default)
    # and other PII recognizers
    analyzer = AnalyzerEngine()

    # Call analyzer to get results
    results = analyzer.analyze(text='My phone number is 212-555-5555', entities=['PHONE_NUMBER'], language='en')
    print(results)
    ```

    """  # noqa: E501

    all_fields = not entities

    recognizers = self.registry.get_recognizers(
        language=language,
        entities=entities,
        all_fields=all_fields,
        ad_hoc_recognizers=ad_hoc_recognizers,
    )

    if all_fields:
        # Since all_fields=True, list all entities by iterating
        # over all recognizers
        entities = self.get_supported_entities(language=language)

    # run the nlp pipeline over the given text, store the results in
    # a NlpArtifacts instance
    if not nlp_artifacts:
        nlp_artifacts = self.nlp_engine.process_text(text, language)

    if self.log_decision_process:
        self.app_tracer.trace(
            correlation_id, "nlp artifacts:" + nlp_artifacts.to_json()
        )

    results = []
    for recognizer in recognizers:
        # Lazy loading of the relevant recognizers
        if not recognizer.is_loaded:
            recognizer.load()
            recognizer.is_loaded = True

        # analyze using the current recognizer and append the results
        current_results = recognizer.analyze(
            text=text, entities=entities, nlp_artifacts=nlp_artifacts
        )
        if current_results:
            # add recognizer name to recognition metadata inside results
            # if not exists
            self.__add_recognizer_id_if_not_exists(current_results, recognizer)
            results.extend(current_results)

    results = self._enhance_using_context(
        text, results, nlp_artifacts, recognizers, context
    )

    if self.log_decision_process:
        self.app_tracer.trace(
            correlation_id,
            json.dumps([str(result.to_dict()) for result in results]),
        )

    # Remove duplicates or low score results
    results = EntityRecognizer.remove_duplicates(results)
    results = self.__remove_low_scores(results, score_threshold)

    if allow_list:
        results = self._remove_allow_list(
            results, allow_list, text, regex_flags, allow_list_match
        )

    if not return_decision_process:
        results = self.__remove_decision_process(results)

    return results

presidio_analyzer.analyzer_engine_provider.AnalyzerEngineProvider

Utility function for loading Presidio Analyzer.

Use this class to load presidio analyzer engine from a yaml file

PARAMETER DESCRIPTION
analyzer_engine_conf_file

the path to the analyzer configuration file

TYPE: Optional[Union[Path, str]] DEFAULT: None

nlp_engine_conf_file

the path to the nlp engine configuration file

TYPE: Optional[Union[Path, str]] DEFAULT: None

recognizer_registry_conf_file

the path to the recognizer registry configuration file

TYPE: Optional[Union[Path, str]] DEFAULT: None

METHOD DESCRIPTION
get_configuration

Retrieve the analyzer engine configuration from the provided file.

create_engine

Load Presidio Analyzer from yaml configuration file.

Source code in presidio_analyzer/analyzer_engine_provider.py
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
class AnalyzerEngineProvider:
    """
    Utility function for loading Presidio Analyzer.

    Use this class to load presidio analyzer engine from a yaml file

    :param analyzer_engine_conf_file: the path to the analyzer configuration file
    :param nlp_engine_conf_file: the path to the nlp engine configuration file
    :param recognizer_registry_conf_file: the path to the recognizer
    registry configuration file
    """

    def __init__(
        self,
        analyzer_engine_conf_file: Optional[Union[Path, str]] = None,
        nlp_engine_conf_file: Optional[Union[Path, str]] = None,
        recognizer_registry_conf_file: Optional[Union[Path, str]] = None,
    ):
        self.configuration = self.get_configuration(conf_file=analyzer_engine_conf_file)
        self.nlp_engine_conf_file = nlp_engine_conf_file
        self.recognizer_registry_conf_file = recognizer_registry_conf_file

    def get_configuration(
        self, conf_file: Optional[Union[Path, str]]
    ) -> Union[Dict[str, Any]]:
        """Retrieve the analyzer engine configuration from the provided file."""

        if not conf_file:
            default_conf_file = self._get_full_conf_path()
            with open(default_conf_file) as file:
                configuration = yaml.safe_load(file)
            logger.info(
                f"Analyzer Engine configuration file "
                f"not provided. Using {default_conf_file}."
            )
        else:
            try:
                logger.info(f"Reading analyzer configuration from {conf_file}")
                with open(conf_file) as file:
                    configuration = yaml.safe_load(file)
            except OSError:
                logger.warning(
                    f"configuration file {conf_file} not found.  "
                    f"Using default config."
                )
                with open(self._get_full_conf_path()) as file:
                    configuration = yaml.safe_load(file)
            except Exception:
                print(f"Failed to parse file {conf_file}, resorting to default")
                with open(self._get_full_conf_path()) as file:
                    configuration = yaml.safe_load(file)

        return configuration

    def create_engine(self) -> AnalyzerEngine:
        """
        Load Presidio Analyzer from yaml configuration file.

        :return: analyzer engine initialized with yaml configuration
        """

        nlp_engine = self._load_nlp_engine()
        supported_languages = self.configuration.get("supported_languages", ["en"])
        default_score_threshold = self.configuration.get("default_score_threshold", 0)

        registry = self._load_recognizer_registry(
            supported_languages=supported_languages, nlp_engine=nlp_engine
        )

        analyzer = AnalyzerEngine(
            nlp_engine=nlp_engine,
            registry=registry,
            supported_languages=supported_languages,
            default_score_threshold=default_score_threshold,
        )

        return analyzer

    def _load_recognizer_registry(
        self,
        supported_languages: List[str],
        nlp_engine: NlpEngine,
    ) -> RecognizerRegistry:
        if self.recognizer_registry_conf_file:
            logger.info(
                f"Reading recognizer registry "
                f"configuration from {self.recognizer_registry_conf_file}"
            )
            provider = RecognizerRegistryProvider(
                conf_file=self.recognizer_registry_conf_file
            )
        elif "recognizer_registry" in self.configuration:
            registry_configuration = self.configuration["recognizer_registry"]
            provider = RecognizerRegistryProvider(
                registry_configuration={
                    **registry_configuration,
                    "supported_languages": supported_languages,
                }
            )
        else:
            logger.warning(
                "configuration file is missing for 'recognizer_registry'. "
                "Using default configuration for recognizer registry"
            )
            registry_configuration = self.configuration.get("recognizer_registry", {})
            provider = RecognizerRegistryProvider(
                registry_configuration={
                    **registry_configuration,
                    "supported_languages": supported_languages,
                }
            )
        registry = provider.create_recognizer_registry()
        if nlp_engine:
            registry.add_nlp_recognizer(nlp_engine)
        return registry

    def _load_nlp_engine(self) -> NlpEngine:
        if self.nlp_engine_conf_file:
            logger.info(f"Reading nlp configuration from {self.nlp_engine_conf_file}")
            provider = NlpEngineProvider(conf_file=self.nlp_engine_conf_file)
        elif "nlp_configuration" in self.configuration:
            nlp_configuration = self.configuration["nlp_configuration"]
            provider = NlpEngineProvider(nlp_configuration=nlp_configuration)
        else:
            logger.warning(
                "configuration file is missing for 'nlp_configuration'."
                "Using default configuration for nlp engine"
            )
            provider = NlpEngineProvider()

        return provider.create_engine()

    @staticmethod
    def _get_full_conf_path(
        default_conf_file: Union[Path, str] = "default_analyzer.yaml",
    ) -> Path:
        """Return a Path to the default conf file."""
        return Path(Path(__file__).parent, "conf", default_conf_file)

get_configuration

get_configuration(
    conf_file: Optional[Union[Path, str]],
) -> Union[Dict[str, Any]]

Retrieve the analyzer engine configuration from the provided file.

Source code in presidio_analyzer/analyzer_engine_provider.py
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
def get_configuration(
    self, conf_file: Optional[Union[Path, str]]
) -> Union[Dict[str, Any]]:
    """Retrieve the analyzer engine configuration from the provided file."""

    if not conf_file:
        default_conf_file = self._get_full_conf_path()
        with open(default_conf_file) as file:
            configuration = yaml.safe_load(file)
        logger.info(
            f"Analyzer Engine configuration file "
            f"not provided. Using {default_conf_file}."
        )
    else:
        try:
            logger.info(f"Reading analyzer configuration from {conf_file}")
            with open(conf_file) as file:
                configuration = yaml.safe_load(file)
        except OSError:
            logger.warning(
                f"configuration file {conf_file} not found.  "
                f"Using default config."
            )
            with open(self._get_full_conf_path()) as file:
                configuration = yaml.safe_load(file)
        except Exception:
            print(f"Failed to parse file {conf_file}, resorting to default")
            with open(self._get_full_conf_path()) as file:
                configuration = yaml.safe_load(file)

    return configuration

create_engine

create_engine() -> AnalyzerEngine

Load Presidio Analyzer from yaml configuration file.

RETURNS DESCRIPTION
AnalyzerEngine

analyzer engine initialized with yaml configuration

Source code in presidio_analyzer/analyzer_engine_provider.py
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
def create_engine(self) -> AnalyzerEngine:
    """
    Load Presidio Analyzer from yaml configuration file.

    :return: analyzer engine initialized with yaml configuration
    """

    nlp_engine = self._load_nlp_engine()
    supported_languages = self.configuration.get("supported_languages", ["en"])
    default_score_threshold = self.configuration.get("default_score_threshold", 0)

    registry = self._load_recognizer_registry(
        supported_languages=supported_languages, nlp_engine=nlp_engine
    )

    analyzer = AnalyzerEngine(
        nlp_engine=nlp_engine,
        registry=registry,
        supported_languages=supported_languages,
        default_score_threshold=default_score_threshold,
    )

    return analyzer

presidio_analyzer.analysis_explanation.AnalysisExplanation

Hold tracing information to explain why PII entities were identified as such.

PARAMETER DESCRIPTION
recognizer

name of recognizer that made the decision

TYPE: str

original_score

recognizer's confidence in result

TYPE: float

pattern_name

name of pattern (if decision was made by a PatternRecognizer)

TYPE: str DEFAULT: None

pattern

regex pattern that was applied (if PatternRecognizer)

TYPE: str DEFAULT: None

validation_result

result of a validation (e.g. checksum)

TYPE: float DEFAULT: None

textual_explanation

Free text for describing a decision of a logic or model

TYPE: str DEFAULT: None

METHOD DESCRIPTION
set_improved_score

Update the score and calculate the difference from the original score.

set_supportive_context_word

Set the context word which helped increase the score.

append_textual_explanation_line

Append a new line to textual_explanation field.

to_dict

Serialize self to dictionary.

Source code in presidio_analyzer/analysis_explanation.py
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
class AnalysisExplanation:
    """
    Hold tracing information to explain why PII entities were identified as such.

    :param recognizer: name of recognizer that made the decision
    :param original_score: recognizer's confidence in result
    :param pattern_name: name of pattern
            (if decision was made by a PatternRecognizer)
    :param pattern: regex pattern that was applied (if PatternRecognizer)
    :param validation_result: result of a validation (e.g. checksum)
    :param textual_explanation: Free text for describing
            a decision of a logic or model
    """

    def __init__(
        self,
        recognizer: str,
        original_score: float,
        pattern_name: str = None,
        pattern: str = None,
        validation_result: float = None,
        textual_explanation: str = None,
        regex_flags: int = None,
    ):
        self.recognizer = recognizer
        self.pattern_name = pattern_name
        self.pattern = pattern
        self.original_score = original_score
        self.score = original_score
        self.textual_explanation = textual_explanation
        self.score_context_improvement = 0
        self.supportive_context_word = ""
        self.validation_result = validation_result
        self.regex_flags = regex_flags

    def __repr__(self):
        """Create string representation of the object."""
        return str(self.__dict__)

    def set_improved_score(self, score: float) -> None:
        """Update the score and calculate the difference from the original score."""
        self.score = score
        self.score_context_improvement = self.score - self.original_score

    def set_supportive_context_word(self, word: str) -> None:
        """Set the context word which helped increase the score."""
        self.supportive_context_word = word

    def append_textual_explanation_line(self, text: str) -> None:
        """Append a new line to textual_explanation field."""
        if self.textual_explanation is None:
            self.textual_explanation = text
        else:
            self.textual_explanation = f"{self.textual_explanation}\n{text}"

    def to_dict(self) -> Dict:
        """
        Serialize self to dictionary.

        :return: a dictionary
        """
        return self.__dict__

set_improved_score

set_improved_score(score: float) -> None

Update the score and calculate the difference from the original score.

Source code in presidio_analyzer/analysis_explanation.py
43
44
45
46
def set_improved_score(self, score: float) -> None:
    """Update the score and calculate the difference from the original score."""
    self.score = score
    self.score_context_improvement = self.score - self.original_score

set_supportive_context_word

set_supportive_context_word(word: str) -> None

Set the context word which helped increase the score.

Source code in presidio_analyzer/analysis_explanation.py
48
49
50
def set_supportive_context_word(self, word: str) -> None:
    """Set the context word which helped increase the score."""
    self.supportive_context_word = word

append_textual_explanation_line

append_textual_explanation_line(text: str) -> None

Append a new line to textual_explanation field.

Source code in presidio_analyzer/analysis_explanation.py
52
53
54
55
56
57
def append_textual_explanation_line(self, text: str) -> None:
    """Append a new line to textual_explanation field."""
    if self.textual_explanation is None:
        self.textual_explanation = text
    else:
        self.textual_explanation = f"{self.textual_explanation}\n{text}"

to_dict

to_dict() -> Dict

Serialize self to dictionary.

RETURNS DESCRIPTION
Dict

a dictionary

Source code in presidio_analyzer/analysis_explanation.py
59
60
61
62
63
64
65
def to_dict(self) -> Dict:
    """
    Serialize self to dictionary.

    :return: a dictionary
    """
    return self.__dict__

presidio_analyzer.recognizer_result.RecognizerResult

Recognizer Result represents the findings of the detected entity.

Result of a recognizer analyzing the text.

PARAMETER DESCRIPTION
entity_type

the type of the entity

TYPE: str

start

the start location of the detected entity

TYPE: int

end

the end location of the detected entity

TYPE: int

score

the score of the detection

TYPE: float

analysis_explanation

contains the explanation of why this entity was identified

TYPE: AnalysisExplanation DEFAULT: None

recognition_metadata

a dictionary of metadata to be used in recognizer specific cases, for example specific recognized context words and recognizer name

TYPE: Dict DEFAULT: None

METHOD DESCRIPTION
append_analysis_explanation_text

Add text to the analysis explanation.

to_dict

Serialize self to dictionary.

from_json

Create RecognizerResult from json.

intersects

Check if self intersects with a different RecognizerResult.

contained_in

Check if self is contained in a different RecognizerResult.

contains

Check if one result is contained or equal to another result.

equal_indices

Check if the indices are equal between two results.

has_conflict

Check if two recognizer results are conflicted or not.

Source code in presidio_analyzer/recognizer_result.py
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
class RecognizerResult:
    """
    Recognizer Result represents the findings of the detected entity.

    Result of a recognizer analyzing the text.

    :param entity_type: the type of the entity
    :param start: the start location of the detected entity
    :param end: the end location of the detected entity
    :param score: the score of the detection
    :param analysis_explanation: contains the explanation of why this
                                 entity was identified
    :param recognition_metadata: a dictionary of metadata to be used in
    recognizer specific cases, for example specific recognized context words
    and recognizer name
    """

    # Keys for recognizer metadata
    RECOGNIZER_NAME_KEY = "recognizer_name"
    RECOGNIZER_IDENTIFIER_KEY = "recognizer_identifier"

    # Key of a flag inside recognition_metadata dictionary
    # which is set to true if the result enhanced by context
    IS_SCORE_ENHANCED_BY_CONTEXT_KEY = "is_score_enhanced_by_context"

    logger = logging.getLogger("presidio-analyzer")

    def __init__(
        self,
        entity_type: str,
        start: int,
        end: int,
        score: float,
        analysis_explanation: AnalysisExplanation = None,
        recognition_metadata: Dict = None,
    ):
        self.entity_type = entity_type
        self.start = start
        self.end = end
        self.score = score
        self.analysis_explanation = analysis_explanation

        if not recognition_metadata:
            self.logger.debug(
                "recognition_metadata should be passed, "
                "containing a recognizer_name value"
            )

        self.recognition_metadata = recognition_metadata

    def append_analysis_explanation_text(self, text: str) -> None:
        """Add text to the analysis explanation."""
        if self.analysis_explanation:
            self.analysis_explanation.append_textual_explanation_line(text)

    def to_dict(self) -> Dict:
        """
        Serialize self to dictionary.

        :return: a dictionary
        """
        return self.__dict__

    @classmethod
    def from_json(cls, data: Dict) -> "RecognizerResult":
        """
        Create RecognizerResult from json.

        :param data: e.g. {
            "start": 24,
            "end": 32,
            "score": 0.8,
            "entity_type": "NAME"
        }
        :return: RecognizerResult
        """
        score = data.get("score")
        entity_type = data.get("entity_type")
        start = data.get("start")
        end = data.get("end")
        return cls(entity_type, start, end, score)

    def __repr__(self) -> str:
        """Return a string representation of the instance."""
        return self.__str__()

    def intersects(self, other: "RecognizerResult") -> int:
        """
        Check if self intersects with a different RecognizerResult.

        :return: If intersecting, returns the number of
        intersecting characters.
        If not, returns 0
        """
        # if they do not overlap the intersection is 0
        if self.end < other.start or other.end < self.start:
            return 0

        # otherwise the intersection is min(end) - max(start)
        return min(self.end, other.end) - max(self.start, other.start)

    def contained_in(self, other: "RecognizerResult") -> bool:
        """
        Check if self is contained in a different RecognizerResult.

        :return: true if contained
        """
        return self.start >= other.start and self.end <= other.end

    def contains(self, other: "RecognizerResult") -> bool:
        """
        Check if one result is contained or equal to another result.

        :param other: another RecognizerResult
        :return: bool
        """
        return self.start <= other.start and self.end >= other.end

    def equal_indices(self, other: "RecognizerResult") -> bool:
        """
        Check if the indices are equal between two results.

        :param other: another RecognizerResult
        :return:
        """
        return self.start == other.start and self.end == other.end

    def __gt__(self, other: "RecognizerResult") -> bool:
        """
        Check if one result is greater by using the results indices in the text.

        :param other: another RecognizerResult
        :return: bool
        """
        if self.start == other.start:
            return self.end > other.end
        return self.start > other.start

    def __eq__(self, other: "RecognizerResult") -> bool:
        """
        Check two results are equal by using all class fields.

        :param other: another RecognizerResult
        :return: bool
        """
        equal_type = self.entity_type == other.entity_type
        equal_score = self.score == other.score
        return self.equal_indices(other) and equal_type and equal_score

    def __hash__(self):
        """
        Hash the result data by using all class fields.

        :return: int
        """
        return hash(
            f"{str(self.start)} {str(self.end)} {str(self.score)} {self.entity_type}"
        )

    def __str__(self) -> str:
        """Return a string representation of the instance."""
        return (
            f"type: {self.entity_type}, "
            f"start: {self.start}, "
            f"end: {self.end}, "
            f"score: {self.score}"
        )

    def has_conflict(self, other: "RecognizerResult") -> bool:
        """
        Check if two recognizer results are conflicted or not.

        I have a conflict if:
        1. My indices are the same as the other and my score is lower.
        2. If my indices are contained in another.

        :param other: RecognizerResult
        :return:
        """
        if self.equal_indices(other):
            return self.score <= other.score
        return other.contains(self)

append_analysis_explanation_text

append_analysis_explanation_text(text: str) -> None

Add text to the analysis explanation.

Source code in presidio_analyzer/recognizer_result.py
57
58
59
60
def append_analysis_explanation_text(self, text: str) -> None:
    """Add text to the analysis explanation."""
    if self.analysis_explanation:
        self.analysis_explanation.append_textual_explanation_line(text)

to_dict

to_dict() -> Dict

Serialize self to dictionary.

RETURNS DESCRIPTION
Dict

a dictionary

Source code in presidio_analyzer/recognizer_result.py
62
63
64
65
66
67
68
def to_dict(self) -> Dict:
    """
    Serialize self to dictionary.

    :return: a dictionary
    """
    return self.__dict__

from_json classmethod

from_json(data: Dict) -> RecognizerResult

Create RecognizerResult from json.

PARAMETER DESCRIPTION
data

e.g. { "start": 24, "end": 32, "score": 0.8, "entity_type": "NAME" }

TYPE: Dict

RETURNS DESCRIPTION
RecognizerResult

RecognizerResult

Source code in presidio_analyzer/recognizer_result.py
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
@classmethod
def from_json(cls, data: Dict) -> "RecognizerResult":
    """
    Create RecognizerResult from json.

    :param data: e.g. {
        "start": 24,
        "end": 32,
        "score": 0.8,
        "entity_type": "NAME"
    }
    :return: RecognizerResult
    """
    score = data.get("score")
    entity_type = data.get("entity_type")
    start = data.get("start")
    end = data.get("end")
    return cls(entity_type, start, end, score)

intersects

intersects(other: RecognizerResult) -> int

Check if self intersects with a different RecognizerResult.

RETURNS DESCRIPTION
int

If intersecting, returns the number of intersecting characters. If not, returns 0

Source code in presidio_analyzer/recognizer_result.py
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
def intersects(self, other: "RecognizerResult") -> int:
    """
    Check if self intersects with a different RecognizerResult.

    :return: If intersecting, returns the number of
    intersecting characters.
    If not, returns 0
    """
    # if they do not overlap the intersection is 0
    if self.end < other.start or other.end < self.start:
        return 0

    # otherwise the intersection is min(end) - max(start)
    return min(self.end, other.end) - max(self.start, other.start)

contained_in

contained_in(other: RecognizerResult) -> bool

Check if self is contained in a different RecognizerResult.

RETURNS DESCRIPTION
bool

true if contained

Source code in presidio_analyzer/recognizer_result.py
108
109
110
111
112
113
114
def contained_in(self, other: "RecognizerResult") -> bool:
    """
    Check if self is contained in a different RecognizerResult.

    :return: true if contained
    """
    return self.start >= other.start and self.end <= other.end

contains

contains(other: RecognizerResult) -> bool

Check if one result is contained or equal to another result.

PARAMETER DESCRIPTION
other

another RecognizerResult

TYPE: RecognizerResult

RETURNS DESCRIPTION
bool

bool

Source code in presidio_analyzer/recognizer_result.py
116
117
118
119
120
121
122
123
def contains(self, other: "RecognizerResult") -> bool:
    """
    Check if one result is contained or equal to another result.

    :param other: another RecognizerResult
    :return: bool
    """
    return self.start <= other.start and self.end >= other.end

equal_indices

equal_indices(other: RecognizerResult) -> bool

Check if the indices are equal between two results.

PARAMETER DESCRIPTION
other

another RecognizerResult

TYPE: RecognizerResult

RETURNS DESCRIPTION
bool
Source code in presidio_analyzer/recognizer_result.py
125
126
127
128
129
130
131
132
def equal_indices(self, other: "RecognizerResult") -> bool:
    """
    Check if the indices are equal between two results.

    :param other: another RecognizerResult
    :return:
    """
    return self.start == other.start and self.end == other.end

has_conflict

has_conflict(other: RecognizerResult) -> bool

Check if two recognizer results are conflicted or not.

I have a conflict if: 1. My indices are the same as the other and my score is lower. 2. If my indices are contained in another.

PARAMETER DESCRIPTION
other

RecognizerResult

TYPE: RecognizerResult

RETURNS DESCRIPTION
bool
Source code in presidio_analyzer/recognizer_result.py
175
176
177
178
179
180
181
182
183
184
185
186
187
188
def has_conflict(self, other: "RecognizerResult") -> bool:
    """
    Check if two recognizer results are conflicted or not.

    I have a conflict if:
    1. My indices are the same as the other and my score is lower.
    2. If my indices are contained in another.

    :param other: RecognizerResult
    :return:
    """
    if self.equal_indices(other):
        return self.score <= other.score
    return other.contains(self)

Batch modules

presidio_analyzer.batch_analyzer_engine.BatchAnalyzerEngine

Batch analysis of documents (tables, lists, dicts).

Wrapper class to run Presidio Analyzer Engine on multiple values, either lists/iterators of strings, or dictionaries.

PARAMETER DESCRIPTION
analyzer_engine

AnalyzerEngine instance to use for handling the values in those collections.

TYPE: Optional[AnalyzerEngine] DEFAULT: None

METHOD DESCRIPTION
analyze_iterator

Analyze an iterable of strings.

analyze_dict

Analyze a dictionary of keys (strings) and values/iterable of values.

Source code in presidio_analyzer/batch_analyzer_engine.py
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
class BatchAnalyzerEngine:
    """
    Batch analysis of documents (tables, lists, dicts).

    Wrapper class to run Presidio Analyzer Engine on multiple values,
    either lists/iterators of strings, or dictionaries.

    :param analyzer_engine: AnalyzerEngine instance to use
    for handling the values in those collections.
    """

    def __init__(self, analyzer_engine: Optional[AnalyzerEngine] = None):
        self.analyzer_engine = analyzer_engine
        if not analyzer_engine:
            self.analyzer_engine = AnalyzerEngine()

    def analyze_iterator(
        self,
        texts: Iterable[Union[str, bool, float, int]],
        language: str,
        batch_size: int = 1,
        n_process: int = 1,
        **kwargs,
    ) -> List[List[RecognizerResult]]:
        """
        Analyze an iterable of strings.

        :param texts: An list containing strings to be analyzed.
        :param language: Input language
        :param batch_size: Batch size to process in a single iteration
        :param n_process: Number of processors to use. Defaults to `1`
        :param kwargs: Additional parameters for the `AnalyzerEngine.analyze` method.
        (default value depends on the nlp engine implementation)
        """

        # validate types
        texts = self._validate_types(texts)

        # Process the texts as batch for improved performance
        nlp_artifacts_batch: Iterator[Tuple[str, NlpArtifacts]] = (
            self.analyzer_engine.nlp_engine.process_batch(
                texts=texts,
                language=language,
                batch_size=batch_size,
                n_process=n_process,
            )
        )

        list_results = []
        for text, nlp_artifacts in nlp_artifacts_batch:
            results = self.analyzer_engine.analyze(
                text=str(text), nlp_artifacts=nlp_artifacts, language=language, **kwargs
            )

            list_results.append(results)

        return list_results

    def analyze_dict(
        self,
        input_dict: Dict[str, Union[Any, Iterable[Any]]],
        language: str,
        keys_to_skip: Optional[List[str]] = None,
        batch_size: int = 1,
        n_process: int = 1,
        **kwargs,
    ) -> Iterator[DictAnalyzerResult]:
        """
        Analyze a dictionary of keys (strings) and values/iterable of values.

        Non-string values are returned as is.

        :param input_dict: The input dictionary for analysis
        :param language: Input language
        :param keys_to_skip: Keys to ignore during analysis
        :param batch_size: Batch size to process in a single iteration
        :param n_process: Number of processors to use. Defaults to `1`

        :param kwargs: Additional keyword arguments
        for the `AnalyzerEngine.analyze` method.
        Use this to pass arguments to the analyze method,
        such as `ad_hoc_recognizers`, `context`, `return_decision_process`.
        See `AnalyzerEngine.analyze` for the full list.
        """

        context = []
        if "context" in kwargs:
            context = kwargs["context"]
            del kwargs["context"]

        if not keys_to_skip:
            keys_to_skip = []

        for key, value in input_dict.items():
            if not value or key in keys_to_skip:
                yield DictAnalyzerResult(key=key, value=value, recognizer_results=[])
                continue  # skip this key as requested

            # Add the key as an additional context
            specific_context = context[:]
            specific_context.append(key)

            if type(value) in (str, int, bool, float):
                results: List[RecognizerResult] = self.analyzer_engine.analyze(
                    text=str(value), language=language, context=[key], **kwargs
                )
            elif isinstance(value, dict):
                new_keys_to_skip = self._get_nested_keys_to_skip(key, keys_to_skip)
                results = self.analyze_dict(
                    input_dict=value,
                    language=language,
                    context=specific_context,
                    keys_to_skip=new_keys_to_skip,
                    **kwargs,
                )
            elif isinstance(value, Iterable):
                # Recursively iterate nested dicts

                results: List[List[RecognizerResult]] = self.analyze_iterator(
                    texts=value,
                    language=language,
                    context=specific_context,
                    n_process=n_process,
                    batch_size=batch_size,
                    **kwargs,
                )
            else:
                raise ValueError(f"type {type(value)} is unsupported.")

            yield DictAnalyzerResult(key=key, value=value, recognizer_results=results)

    @staticmethod
    def _validate_types(value_iterator: Iterable[Any]) -> Iterator[Any]:
        for val in value_iterator:
            if val and type(val) not in (int, float, bool, str):
                err_msg = (
                    "Analyzer.analyze_iterator only works "
                    "on primitive types (int, float, bool, str). "
                    "Lists of objects are not yet supported."
                )
                logger.error(err_msg)
                raise ValueError(err_msg)
            yield val

    @staticmethod
    def _get_nested_keys_to_skip(key, keys_to_skip):
        new_keys_to_skip = [
            k.replace(f"{key}.", "") for k in keys_to_skip if k.startswith(key)
        ]
        return new_keys_to_skip

analyze_iterator

analyze_iterator(
    texts: Iterable[Union[str, bool, float, int]],
    language: str,
    batch_size: int = 1,
    n_process: int = 1,
    **kwargs
) -> List[List[RecognizerResult]]

Analyze an iterable of strings.

PARAMETER DESCRIPTION
texts

An list containing strings to be analyzed.

TYPE: Iterable[Union[str, bool, float, int]]

language

Input language

TYPE: str

batch_size

Batch size to process in a single iteration

TYPE: int DEFAULT: 1

n_process

Number of processors to use. Defaults to 1

TYPE: int DEFAULT: 1

kwargs

Additional parameters for the AnalyzerEngine.analyze method. (default value depends on the nlp engine implementation)

DEFAULT: {}

Source code in presidio_analyzer/batch_analyzer_engine.py
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
def analyze_iterator(
    self,
    texts: Iterable[Union[str, bool, float, int]],
    language: str,
    batch_size: int = 1,
    n_process: int = 1,
    **kwargs,
) -> List[List[RecognizerResult]]:
    """
    Analyze an iterable of strings.

    :param texts: An list containing strings to be analyzed.
    :param language: Input language
    :param batch_size: Batch size to process in a single iteration
    :param n_process: Number of processors to use. Defaults to `1`
    :param kwargs: Additional parameters for the `AnalyzerEngine.analyze` method.
    (default value depends on the nlp engine implementation)
    """

    # validate types
    texts = self._validate_types(texts)

    # Process the texts as batch for improved performance
    nlp_artifacts_batch: Iterator[Tuple[str, NlpArtifacts]] = (
        self.analyzer_engine.nlp_engine.process_batch(
            texts=texts,
            language=language,
            batch_size=batch_size,
            n_process=n_process,
        )
    )

    list_results = []
    for text, nlp_artifacts in nlp_artifacts_batch:
        results = self.analyzer_engine.analyze(
            text=str(text), nlp_artifacts=nlp_artifacts, language=language, **kwargs
        )

        list_results.append(results)

    return list_results

analyze_dict

analyze_dict(
    input_dict: Dict[str, Union[Any, Iterable[Any]]],
    language: str,
    keys_to_skip: Optional[List[str]] = None,
    batch_size: int = 1,
    n_process: int = 1,
    **kwargs
) -> Iterator[DictAnalyzerResult]

Analyze a dictionary of keys (strings) and values/iterable of values.

Non-string values are returned as is.

PARAMETER DESCRIPTION
input_dict

The input dictionary for analysis

TYPE: Dict[str, Union[Any, Iterable[Any]]]

language

Input language

TYPE: str

keys_to_skip

Keys to ignore during analysis

TYPE: Optional[List[str]] DEFAULT: None

batch_size

Batch size to process in a single iteration

TYPE: int DEFAULT: 1

n_process

Number of processors to use. Defaults to 1

TYPE: int DEFAULT: 1

kwargs

Additional keyword arguments for the AnalyzerEngine.analyze method. Use this to pass arguments to the analyze method, such as ad_hoc_recognizers, context, return_decision_process. See AnalyzerEngine.analyze for the full list.

DEFAULT: {}

Source code in presidio_analyzer/batch_analyzer_engine.py
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
def analyze_dict(
    self,
    input_dict: Dict[str, Union[Any, Iterable[Any]]],
    language: str,
    keys_to_skip: Optional[List[str]] = None,
    batch_size: int = 1,
    n_process: int = 1,
    **kwargs,
) -> Iterator[DictAnalyzerResult]:
    """
    Analyze a dictionary of keys (strings) and values/iterable of values.

    Non-string values are returned as is.

    :param input_dict: The input dictionary for analysis
    :param language: Input language
    :param keys_to_skip: Keys to ignore during analysis
    :param batch_size: Batch size to process in a single iteration
    :param n_process: Number of processors to use. Defaults to `1`

    :param kwargs: Additional keyword arguments
    for the `AnalyzerEngine.analyze` method.
    Use this to pass arguments to the analyze method,
    such as `ad_hoc_recognizers`, `context`, `return_decision_process`.
    See `AnalyzerEngine.analyze` for the full list.
    """

    context = []
    if "context" in kwargs:
        context = kwargs["context"]
        del kwargs["context"]

    if not keys_to_skip:
        keys_to_skip = []

    for key, value in input_dict.items():
        if not value or key in keys_to_skip:
            yield DictAnalyzerResult(key=key, value=value, recognizer_results=[])
            continue  # skip this key as requested

        # Add the key as an additional context
        specific_context = context[:]
        specific_context.append(key)

        if type(value) in (str, int, bool, float):
            results: List[RecognizerResult] = self.analyzer_engine.analyze(
                text=str(value), language=language, context=[key], **kwargs
            )
        elif isinstance(value, dict):
            new_keys_to_skip = self._get_nested_keys_to_skip(key, keys_to_skip)
            results = self.analyze_dict(
                input_dict=value,
                language=language,
                context=specific_context,
                keys_to_skip=new_keys_to_skip,
                **kwargs,
            )
        elif isinstance(value, Iterable):
            # Recursively iterate nested dicts

            results: List[List[RecognizerResult]] = self.analyze_iterator(
                texts=value,
                language=language,
                context=specific_context,
                n_process=n_process,
                batch_size=batch_size,
                **kwargs,
            )
        else:
            raise ValueError(f"type {type(value)} is unsupported.")

        yield DictAnalyzerResult(key=key, value=value, recognizer_results=results)

presidio_analyzer.dict_analyzer_result.DictAnalyzerResult dataclass

Data class for holding the output of the Presidio Analyzer on dictionaries.

PARAMETER DESCRIPTION
key

key in dictionary

TYPE: str

value

value to run analysis on (either string or list of strings)

TYPE: Union[str, List[str], dict]

recognizer_results

Analyzer output for one value. Could be either: - A list of recognizer results if the input is one string - A list of lists of recognizer results, if the input is a list of strings. - An iterator of a DictAnalyzerResult, if the input is a dictionary. In this case the recognizer_results would be the iterator of the DictAnalyzerResults next level in the dictionary.

TYPE: Union[List[RecognizerResult], List[List[RecognizerResult]], Iterator[DictAnalyzerResult]]

Source code in presidio_analyzer/dict_analyzer_result.py
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
@dataclass
class DictAnalyzerResult:
    """
    Data class for holding the output of the Presidio Analyzer on dictionaries.

    :param key: key in dictionary
    :param value: value to run analysis on (either string or list of strings)
    :param recognizer_results: Analyzer output for one value.
    Could be either:
     - A list of recognizer results if the input is one string
     - A list of lists of recognizer results, if the input is a list of strings.
     - An iterator of a DictAnalyzerResult, if the input is a dictionary.
     In this case the recognizer_results would be the iterator
     of the DictAnalyzerResults next level in the dictionary.
    """

    key: str
    value: Union[str, List[str], dict]
    recognizer_results: Union[
        List[RecognizerResult],
        List[List[RecognizerResult]],
        Iterator["DictAnalyzerResult"],
    ]

Recognizers and patterns

presidio_analyzer.entity_recognizer.EntityRecognizer

A class representing an abstract PII entity recognizer.

EntityRecognizer is an abstract class to be inherited by Recognizers which hold the logic for recognizing specific PII entities.

EntityRecognizer exposes a method called enhance_using_context which can be overridden in case a custom context aware enhancement is needed in derived class of a recognizer.

PARAMETER DESCRIPTION
supported_entities

the entities supported by this recognizer (for example, phone number, address, etc.)

TYPE: List[str]

supported_language

the language supported by this recognizer. The supported langauge code is iso6391Name

TYPE: str DEFAULT: 'en'

name

the name of this recognizer (optional)

TYPE: str DEFAULT: None

version

the recognizer current version

TYPE: str DEFAULT: '0.0.1'

context

a list of words which can help boost confidence score when they appear in context of the matched entity

TYPE: Optional[List[str]] DEFAULT: None

METHOD DESCRIPTION
load

Initialize the recognizer assets if needed.

analyze

Analyze text to identify entities.

enhance_using_context

Enhance confidence score using context of the entity.

get_supported_entities

Return the list of entities this recognizer can identify.

get_supported_language

Return the language this recognizer can support.

get_version

Return the version of this recognizer.

to_dict

Serialize self to dictionary.

from_dict

Create EntityRecognizer from a dict input.

remove_duplicates

Remove duplicate results.

sanitize_value

Cleanse the input string of the replacement pairs specified as argument.

Source code in presidio_analyzer/entity_recognizer.py
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
class EntityRecognizer:
    """
    A class representing an abstract PII entity recognizer.

    EntityRecognizer is an abstract class to be inherited by
    Recognizers which hold the logic for recognizing specific PII entities.

    EntityRecognizer exposes a method called enhance_using_context which
    can be overridden in case a custom context aware enhancement is needed
    in derived class of a recognizer.

    :param supported_entities: the entities supported by this recognizer
    (for example, phone number, address, etc.)
    :param supported_language: the language supported by this recognizer.
    The supported langauge code is iso6391Name
    :param name: the name of this recognizer (optional)
    :param version: the recognizer current version
    :param context: a list of words which can help boost confidence score
    when they appear in context of the matched entity
    """

    MIN_SCORE = 0
    MAX_SCORE = 1.0

    def __init__(
        self,
        supported_entities: List[str],
        name: str = None,
        supported_language: str = "en",
        version: str = "0.0.1",
        context: Optional[List[str]] = None,
    ):
        self.supported_entities = supported_entities

        if name is None:
            self.name = self.__class__.__name__  # assign class name as name
        else:
            self.name = name

        self._id = f"{self.name}_{id(self)}"

        self.supported_language = supported_language
        self.version = version
        self.is_loaded = False
        self.context = context if context else []

        self.load()
        logger.info("Loaded recognizer: %s", self.name)
        self.is_loaded = True

    @property
    def id(self):
        """Return a unique identifier of this recognizer."""

        return self._id

    @abstractmethod
    def load(self) -> None:
        """
        Initialize the recognizer assets if needed.

        (e.g. machine learning models)
        """

    @abstractmethod
    def analyze(
        self, text: str, entities: List[str], nlp_artifacts: NlpArtifacts
    ) -> List[RecognizerResult]:
        """
        Analyze text to identify entities.

        :param text: The text to be analyzed
        :param entities: The list of entities this recognizer is able to detect
        :param nlp_artifacts: A group of attributes which are the result of
        an NLP process over the input text.
        :return: List of results detected by this recognizer.
        """
        return None

    def enhance_using_context(
        self,
        text: str,
        raw_recognizer_results: List[RecognizerResult],
        other_raw_recognizer_results: List[RecognizerResult],
        nlp_artifacts: NlpArtifacts,
        context: Optional[List[str]] = None,
    ) -> List[RecognizerResult]:
        """Enhance confidence score using context of the entity.

        Override this method in derived class in case a custom logic
        is needed, otherwise return value will be equal to
        raw_results.

        in case a result score is boosted, derived class need to update
        result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

        :param text: The actual text that was analyzed
        :param raw_recognizer_results: This recognizer's results, to be updated
        based on recognizer specific context.
        :param other_raw_recognizer_results: Other recognizer results matched in
        the given text to allow related entity context enhancement
        :param nlp_artifacts: The nlp artifacts contains elements
                              such as lemmatized tokens for better
                              accuracy of the context enhancement process
        :param context: list of context words
        """
        return raw_recognizer_results

    def get_supported_entities(self) -> List[str]:
        """
        Return the list of entities this recognizer can identify.

        :return: A list of the supported entities by this recognizer
        """
        return self.supported_entities

    def get_supported_language(self) -> str:
        """
        Return the language this recognizer can support.

        :return: A list of the supported language by this recognizer
        """
        return self.supported_language

    def get_version(self) -> str:
        """
        Return the version of this recognizer.

        :return: The current version of this recognizer
        """
        return self.version

    def to_dict(self) -> Dict:
        """
        Serialize self to dictionary.

        :return: a dictionary
        """
        return_dict = {
            "supported_entities": self.supported_entities,
            "supported_language": self.supported_language,
            "name": self.name,
            "version": self.version,
        }
        return return_dict

    @classmethod
    def from_dict(cls, entity_recognizer_dict: Dict) -> "EntityRecognizer":
        """
        Create EntityRecognizer from a dict input.

        :param entity_recognizer_dict: Dict containing keys and values for instantiation
        """
        return cls(**entity_recognizer_dict)

    @staticmethod
    def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
        """
        Remove duplicate results.

        Remove duplicates in case the two results
        have identical start and ends and types.
        :param results: List[RecognizerResult]
        :return: List[RecognizerResult]
        """
        results = list(set(results))
        results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
        filtered_results = []

        for result in results:
            if result.score == 0:
                continue

            to_keep = result not in filtered_results  # equals based comparison
            if to_keep:
                for filtered in filtered_results:
                    # If result is contained in one of the other results
                    if (
                        result.contained_in(filtered)
                        and result.entity_type == filtered.entity_type
                    ):
                        to_keep = False
                        break

            if to_keep:
                filtered_results.append(result)

        return filtered_results

    @staticmethod
    def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
        """
        Cleanse the input string of the replacement pairs specified as argument.

        :param text: input string
        :param replacement_pairs: pairs of what has to be replaced with which value
        :return: cleansed string
        """
        for search_string, replacement_string in replacement_pairs:
            text = text.replace(search_string, replacement_string)
        return text

id property

id

Return a unique identifier of this recognizer.

load abstractmethod

load() -> None

Initialize the recognizer assets if needed.

(e.g. machine learning models)

Source code in presidio_analyzer/entity_recognizer.py
67
68
69
70
71
72
73
@abstractmethod
def load(self) -> None:
    """
    Initialize the recognizer assets if needed.

    (e.g. machine learning models)
    """

analyze abstractmethod

analyze(
    text: str, entities: List[str], nlp_artifacts: NlpArtifacts
) -> List[RecognizerResult]

Analyze text to identify entities.

PARAMETER DESCRIPTION
text

The text to be analyzed

TYPE: str

entities

The list of entities this recognizer is able to detect

TYPE: List[str]

nlp_artifacts

A group of attributes which are the result of an NLP process over the input text.

TYPE: NlpArtifacts

RETURNS DESCRIPTION
List[RecognizerResult]

List of results detected by this recognizer.

Source code in presidio_analyzer/entity_recognizer.py
75
76
77
78
79
80
81
82
83
84
85
86
87
88
@abstractmethod
def analyze(
    self, text: str, entities: List[str], nlp_artifacts: NlpArtifacts
) -> List[RecognizerResult]:
    """
    Analyze text to identify entities.

    :param text: The text to be analyzed
    :param entities: The list of entities this recognizer is able to detect
    :param nlp_artifacts: A group of attributes which are the result of
    an NLP process over the input text.
    :return: List of results detected by this recognizer.
    """
    return None

enhance_using_context

enhance_using_context(
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]

Enhance confidence score using context of the entity.

Override this method in derived class in case a custom logic is needed, otherwise return value will be equal to raw_results.

in case a result score is boosted, derived class need to update result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

PARAMETER DESCRIPTION
text

The actual text that was analyzed

TYPE: str

raw_recognizer_results

This recognizer's results, to be updated based on recognizer specific context.

TYPE: List[RecognizerResult]

other_raw_recognizer_results

Other recognizer results matched in the given text to allow related entity context enhancement

TYPE: List[RecognizerResult]

nlp_artifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

TYPE: NlpArtifacts

context

list of context words

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/entity_recognizer.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def enhance_using_context(
    self,
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """Enhance confidence score using context of the entity.

    Override this method in derived class in case a custom logic
    is needed, otherwise return value will be equal to
    raw_results.

    in case a result score is boosted, derived class need to update
    result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

    :param text: The actual text that was analyzed
    :param raw_recognizer_results: This recognizer's results, to be updated
    based on recognizer specific context.
    :param other_raw_recognizer_results: Other recognizer results matched in
    the given text to allow related entity context enhancement
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param context: list of context words
    """
    return raw_recognizer_results

get_supported_entities

get_supported_entities() -> List[str]

Return the list of entities this recognizer can identify.

RETURNS DESCRIPTION
List[str]

A list of the supported entities by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
119
120
121
122
123
124
125
def get_supported_entities(self) -> List[str]:
    """
    Return the list of entities this recognizer can identify.

    :return: A list of the supported entities by this recognizer
    """
    return self.supported_entities

get_supported_language

get_supported_language() -> str

Return the language this recognizer can support.

RETURNS DESCRIPTION
str

A list of the supported language by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
127
128
129
130
131
132
133
def get_supported_language(self) -> str:
    """
    Return the language this recognizer can support.

    :return: A list of the supported language by this recognizer
    """
    return self.supported_language

get_version

get_version() -> str

Return the version of this recognizer.

RETURNS DESCRIPTION
str

The current version of this recognizer

Source code in presidio_analyzer/entity_recognizer.py
135
136
137
138
139
140
141
def get_version(self) -> str:
    """
    Return the version of this recognizer.

    :return: The current version of this recognizer
    """
    return self.version

to_dict

to_dict() -> Dict

Serialize self to dictionary.

RETURNS DESCRIPTION
Dict

a dictionary

Source code in presidio_analyzer/entity_recognizer.py
143
144
145
146
147
148
149
150
151
152
153
154
155
def to_dict(self) -> Dict:
    """
    Serialize self to dictionary.

    :return: a dictionary
    """
    return_dict = {
        "supported_entities": self.supported_entities,
        "supported_language": self.supported_language,
        "name": self.name,
        "version": self.version,
    }
    return return_dict

from_dict classmethod

from_dict(entity_recognizer_dict: Dict) -> EntityRecognizer

Create EntityRecognizer from a dict input.

PARAMETER DESCRIPTION
entity_recognizer_dict

Dict containing keys and values for instantiation

TYPE: Dict

Source code in presidio_analyzer/entity_recognizer.py
157
158
159
160
161
162
163
164
@classmethod
def from_dict(cls, entity_recognizer_dict: Dict) -> "EntityRecognizer":
    """
    Create EntityRecognizer from a dict input.

    :param entity_recognizer_dict: Dict containing keys and values for instantiation
    """
    return cls(**entity_recognizer_dict)

remove_duplicates staticmethod

remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]

Remove duplicate results.

Remove duplicates in case the two results have identical start and ends and types.

PARAMETER DESCRIPTION
results

List[RecognizerResult]

TYPE: List[RecognizerResult]

RETURNS DESCRIPTION
List[RecognizerResult]

List[RecognizerResult]

Source code in presidio_analyzer/entity_recognizer.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
@staticmethod
def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
    """
    Remove duplicate results.

    Remove duplicates in case the two results
    have identical start and ends and types.
    :param results: List[RecognizerResult]
    :return: List[RecognizerResult]
    """
    results = list(set(results))
    results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
    filtered_results = []

    for result in results:
        if result.score == 0:
            continue

        to_keep = result not in filtered_results  # equals based comparison
        if to_keep:
            for filtered in filtered_results:
                # If result is contained in one of the other results
                if (
                    result.contained_in(filtered)
                    and result.entity_type == filtered.entity_type
                ):
                    to_keep = False
                    break

        if to_keep:
            filtered_results.append(result)

    return filtered_results

sanitize_value staticmethod

sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str

Cleanse the input string of the replacement pairs specified as argument.

PARAMETER DESCRIPTION
text

input string

TYPE: str

replacement_pairs

pairs of what has to be replaced with which value

TYPE: List[Tuple[str, str]]

RETURNS DESCRIPTION
str

cleansed string

Source code in presidio_analyzer/entity_recognizer.py
200
201
202
203
204
205
206
207
208
209
210
211
@staticmethod
def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
    """
    Cleanse the input string of the replacement pairs specified as argument.

    :param text: input string
    :param replacement_pairs: pairs of what has to be replaced with which value
    :return: cleansed string
    """
    for search_string, replacement_string in replacement_pairs:
        text = text.replace(search_string, replacement_string)
    return text

presidio_analyzer.local_recognizer.LocalRecognizer

Bases: ABC, EntityRecognizer

PII entity recognizer which runs on the same process as the AnalyzerEngine.

METHOD DESCRIPTION
load

Initialize the recognizer assets if needed.

analyze

Analyze text to identify entities.

enhance_using_context

Enhance confidence score using context of the entity.

get_supported_entities

Return the list of entities this recognizer can identify.

get_supported_language

Return the language this recognizer can support.

get_version

Return the version of this recognizer.

to_dict

Serialize self to dictionary.

from_dict

Create EntityRecognizer from a dict input.

remove_duplicates

Remove duplicate results.

sanitize_value

Cleanse the input string of the replacement pairs specified as argument.

Source code in presidio_analyzer/local_recognizer.py
6
7
class LocalRecognizer(ABC, EntityRecognizer):
    """PII entity recognizer which runs on the same process as the AnalyzerEngine."""

id property

id

Return a unique identifier of this recognizer.

load abstractmethod

load() -> None

Initialize the recognizer assets if needed.

(e.g. machine learning models)

Source code in presidio_analyzer/entity_recognizer.py
67
68
69
70
71
72
73
@abstractmethod
def load(self) -> None:
    """
    Initialize the recognizer assets if needed.

    (e.g. machine learning models)
    """

analyze abstractmethod

analyze(
    text: str, entities: List[str], nlp_artifacts: NlpArtifacts
) -> List[RecognizerResult]

Analyze text to identify entities.

PARAMETER DESCRIPTION
text

The text to be analyzed

TYPE: str

entities

The list of entities this recognizer is able to detect

TYPE: List[str]

nlp_artifacts

A group of attributes which are the result of an NLP process over the input text.

TYPE: NlpArtifacts

RETURNS DESCRIPTION
List[RecognizerResult]

List of results detected by this recognizer.

Source code in presidio_analyzer/entity_recognizer.py
75
76
77
78
79
80
81
82
83
84
85
86
87
88
@abstractmethod
def analyze(
    self, text: str, entities: List[str], nlp_artifacts: NlpArtifacts
) -> List[RecognizerResult]:
    """
    Analyze text to identify entities.

    :param text: The text to be analyzed
    :param entities: The list of entities this recognizer is able to detect
    :param nlp_artifacts: A group of attributes which are the result of
    an NLP process over the input text.
    :return: List of results detected by this recognizer.
    """
    return None

enhance_using_context

enhance_using_context(
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]

Enhance confidence score using context of the entity.

Override this method in derived class in case a custom logic is needed, otherwise return value will be equal to raw_results.

in case a result score is boosted, derived class need to update result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

PARAMETER DESCRIPTION
text

The actual text that was analyzed

TYPE: str

raw_recognizer_results

This recognizer's results, to be updated based on recognizer specific context.

TYPE: List[RecognizerResult]

other_raw_recognizer_results

Other recognizer results matched in the given text to allow related entity context enhancement

TYPE: List[RecognizerResult]

nlp_artifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

TYPE: NlpArtifacts

context

list of context words

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/entity_recognizer.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def enhance_using_context(
    self,
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """Enhance confidence score using context of the entity.

    Override this method in derived class in case a custom logic
    is needed, otherwise return value will be equal to
    raw_results.

    in case a result score is boosted, derived class need to update
    result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

    :param text: The actual text that was analyzed
    :param raw_recognizer_results: This recognizer's results, to be updated
    based on recognizer specific context.
    :param other_raw_recognizer_results: Other recognizer results matched in
    the given text to allow related entity context enhancement
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param context: list of context words
    """
    return raw_recognizer_results

get_supported_entities

get_supported_entities() -> List[str]

Return the list of entities this recognizer can identify.

RETURNS DESCRIPTION
List[str]

A list of the supported entities by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
119
120
121
122
123
124
125
def get_supported_entities(self) -> List[str]:
    """
    Return the list of entities this recognizer can identify.

    :return: A list of the supported entities by this recognizer
    """
    return self.supported_entities

get_supported_language

get_supported_language() -> str

Return the language this recognizer can support.

RETURNS DESCRIPTION
str

A list of the supported language by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
127
128
129
130
131
132
133
def get_supported_language(self) -> str:
    """
    Return the language this recognizer can support.

    :return: A list of the supported language by this recognizer
    """
    return self.supported_language

get_version

get_version() -> str

Return the version of this recognizer.

RETURNS DESCRIPTION
str

The current version of this recognizer

Source code in presidio_analyzer/entity_recognizer.py
135
136
137
138
139
140
141
def get_version(self) -> str:
    """
    Return the version of this recognizer.

    :return: The current version of this recognizer
    """
    return self.version

to_dict

to_dict() -> Dict

Serialize self to dictionary.

RETURNS DESCRIPTION
Dict

a dictionary

Source code in presidio_analyzer/entity_recognizer.py
143
144
145
146
147
148
149
150
151
152
153
154
155
def to_dict(self) -> Dict:
    """
    Serialize self to dictionary.

    :return: a dictionary
    """
    return_dict = {
        "supported_entities": self.supported_entities,
        "supported_language": self.supported_language,
        "name": self.name,
        "version": self.version,
    }
    return return_dict

from_dict classmethod

from_dict(entity_recognizer_dict: Dict) -> EntityRecognizer

Create EntityRecognizer from a dict input.

PARAMETER DESCRIPTION
entity_recognizer_dict

Dict containing keys and values for instantiation

TYPE: Dict

Source code in presidio_analyzer/entity_recognizer.py
157
158
159
160
161
162
163
164
@classmethod
def from_dict(cls, entity_recognizer_dict: Dict) -> "EntityRecognizer":
    """
    Create EntityRecognizer from a dict input.

    :param entity_recognizer_dict: Dict containing keys and values for instantiation
    """
    return cls(**entity_recognizer_dict)

remove_duplicates staticmethod

remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]

Remove duplicate results.

Remove duplicates in case the two results have identical start and ends and types.

PARAMETER DESCRIPTION
results

List[RecognizerResult]

TYPE: List[RecognizerResult]

RETURNS DESCRIPTION
List[RecognizerResult]

List[RecognizerResult]

Source code in presidio_analyzer/entity_recognizer.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
@staticmethod
def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
    """
    Remove duplicate results.

    Remove duplicates in case the two results
    have identical start and ends and types.
    :param results: List[RecognizerResult]
    :return: List[RecognizerResult]
    """
    results = list(set(results))
    results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
    filtered_results = []

    for result in results:
        if result.score == 0:
            continue

        to_keep = result not in filtered_results  # equals based comparison
        if to_keep:
            for filtered in filtered_results:
                # If result is contained in one of the other results
                if (
                    result.contained_in(filtered)
                    and result.entity_type == filtered.entity_type
                ):
                    to_keep = False
                    break

        if to_keep:
            filtered_results.append(result)

    return filtered_results

sanitize_value staticmethod

sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str

Cleanse the input string of the replacement pairs specified as argument.

PARAMETER DESCRIPTION
text

input string

TYPE: str

replacement_pairs

pairs of what has to be replaced with which value

TYPE: List[Tuple[str, str]]

RETURNS DESCRIPTION
str

cleansed string

Source code in presidio_analyzer/entity_recognizer.py
200
201
202
203
204
205
206
207
208
209
210
211
@staticmethod
def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
    """
    Cleanse the input string of the replacement pairs specified as argument.

    :param text: input string
    :param replacement_pairs: pairs of what has to be replaced with which value
    :return: cleansed string
    """
    for search_string, replacement_string in replacement_pairs:
        text = text.replace(search_string, replacement_string)
    return text

presidio_analyzer.pattern.Pattern

A class that represents a regex pattern.

PARAMETER DESCRIPTION
name

the name of the pattern

TYPE: str

regex

the regex pattern to detect

TYPE: str

score

the pattern's strength (values varies 0-1)

TYPE: float

METHOD DESCRIPTION
to_dict

Turn this instance into a dictionary.

from_dict

Load an instance from a dictionary.

Source code in presidio_analyzer/pattern.py
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
class Pattern:
    """
    A class that represents a regex pattern.

    :param name: the name of the pattern
    :param regex: the regex pattern to detect
    :param score: the pattern's strength (values varies 0-1)
    """

    def __init__(self, name: str, regex: str, score: float):
        self.name = name
        self.regex = regex
        self.score = score
        self.compiled_regex = None
        self.compiled_with_flags = None

    def to_dict(self) -> Dict:
        """
        Turn this instance into a dictionary.

        :return: a dictionary
        """
        return_dict = {"name": self.name, "score": self.score, "regex": self.regex}
        return return_dict

    @classmethod
    def from_dict(cls, pattern_dict: Dict) -> "Pattern":
        """
        Load an instance from a dictionary.

        :param pattern_dict: a dictionary holding the pattern's parameters
        :return: a Pattern instance
        """
        return cls(**pattern_dict)

    def __repr__(self):
        """Return string representation of instance."""
        return json.dumps(self.to_dict())

    def __str__(self):
        """Return string representation of instance."""
        return json.dumps(self.to_dict())

to_dict

to_dict() -> Dict

Turn this instance into a dictionary.

RETURNS DESCRIPTION
Dict

a dictionary

Source code in presidio_analyzer/pattern.py
21
22
23
24
25
26
27
28
def to_dict(self) -> Dict:
    """
    Turn this instance into a dictionary.

    :return: a dictionary
    """
    return_dict = {"name": self.name, "score": self.score, "regex": self.regex}
    return return_dict

from_dict classmethod

from_dict(pattern_dict: Dict) -> Pattern

Load an instance from a dictionary.

PARAMETER DESCRIPTION
pattern_dict

a dictionary holding the pattern's parameters

TYPE: Dict

RETURNS DESCRIPTION
Pattern

a Pattern instance

Source code in presidio_analyzer/pattern.py
30
31
32
33
34
35
36
37
38
@classmethod
def from_dict(cls, pattern_dict: Dict) -> "Pattern":
    """
    Load an instance from a dictionary.

    :param pattern_dict: a dictionary holding the pattern's parameters
    :return: a Pattern instance
    """
    return cls(**pattern_dict)

presidio_analyzer.pattern_recognizer.PatternRecognizer

Bases: LocalRecognizer

PII entity recognizer using regular expressions or deny-lists.

PARAMETER DESCRIPTION
patterns

A list of patterns to detect

TYPE: List[Pattern] DEFAULT: None

deny_list

A list of words to detect, in case our recognizer uses a predefined list of words (deny list)

TYPE: List[str] DEFAULT: None

context

list of context words

TYPE: List[str] DEFAULT: None

deny_list_score

confidence score for a term identified using a deny-list

TYPE: float DEFAULT: 1.0

global_regex_flags

regex flags to be used in regex matching, including deny-lists.

TYPE: Optional[int] DEFAULT: DOTALL | MULTILINE | IGNORECASE

METHOD DESCRIPTION
enhance_using_context

Enhance confidence score using context of the entity.

get_supported_entities

Return the list of entities this recognizer can identify.

get_supported_language

Return the language this recognizer can support.

get_version

Return the version of this recognizer.

remove_duplicates

Remove duplicate results.

sanitize_value

Cleanse the input string of the replacement pairs specified as argument.

analyze

Analyzes text to detect PII using regular expressions or deny-lists.

validate_result

Validate the pattern logic e.g., by running checksum on a detected pattern.

invalidate_result

Logic to check for result invalidation by running pruning logic.

build_regex_explanation

Construct an explanation for why this entity was detected.

to_dict

Serialize instance into a dictionary.

from_dict

Create instance from a serialized dict.

Source code in presidio_analyzer/pattern_recognizer.py
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
class PatternRecognizer(LocalRecognizer):
    """
    PII entity recognizer using regular expressions or deny-lists.

    :param patterns: A list of patterns to detect
    :param deny_list: A list of words to detect,
    in case our recognizer uses a predefined list of words (deny list)
    :param context: list of context words
    :param deny_list_score: confidence score for a term
    identified using a deny-list
    :param global_regex_flags: regex flags to be used in regex matching,
    including deny-lists.
    """

    def __init__(
        self,
        supported_entity: str,
        name: str = None,
        supported_language: str = "en",
        patterns: List[Pattern] = None,
        deny_list: List[str] = None,
        context: List[str] = None,
        deny_list_score: float = 1.0,
        global_regex_flags: Optional[int] = re.DOTALL | re.MULTILINE | re.IGNORECASE,
        version: str = "0.0.1",
    ):
        if not supported_entity:
            raise ValueError("Pattern recognizer should be initialized with entity")

        if not patterns and not deny_list:
            raise ValueError(
                "Pattern recognizer should be initialized with patterns"
                " or with deny list"
            )

        super().__init__(
            supported_entities=[supported_entity],
            supported_language=supported_language,
            name=name,
            version=version,
        )
        if patterns is None:
            self.patterns = []
        else:
            self.patterns = patterns
        self.context = context
        self.deny_list_score = deny_list_score
        self.global_regex_flags = global_regex_flags

        if deny_list:
            deny_list_pattern = self._deny_list_to_regex(deny_list)
            self.patterns.append(deny_list_pattern)
            self.deny_list = deny_list
        else:
            self.deny_list = []

    def load(self):  # noqa D102
        pass

    def analyze(
        self,
        text: str,
        entities: List[str],
        nlp_artifacts: Optional[NlpArtifacts] = None,
        regex_flags: Optional[int] = None,
    ) -> List[RecognizerResult]:
        """
        Analyzes text to detect PII using regular expressions or deny-lists.

        :param text: Text to be analyzed
        :param entities: Entities this recognizer can detect
        :param nlp_artifacts: Output values from the NLP engine
        :param regex_flags: regex flags to be used in regex matching
        :return:
        """
        results = []

        if self.patterns:
            pattern_result = self.__analyze_patterns(text, regex_flags)
            results.extend(pattern_result)

        return results

    def _deny_list_to_regex(self, deny_list: List[str]) -> Pattern:
        """
        Convert a list of words to a matching regex.

        To be analyzed by the analyze method as any other regex patterns.

        :param deny_list: the list of words to detect
        :return:the regex of the words for detection
        """

        # Escape deny list elements as preparation for regex
        escaped_deny_list = [re.escape(element) for element in deny_list]
        regex = r"(?:^|(?<=\W))(" + "|".join(escaped_deny_list) + r")(?:(?=\W)|$)"
        return Pattern(name="deny_list", regex=regex, score=self.deny_list_score)

    def validate_result(self, pattern_text: str) -> Optional[bool]:
        """
        Validate the pattern logic e.g., by running checksum on a detected pattern.

        :param pattern_text: the text to validated.
        Only the part in text that was detected by the regex engine
        :return: A bool indicating whether the validation was successful.
        """
        return None

    def invalidate_result(self, pattern_text: str) -> Optional[bool]:
        """
        Logic to check for result invalidation by running pruning logic.

        For example, each SSN number group should not consist of all the same digits.

        :param pattern_text: the text to validated.
        Only the part in text that was detected by the regex engine
        :return: A bool indicating whether the result is invalidated
        """
        return None

    @staticmethod
    def build_regex_explanation(
        recognizer_name: str,
        pattern_name: str,
        pattern: str,
        original_score: float,
        validation_result: bool,
        regex_flags: int,
    ) -> AnalysisExplanation:
        """
        Construct an explanation for why this entity was detected.

        :param recognizer_name: Name of recognizer detecting the entity
        :param pattern_name: Regex pattern name which detected the entity
        :param pattern: Regex pattern logic
        :param original_score: Score given by the recognizer
        :param validation_result: Whether validation was used and its result
        :param regex_flags: Regex flags used in the regex matching
        :return: Analysis explanation
        """
        textual_explanation = (
            f"Detected by `{recognizer_name}` " f"using pattern `{pattern_name}`"
        )

        explanation = AnalysisExplanation(
            recognizer=recognizer_name,
            original_score=original_score,
            pattern_name=pattern_name,
            pattern=pattern,
            validation_result=validation_result,
            regex_flags=regex_flags,
            textual_explanation=textual_explanation,
        )
        return explanation

    def __analyze_patterns(
        self, text: str, flags: int = None
    ) -> List[RecognizerResult]:
        """
        Evaluate all patterns in the provided text.

        Including words in the provided deny-list

        :param text: text to analyze
        :param flags: regex flags
        :return: A list of RecognizerResult
        """
        flags = flags if flags else self.global_regex_flags
        results = []
        for pattern in self.patterns:
            match_start_time = datetime.datetime.now()

            # Compile regex if flags differ from flags the regex was compiled with
            if not pattern.compiled_regex or pattern.compiled_with_flags != flags:
                pattern.compiled_with_flags = flags
                pattern.compiled_regex = re.compile(pattern.regex, flags=flags)

            matches = pattern.compiled_regex.finditer(text)
            match_time = datetime.datetime.now() - match_start_time
            logger.debug(
                "--- match_time[%s]: %.6f seconds",
                pattern.name,
                match_time.total_seconds()
            )

            for match in matches:
                start, end = match.span()
                current_match = text[start:end]

                # Skip empty results
                if current_match == "":
                    continue

                score = pattern.score

                validation_result = self.validate_result(current_match)
                description = self.build_regex_explanation(
                    self.name,
                    pattern.name,
                    pattern.regex,
                    score,
                    validation_result,
                    flags,
                )
                pattern_result = RecognizerResult(
                    entity_type=self.supported_entities[0],
                    start=start,
                    end=end,
                    score=score,
                    analysis_explanation=description,
                    recognition_metadata={
                        RecognizerResult.RECOGNIZER_NAME_KEY: self.name,
                        RecognizerResult.RECOGNIZER_IDENTIFIER_KEY: self.id,
                    },
                )

                if validation_result is not None:
                    if validation_result:
                        pattern_result.score = EntityRecognizer.MAX_SCORE
                    else:
                        pattern_result.score = EntityRecognizer.MIN_SCORE

                invalidation_result = self.invalidate_result(current_match)
                if invalidation_result is not None and invalidation_result:
                    pattern_result.score = EntityRecognizer.MIN_SCORE

                if pattern_result.score > EntityRecognizer.MIN_SCORE:
                    results.append(pattern_result)

                # Update analysis explanation score following validation or invalidation
                description.score = pattern_result.score

        results = EntityRecognizer.remove_duplicates(results)
        return results

    def to_dict(self) -> Dict:
        """Serialize instance into a dictionary."""
        return_dict = super().to_dict()

        return_dict["patterns"] = [pat.to_dict() for pat in self.patterns]
        return_dict["deny_list"] = self.deny_list
        return_dict["context"] = self.context
        return_dict["supported_entity"] = return_dict["supported_entities"][0]
        del return_dict["supported_entities"]

        return return_dict

    @classmethod
    def from_dict(cls, entity_recognizer_dict: Dict) -> "PatternRecognizer":
        """Create instance from a serialized dict."""
        patterns = entity_recognizer_dict.get("patterns")
        if patterns:
            patterns_list = [Pattern.from_dict(pat) for pat in patterns]
            entity_recognizer_dict["patterns"] = patterns_list

        return cls(**entity_recognizer_dict)

id property

id

Return a unique identifier of this recognizer.

enhance_using_context

enhance_using_context(
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]

Enhance confidence score using context of the entity.

Override this method in derived class in case a custom logic is needed, otherwise return value will be equal to raw_results.

in case a result score is boosted, derived class need to update result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

PARAMETER DESCRIPTION
text

The actual text that was analyzed

TYPE: str

raw_recognizer_results

This recognizer's results, to be updated based on recognizer specific context.

TYPE: List[RecognizerResult]

other_raw_recognizer_results

Other recognizer results matched in the given text to allow related entity context enhancement

TYPE: List[RecognizerResult]

nlp_artifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

TYPE: NlpArtifacts

context

list of context words

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/entity_recognizer.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def enhance_using_context(
    self,
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """Enhance confidence score using context of the entity.

    Override this method in derived class in case a custom logic
    is needed, otherwise return value will be equal to
    raw_results.

    in case a result score is boosted, derived class need to update
    result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

    :param text: The actual text that was analyzed
    :param raw_recognizer_results: This recognizer's results, to be updated
    based on recognizer specific context.
    :param other_raw_recognizer_results: Other recognizer results matched in
    the given text to allow related entity context enhancement
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param context: list of context words
    """
    return raw_recognizer_results

get_supported_entities

get_supported_entities() -> List[str]

Return the list of entities this recognizer can identify.

RETURNS DESCRIPTION
List[str]

A list of the supported entities by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
119
120
121
122
123
124
125
def get_supported_entities(self) -> List[str]:
    """
    Return the list of entities this recognizer can identify.

    :return: A list of the supported entities by this recognizer
    """
    return self.supported_entities

get_supported_language

get_supported_language() -> str

Return the language this recognizer can support.

RETURNS DESCRIPTION
str

A list of the supported language by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
127
128
129
130
131
132
133
def get_supported_language(self) -> str:
    """
    Return the language this recognizer can support.

    :return: A list of the supported language by this recognizer
    """
    return self.supported_language

get_version

get_version() -> str

Return the version of this recognizer.

RETURNS DESCRIPTION
str

The current version of this recognizer

Source code in presidio_analyzer/entity_recognizer.py
135
136
137
138
139
140
141
def get_version(self) -> str:
    """
    Return the version of this recognizer.

    :return: The current version of this recognizer
    """
    return self.version

remove_duplicates staticmethod

remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]

Remove duplicate results.

Remove duplicates in case the two results have identical start and ends and types.

PARAMETER DESCRIPTION
results

List[RecognizerResult]

TYPE: List[RecognizerResult]

RETURNS DESCRIPTION
List[RecognizerResult]

List[RecognizerResult]

Source code in presidio_analyzer/entity_recognizer.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
@staticmethod
def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
    """
    Remove duplicate results.

    Remove duplicates in case the two results
    have identical start and ends and types.
    :param results: List[RecognizerResult]
    :return: List[RecognizerResult]
    """
    results = list(set(results))
    results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
    filtered_results = []

    for result in results:
        if result.score == 0:
            continue

        to_keep = result not in filtered_results  # equals based comparison
        if to_keep:
            for filtered in filtered_results:
                # If result is contained in one of the other results
                if (
                    result.contained_in(filtered)
                    and result.entity_type == filtered.entity_type
                ):
                    to_keep = False
                    break

        if to_keep:
            filtered_results.append(result)

    return filtered_results

sanitize_value staticmethod

sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str

Cleanse the input string of the replacement pairs specified as argument.

PARAMETER DESCRIPTION
text

input string

TYPE: str

replacement_pairs

pairs of what has to be replaced with which value

TYPE: List[Tuple[str, str]]

RETURNS DESCRIPTION
str

cleansed string

Source code in presidio_analyzer/entity_recognizer.py
200
201
202
203
204
205
206
207
208
209
210
211
@staticmethod
def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
    """
    Cleanse the input string of the replacement pairs specified as argument.

    :param text: input string
    :param replacement_pairs: pairs of what has to be replaced with which value
    :return: cleansed string
    """
    for search_string, replacement_string in replacement_pairs:
        text = text.replace(search_string, replacement_string)
    return text

analyze

analyze(
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]

Analyzes text to detect PII using regular expressions or deny-lists.

PARAMETER DESCRIPTION
text

Text to be analyzed

TYPE: str

entities

Entities this recognizer can detect

TYPE: List[str]

nlp_artifacts

Output values from the NLP engine

TYPE: Optional[NlpArtifacts] DEFAULT: None

regex_flags

regex flags to be used in regex matching

TYPE: Optional[int] DEFAULT: None

RETURNS DESCRIPTION
List[RecognizerResult]
Source code in presidio_analyzer/pattern_recognizer.py
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def analyze(
    self,
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]:
    """
    Analyzes text to detect PII using regular expressions or deny-lists.

    :param text: Text to be analyzed
    :param entities: Entities this recognizer can detect
    :param nlp_artifacts: Output values from the NLP engine
    :param regex_flags: regex flags to be used in regex matching
    :return:
    """
    results = []

    if self.patterns:
        pattern_result = self.__analyze_patterns(text, regex_flags)
        results.extend(pattern_result)

    return results

validate_result

validate_result(pattern_text: str) -> Optional[bool]

Validate the pattern logic e.g., by running checksum on a detected pattern.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the validation was successful.

Source code in presidio_analyzer/pattern_recognizer.py
117
118
119
120
121
122
123
124
125
def validate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Validate the pattern logic e.g., by running checksum on a detected pattern.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the validation was successful.
    """
    return None

invalidate_result

invalidate_result(pattern_text: str) -> Optional[bool]

Logic to check for result invalidation by running pruning logic.

For example, each SSN number group should not consist of all the same digits.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the result is invalidated

Source code in presidio_analyzer/pattern_recognizer.py
127
128
129
130
131
132
133
134
135
136
137
def invalidate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Logic to check for result invalidation by running pruning logic.

    For example, each SSN number group should not consist of all the same digits.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the result is invalidated
    """
    return None

build_regex_explanation staticmethod

build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation

Construct an explanation for why this entity was detected.

PARAMETER DESCRIPTION
recognizer_name

Name of recognizer detecting the entity

TYPE: str

pattern_name

Regex pattern name which detected the entity

TYPE: str

pattern

Regex pattern logic

TYPE: str

original_score

Score given by the recognizer

TYPE: float

validation_result

Whether validation was used and its result

TYPE: bool

regex_flags

Regex flags used in the regex matching

TYPE: int

RETURNS DESCRIPTION
AnalysisExplanation

Analysis explanation

Source code in presidio_analyzer/pattern_recognizer.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
@staticmethod
def build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation:
    """
    Construct an explanation for why this entity was detected.

    :param recognizer_name: Name of recognizer detecting the entity
    :param pattern_name: Regex pattern name which detected the entity
    :param pattern: Regex pattern logic
    :param original_score: Score given by the recognizer
    :param validation_result: Whether validation was used and its result
    :param regex_flags: Regex flags used in the regex matching
    :return: Analysis explanation
    """
    textual_explanation = (
        f"Detected by `{recognizer_name}` " f"using pattern `{pattern_name}`"
    )

    explanation = AnalysisExplanation(
        recognizer=recognizer_name,
        original_score=original_score,
        pattern_name=pattern_name,
        pattern=pattern,
        validation_result=validation_result,
        regex_flags=regex_flags,
        textual_explanation=textual_explanation,
    )
    return explanation

to_dict

to_dict() -> Dict

Serialize instance into a dictionary.

Source code in presidio_analyzer/pattern_recognizer.py
254
255
256
257
258
259
260
261
262
263
264
def to_dict(self) -> Dict:
    """Serialize instance into a dictionary."""
    return_dict = super().to_dict()

    return_dict["patterns"] = [pat.to_dict() for pat in self.patterns]
    return_dict["deny_list"] = self.deny_list
    return_dict["context"] = self.context
    return_dict["supported_entity"] = return_dict["supported_entities"][0]
    del return_dict["supported_entities"]

    return return_dict

from_dict classmethod

from_dict(entity_recognizer_dict: Dict) -> PatternRecognizer

Create instance from a serialized dict.

Source code in presidio_analyzer/pattern_recognizer.py
266
267
268
269
270
271
272
273
274
@classmethod
def from_dict(cls, entity_recognizer_dict: Dict) -> "PatternRecognizer":
    """Create instance from a serialized dict."""
    patterns = entity_recognizer_dict.get("patterns")
    if patterns:
        patterns_list = [Pattern.from_dict(pat) for pat in patterns]
        entity_recognizer_dict["patterns"] = patterns_list

    return cls(**entity_recognizer_dict)

presidio_analyzer.remote_recognizer.RemoteRecognizer

Bases: ABC, EntityRecognizer

A configuration for a recognizer that runs on a different process / remote machine.

PARAMETER DESCRIPTION
supported_entities

A list of entities this recognizer can identify

TYPE: List[str]

name

name of recognizer

TYPE: Optional[str]

supported_language

The language this recognizer can detect entities in

TYPE: str

version

Version of this recognizer

TYPE: str

METHOD DESCRIPTION
enhance_using_context

Enhance confidence score using context of the entity.

get_supported_language

Return the language this recognizer can support.

get_version

Return the version of this recognizer.

to_dict

Serialize self to dictionary.

from_dict

Create EntityRecognizer from a dict input.

remove_duplicates

Remove duplicate results.

sanitize_value

Cleanse the input string of the replacement pairs specified as argument.

analyze

Call an external service for PII detection.

Source code in presidio_analyzer/remote_recognizer.py
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
class RemoteRecognizer(ABC, EntityRecognizer):
    """
    A configuration for a recognizer that runs on a different process / remote machine.

    :param supported_entities: A list of entities this recognizer can identify
    :param name: name of recognizer
    :param supported_language: The language this recognizer can detect entities in
    :param version: Version of this recognizer
    """

    def __init__(
        self,
        supported_entities: List[str],
        name: Optional[str],
        supported_language: str,
        version: str,
        context: Optional[List[str]] = None,
    ):
        super().__init__(
            supported_entities=supported_entities,
            name=name,
            supported_language=supported_language,
            version=version,
            context=context,
        )

    def load(self):  # noqa D102
        pass

    @abstractmethod
    def analyze(self, text: str, entities: List[str], nlp_artifacts: NlpArtifacts):  # noqa ANN201
        """
        Call an external service for PII detection.

        :param text: text to be analyzed
        :param entities: Entities that should be looked for
        :param nlp_artifacts: Additional metadata from the NLP engine
        :return: List of identified PII entities
        """

        # 1. Call the external service.
        # 2. Translate results into List[RecognizerResult]
        pass

    @abstractmethod
    def get_supported_entities(self) -> List[str]:  # noqa D102
        pass

id property

id

Return a unique identifier of this recognizer.

enhance_using_context

enhance_using_context(
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]

Enhance confidence score using context of the entity.

Override this method in derived class in case a custom logic is needed, otherwise return value will be equal to raw_results.

in case a result score is boosted, derived class need to update result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

PARAMETER DESCRIPTION
text

The actual text that was analyzed

TYPE: str

raw_recognizer_results

This recognizer's results, to be updated based on recognizer specific context.

TYPE: List[RecognizerResult]

other_raw_recognizer_results

Other recognizer results matched in the given text to allow related entity context enhancement

TYPE: List[RecognizerResult]

nlp_artifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

TYPE: NlpArtifacts

context

list of context words

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/entity_recognizer.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def enhance_using_context(
    self,
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """Enhance confidence score using context of the entity.

    Override this method in derived class in case a custom logic
    is needed, otherwise return value will be equal to
    raw_results.

    in case a result score is boosted, derived class need to update
    result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

    :param text: The actual text that was analyzed
    :param raw_recognizer_results: This recognizer's results, to be updated
    based on recognizer specific context.
    :param other_raw_recognizer_results: Other recognizer results matched in
    the given text to allow related entity context enhancement
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param context: list of context words
    """
    return raw_recognizer_results

get_supported_language

get_supported_language() -> str

Return the language this recognizer can support.

RETURNS DESCRIPTION
str

A list of the supported language by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
127
128
129
130
131
132
133
def get_supported_language(self) -> str:
    """
    Return the language this recognizer can support.

    :return: A list of the supported language by this recognizer
    """
    return self.supported_language

get_version

get_version() -> str

Return the version of this recognizer.

RETURNS DESCRIPTION
str

The current version of this recognizer

Source code in presidio_analyzer/entity_recognizer.py
135
136
137
138
139
140
141
def get_version(self) -> str:
    """
    Return the version of this recognizer.

    :return: The current version of this recognizer
    """
    return self.version

to_dict

to_dict() -> Dict

Serialize self to dictionary.

RETURNS DESCRIPTION
Dict

a dictionary

Source code in presidio_analyzer/entity_recognizer.py
143
144
145
146
147
148
149
150
151
152
153
154
155
def to_dict(self) -> Dict:
    """
    Serialize self to dictionary.

    :return: a dictionary
    """
    return_dict = {
        "supported_entities": self.supported_entities,
        "supported_language": self.supported_language,
        "name": self.name,
        "version": self.version,
    }
    return return_dict

from_dict classmethod

from_dict(entity_recognizer_dict: Dict) -> EntityRecognizer

Create EntityRecognizer from a dict input.

PARAMETER DESCRIPTION
entity_recognizer_dict

Dict containing keys and values for instantiation

TYPE: Dict

Source code in presidio_analyzer/entity_recognizer.py
157
158
159
160
161
162
163
164
@classmethod
def from_dict(cls, entity_recognizer_dict: Dict) -> "EntityRecognizer":
    """
    Create EntityRecognizer from a dict input.

    :param entity_recognizer_dict: Dict containing keys and values for instantiation
    """
    return cls(**entity_recognizer_dict)

remove_duplicates staticmethod

remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]

Remove duplicate results.

Remove duplicates in case the two results have identical start and ends and types.

PARAMETER DESCRIPTION
results

List[RecognizerResult]

TYPE: List[RecognizerResult]

RETURNS DESCRIPTION
List[RecognizerResult]

List[RecognizerResult]

Source code in presidio_analyzer/entity_recognizer.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
@staticmethod
def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
    """
    Remove duplicate results.

    Remove duplicates in case the two results
    have identical start and ends and types.
    :param results: List[RecognizerResult]
    :return: List[RecognizerResult]
    """
    results = list(set(results))
    results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
    filtered_results = []

    for result in results:
        if result.score == 0:
            continue

        to_keep = result not in filtered_results  # equals based comparison
        if to_keep:
            for filtered in filtered_results:
                # If result is contained in one of the other results
                if (
                    result.contained_in(filtered)
                    and result.entity_type == filtered.entity_type
                ):
                    to_keep = False
                    break

        if to_keep:
            filtered_results.append(result)

    return filtered_results

sanitize_value staticmethod

sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str

Cleanse the input string of the replacement pairs specified as argument.

PARAMETER DESCRIPTION
text

input string

TYPE: str

replacement_pairs

pairs of what has to be replaced with which value

TYPE: List[Tuple[str, str]]

RETURNS DESCRIPTION
str

cleansed string

Source code in presidio_analyzer/entity_recognizer.py
200
201
202
203
204
205
206
207
208
209
210
211
@staticmethod
def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
    """
    Cleanse the input string of the replacement pairs specified as argument.

    :param text: input string
    :param replacement_pairs: pairs of what has to be replaced with which value
    :return: cleansed string
    """
    for search_string, replacement_string in replacement_pairs:
        text = text.replace(search_string, replacement_string)
    return text

analyze abstractmethod

analyze(text: str, entities: List[str], nlp_artifacts: NlpArtifacts)

Call an external service for PII detection.

PARAMETER DESCRIPTION
text

text to be analyzed

TYPE: str

entities

Entities that should be looked for

TYPE: List[str]

nlp_artifacts

Additional metadata from the NLP engine

TYPE: NlpArtifacts

RETURNS DESCRIPTION

List of identified PII entities

Source code in presidio_analyzer/remote_recognizer.py
37
38
39
40
41
42
43
44
45
46
47
48
49
50
@abstractmethod
def analyze(self, text: str, entities: List[str], nlp_artifacts: NlpArtifacts):  # noqa ANN201
    """
    Call an external service for PII detection.

    :param text: text to be analyzed
    :param entities: Entities that should be looked for
    :param nlp_artifacts: Additional metadata from the NLP engine
    :return: List of identified PII entities
    """

    # 1. Call the external service.
    # 2. Translate results into List[RecognizerResult]
    pass

Recognizer registry modules

presidio_analyzer.recognizer_registry.RecognizerRegistry

Detect, register and hold all recognizers to be used by the analyzer.

PARAMETER DESCRIPTION
recognizers

An optional list of recognizers, that will be available instead of the predefined recognizers

TYPE: Optional[Iterable[EntityRecognizer]] DEFAULT: None

global_regex_flags

regex flags to be used in regex matching, including deny-lists

TYPE: Optional[int] DEFAULT: DOTALL | MULTILINE | IGNORECASE

supported_languages

List of languages supported by this registry.

TYPE: Optional[List[str]] DEFAULT: None

METHOD DESCRIPTION
add_nlp_recognizer

Adding NLP recognizer in accordance with the nlp engine.

load_predefined_recognizers

Load the existing recognizers into memory.

get_recognizers

Return a list of recognizers which supports the specified name and language.

add_recognizer

Add a new recognizer to the list of recognizers.

remove_recognizer

Remove a recognizer based on its name.

add_pattern_recognizer_from_dict

Load a pattern recognizer from a Dict into the recognizer registry.

add_recognizers_from_yaml

Read YAML file and load recognizers into the recognizer registry.

get_supported_entities

Return the supported entities by the set of recognizers loaded.

Source code in presidio_analyzer/recognizer_registry/recognizer_registry.py
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
class RecognizerRegistry:
    """
    Detect, register and hold all recognizers to be used by the analyzer.

    :param recognizers: An optional list of recognizers,
    that will be available instead of the predefined recognizers
    :param global_regex_flags: regex flags to be used in regex matching,
    including deny-lists
    :param supported_languages: List of languages supported by this registry.

    """

    def __init__(
        self,
        recognizers: Optional[Iterable[EntityRecognizer]] = None,
        global_regex_flags: Optional[int] = re.DOTALL | re.MULTILINE | re.IGNORECASE,
        supported_languages: Optional[List[str]] = None,
    ):
        if recognizers:
            self.recognizers = recognizers
        else:
            self.recognizers = []
        self.global_regex_flags = global_regex_flags
        self.supported_languages = (
            supported_languages if supported_languages else ["en"]
        )

    def _create_nlp_recognizer(
        self,
        nlp_engine: Optional[NlpEngine] = None,
        supported_language: Optional[str] = None
    ) -> SpacyRecognizer:
        nlp_recognizer = self._get_nlp_recognizer(nlp_engine)

        if nlp_engine:
            return nlp_recognizer(
                supported_language=supported_language,
                supported_entities=nlp_engine.get_supported_entities(),
            )

        return nlp_recognizer(supported_language=supported_language)

    def add_nlp_recognizer(self, nlp_engine: NlpEngine) -> None:
        """
        Adding NLP recognizer in accordance with the nlp engine.

        :param nlp_engine: The NLP engine.
        :return: None
        """

        if not nlp_engine:
            supported_languages = self.supported_languages
        else:
            supported_languages = nlp_engine.get_supported_languages()

        self.recognizers.extend(
            [
                self._create_nlp_recognizer(
                    nlp_engine=nlp_engine, supported_language=supported_language
                )
                for supported_language in supported_languages
            ]
        )

    def load_predefined_recognizers(
        self, languages: Optional[List[str]] = None, nlp_engine: NlpEngine = None
    ) -> None:
        """
        Load the existing recognizers into memory.

        :param languages: List of languages for which to load recognizers
        :param nlp_engine: The NLP engine to use.
        :return: None
        """

        registry_configuration = {"global_regex_flags": self.global_regex_flags}
        if languages is not None:
            registry_configuration["supported_languages"] = languages

        configuration = RecognizerConfigurationLoader.get(
            registry_configuration=registry_configuration
        )
        recognizers = RecognizerListLoader.get(**configuration)

        self.recognizers.extend(recognizers)
        self.add_nlp_recognizer(nlp_engine=nlp_engine)

    @staticmethod
    def _get_nlp_recognizer(
        nlp_engine: NlpEngine,
    ) -> Type[SpacyRecognizer]:
        """Return the recognizer leveraging the selected NLP Engine."""

        if isinstance(nlp_engine, StanzaNlpEngine):
            return StanzaRecognizer
        if isinstance(nlp_engine, TransformersNlpEngine):
            return TransformersRecognizer
        if not nlp_engine or isinstance(nlp_engine, SpacyNlpEngine):
            return SpacyRecognizer
        else:
            logger.warning(
                "nlp engine should be either SpacyNlpEngine,"
                "StanzaNlpEngine or TransformersNlpEngine"
            )
            # Returning default
            return SpacyRecognizer

    def get_recognizers(
        self,
        language: str,
        entities: Optional[List[str]] = None,
        all_fields: bool = False,
        ad_hoc_recognizers: Optional[List[EntityRecognizer]] = None,
    ) -> List[EntityRecognizer]:
        """
        Return a list of recognizers which supports the specified name and language.

        :param entities: the requested entities
        :param language: the requested language
        :param all_fields: a flag to return all fields of a requested language.
        :param ad_hoc_recognizers: Additional recognizers provided by the user
        as part of the request
        :return: A list of the recognizers which supports the supplied entities
        and language
        """
        if language is None:
            raise ValueError("No language provided")

        if entities is None and all_fields is False:
            raise ValueError("No entities provided")

        all_possible_recognizers = copy.copy(self.recognizers)
        if ad_hoc_recognizers:
            all_possible_recognizers.extend(ad_hoc_recognizers)

        # filter out unwanted recognizers
        to_return = set()
        if all_fields:
            to_return = [
                rec
                for rec in all_possible_recognizers
                if language == rec.supported_language
            ]
        else:
            for entity in entities:
                subset = [
                    rec
                    for rec in all_possible_recognizers
                    if entity in rec.supported_entities
                    and language == rec.supported_language
                ]

                if not subset:
                    logger.warning(
                        "Entity %s doesn't have the corresponding"
                        " recognizer in language : %s",
                        entity,
                        language,
                    )
                else:
                    to_return.update(set(subset))

        logger.debug(
            "Returning a total of %s recognizers",
            str(len(to_return)),
        )

        if not to_return:
            raise ValueError("No matching recognizers were found to serve the request.")

        return list(to_return)

    def add_recognizer(self, recognizer: EntityRecognizer) -> None:
        """
        Add a new recognizer to the list of recognizers.

        :param recognizer: Recognizer to add
        """
        if not isinstance(recognizer, EntityRecognizer):
            raise ValueError("Input is not of type EntityRecognizer")

        self.recognizers.append(recognizer)

    def remove_recognizer(
        self, recognizer_name: str, language: Optional[str] = None
    ) -> None:
        """
        Remove a recognizer based on its name.

        :param recognizer_name: Name of recognizer to remove
        :param language: The supported language of the recognizer to be removed,
        in case multiple recognizers with the same name are present,
        and only one should be removed.
        """

        if not language:
            new_recognizers = [
                rec for rec in self.recognizers if rec.name != recognizer_name
            ]

            logger.info(
                "Removed %s recognizers which had the name %s",
                str(len(self.recognizers) - len(new_recognizers)),
                recognizer_name,
            )

        else:
            new_recognizers = [
                rec
                for rec in self.recognizers
                if rec.name != recognizer_name or rec.supported_language != language
            ]

            logger.info(
                "Removed %s recognizers which had the name %s and language %s",
                str(len(self.recognizers) - len(new_recognizers)),
                recognizer_name,
                language,
            )

        self.recognizers = new_recognizers

    def add_pattern_recognizer_from_dict(self, recognizer_dict: Dict) -> None:
        """
        Load a pattern recognizer from a Dict into the recognizer registry.

        :param recognizer_dict: Dict holding a serialization of an PatternRecognizer

        :example:
        >>> registry = RecognizerRegistry()
        >>> recognizer = { "name": "Titles Recognizer", "supported_language": "de","supported_entity": "TITLE", "deny_list": ["Mr.","Mrs."]}
        >>> registry.add_pattern_recognizer_from_dict(recognizer)
        """  # noqa: E501

        recognizer = PatternRecognizer.from_dict(recognizer_dict)
        self.add_recognizer(recognizer)

    def add_recognizers_from_yaml(self, yml_path: Union[str, Path]) -> None:
        r"""
        Read YAML file and load recognizers into the recognizer registry.

        See example yaml file here:
        https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/conf/example_recognizers.yaml

        :example:
        >>> yaml_file = "recognizers.yaml"
        >>> registry = RecognizerRegistry()
        >>> registry.add_recognizers_from_yaml(yaml_file)

        """

        try:
            with open(yml_path) as stream:
                yaml_recognizers = yaml.safe_load(stream)

            for yaml_recognizer in yaml_recognizers["recognizers"]:
                self.add_pattern_recognizer_from_dict(yaml_recognizer)
        except OSError as io_error:
            print(f"Error reading file {yml_path}")
            raise io_error
        except yaml.YAMLError as yaml_error:
            print(f"Failed to parse file {yml_path}")
            raise yaml_error
        except TypeError as yaml_error:
            print(f"Failed to parse file {yml_path}")
            raise yaml_error

    def __instantiate_recognizer(
        self, recognizer_class: Type[EntityRecognizer], supported_language: str
    ):
        """
        Instantiate a recognizer class given type and input.

        :param recognizer_class: Class object of the recognizer
        :param supported_language: Language this recognizer should support
        """

        inst = recognizer_class(supported_language=supported_language)
        if isinstance(inst, PatternRecognizer):
            inst.global_regex_flags = self.global_regex_flags
        return inst

    def _get_supported_languages(self) -> List[str]:
        languages = []
        for rec in self.recognizers:
            languages.append(rec.supported_language)

        return list(set(languages))

    def get_supported_entities(
        self, languages: Optional[List[str]] = None
    ) -> List[str]:
        """
        Return the supported entities by the set of recognizers loaded.

        :param languages: The languages to get the supported entities for.
        If languages=None, returns all entities for all languages.
        """
        if not languages:
            languages = self._get_supported_languages()

        supported_entities = []
        for language in languages:
            recognizers = self.get_recognizers(language=language, all_fields=True)

            for recognizer in recognizers:
                supported_entities.extend(recognizer.get_supported_entities())

        return list(set(supported_entities))

add_nlp_recognizer

add_nlp_recognizer(nlp_engine: NlpEngine) -> None

Adding NLP recognizer in accordance with the nlp engine.

PARAMETER DESCRIPTION
nlp_engine

The NLP engine.

TYPE: NlpEngine

RETURNS DESCRIPTION
None

None

Source code in presidio_analyzer/recognizer_registry/recognizer_registry.py
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
def add_nlp_recognizer(self, nlp_engine: NlpEngine) -> None:
    """
    Adding NLP recognizer in accordance with the nlp engine.

    :param nlp_engine: The NLP engine.
    :return: None
    """

    if not nlp_engine:
        supported_languages = self.supported_languages
    else:
        supported_languages = nlp_engine.get_supported_languages()

    self.recognizers.extend(
        [
            self._create_nlp_recognizer(
                nlp_engine=nlp_engine, supported_language=supported_language
            )
            for supported_language in supported_languages
        ]
    )

load_predefined_recognizers

load_predefined_recognizers(
    languages: Optional[List[str]] = None, nlp_engine: NlpEngine = None
) -> None

Load the existing recognizers into memory.

PARAMETER DESCRIPTION
languages

List of languages for which to load recognizers

TYPE: Optional[List[str]] DEFAULT: None

nlp_engine

The NLP engine to use.

TYPE: NlpEngine DEFAULT: None

RETURNS DESCRIPTION
None

None

Source code in presidio_analyzer/recognizer_registry/recognizer_registry.py
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
def load_predefined_recognizers(
    self, languages: Optional[List[str]] = None, nlp_engine: NlpEngine = None
) -> None:
    """
    Load the existing recognizers into memory.

    :param languages: List of languages for which to load recognizers
    :param nlp_engine: The NLP engine to use.
    :return: None
    """

    registry_configuration = {"global_regex_flags": self.global_regex_flags}
    if languages is not None:
        registry_configuration["supported_languages"] = languages

    configuration = RecognizerConfigurationLoader.get(
        registry_configuration=registry_configuration
    )
    recognizers = RecognizerListLoader.get(**configuration)

    self.recognizers.extend(recognizers)
    self.add_nlp_recognizer(nlp_engine=nlp_engine)

get_recognizers

get_recognizers(
    language: str,
    entities: Optional[List[str]] = None,
    all_fields: bool = False,
    ad_hoc_recognizers: Optional[List[EntityRecognizer]] = None,
) -> List[EntityRecognizer]

Return a list of recognizers which supports the specified name and language.

PARAMETER DESCRIPTION
entities

the requested entities

TYPE: Optional[List[str]] DEFAULT: None

language

the requested language

TYPE: str

all_fields

a flag to return all fields of a requested language.

TYPE: bool DEFAULT: False

ad_hoc_recognizers

Additional recognizers provided by the user as part of the request

TYPE: Optional[List[EntityRecognizer]] DEFAULT: None

RETURNS DESCRIPTION
List[EntityRecognizer]

A list of the recognizers which supports the supplied entities and language

Source code in presidio_analyzer/recognizer_registry/recognizer_registry.py
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
def get_recognizers(
    self,
    language: str,
    entities: Optional[List[str]] = None,
    all_fields: bool = False,
    ad_hoc_recognizers: Optional[List[EntityRecognizer]] = None,
) -> List[EntityRecognizer]:
    """
    Return a list of recognizers which supports the specified name and language.

    :param entities: the requested entities
    :param language: the requested language
    :param all_fields: a flag to return all fields of a requested language.
    :param ad_hoc_recognizers: Additional recognizers provided by the user
    as part of the request
    :return: A list of the recognizers which supports the supplied entities
    and language
    """
    if language is None:
        raise ValueError("No language provided")

    if entities is None and all_fields is False:
        raise ValueError("No entities provided")

    all_possible_recognizers = copy.copy(self.recognizers)
    if ad_hoc_recognizers:
        all_possible_recognizers.extend(ad_hoc_recognizers)

    # filter out unwanted recognizers
    to_return = set()
    if all_fields:
        to_return = [
            rec
            for rec in all_possible_recognizers
            if language == rec.supported_language
        ]
    else:
        for entity in entities:
            subset = [
                rec
                for rec in all_possible_recognizers
                if entity in rec.supported_entities
                and language == rec.supported_language
            ]

            if not subset:
                logger.warning(
                    "Entity %s doesn't have the corresponding"
                    " recognizer in language : %s",
                    entity,
                    language,
                )
            else:
                to_return.update(set(subset))

    logger.debug(
        "Returning a total of %s recognizers",
        str(len(to_return)),
    )

    if not to_return:
        raise ValueError("No matching recognizers were found to serve the request.")

    return list(to_return)

add_recognizer

add_recognizer(recognizer: EntityRecognizer) -> None

Add a new recognizer to the list of recognizers.

PARAMETER DESCRIPTION
recognizer

Recognizer to add

TYPE: EntityRecognizer

Source code in presidio_analyzer/recognizer_registry/recognizer_registry.py
201
202
203
204
205
206
207
208
209
210
def add_recognizer(self, recognizer: EntityRecognizer) -> None:
    """
    Add a new recognizer to the list of recognizers.

    :param recognizer: Recognizer to add
    """
    if not isinstance(recognizer, EntityRecognizer):
        raise ValueError("Input is not of type EntityRecognizer")

    self.recognizers.append(recognizer)

remove_recognizer

remove_recognizer(recognizer_name: str, language: Optional[str] = None) -> None

Remove a recognizer based on its name.

PARAMETER DESCRIPTION
recognizer_name

Name of recognizer to remove

TYPE: str

language

The supported language of the recognizer to be removed, in case multiple recognizers with the same name are present, and only one should be removed.

TYPE: Optional[str] DEFAULT: None

Source code in presidio_analyzer/recognizer_registry/recognizer_registry.py
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
def remove_recognizer(
    self, recognizer_name: str, language: Optional[str] = None
) -> None:
    """
    Remove a recognizer based on its name.

    :param recognizer_name: Name of recognizer to remove
    :param language: The supported language of the recognizer to be removed,
    in case multiple recognizers with the same name are present,
    and only one should be removed.
    """

    if not language:
        new_recognizers = [
            rec for rec in self.recognizers if rec.name != recognizer_name
        ]

        logger.info(
            "Removed %s recognizers which had the name %s",
            str(len(self.recognizers) - len(new_recognizers)),
            recognizer_name,
        )

    else:
        new_recognizers = [
            rec
            for rec in self.recognizers
            if rec.name != recognizer_name or rec.supported_language != language
        ]

        logger.info(
            "Removed %s recognizers which had the name %s and language %s",
            str(len(self.recognizers) - len(new_recognizers)),
            recognizer_name,
            language,
        )

    self.recognizers = new_recognizers

add_pattern_recognizer_from_dict

add_pattern_recognizer_from_dict(recognizer_dict: Dict) -> None

Load a pattern recognizer from a Dict into the recognizer registry.

:example:

registry = RecognizerRegistry() recognizer = { "name": "Titles Recognizer", "supported_language": "de","supported_entity": "TITLE", "deny_list": ["Mr.","Mrs."]} registry.add_pattern_recognizer_from_dict(recognizer)

PARAMETER DESCRIPTION
recognizer_dict

Dict holding a serialization of an PatternRecognizer

TYPE: Dict

Source code in presidio_analyzer/recognizer_registry/recognizer_registry.py
251
252
253
254
255
256
257
258
259
260
261
262
263
264
def add_pattern_recognizer_from_dict(self, recognizer_dict: Dict) -> None:
    """
    Load a pattern recognizer from a Dict into the recognizer registry.

    :param recognizer_dict: Dict holding a serialization of an PatternRecognizer

    :example:
    >>> registry = RecognizerRegistry()
    >>> recognizer = { "name": "Titles Recognizer", "supported_language": "de","supported_entity": "TITLE", "deny_list": ["Mr.","Mrs."]}
    >>> registry.add_pattern_recognizer_from_dict(recognizer)
    """  # noqa: E501

    recognizer = PatternRecognizer.from_dict(recognizer_dict)
    self.add_recognizer(recognizer)

add_recognizers_from_yaml

add_recognizers_from_yaml(yml_path: Union[str, Path]) -> None

Read YAML file and load recognizers into the recognizer registry.

See example yaml file here: https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/conf/example_recognizers.yaml

:example:

yaml_file = "recognizers.yaml" registry = RecognizerRegistry() registry.add_recognizers_from_yaml(yaml_file)

Source code in presidio_analyzer/recognizer_registry/recognizer_registry.py
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
def add_recognizers_from_yaml(self, yml_path: Union[str, Path]) -> None:
    r"""
    Read YAML file and load recognizers into the recognizer registry.

    See example yaml file here:
    https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/conf/example_recognizers.yaml

    :example:
    >>> yaml_file = "recognizers.yaml"
    >>> registry = RecognizerRegistry()
    >>> registry.add_recognizers_from_yaml(yaml_file)

    """

    try:
        with open(yml_path) as stream:
            yaml_recognizers = yaml.safe_load(stream)

        for yaml_recognizer in yaml_recognizers["recognizers"]:
            self.add_pattern_recognizer_from_dict(yaml_recognizer)
    except OSError as io_error:
        print(f"Error reading file {yml_path}")
        raise io_error
    except yaml.YAMLError as yaml_error:
        print(f"Failed to parse file {yml_path}")
        raise yaml_error
    except TypeError as yaml_error:
        print(f"Failed to parse file {yml_path}")
        raise yaml_error

get_supported_entities

get_supported_entities(languages: Optional[List[str]] = None) -> List[str]

Return the supported entities by the set of recognizers loaded.

PARAMETER DESCRIPTION
languages

The languages to get the supported entities for. If languages=None, returns all entities for all languages.

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/recognizer_registry/recognizer_registry.py
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
def get_supported_entities(
    self, languages: Optional[List[str]] = None
) -> List[str]:
    """
    Return the supported entities by the set of recognizers loaded.

    :param languages: The languages to get the supported entities for.
    If languages=None, returns all entities for all languages.
    """
    if not languages:
        languages = self._get_supported_languages()

    supported_entities = []
    for language in languages:
        recognizers = self.get_recognizers(language=language, all_fields=True)

        for recognizer in recognizers:
            supported_entities.extend(recognizer.get_supported_entities())

    return list(set(supported_entities))

presidio_analyzer.recognizer_registry.RecognizerRegistryProvider

Utility class for loading Recognizer Registry.

Use this class to load recognizer registry from a yaml file

:example: { "supported_languages": ["de", "es"], "recognizers": [ { "name": "Zip code Recognizer", "supported_language": "en", "patterns": [ { "name": "zip code (weak)", "regex": "(\b\d{5}(?:\-\d{4})?\b)", "score": 0.01, } ], "context": ["zip", "code"], "supported_entity": "ZIP", } ] }

PARAMETER DESCRIPTION
conf_file

Path to yaml file containing registry configuration

TYPE: Optional[Union[Path, str]] DEFAULT: None

registry_configuration

Dict containing registry configuration

TYPE: Optional[Dict] DEFAULT: None

METHOD DESCRIPTION
create_recognizer_registry

Create a recognizer registry according to configuration loaded previously.

Source code in presidio_analyzer/recognizer_registry/recognizer_registry_provider.py
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
class RecognizerRegistryProvider:
    r"""
    Utility class for loading Recognizer Registry.

    Use this class to load recognizer registry from a yaml file

    :param conf_file: Path to yaml file containing registry configuration
    :param registry_configuration: Dict containing registry configuration
    :example:
        {
            "supported_languages": ["de", "es"],
            "recognizers": [
                {
                    "name": "Zip code Recognizer",
                    "supported_language": "en",
                    "patterns": [
                        {
                            "name": "zip code (weak)",
                            "regex": "(\\b\\d{5}(?:\\-\\d{4})?\\b)",
                            "score": 0.01,
                        }
                    ],
                    "context": ["zip", "code"],
                    "supported_entity": "ZIP",
                }
            ]
        }
    """

    def __init__(
        self,
        conf_file: Optional[Union[Path, str]] = None,
        registry_configuration: Optional[Dict] = None,
    ):
        self.configuration = RecognizerConfigurationLoader.get(
            conf_file=conf_file, registry_configuration=registry_configuration
        )
        return

    def create_recognizer_registry(self) -> RecognizerRegistry:
        """Create a recognizer registry according to configuration loaded previously."""
        supported_languages = self.configuration.get("supported_languages")
        global_regex_flags = self.configuration.get("global_regex_flags")
        recognizers = RecognizerListLoader.get(
            self.configuration.get("recognizers"),
            supported_languages,
            global_regex_flags,
        )

        return RecognizerRegistry(
            recognizers=recognizers,
            supported_languages=supported_languages,
            global_regex_flags=global_regex_flags,
        )

create_recognizer_registry

create_recognizer_registry() -> RecognizerRegistry

Create a recognizer registry according to configuration loaded previously.

Source code in presidio_analyzer/recognizer_registry/recognizer_registry_provider.py
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
def create_recognizer_registry(self) -> RecognizerRegistry:
    """Create a recognizer registry according to configuration loaded previously."""
    supported_languages = self.configuration.get("supported_languages")
    global_regex_flags = self.configuration.get("global_regex_flags")
    recognizers = RecognizerListLoader.get(
        self.configuration.get("recognizers"),
        supported_languages,
        global_regex_flags,
    )

    return RecognizerRegistry(
        recognizers=recognizers,
        supported_languages=supported_languages,
        global_regex_flags=global_regex_flags,
    )

Context awareness modules

presidio_analyzer.context_aware_enhancers

Context awareness modules.

ContextAwareEnhancer

A class representing an abstract context aware enhancer.

Context words might enhance confidence score of a recognized entity, ContextAwareEnhancer is an abstract class to be inherited by a context aware enhancer logic.

PARAMETER DESCRIPTION
context_similarity_factor

How much to enhance confidence of match entity

TYPE: float

min_score_with_context_similarity

Minimum confidence score

TYPE: float

context_prefix_count

how many words before the entity to match context

TYPE: int

context_suffix_count

how many words after the entity to match context

TYPE: int

METHOD DESCRIPTION
enhance_using_context

Update results in case surrounding words are relevant to the context words.

Source code in presidio_analyzer/context_aware_enhancers/context_aware_enhancer.py
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
class ContextAwareEnhancer:
    """
    A class representing an abstract context aware enhancer.

    Context words might enhance confidence score of a recognized entity,
    ContextAwareEnhancer is an abstract class to be inherited by a context aware
    enhancer logic.

    :param context_similarity_factor: How much to enhance confidence of match entity
    :param min_score_with_context_similarity: Minimum confidence score
    :param context_prefix_count: how many words before the entity to match context
    :param context_suffix_count: how many words after the entity to match context
    """

    MIN_SCORE = 0
    MAX_SCORE = 1.0

    def __init__(
        self,
        context_similarity_factor: float,
        min_score_with_context_similarity: float,
        context_prefix_count: int,
        context_suffix_count: int,
    ):
        self.context_similarity_factor = context_similarity_factor
        self.min_score_with_context_similarity = min_score_with_context_similarity
        self.context_prefix_count = context_prefix_count
        self.context_suffix_count = context_suffix_count

    @abstractmethod
    def enhance_using_context(
        self,
        text: str,
        raw_results: List[RecognizerResult],
        nlp_artifacts: NlpArtifacts,
        recognizers: List[EntityRecognizer],
        context: Optional[List[str]] = None,
    ) -> List[RecognizerResult]:
        """
        Update results in case surrounding words are relevant to the context words.

        Using the surrounding words of the actual word matches, look
        for specific strings that if found contribute to the score
        of the result, improving the confidence that the match is
        indeed of that PII entity type

        :param text: The actual text that was analyzed
        :param raw_results: Recognizer results which didn't take
                            context into consideration
        :param nlp_artifacts: The nlp artifacts contains elements
                              such as lemmatized tokens for better
                              accuracy of the context enhancement process
        :param recognizers: the list of recognizers
        :param context: list of context words
        """
        return raw_results

enhance_using_context abstractmethod

enhance_using_context(
    text: str,
    raw_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    recognizers: List[EntityRecognizer],
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]

Update results in case surrounding words are relevant to the context words.

Using the surrounding words of the actual word matches, look for specific strings that if found contribute to the score of the result, improving the confidence that the match is indeed of that PII entity type

PARAMETER DESCRIPTION
text

The actual text that was analyzed

TYPE: str

raw_results

Recognizer results which didn't take context into consideration

TYPE: List[RecognizerResult]

nlp_artifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

TYPE: NlpArtifacts

recognizers

the list of recognizers

TYPE: List[EntityRecognizer]

context

list of context words

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/context_aware_enhancers/context_aware_enhancer.py
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
@abstractmethod
def enhance_using_context(
    self,
    text: str,
    raw_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    recognizers: List[EntityRecognizer],
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """
    Update results in case surrounding words are relevant to the context words.

    Using the surrounding words of the actual word matches, look
    for specific strings that if found contribute to the score
    of the result, improving the confidence that the match is
    indeed of that PII entity type

    :param text: The actual text that was analyzed
    :param raw_results: Recognizer results which didn't take
                        context into consideration
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param recognizers: the list of recognizers
    :param context: list of context words
    """
    return raw_results

LemmaContextAwareEnhancer

Bases: ContextAwareEnhancer

A class representing a lemma based context aware enhancer logic.

Context words might enhance confidence score of a recognized entity, LemmaContextAwareEnhancer is an implementation of Lemma based context aware logic, it compares spacy lemmas of each word in context of the matched entity to given context and the recognizer context words, if matched it enhance the recognized entity confidence score by a given factor.

PARAMETER DESCRIPTION
context_similarity_factor

How much to enhance confidence of match entity

TYPE: float DEFAULT: 0.35

min_score_with_context_similarity

Minimum confidence score

TYPE: float DEFAULT: 0.4

context_prefix_count

how many words before the entity to match context

TYPE: int DEFAULT: 5

context_suffix_count

how many words after the entity to match context

TYPE: int DEFAULT: 0

METHOD DESCRIPTION
enhance_using_context

Update results in case the lemmas of surrounding words or input context

Source code in presidio_analyzer/context_aware_enhancers/lemma_context_aware_enhancer.py
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
class LemmaContextAwareEnhancer(ContextAwareEnhancer):
    """
    A class representing a lemma based context aware enhancer logic.

    Context words might enhance confidence score of a recognized entity,
    LemmaContextAwareEnhancer is an implementation of Lemma based context aware logic,
    it compares spacy lemmas of each word in context of the matched entity to given
    context and the recognizer context words,
    if matched it enhance the recognized entity confidence score by a given factor.

    :param context_similarity_factor: How much to enhance confidence of match entity
    :param min_score_with_context_similarity: Minimum confidence score
    :param context_prefix_count: how many words before the entity to match context
    :param context_suffix_count: how many words after the entity to match context
    """

    def __init__(
        self,
        context_similarity_factor: float = 0.35,
        min_score_with_context_similarity: float = 0.4,
        context_prefix_count: int = 5,
        context_suffix_count: int = 0,
    ):
        super().__init__(
            context_similarity_factor=context_similarity_factor,
            min_score_with_context_similarity=min_score_with_context_similarity,
            context_prefix_count=context_prefix_count,
            context_suffix_count=context_suffix_count,
        )

    def enhance_using_context(
        self,
        text: str,
        raw_results: List[RecognizerResult],
        nlp_artifacts: NlpArtifacts,
        recognizers: List[EntityRecognizer],
        context: Optional[List[str]] = None,
    ) -> List[RecognizerResult]:
        """
        Update results in case the lemmas of surrounding words or input context
        words are identical to the context words.

        Using the surrounding words of the actual word matches, look
        for specific strings that if found contribute to the score
        of the result, improving the confidence that the match is
        indeed of that PII entity type

        :param text: The actual text that was analyzed
        :param raw_results: Recognizer results which didn't take
                            context into consideration
        :param nlp_artifacts: The nlp artifacts contains elements
                              such as lemmatized tokens for better
                              accuracy of the context enhancement process
        :param recognizers: the list of recognizers
        :param context: list of context words
        """  # noqa D205 D400

        # create a deep copy of the results object, so we can manipulate it
        results = copy.deepcopy(raw_results)

        # create recognizer context dictionary
        recognizers_dict = {recognizer.id: recognizer for recognizer in recognizers}

        # Create empty list in None or lowercase all context words in the list
        if not context:
            context = []
        else:
            context = [word.lower() for word in context]

        # Sanity
        if nlp_artifacts is None:
            logger.warning("NLP artifacts were not provided")
            return results

        for result in results:
            recognizer = None
            # get recognizer matching the result, if found.
            if (
                result.recognition_metadata
                and RecognizerResult.RECOGNIZER_IDENTIFIER_KEY
                in result.recognition_metadata.keys()
            ):
                recognizer = recognizers_dict.get(
                    result.recognition_metadata[
                        RecognizerResult.RECOGNIZER_IDENTIFIER_KEY
                    ]
                )

            if not recognizer:
                logger.debug(
                    "Recognizer name not found as part of the "
                    "recognition_metadata dict in the RecognizerResult. "
                )
                continue

            # skip recognizer result if the recognizer doesn't support
            # context enhancement
            if not recognizer.context:
                logger.debug(
                    "recognizer '%s' does not support context enhancement",
                    recognizer.name,
                )
                continue

            # skip context enhancement if already boosted by recognizer level
            if result.recognition_metadata.get(
                RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY
            ):
                logger.debug("result score already boosted, skipping")
                continue

            # extract lemmatized context from the surrounding of the match
            word = text[result.start : result.end]

            surrounding_words = self._extract_surrounding_words(
                nlp_artifacts=nlp_artifacts, word=word, start=result.start
            )

            # combine other sources of context with surrounding words
            surrounding_words.extend(context)

            supportive_context_word = self._find_supportive_word_in_context(
                surrounding_words, recognizer.context
            )
            if supportive_context_word != "":
                result.score += self.context_similarity_factor
                result.score = max(result.score, self.min_score_with_context_similarity)
                result.score = min(result.score, ContextAwareEnhancer.MAX_SCORE)

                # Update the explainability object with context information
                # helped to improve the score
                result.analysis_explanation.set_supportive_context_word(
                    supportive_context_word
                )
                result.analysis_explanation.set_improved_score(result.score)
        return results

    @staticmethod
    def _find_supportive_word_in_context(
        context_list: List[str], recognizer_context_list: List[str]
    ) -> str:
        """
        Find words in the text which are relevant for context evaluation.

        A word is considered a supportive context word if there's exact match
        between a keyword in context_text and any keyword in context_list.

        :param context_list words before and after the matched entity within
               a specified window size
        :param recognizer_context_list a list of words considered as
                context keywords manually specified by the recognizer's author
        """
        word = ""
        # If the context list is empty, no need to continue
        if context_list is None or recognizer_context_list is None:
            return word

        for predefined_context_word in recognizer_context_list:
            # result == true only if any of the predefined context words
            # is found exactly or as a substring in any of the collected
            # context words
            result = next(
                (
                    True
                    for keyword in context_list
                    if predefined_context_word in keyword
                ),
                False,
            )
            if result:
                logger.debug("Found context keyword '%s'", predefined_context_word)
                word = predefined_context_word
                break

        return word

    def _extract_surrounding_words(
        self, nlp_artifacts: NlpArtifacts, word: str, start: int
    ) -> List[str]:
        """Extract words surrounding another given word.

        The text from which the context is extracted is given in the nlp
        doc.

        :param nlp_artifacts: An abstraction layer which holds different
                              items which are the result of a NLP pipeline
                              execution on a given text
        :param word: The word to look for context around
        :param start: The start index of the word in the original text
        """
        if not nlp_artifacts.tokens:
            logger.info("Skipping context extraction due to lack of NLP artifacts")
            # if there are no nlp artifacts, this is ok, we can
            # extract context and we return a valid, yet empty
            # context
            return [""]

        # Get the already prepared words in the given text, in their
        # LEMMATIZED version
        lemmatized_keywords = nlp_artifacts.keywords

        # since the list of tokens is not necessarily aligned
        # with the actual index of the match, we look for the
        # token index which corresponds to the match
        token_index = self._find_index_of_match_token(
            word, start, nlp_artifacts.tokens, nlp_artifacts.tokens_indices
        )

        # index i belongs to the PII entity, take the preceding n words
        # and the successing m words into a context list

        backward_context = self._add_n_words_backward(
            token_index,
            self.context_prefix_count,
            nlp_artifacts.lemmas,
            lemmatized_keywords,
        )
        forward_context = self._add_n_words_forward(
            token_index,
            self.context_suffix_count,
            nlp_artifacts.lemmas,
            lemmatized_keywords,
        )

        context_list = []
        context_list.extend(backward_context)
        context_list.extend(forward_context)
        context_list = list(set(context_list))
        logger.debug("Context list is: %s", " ".join(context_list))
        return context_list

    @staticmethod
    def _find_index_of_match_token(
        word: str,
        start: int,
        tokens,
        tokens_indices: List[int],  # noqa ANN001
    ) -> int:
        found = False
        # we use the known start index of the original word to find the actual
        # token at that index, we are not checking for equivilance since the
        # token might be just a substring of that word (e.g. for phone number
        # 555-124564 the first token might be just '555' or for a match like '
        # rocket' the actual token will just be 'rocket' hence the misalignment
        # of indices)
        # Note: we are iterating over the original tokens (not the lemmatized)
        i = -1
        for i, token in enumerate(tokens, 0):
            # Either we found a token with the exact location, or
            # we take a token which its characters indices covers
            # the index we are looking for.
            if (tokens_indices[i] == start) or (start < tokens_indices[i] + len(token)):
                # found the interesting token, the one that around it
                # we take n words, we save the matching lemma
                found = True
                break

        if not found:
            raise ValueError(
                "Did not find word '" + word + "' "
                "in the list of tokens although it "
                "is expected to be found"
            )
        return i

    @staticmethod
    def _add_n_words(
        index: int,
        n_words: int,
        lemmas: List[str],
        lemmatized_filtered_keywords: List[str],
        is_backward: bool,
    ) -> List[str]:
        """
        Prepare a string of context words.

        Return a list of words which surrounds a lemma at a given index.
        The words will be collected only if exist in the filtered array

        :param index: index of the lemma that its surrounding words we want
        :param n_words: number of words to take
        :param lemmas: array of lemmas
        :param lemmatized_filtered_keywords: the array of filtered
               lemmas from the original sentence,
        :param is_backward: if true take the preceeding words, if false,
                            take the successing words
        """
        i = index
        context_words = []
        # The entity itself is no interest to us...however we want to
        # consider it anyway for cases were it is attached with no spaces
        # to an interesting context word, so we allow it and add 1 to
        # the number of collected words

        # collect at most n words (in lower case)
        remaining = n_words + 1
        while 0 <= i < len(lemmas) and remaining > 0:
            lower_lemma = lemmas[i].lower()
            if lower_lemma in lemmatized_filtered_keywords:
                context_words.append(lower_lemma)
                remaining -= 1
            i = i - 1 if is_backward else i + 1
        return context_words

    def _add_n_words_forward(
        self,
        index: int,
        n_words: int,
        lemmas: List[str],
        lemmatized_filtered_keywords: List[str],
    ) -> List[str]:
        return self._add_n_words(
            index, n_words, lemmas, lemmatized_filtered_keywords, False
        )

    def _add_n_words_backward(
        self,
        index: int,
        n_words: int,
        lemmas: List[str],
        lemmatized_filtered_keywords: List[str],
    ) -> List[str]:
        return self._add_n_words(
            index, n_words, lemmas, lemmatized_filtered_keywords, True
        )

enhance_using_context

enhance_using_context(
    text: str,
    raw_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    recognizers: List[EntityRecognizer],
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]

Update results in case the lemmas of surrounding words or input context words are identical to the context words.

Using the surrounding words of the actual word matches, look for specific strings that if found contribute to the score of the result, improving the confidence that the match is indeed of that PII entity type

PARAMETER DESCRIPTION
text

The actual text that was analyzed

TYPE: str

raw_results

Recognizer results which didn't take context into consideration

TYPE: List[RecognizerResult]

nlp_artifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

TYPE: NlpArtifacts

recognizers

the list of recognizers

TYPE: List[EntityRecognizer]

context

list of context words

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/context_aware_enhancers/lemma_context_aware_enhancer.py
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
def enhance_using_context(
    self,
    text: str,
    raw_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    recognizers: List[EntityRecognizer],
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """
    Update results in case the lemmas of surrounding words or input context
    words are identical to the context words.

    Using the surrounding words of the actual word matches, look
    for specific strings that if found contribute to the score
    of the result, improving the confidence that the match is
    indeed of that PII entity type

    :param text: The actual text that was analyzed
    :param raw_results: Recognizer results which didn't take
                        context into consideration
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param recognizers: the list of recognizers
    :param context: list of context words
    """  # noqa D205 D400

    # create a deep copy of the results object, so we can manipulate it
    results = copy.deepcopy(raw_results)

    # create recognizer context dictionary
    recognizers_dict = {recognizer.id: recognizer for recognizer in recognizers}

    # Create empty list in None or lowercase all context words in the list
    if not context:
        context = []
    else:
        context = [word.lower() for word in context]

    # Sanity
    if nlp_artifacts is None:
        logger.warning("NLP artifacts were not provided")
        return results

    for result in results:
        recognizer = None
        # get recognizer matching the result, if found.
        if (
            result.recognition_metadata
            and RecognizerResult.RECOGNIZER_IDENTIFIER_KEY
            in result.recognition_metadata.keys()
        ):
            recognizer = recognizers_dict.get(
                result.recognition_metadata[
                    RecognizerResult.RECOGNIZER_IDENTIFIER_KEY
                ]
            )

        if not recognizer:
            logger.debug(
                "Recognizer name not found as part of the "
                "recognition_metadata dict in the RecognizerResult. "
            )
            continue

        # skip recognizer result if the recognizer doesn't support
        # context enhancement
        if not recognizer.context:
            logger.debug(
                "recognizer '%s' does not support context enhancement",
                recognizer.name,
            )
            continue

        # skip context enhancement if already boosted by recognizer level
        if result.recognition_metadata.get(
            RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY
        ):
            logger.debug("result score already boosted, skipping")
            continue

        # extract lemmatized context from the surrounding of the match
        word = text[result.start : result.end]

        surrounding_words = self._extract_surrounding_words(
            nlp_artifacts=nlp_artifacts, word=word, start=result.start
        )

        # combine other sources of context with surrounding words
        surrounding_words.extend(context)

        supportive_context_word = self._find_supportive_word_in_context(
            surrounding_words, recognizer.context
        )
        if supportive_context_word != "":
            result.score += self.context_similarity_factor
            result.score = max(result.score, self.min_score_with_context_similarity)
            result.score = min(result.score, ContextAwareEnhancer.MAX_SCORE)

            # Update the explainability object with context information
            # helped to improve the score
            result.analysis_explanation.set_supportive_context_word(
                supportive_context_word
            )
            result.analysis_explanation.set_improved_score(result.score)
    return results

NLP Engine modules

presidio_analyzer.nlp_engine

NLP engine package. Performs text pre-processing.

NerModelConfiguration dataclass

NER model configuration.

PARAMETER DESCRIPTION
labels_to_ignore

List of labels to not return predictions for.

TYPE: Optional[Collection[str]] DEFAULT: None

aggregation_strategy

See https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TokenClassificationPipeline.aggregation_strategy

TYPE: Optional[str] DEFAULT: 'max'

stride

See https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TokenClassificationPipeline.stride

TYPE: Optional[int] DEFAULT: 14

alignment_mode

See https://spacy.io/api/doc#char_span

TYPE: Optional[str] DEFAULT: 'expand'

default_score

Default confidence score if the model does not provide one.

TYPE: Optional[float] DEFAULT: 0.85

model_to_presidio_entity_mapping

Mapping between the NER model entities and Presidio entities.

TYPE: Optional[Dict[str, str]] DEFAULT: None

low_score_entity_names

Set of entity names that are likely to have low detection accuracy that should be adjusted.

TYPE: Optional[Collection] DEFAULT: None

low_confidence_score_multiplier

A multiplier for the score given for low_score_entity_names. Multiplier to the score given for low_score_entity_names.

TYPE: Optional[float] DEFAULT: 0.4

METHOD DESCRIPTION
from_dict

Load NLP engine configuration from dict.

to_dict

Return the configuration as a dict.

Source code in presidio_analyzer/nlp_engine/ner_model_configuration.py
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
@dataclass
class NerModelConfiguration:
    """NER model configuration.

    :param labels_to_ignore: List of labels to not return predictions for.
    :param aggregation_strategy:
    See https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TokenClassificationPipeline.aggregation_strategy
    :param stride:
    See https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.TokenClassificationPipeline.stride
    :param alignment_mode: See https://spacy.io/api/doc#char_span
    :param default_score: Default confidence score if the model does not provide one.
    :param model_to_presidio_entity_mapping:
    Mapping between the NER model entities and Presidio entities.
    :param low_score_entity_names:
    Set of entity names that are likely to have low detection accuracy that should be adjusted.
    :param low_confidence_score_multiplier: A multiplier for the score given for low_score_entity_names.
    Multiplier to the score given for low_score_entity_names.
    """  # noqa E501

    labels_to_ignore: Optional[Collection[str]] = None
    aggregation_strategy: Optional[str] = "max"
    stride: Optional[int] = 14
    alignment_mode: Optional[str] = "expand"
    default_score: Optional[float] = 0.85
    model_to_presidio_entity_mapping: Optional[Dict[str, str]] = None
    low_score_entity_names: Optional[Collection] = None
    low_confidence_score_multiplier: Optional[float] = 0.4

    def __post_init__(self):
        """Validate the configuration and set defaults."""
        if self.model_to_presidio_entity_mapping is None:
            logger.warning(
                "model_to_presidio_entity_mapping is missing from configuration, "
                "using default"
            )
            self.model_to_presidio_entity_mapping = MODEL_TO_PRESIDIO_ENTITY_MAPPING
        if self.low_score_entity_names is None:
            logger.warning(
                "low_score_entity_names is missing from configuration, " "using default"
            )
            self.low_score_entity_names = LOW_SCORE_ENTITY_NAMES
        if self.labels_to_ignore is None:
            logger.warning(
                "labels_to_ignore is missing from configuration, " "using default"
            )
            self.labels_to_ignore = LABELS_TO_IGNORE

    @classmethod
    def _validate_input(cls, ner_model_configuration_dict: Dict) -> None:
        key_to_type = {
            "labels_to_ignore": Collection,
            "aggregation_strategy": str,
            "alignment_mode": str,
            "model_to_presidio_entity_mapping": dict,
            "low_confidence_score_multiplier": float,
            "low_score_entity_names": Collection,
            "stride": int,
        }

        for key, field_type in key_to_type.items():
            cls.__validate_type(
                config_dict=ner_model_configuration_dict, key=key, field_type=field_type
            )

    @staticmethod
    def __validate_type(config_dict: Dict, key: str, field_type: Type) -> None:
        if key in config_dict:
            if not isinstance(config_dict[key], field_type):
                raise ValueError(f"{key} must be of type {field_type}")

    @classmethod
    def from_dict(cls, nlp_engine_configuration: Dict) -> "NerModelConfiguration":
        """Load NLP engine configuration from dict.

        :param nlp_engine_configuration: Dict with the configuration to load.
        """
        cls._validate_input(nlp_engine_configuration)

        return cls(**nlp_engine_configuration)

    def to_dict(self) -> Dict:
        """Return the configuration as a dict."""
        return self.__dict__

    def __str__(self) -> str:  # noqa D105
        return str(self.to_dict())

    def __repr__(self) -> str:  # noqa D105
        return str(self)

from_dict classmethod

from_dict(nlp_engine_configuration: Dict) -> NerModelConfiguration

Load NLP engine configuration from dict.

PARAMETER DESCRIPTION
nlp_engine_configuration

Dict with the configuration to load.

TYPE: Dict

Source code in presidio_analyzer/nlp_engine/ner_model_configuration.py
117
118
119
120
121
122
123
124
125
@classmethod
def from_dict(cls, nlp_engine_configuration: Dict) -> "NerModelConfiguration":
    """Load NLP engine configuration from dict.

    :param nlp_engine_configuration: Dict with the configuration to load.
    """
    cls._validate_input(nlp_engine_configuration)

    return cls(**nlp_engine_configuration)

to_dict

to_dict() -> Dict

Return the configuration as a dict.

Source code in presidio_analyzer/nlp_engine/ner_model_configuration.py
127
128
129
def to_dict(self) -> Dict:
    """Return the configuration as a dict."""
    return self.__dict__

NlpArtifacts

NlpArtifacts is an abstraction layer over the results of an NLP pipeline.

processing over a given text, it holds attributes such as entities, tokens and lemmas which can be used by any recognizer

PARAMETER DESCRIPTION
entities

Identified entities

TYPE: List[Span]

tokens

Tokenized text

TYPE: Doc

tokens_indices

Indices of tokens

TYPE: List[int]

lemmas

List of lemmas in text

TYPE: List[str]

nlp_engine

NlpEngine object

TYPE: NlpEngine

language

Text language

TYPE: str

scores

Entity confidence scores

TYPE: Optional[List[float]] DEFAULT: None

METHOD DESCRIPTION
set_keywords

Return keywords fpr text.

to_json

Convert nlp artifacts to json.

Source code in presidio_analyzer/nlp_engine/nlp_artifacts.py
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
class NlpArtifacts:
    """
    NlpArtifacts is an abstraction layer over the results of an NLP pipeline.

    processing over a given text, it holds attributes such as entities,
    tokens and lemmas which can be used by any recognizer

    :param entities: Identified entities
    :param tokens: Tokenized text
    :param tokens_indices: Indices of tokens
    :param lemmas: List of lemmas in text
    :param nlp_engine: NlpEngine object
    :param language: Text language
    :param scores: Entity confidence scores
    """

    def __init__(
        self,
        entities: List[Span],
        tokens: Doc,
        tokens_indices: List[int],
        lemmas: List[str],
        nlp_engine: "NlpEngine",  # noqa F821
        language: str,
        scores: Optional[List[float]] = None,
    ):
        self.entities = entities
        self.tokens = tokens
        self.lemmas = lemmas
        self.tokens_indices = tokens_indices
        self.keywords = self.set_keywords(nlp_engine, lemmas, language)
        self.nlp_engine = nlp_engine
        self.scores = scores if scores else [0.85] * len(entities)

    @staticmethod
    def set_keywords(
        nlp_engine,
        lemmas: List[str],
        language: str,  # noqa ANN001
    ) -> List[str]:
        """
        Return keywords fpr text.

        Extracts lemmas with certain conditions as keywords.
        """
        if not nlp_engine:
            return []
        keywords = [
            k.lower()
            for k in lemmas
            if not nlp_engine.is_stopword(k, language)
            and not nlp_engine.is_punct(k, language)
            and k != "-PRON-"
            and k != "be"
        ]

        # best effort, try even further to break tokens into sub tokens,
        # this can result in reducing false negatives
        keywords = [i.split(":") for i in keywords]

        # splitting the list can, if happened, will result in list of lists,
        # we flatten the list
        keywords = [item for sublist in keywords for item in sublist]
        return keywords

    def to_json(self) -> str:
        """Convert nlp artifacts to json."""

        return_dict = self.__dict__.copy()

        # Ignore NLP engine as it's not serializable currently
        del return_dict["nlp_engine"]

        # Converting spaCy tokens and spans to string as they are not serializable
        if "tokens" in return_dict:
            return_dict["tokens"] = [token.text for token in self.tokens]
        if "entities" in return_dict:
            return_dict["entities"] = [entity.text for entity in self.entities]
        if "scores" in return_dict:
            return_dict["scores"] = [float(score) for score in self.scores]

        return json.dumps(return_dict)

set_keywords staticmethod

set_keywords(nlp_engine, lemmas: List[str], language: str) -> List[str]

Return keywords fpr text.

Extracts lemmas with certain conditions as keywords.

Source code in presidio_analyzer/nlp_engine/nlp_artifacts.py
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
@staticmethod
def set_keywords(
    nlp_engine,
    lemmas: List[str],
    language: str,  # noqa ANN001
) -> List[str]:
    """
    Return keywords fpr text.

    Extracts lemmas with certain conditions as keywords.
    """
    if not nlp_engine:
        return []
    keywords = [
        k.lower()
        for k in lemmas
        if not nlp_engine.is_stopword(k, language)
        and not nlp_engine.is_punct(k, language)
        and k != "-PRON-"
        and k != "be"
    ]

    # best effort, try even further to break tokens into sub tokens,
    # this can result in reducing false negatives
    keywords = [i.split(":") for i in keywords]

    # splitting the list can, if happened, will result in list of lists,
    # we flatten the list
    keywords = [item for sublist in keywords for item in sublist]
    return keywords

to_json

to_json() -> str

Convert nlp artifacts to json.

Source code in presidio_analyzer/nlp_engine/nlp_artifacts.py
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
def to_json(self) -> str:
    """Convert nlp artifacts to json."""

    return_dict = self.__dict__.copy()

    # Ignore NLP engine as it's not serializable currently
    del return_dict["nlp_engine"]

    # Converting spaCy tokens and spans to string as they are not serializable
    if "tokens" in return_dict:
        return_dict["tokens"] = [token.text for token in self.tokens]
    if "entities" in return_dict:
        return_dict["entities"] = [entity.text for entity in self.entities]
    if "scores" in return_dict:
        return_dict["scores"] = [float(score) for score in self.scores]

    return json.dumps(return_dict)

NlpEngine

Bases: ABC

NlpEngine is an abstraction layer over the nlp module.

It provides NLP preprocessing functionality as well as other queries on tokens.

METHOD DESCRIPTION
load

Load the NLP model.

is_loaded

Return True if the model is already loaded.

process_text

Execute the NLP pipeline on the given text and language.

process_batch

Execute the NLP pipeline on a batch of texts.

is_stopword

Return true if the given word is a stop word.

is_punct

Return true if the given word is a punctuation word.

get_supported_entities

Return the supported entities for this NLP engine.

get_supported_languages

Return the supported languages for this NLP engine.

Source code in presidio_analyzer/nlp_engine/nlp_engine.py
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
class NlpEngine(ABC):
    """
    NlpEngine is an abstraction layer over the nlp module.

    It provides NLP preprocessing functionality as well as other queries
    on tokens.
    """

    @abstractmethod
    def load(self) -> None:
        """Load the NLP model."""

    @abstractmethod
    def is_loaded(self) -> bool:
        """Return True if the model is already loaded."""

    @abstractmethod
    def process_text(self, text: str, language: str) -> NlpArtifacts:
        """Execute the NLP pipeline on the given text and language."""

    @abstractmethod
    def process_batch(
        self,
        texts: Iterable[str],
        language: str,
        batch_size: int = 1,
        n_process: int = 1,
        **kwargs,  # noqa ANN003
    ) -> Iterator[Tuple[str, NlpArtifacts]]:
        """Execute the NLP pipeline on a batch of texts.

        Returns a tuple of (text, NlpArtifacts)
        """

    @abstractmethod
    def is_stopword(self, word: str, language: str) -> bool:
        """
        Return true if the given word is a stop word.

        (within the given language)
        """

    @abstractmethod
    def is_punct(self, word: str, language: str) -> bool:
        """
        Return true if the given word is a punctuation word.

        (within the given language)
        """

    @abstractmethod
    def get_supported_entities(self) -> List[str]:
        """Return the supported entities for this NLP engine."""
        pass

    @abstractmethod
    def get_supported_languages(self) -> List[str]:
        """Return the supported languages for this NLP engine."""
        pass

load abstractmethod

load() -> None

Load the NLP model.

Source code in presidio_analyzer/nlp_engine/nlp_engine.py
15
16
17
@abstractmethod
def load(self) -> None:
    """Load the NLP model."""

is_loaded abstractmethod

is_loaded() -> bool

Return True if the model is already loaded.

Source code in presidio_analyzer/nlp_engine/nlp_engine.py
19
20
21
@abstractmethod
def is_loaded(self) -> bool:
    """Return True if the model is already loaded."""

process_text abstractmethod

process_text(text: str, language: str) -> NlpArtifacts

Execute the NLP pipeline on the given text and language.

Source code in presidio_analyzer/nlp_engine/nlp_engine.py
23
24
25
@abstractmethod
def process_text(self, text: str, language: str) -> NlpArtifacts:
    """Execute the NLP pipeline on the given text and language."""

process_batch abstractmethod

process_batch(
    texts: Iterable[str],
    language: str,
    batch_size: int = 1,
    n_process: int = 1,
    **kwargs
) -> Iterator[Tuple[str, NlpArtifacts]]

Execute the NLP pipeline on a batch of texts.

Returns a tuple of (text, NlpArtifacts)

Source code in presidio_analyzer/nlp_engine/nlp_engine.py
27
28
29
30
31
32
33
34
35
36
37
38
39
@abstractmethod
def process_batch(
    self,
    texts: Iterable[str],
    language: str,
    batch_size: int = 1,
    n_process: int = 1,
    **kwargs,  # noqa ANN003
) -> Iterator[Tuple[str, NlpArtifacts]]:
    """Execute the NLP pipeline on a batch of texts.

    Returns a tuple of (text, NlpArtifacts)
    """

is_stopword abstractmethod

is_stopword(word: str, language: str) -> bool

Return true if the given word is a stop word.

(within the given language)

Source code in presidio_analyzer/nlp_engine/nlp_engine.py
41
42
43
44
45
46
47
@abstractmethod
def is_stopword(self, word: str, language: str) -> bool:
    """
    Return true if the given word is a stop word.

    (within the given language)
    """

is_punct abstractmethod

is_punct(word: str, language: str) -> bool

Return true if the given word is a punctuation word.

(within the given language)

Source code in presidio_analyzer/nlp_engine/nlp_engine.py
49
50
51
52
53
54
55
@abstractmethod
def is_punct(self, word: str, language: str) -> bool:
    """
    Return true if the given word is a punctuation word.

    (within the given language)
    """

get_supported_entities abstractmethod

get_supported_entities() -> List[str]

Return the supported entities for this NLP engine.

Source code in presidio_analyzer/nlp_engine/nlp_engine.py
57
58
59
60
@abstractmethod
def get_supported_entities(self) -> List[str]:
    """Return the supported entities for this NLP engine."""
    pass

get_supported_languages abstractmethod

get_supported_languages() -> List[str]

Return the supported languages for this NLP engine.

Source code in presidio_analyzer/nlp_engine/nlp_engine.py
62
63
64
65
@abstractmethod
def get_supported_languages(self) -> List[str]:
    """Return the supported languages for this NLP engine."""
    pass

SpacyNlpEngine

Bases: NlpEngine

SpacyNlpEngine is an abstraction layer over the nlp module.

It provides processing functionality as well as other queries on tokens. The SpacyNlpEngine uses SpaCy as its NLP module

METHOD DESCRIPTION
load

Load the spaCy NLP model.

get_supported_entities

Return the supported entities for this NLP engine.

get_supported_languages

Return the supported languages for this NLP engine.

is_loaded

Return True if the model is already loaded.

process_text

Execute the SpaCy NLP pipeline on the given text and language.

process_batch

Execute the NLP pipeline on a batch of texts using spacy pipe.

is_stopword

Return true if the given word is a stop word.

is_punct

Return true if the given word is a punctuation word.

get_nlp

Return the language model loaded for a language.

Source code in presidio_analyzer/nlp_engine/spacy_nlp_engine.py
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
class SpacyNlpEngine(NlpEngine):
    """
    SpacyNlpEngine is an abstraction layer over the nlp module.

    It provides processing functionality as well as other queries
    on tokens.
    The SpacyNlpEngine uses SpaCy as its NLP module
    """

    engine_name = "spacy"
    is_available = bool(spacy)

    def __init__(
        self,
        models: Optional[List[Dict[str, str]]] = None,
        ner_model_configuration: Optional[NerModelConfiguration] = None,
    ):
        """
        Initialize a wrapper on spaCy functionality.

        :param models: Dictionary with the name of the spaCy model per language.
        For example: models = [{"lang_code": "en", "model_name": "en_core_web_lg"}]
        :param ner_model_configuration: Parameters for the NER model.
        See conf/spacy.yaml for an example
        """
        if not models:
            models = [{"lang_code": "en", "model_name": "en_core_web_lg"}]
        self.models = models

        if not ner_model_configuration:
            ner_model_configuration = NerModelConfiguration()
        self.ner_model_configuration = ner_model_configuration

        self.nlp = None

    def load(self) -> None:
        """Load the spaCy NLP model."""
        logger.debug(f"Loading SpaCy models: {self.models}")

        self.nlp = {}
        # Download spaCy model if missing
        for model in self.models:
            self._validate_model_params(model)
            self._download_spacy_model_if_needed(model["model_name"])
            self.nlp[model["lang_code"]] = spacy.load(model["model_name"])

    @staticmethod
    def _download_spacy_model_if_needed(model_name: str) -> None:
        if not (spacy.util.is_package(model_name) or Path(model_name).exists()):
            logger.warning(f"Model {model_name} is not installed. Downloading...")
            spacy.cli.download(model_name)
            logger.info(f"Finished downloading model {model_name}")

    @staticmethod
    def _validate_model_params(model: Dict) -> None:
        if "lang_code" not in model:
            raise ValueError("lang_code is missing from model configuration")
        if "model_name" not in model:
            raise ValueError("model_name is missing from model configuration")
        if not isinstance(model["model_name"], str):
            raise ValueError("model_name must be a string")

    def get_supported_entities(self) -> List[str]:
        """Return the supported entities for this NLP engine."""
        if not self.ner_model_configuration.model_to_presidio_entity_mapping:
            raise ValueError(
                "model_to_presidio_entity_mapping is missing from model configuration"
            )
        entities_from_mapping = list(
            set(self.ner_model_configuration.model_to_presidio_entity_mapping.values())
        )
        entities = [
            ent
            for ent in entities_from_mapping
            if ent not in self.ner_model_configuration.labels_to_ignore
        ]
        return entities

    def get_supported_languages(self) -> List[str]:
        """Return the supported languages for this NLP engine."""
        if not self.nlp:
            raise ValueError("NLP engine is not loaded. Consider calling .load()")
        return list(self.nlp.keys())

    def is_loaded(self) -> bool:
        """Return True if the model is already loaded."""
        return self.nlp is not None

    def process_text(self, text: str, language: str) -> NlpArtifacts:
        """Execute the SpaCy NLP pipeline on the given text and language."""
        if not self.nlp:
            raise ValueError("NLP engine is not loaded. Consider calling .load()")

        doc = self.nlp[language](text)
        return self._doc_to_nlp_artifact(doc, language)

    def process_batch(
        self,
        texts: Union[List[str], List[Tuple[str, object]]],
        language: str,
        batch_size: int = 1,
        n_process: int = 1,
        as_tuples: bool = False,
    ) -> Iterator[Optional[NlpArtifacts]]:
        """Execute the NLP pipeline on a batch of texts using spacy pipe.

        :param texts: A list of texts to process.
        :param language: The language of the texts.
        :param batch_size: Default batch size for pipe and evaluate.
        :param n_process: Number of processors to process texts.
        :param as_tuples: If set to True, inputs should be a sequence of
            (text, context) tuples. Output will then be a sequence of
            (doc, context) tuples. Defaults to False.
        """

        if not self.nlp:
            raise ValueError("NLP engine is not loaded. Consider calling .load()")

        texts = (str(text) for text in texts)
        docs = self.nlp[language].pipe(
            texts, as_tuples=as_tuples, batch_size=batch_size, n_process=n_process
        )
        for doc in docs:
            yield doc.text, self._doc_to_nlp_artifact(doc, language)

    def is_stopword(self, word: str, language: str) -> bool:
        """
        Return true if the given word is a stop word.

        (within the given language)
        """
        return self.nlp[language].vocab[word].is_stop

    def is_punct(self, word: str, language: str) -> bool:
        """
        Return true if the given word is a punctuation word.

        (within the given language).
        """
        return self.nlp[language].vocab[word].is_punct

    def get_nlp(self, language: str) -> Language:
        """
        Return the language model loaded for a language.

        :param language: Language
        :return: Model from spaCy
        """
        return self.nlp[language]

    def _doc_to_nlp_artifact(self, doc: Doc, language: str) -> NlpArtifacts:
        lemmas = [token.lemma_ for token in doc]
        tokens_indices = [token.idx for token in doc]

        entities = self._get_entities(doc)
        scores = self._get_scores_for_entities(doc)

        entities, scores = self._get_updated_entities(entities, scores)

        return NlpArtifacts(
            entities=entities,
            tokens=doc,
            tokens_indices=tokens_indices,
            lemmas=lemmas,
            nlp_engine=self,
            language=language,
            scores=scores,
        )

    def _get_entities(self, doc: Doc) -> List[Span]:
        """
        Extract entities out of a spaCy pipeline, depending on the type of pipeline.

        For normal spaCy, this would be doc.ents
        :param doc: the output spaCy doc.
        :return: List of entities
        """

        return doc.ents

    def _get_scores_for_entities(self, doc: Doc) -> List[float]:
        """Extract scores for entities from the doc.

        Since spaCy does not provide confidence scores for entities by default,
        we use the default score from the ner model configuration.
        :param doc: SpaCy doc
        """

        entities = doc.ents
        scores = [self.ner_model_configuration.default_score] * len(entities)
        return scores

    def _get_updated_entities(
        self, entities: List[Span], scores: List[float]
    ) -> Tuple[List[Span], List[float]]:
        """
        Get an updated list of entities based on the ner model configuration.

        Remove entities that are in labels_to_ignore,
        update entity names based on model_to_presidio_entity_mapping

        :param entities: Entities that were extracted from a spaCy pipeline
        :param scores: Original confidence scores for the entities extracted
        :return: Tuple holding the entities and confidence scores
        """
        if len(entities) != len(scores):
            raise ValueError("Entities and scores must be the same length")

        new_entities = []
        new_scores = []

        mapping = self.ner_model_configuration.model_to_presidio_entity_mapping
        to_ignore = self.ner_model_configuration.labels_to_ignore
        for ent, score in zip(entities, scores):
            # Remove model labels in the ignore list
            if ent.label_ in to_ignore:
                continue

            # Update entity label based on mapping
            if ent.label_ in mapping:
                ent.label_ = mapping[ent.label_]
            else:
                logger.warning(
                    f"Entity {ent.label_} is not mapped to a Presidio entity, "
                    f"but keeping anyway. "
                    f"Add to `NerModelConfiguration.labels_to_ignore` to remove."
                )

            # Remove presidio entities in the ignore list
            if ent.label_ in to_ignore:
                continue

            new_entities.append(ent)

            # Update score if entity is in low score entity names
            if ent.label_ in self.ner_model_configuration.low_score_entity_names:
                score *= self.ner_model_configuration.low_confidence_score_multiplier

            new_scores.append(score)

        return new_entities, new_scores

load

load() -> None

Load the spaCy NLP model.

Source code in presidio_analyzer/nlp_engine/spacy_nlp_engine.py
53
54
55
56
57
58
59
60
61
62
def load(self) -> None:
    """Load the spaCy NLP model."""
    logger.debug(f"Loading SpaCy models: {self.models}")

    self.nlp = {}
    # Download spaCy model if missing
    for model in self.models:
        self._validate_model_params(model)
        self._download_spacy_model_if_needed(model["model_name"])
        self.nlp[model["lang_code"]] = spacy.load(model["model_name"])

get_supported_entities

get_supported_entities() -> List[str]

Return the supported entities for this NLP engine.

Source code in presidio_analyzer/nlp_engine/spacy_nlp_engine.py
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
def get_supported_entities(self) -> List[str]:
    """Return the supported entities for this NLP engine."""
    if not self.ner_model_configuration.model_to_presidio_entity_mapping:
        raise ValueError(
            "model_to_presidio_entity_mapping is missing from model configuration"
        )
    entities_from_mapping = list(
        set(self.ner_model_configuration.model_to_presidio_entity_mapping.values())
    )
    entities = [
        ent
        for ent in entities_from_mapping
        if ent not in self.ner_model_configuration.labels_to_ignore
    ]
    return entities

get_supported_languages

get_supported_languages() -> List[str]

Return the supported languages for this NLP engine.

Source code in presidio_analyzer/nlp_engine/spacy_nlp_engine.py
 96
 97
 98
 99
100
def get_supported_languages(self) -> List[str]:
    """Return the supported languages for this NLP engine."""
    if not self.nlp:
        raise ValueError("NLP engine is not loaded. Consider calling .load()")
    return list(self.nlp.keys())

is_loaded

is_loaded() -> bool

Return True if the model is already loaded.

Source code in presidio_analyzer/nlp_engine/spacy_nlp_engine.py
102
103
104
def is_loaded(self) -> bool:
    """Return True if the model is already loaded."""
    return self.nlp is not None

process_text

process_text(text: str, language: str) -> NlpArtifacts

Execute the SpaCy NLP pipeline on the given text and language.

Source code in presidio_analyzer/nlp_engine/spacy_nlp_engine.py
106
107
108
109
110
111
112
def process_text(self, text: str, language: str) -> NlpArtifacts:
    """Execute the SpaCy NLP pipeline on the given text and language."""
    if not self.nlp:
        raise ValueError("NLP engine is not loaded. Consider calling .load()")

    doc = self.nlp[language](text)
    return self._doc_to_nlp_artifact(doc, language)

process_batch

process_batch(
    texts: Union[List[str], List[Tuple[str, object]]],
    language: str,
    batch_size: int = 1,
    n_process: int = 1,
    as_tuples: bool = False,
) -> Iterator[Optional[NlpArtifacts]]

Execute the NLP pipeline on a batch of texts using spacy pipe.

PARAMETER DESCRIPTION
texts

A list of texts to process.

TYPE: Union[List[str], List[Tuple[str, object]]]

language

The language of the texts.

TYPE: str

batch_size

Default batch size for pipe and evaluate.

TYPE: int DEFAULT: 1

n_process

Number of processors to process texts.

TYPE: int DEFAULT: 1

as_tuples

If set to True, inputs should be a sequence of (text, context) tuples. Output will then be a sequence of (doc, context) tuples. Defaults to False.

TYPE: bool DEFAULT: False

Source code in presidio_analyzer/nlp_engine/spacy_nlp_engine.py
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
def process_batch(
    self,
    texts: Union[List[str], List[Tuple[str, object]]],
    language: str,
    batch_size: int = 1,
    n_process: int = 1,
    as_tuples: bool = False,
) -> Iterator[Optional[NlpArtifacts]]:
    """Execute the NLP pipeline on a batch of texts using spacy pipe.

    :param texts: A list of texts to process.
    :param language: The language of the texts.
    :param batch_size: Default batch size for pipe and evaluate.
    :param n_process: Number of processors to process texts.
    :param as_tuples: If set to True, inputs should be a sequence of
        (text, context) tuples. Output will then be a sequence of
        (doc, context) tuples. Defaults to False.
    """

    if not self.nlp:
        raise ValueError("NLP engine is not loaded. Consider calling .load()")

    texts = (str(text) for text in texts)
    docs = self.nlp[language].pipe(
        texts, as_tuples=as_tuples, batch_size=batch_size, n_process=n_process
    )
    for doc in docs:
        yield doc.text, self._doc_to_nlp_artifact(doc, language)

is_stopword

is_stopword(word: str, language: str) -> bool

Return true if the given word is a stop word.

(within the given language)

Source code in presidio_analyzer/nlp_engine/spacy_nlp_engine.py
143
144
145
146
147
148
149
def is_stopword(self, word: str, language: str) -> bool:
    """
    Return true if the given word is a stop word.

    (within the given language)
    """
    return self.nlp[language].vocab[word].is_stop

is_punct

is_punct(word: str, language: str) -> bool

Return true if the given word is a punctuation word.

(within the given language).

Source code in presidio_analyzer/nlp_engine/spacy_nlp_engine.py
151
152
153
154
155
156
157
def is_punct(self, word: str, language: str) -> bool:
    """
    Return true if the given word is a punctuation word.

    (within the given language).
    """
    return self.nlp[language].vocab[word].is_punct

get_nlp

get_nlp(language: str) -> Language

Return the language model loaded for a language.

PARAMETER DESCRIPTION
language

Language

TYPE: str

RETURNS DESCRIPTION
Language

Model from spaCy

Source code in presidio_analyzer/nlp_engine/spacy_nlp_engine.py
159
160
161
162
163
164
165
166
def get_nlp(self, language: str) -> Language:
    """
    Return the language model loaded for a language.

    :param language: Language
    :return: Model from spaCy
    """
    return self.nlp[language]

StanzaNlpEngine

Bases: SpacyNlpEngine

StanzaNlpEngine is an abstraction layer over the nlp module.

It provides processing functionality as well as other queries on tokens. The StanzaNlpEngine uses spacy-stanza and stanza as its NLP module

PARAMETER DESCRIPTION
models

Dictionary with the name of the spaCy model per language. For example: models = [{"lang_code": "en", "model_name": "en"}]

TYPE: Optional[List[Dict[str, str]]] DEFAULT: None

ner_model_configuration

Parameters for the NER model. See conf/stanza.yaml for an example

TYPE: Optional[NerModelConfiguration] DEFAULT: None

METHOD DESCRIPTION
is_loaded

Return True if the model is already loaded.

process_text

Execute the SpaCy NLP pipeline on the given text and language.

process_batch

Execute the NLP pipeline on a batch of texts using spacy pipe.

is_stopword

Return true if the given word is a stop word.

is_punct

Return true if the given word is a punctuation word.

get_supported_entities

Return the supported entities for this NLP engine.

get_supported_languages

Return the supported languages for this NLP engine.

get_nlp

Return the language model loaded for a language.

load

Load the NLP model.

Source code in presidio_analyzer/nlp_engine/stanza_nlp_engine.py
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
class StanzaNlpEngine(SpacyNlpEngine):
    """
    StanzaNlpEngine is an abstraction layer over the nlp module.

    It provides processing functionality as well as other queries
    on tokens.
    The StanzaNlpEngine uses spacy-stanza and stanza as its NLP module

    :param models: Dictionary with the name of the spaCy model per language.
    For example: models = [{"lang_code": "en", "model_name": "en"}]
    :param ner_model_configuration: Parameters for the NER model.
    See conf/stanza.yaml for an example

    """

    engine_name = "stanza"
    is_available = bool(stanza)

    def __init__(
        self,
        models: Optional[List[Dict[str, str]]] = None,
        ner_model_configuration: Optional[NerModelConfiguration] = None,
        download_if_missing: bool = True,
    ):
        super().__init__(models, ner_model_configuration)
        self.download_if_missing = download_if_missing

    def load(self) -> None:
        """Load the NLP model."""

        logger.debug(f"Loading Stanza models: {self.models}")

        self.nlp = {}
        for model in self.models:
            self._validate_model_params(model)
            self.nlp[model["lang_code"]] = load_pipeline(
                model["model_name"],
                processors="tokenize,pos,lemma,ner",
                download_method="DOWNLOAD_RESOURCES"
                if self.download_if_missing
                else None,
            )

is_loaded

is_loaded() -> bool

Return True if the model is already loaded.

Source code in presidio_analyzer/nlp_engine/spacy_nlp_engine.py
102
103
104
def is_loaded(self) -> bool:
    """Return True if the model is already loaded."""
    return self.nlp is not None

process_text

process_text(text: str, language: str) -> NlpArtifacts

Execute the SpaCy NLP pipeline on the given text and language.

Source code in presidio_analyzer/nlp_engine/spacy_nlp_engine.py
106
107
108
109
110
111
112
def process_text(self, text: str, language: str) -> NlpArtifacts:
    """Execute the SpaCy NLP pipeline on the given text and language."""
    if not self.nlp:
        raise ValueError("NLP engine is not loaded. Consider calling .load()")

    doc = self.nlp[language](text)
    return self._doc_to_nlp_artifact(doc, language)

process_batch

process_batch(
    texts: Union[List[str], List[Tuple[str, object]]],
    language: str,
    batch_size: int = 1,
    n_process: int = 1,
    as_tuples: bool = False,
) -> Iterator[Optional[NlpArtifacts]]

Execute the NLP pipeline on a batch of texts using spacy pipe.

PARAMETER DESCRIPTION
texts

A list of texts to process.

TYPE: Union[List[str], List[Tuple[str, object]]]

language

The language of the texts.

TYPE: str

batch_size

Default batch size for pipe and evaluate.

TYPE: int DEFAULT: 1

n_process

Number of processors to process texts.

TYPE: int DEFAULT: 1

as_tuples

If set to True, inputs should be a sequence of (text, context) tuples. Output will then be a sequence of (doc, context) tuples. Defaults to False.

TYPE: bool DEFAULT: False

Source code in presidio_analyzer/nlp_engine/spacy_nlp_engine.py
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
def process_batch(
    self,
    texts: Union[List[str], List[Tuple[str, object]]],
    language: str,
    batch_size: int = 1,
    n_process: int = 1,
    as_tuples: bool = False,
) -> Iterator[Optional[NlpArtifacts]]:
    """Execute the NLP pipeline on a batch of texts using spacy pipe.

    :param texts: A list of texts to process.
    :param language: The language of the texts.
    :param batch_size: Default batch size for pipe and evaluate.
    :param n_process: Number of processors to process texts.
    :param as_tuples: If set to True, inputs should be a sequence of
        (text, context) tuples. Output will then be a sequence of
        (doc, context) tuples. Defaults to False.
    """

    if not self.nlp:
        raise ValueError("NLP engine is not loaded. Consider calling .load()")

    texts = (str(text) for text in texts)
    docs = self.nlp[language].pipe(
        texts, as_tuples=as_tuples, batch_size=batch_size, n_process=n_process
    )
    for doc in docs:
        yield doc.text, self._doc_to_nlp_artifact(doc, language)

is_stopword

is_stopword(word: str, language: str) -> bool

Return true if the given word is a stop word.

(within the given language)

Source code in presidio_analyzer/nlp_engine/spacy_nlp_engine.py
143
144
145
146
147
148
149
def is_stopword(self, word: str, language: str) -> bool:
    """
    Return true if the given word is a stop word.

    (within the given language)
    """
    return self.nlp[language].vocab[word].is_stop

is_punct

is_punct(word: str, language: str) -> bool

Return true if the given word is a punctuation word.

(within the given language).

Source code in presidio_analyzer/nlp_engine/spacy_nlp_engine.py
151
152
153
154
155
156
157
def is_punct(self, word: str, language: str) -> bool:
    """
    Return true if the given word is a punctuation word.

    (within the given language).
    """
    return self.nlp[language].vocab[word].is_punct

get_supported_entities

get_supported_entities() -> List[str]

Return the supported entities for this NLP engine.

Source code in presidio_analyzer/nlp_engine/spacy_nlp_engine.py
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
def get_supported_entities(self) -> List[str]:
    """Return the supported entities for this NLP engine."""
    if not self.ner_model_configuration.model_to_presidio_entity_mapping:
        raise ValueError(
            "model_to_presidio_entity_mapping is missing from model configuration"
        )
    entities_from_mapping = list(
        set(self.ner_model_configuration.model_to_presidio_entity_mapping.values())
    )
    entities = [
        ent
        for ent in entities_from_mapping
        if ent not in self.ner_model_configuration.labels_to_ignore
    ]
    return entities

get_supported_languages

get_supported_languages() -> List[str]

Return the supported languages for this NLP engine.

Source code in presidio_analyzer/nlp_engine/spacy_nlp_engine.py
 96
 97
 98
 99
100
def get_supported_languages(self) -> List[str]:
    """Return the supported languages for this NLP engine."""
    if not self.nlp:
        raise ValueError("NLP engine is not loaded. Consider calling .load()")
    return list(self.nlp.keys())

get_nlp

get_nlp(language: str) -> Language

Return the language model loaded for a language.

PARAMETER DESCRIPTION
language

Language

TYPE: str

RETURNS DESCRIPTION
Language

Model from spaCy

Source code in presidio_analyzer/nlp_engine/spacy_nlp_engine.py
159
160
161
162
163
164
165
166
def get_nlp(self, language: str) -> Language:
    """
    Return the language model loaded for a language.

    :param language: Language
    :return: Model from spaCy
    """
    return self.nlp[language]

load

load() -> None

Load the NLP model.

Source code in presidio_analyzer/nlp_engine/stanza_nlp_engine.py
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
def load(self) -> None:
    """Load the NLP model."""

    logger.debug(f"Loading Stanza models: {self.models}")

    self.nlp = {}
    for model in self.models:
        self._validate_model_params(model)
        self.nlp[model["lang_code"]] = load_pipeline(
            model["model_name"],
            processors="tokenize,pos,lemma,ner",
            download_method="DOWNLOAD_RESOURCES"
            if self.download_if_missing
            else None,
        )

TransformersNlpEngine

Bases: SpacyNlpEngine

TransformersNlpEngine is a transformers based NlpEngine.

It comprises a spacy pipeline used for tokenization, lemmatization, pos, and a transformers component for NER.

Both the underlying spacy pipeline and the transformers engine could be configured by the user. :example: [{"lang_code": "en", "model_name": { "spacy": "en_core_web_sm", "transformers": "dslim/bert-base-NER" } }]

PARAMETER DESCRIPTION
models

A dict holding the model's configuration.

TYPE: Optional[List[Dict]] DEFAULT: None

ner_model_configuration

Parameters for the NER model. See conf/transformers.yaml for an example Note that since the spaCy model is not used for NER, we recommend using a simple model, such as en_core_web_sm for English. For potential Transformers models, see a list of models here: https://huggingface.co/models?pipeline_tag=token-classification It is further recommended to fine-tune these models to the specific scenario in hand.

TYPE: Optional[NerModelConfiguration] DEFAULT: None

METHOD DESCRIPTION
is_loaded

Return True if the model is already loaded.

process_text

Execute the SpaCy NLP pipeline on the given text and language.

process_batch

Execute the NLP pipeline on a batch of texts using spacy pipe.

is_stopword

Return true if the given word is a stop word.

is_punct

Return true if the given word is a punctuation word.

get_supported_entities

Return the supported entities for this NLP engine.

get_supported_languages

Return the supported languages for this NLP engine.

get_nlp

Return the language model loaded for a language.

load

Load the spaCy and transformers models.

Source code in presidio_analyzer/nlp_engine/transformers_nlp_engine.py
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
class TransformersNlpEngine(SpacyNlpEngine):
    """

    TransformersNlpEngine is a transformers based NlpEngine.

    It comprises a spacy pipeline used for tokenization,
    lemmatization, pos, and a transformers component for NER.

    Both the underlying spacy pipeline and the transformers engine could be
    configured by the user.
    :param models: A dict holding the model's configuration.
    :example:
    [{"lang_code": "en", "model_name": {
            "spacy": "en_core_web_sm",
            "transformers": "dslim/bert-base-NER"
            }
    }]
    :param ner_model_configuration: Parameters for the NER model.
    See conf/transformers.yaml for an example


    Note that since the spaCy model is not used for NER,
    we recommend using a simple model, such as en_core_web_sm for English.
    For potential Transformers models, see a list of models here:
    https://huggingface.co/models?pipeline_tag=token-classification
    It is further recommended to fine-tune these models
    to the specific scenario in hand.

    """

    engine_name = "transformers"
    is_available = bool(spacy_huggingface_pipelines)

    def __init__(
        self,
        models: Optional[List[Dict]] = None,
        ner_model_configuration: Optional[NerModelConfiguration] = None,
    ):
        if not models:
            models = [
                {
                    "lang_code": "en",
                    "model_name": {
                        "spacy": "en_core_web_sm",
                        "transformers": "obi/deid_roberta_i2b2",
                    },
                }
            ]
        super().__init__(models=models, ner_model_configuration=ner_model_configuration)
        self.entity_key = "bert-base-ner"

    def load(self) -> None:
        """Load the spaCy and transformers models."""

        logger.debug(f"Loading SpaCy and transformers models: {self.models}")
        self.nlp = {}

        for model in self.models:
            self._validate_model_params(model)
            spacy_model = model["model_name"]["spacy"]
            transformers_model = model["model_name"]["transformers"]
            self._download_spacy_model_if_needed(spacy_model)

            nlp = spacy.load(spacy_model, disable=["parser", "ner"])
            nlp.add_pipe(
                "hf_token_pipe",
                config={
                    "model": transformers_model,
                    "annotate": "spans",
                    "stride": self.ner_model_configuration.stride,
                    "alignment_mode": self.ner_model_configuration.alignment_mode,
                    "aggregation_strategy": self.ner_model_configuration.aggregation_strategy,  # noqa E501
                    "annotate_spans_key": self.entity_key,
                },
            )
            self.nlp[model["lang_code"]] = nlp

    @staticmethod
    def _validate_model_params(model: Dict) -> None:
        if "lang_code" not in model:
            raise ValueError("lang_code is missing from model configuration")
        if "model_name" not in model:
            raise ValueError("model_name is missing from model configuration")
        if not isinstance(model["model_name"], dict):
            raise ValueError("model_name must be a dictionary")
        if "spacy" not in model["model_name"]:
            raise ValueError("spacy model name is missing from model configuration")
        if "transformers" not in model["model_name"]:
            raise ValueError(
                "transformers model name is missing from model configuration"
            )

    def _get_entities(self, doc: Doc) -> List[Span]:
        """
        Extract entities out of a spaCy pipeline, depending on the type of pipeline.

        For spacy-huggingface-pipeline, this would be doc.spans[key]
        :param doc: the output spaCy doc.
        :return: List of entities
        """

        return doc.spans[self.entity_key]

    def _get_scores_for_entities(self, doc: Doc) -> List[float]:
        """Extract scores for entities from the doc.

        While spaCy does not provide confidence scores,
        the spacy-huggingface-pipeline flow adds confidence scores
        as SpanGroup attributes.
        :param doc: SpaCy doc
        """

        return doc.spans[self.entity_key].attrs["scores"]

is_loaded

is_loaded() -> bool

Return True if the model is already loaded.

Source code in presidio_analyzer/nlp_engine/spacy_nlp_engine.py
102
103
104
def is_loaded(self) -> bool:
    """Return True if the model is already loaded."""
    return self.nlp is not None

process_text

process_text(text: str, language: str) -> NlpArtifacts

Execute the SpaCy NLP pipeline on the given text and language.

Source code in presidio_analyzer/nlp_engine/spacy_nlp_engine.py
106
107
108
109
110
111
112
def process_text(self, text: str, language: str) -> NlpArtifacts:
    """Execute the SpaCy NLP pipeline on the given text and language."""
    if not self.nlp:
        raise ValueError("NLP engine is not loaded. Consider calling .load()")

    doc = self.nlp[language](text)
    return self._doc_to_nlp_artifact(doc, language)

process_batch

process_batch(
    texts: Union[List[str], List[Tuple[str, object]]],
    language: str,
    batch_size: int = 1,
    n_process: int = 1,
    as_tuples: bool = False,
) -> Iterator[Optional[NlpArtifacts]]

Execute the NLP pipeline on a batch of texts using spacy pipe.

PARAMETER DESCRIPTION
texts

A list of texts to process.

TYPE: Union[List[str], List[Tuple[str, object]]]

language

The language of the texts.

TYPE: str

batch_size

Default batch size for pipe and evaluate.

TYPE: int DEFAULT: 1

n_process

Number of processors to process texts.

TYPE: int DEFAULT: 1

as_tuples

If set to True, inputs should be a sequence of (text, context) tuples. Output will then be a sequence of (doc, context) tuples. Defaults to False.

TYPE: bool DEFAULT: False

Source code in presidio_analyzer/nlp_engine/spacy_nlp_engine.py
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
def process_batch(
    self,
    texts: Union[List[str], List[Tuple[str, object]]],
    language: str,
    batch_size: int = 1,
    n_process: int = 1,
    as_tuples: bool = False,
) -> Iterator[Optional[NlpArtifacts]]:
    """Execute the NLP pipeline on a batch of texts using spacy pipe.

    :param texts: A list of texts to process.
    :param language: The language of the texts.
    :param batch_size: Default batch size for pipe and evaluate.
    :param n_process: Number of processors to process texts.
    :param as_tuples: If set to True, inputs should be a sequence of
        (text, context) tuples. Output will then be a sequence of
        (doc, context) tuples. Defaults to False.
    """

    if not self.nlp:
        raise ValueError("NLP engine is not loaded. Consider calling .load()")

    texts = (str(text) for text in texts)
    docs = self.nlp[language].pipe(
        texts, as_tuples=as_tuples, batch_size=batch_size, n_process=n_process
    )
    for doc in docs:
        yield doc.text, self._doc_to_nlp_artifact(doc, language)

is_stopword

is_stopword(word: str, language: str) -> bool

Return true if the given word is a stop word.

(within the given language)

Source code in presidio_analyzer/nlp_engine/spacy_nlp_engine.py
143
144
145
146
147
148
149
def is_stopword(self, word: str, language: str) -> bool:
    """
    Return true if the given word is a stop word.

    (within the given language)
    """
    return self.nlp[language].vocab[word].is_stop

is_punct

is_punct(word: str, language: str) -> bool

Return true if the given word is a punctuation word.

(within the given language).

Source code in presidio_analyzer/nlp_engine/spacy_nlp_engine.py
151
152
153
154
155
156
157
def is_punct(self, word: str, language: str) -> bool:
    """
    Return true if the given word is a punctuation word.

    (within the given language).
    """
    return self.nlp[language].vocab[word].is_punct

get_supported_entities

get_supported_entities() -> List[str]

Return the supported entities for this NLP engine.

Source code in presidio_analyzer/nlp_engine/spacy_nlp_engine.py
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
def get_supported_entities(self) -> List[str]:
    """Return the supported entities for this NLP engine."""
    if not self.ner_model_configuration.model_to_presidio_entity_mapping:
        raise ValueError(
            "model_to_presidio_entity_mapping is missing from model configuration"
        )
    entities_from_mapping = list(
        set(self.ner_model_configuration.model_to_presidio_entity_mapping.values())
    )
    entities = [
        ent
        for ent in entities_from_mapping
        if ent not in self.ner_model_configuration.labels_to_ignore
    ]
    return entities

get_supported_languages

get_supported_languages() -> List[str]

Return the supported languages for this NLP engine.

Source code in presidio_analyzer/nlp_engine/spacy_nlp_engine.py
 96
 97
 98
 99
100
def get_supported_languages(self) -> List[str]:
    """Return the supported languages for this NLP engine."""
    if not self.nlp:
        raise ValueError("NLP engine is not loaded. Consider calling .load()")
    return list(self.nlp.keys())

get_nlp

get_nlp(language: str) -> Language

Return the language model loaded for a language.

PARAMETER DESCRIPTION
language

Language

TYPE: str

RETURNS DESCRIPTION
Language

Model from spaCy

Source code in presidio_analyzer/nlp_engine/spacy_nlp_engine.py
159
160
161
162
163
164
165
166
def get_nlp(self, language: str) -> Language:
    """
    Return the language model loaded for a language.

    :param language: Language
    :return: Model from spaCy
    """
    return self.nlp[language]

load

load() -> None

Load the spaCy and transformers models.

Source code in presidio_analyzer/nlp_engine/transformers_nlp_engine.py
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
def load(self) -> None:
    """Load the spaCy and transformers models."""

    logger.debug(f"Loading SpaCy and transformers models: {self.models}")
    self.nlp = {}

    for model in self.models:
        self._validate_model_params(model)
        spacy_model = model["model_name"]["spacy"]
        transformers_model = model["model_name"]["transformers"]
        self._download_spacy_model_if_needed(spacy_model)

        nlp = spacy.load(spacy_model, disable=["parser", "ner"])
        nlp.add_pipe(
            "hf_token_pipe",
            config={
                "model": transformers_model,
                "annotate": "spans",
                "stride": self.ner_model_configuration.stride,
                "alignment_mode": self.ner_model_configuration.alignment_mode,
                "aggregation_strategy": self.ner_model_configuration.aggregation_strategy,  # noqa E501
                "annotate_spans_key": self.entity_key,
            },
        )
        self.nlp[model["lang_code"]] = nlp

NlpEngineProvider

Create different NLP engines from configuration.

:example: configuration: { "nlp_engine_name": "spacy", "models": [{"lang_code": "en", "model_name": "en_core_web_lg" }] } Nlp engine names available by default: spacy, stanza.

PARAMETER DESCRIPTION
nlp_engines

List of available NLP engines. Default: (SpacyNlpEngine, StanzaNlpEngine)

TYPE: Optional[Tuple] DEFAULT: None

nlp_configuration

Dict containing nlp configuration

TYPE: Optional[Dict] DEFAULT: None

conf_file

Path to yaml file containing nlp engine configuration.

TYPE: Optional[Union[Path, str]] DEFAULT: None

METHOD DESCRIPTION
create_engine

Create an NLP engine instance.

Source code in presidio_analyzer/nlp_engine/nlp_engine_provider.py
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
class NlpEngineProvider:
    """Create different NLP engines from configuration.

    :param nlp_engines: List of available NLP engines.
    Default: (SpacyNlpEngine, StanzaNlpEngine)
    :param nlp_configuration: Dict containing nlp configuration
    :example: configuration:
            {
                "nlp_engine_name": "spacy",
                "models": [{"lang_code": "en",
                            "model_name": "en_core_web_lg"
                          }]
            }
    Nlp engine names available by default: spacy, stanza.
    :param conf_file: Path to yaml file containing nlp engine configuration.
    """

    def __init__(
        self,
        nlp_engines: Optional[Tuple] = None,
        conf_file: Optional[Union[Path, str]] = None,
        nlp_configuration: Optional[Dict] = None,
    ):
        if not nlp_engines:
            nlp_engines = (SpacyNlpEngine, StanzaNlpEngine, TransformersNlpEngine)

        self.nlp_engines = {
            engine.engine_name: engine for engine in nlp_engines if engine.is_available
        }
        logger.debug(
            f"Loaded these available nlp engines: {list(self.nlp_engines.keys())}"
        )

        if conf_file and nlp_configuration:
            raise ValueError(
                "Either conf_file or nlp_configuration should be provided, not both."
            )

        if nlp_configuration:
            self.nlp_configuration = nlp_configuration

        if conf_file:
            self.nlp_configuration = self._read_nlp_conf(conf_file)

        if conf_file is None and nlp_configuration is None:
            conf_file = self._get_full_conf_path()
            logger.debug(f"Reading default conf file from {conf_file}")
            self.nlp_configuration = self._read_nlp_conf(conf_file)

    def create_engine(self) -> NlpEngine:
        """Create an NLP engine instance."""
        if (
            not self.nlp_configuration
            or not self.nlp_configuration.get("models")
            or not self.nlp_configuration.get("nlp_engine_name")
        ):
            raise ValueError(
                "Illegal nlp configuration. "
                "Configuration should include nlp_engine_name and models "
                "(list of model_name for each lang_code)."
            )
        nlp_engine_name = self.nlp_configuration["nlp_engine_name"]
        if nlp_engine_name not in self.nlp_engines:
            raise ValueError(
                f"NLP engine '{nlp_engine_name}' is not available. "
                "Make sure you have all required packages installed"
            )
        try:
            nlp_engine_class = self.nlp_engines[nlp_engine_name]
            nlp_models = self.nlp_configuration["models"]

            ner_model_configuration = self.nlp_configuration.get(
                "ner_model_configuration"
            )
            if ner_model_configuration:
                ner_model_configuration = NerModelConfiguration.from_dict(
                    ner_model_configuration
                )

            engine = nlp_engine_class(
                models=nlp_models, ner_model_configuration=ner_model_configuration
            )
            engine.load()
            logger.info(
                f"Created NLP engine: {engine.engine_name}. "
                f"Loaded models: {list(engine.nlp.keys())}"
            )
            return engine
        except KeyError:
            raise ValueError("Wrong NLP engine configuration")

    @staticmethod
    def _read_nlp_conf(conf_file: Union[Path, str]) -> dict:
        """Read the nlp configuration from a provided yaml file."""

        if not Path(conf_file).exists():
            nlp_configuration = {
                "nlp_engine_name": "spacy",
                "models": [{"lang_code": "en", "model_name": "en_core_web_lg"}],
            }
            logger.warning(
                f"configuration file {conf_file} not found.  "
                f"Using default config: {nlp_configuration}."
            )

        else:
            with open(conf_file) as file:
                nlp_configuration = yaml.safe_load(file)

        if "ner_model_configuration" not in nlp_configuration:
            logger.warning(
                "configuration file is missing 'ner_model_configuration'. Using default"
            )

        return nlp_configuration

    @staticmethod
    def _get_full_conf_path(
        default_conf_file: Union[Path, str] = "default.yaml",
    ) -> Path:
        """Return a Path to the default conf file."""
        return Path(Path(__file__).parent.parent, "conf", default_conf_file)

create_engine

create_engine() -> NlpEngine

Create an NLP engine instance.

Source code in presidio_analyzer/nlp_engine/nlp_engine_provider.py
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
def create_engine(self) -> NlpEngine:
    """Create an NLP engine instance."""
    if (
        not self.nlp_configuration
        or not self.nlp_configuration.get("models")
        or not self.nlp_configuration.get("nlp_engine_name")
    ):
        raise ValueError(
            "Illegal nlp configuration. "
            "Configuration should include nlp_engine_name and models "
            "(list of model_name for each lang_code)."
        )
    nlp_engine_name = self.nlp_configuration["nlp_engine_name"]
    if nlp_engine_name not in self.nlp_engines:
        raise ValueError(
            f"NLP engine '{nlp_engine_name}' is not available. "
            "Make sure you have all required packages installed"
        )
    try:
        nlp_engine_class = self.nlp_engines[nlp_engine_name]
        nlp_models = self.nlp_configuration["models"]

        ner_model_configuration = self.nlp_configuration.get(
            "ner_model_configuration"
        )
        if ner_model_configuration:
            ner_model_configuration = NerModelConfiguration.from_dict(
                ner_model_configuration
            )

        engine = nlp_engine_class(
            models=nlp_models, ner_model_configuration=ner_model_configuration
        )
        engine.load()
        logger.info(
            f"Created NLP engine: {engine.engine_name}. "
            f"Loaded models: {list(engine.nlp.keys())}"
        )
        return engine
    except KeyError:
        raise ValueError("Wrong NLP engine configuration")

Predefined Recognizers

presidio_analyzer.predefined_recognizers

Predefined recognizers package. Holds all the default recognizers.

TransformersRecognizer

Bases: SpacyRecognizer

Recognize entities using the spacy-huggingface-pipeline package.

The recognizer doesn't run transformers models, but loads the output from the NlpArtifacts See: - https://huggingface.co/docs/transformers/main/en/index for transformer models - https://github.com/explosion/spacy-huggingface-pipelines on the spaCy wrapper to transformers

METHOD DESCRIPTION
enhance_using_context

Enhance confidence score using context of the entity.

get_supported_entities

Return the list of entities this recognizer can identify.

get_supported_language

Return the language this recognizer can support.

get_version

Return the version of this recognizer.

to_dict

Serialize self to dictionary.

from_dict

Create EntityRecognizer from a dict input.

remove_duplicates

Remove duplicate results.

sanitize_value

Cleanse the input string of the replacement pairs specified as argument.

build_explanation

Create explanation for why this result was detected.

Source code in presidio_analyzer/predefined_recognizers/transformers_recognizer.py
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
class TransformersRecognizer(SpacyRecognizer):
    """
    Recognize entities using the spacy-huggingface-pipeline package.

    The recognizer doesn't run transformers models,
    but loads the output from the NlpArtifacts
    See:
     - https://huggingface.co/docs/transformers/main/en/index for transformer models
     - https://github.com/explosion/spacy-huggingface-pipelines on the spaCy wrapper to transformers
    """  # noqa E501

    ENTITIES = [
        "PERSON",
        "LOCATION",
        "ORGANIZATION",
        "AGE",
        "ID",
        "EMAIL",
        "DATE_TIME",
        "PHONE_NUMBER",
    ]

    def __init__(self, **kwargs):  # noqa ANN003
        self.DEFAULT_EXPLANATION = self.DEFAULT_EXPLANATION.replace(
            "Spacy", "Transformers"
        )
        super().__init__(**kwargs)

id property

id

Return a unique identifier of this recognizer.

enhance_using_context

enhance_using_context(
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]

Enhance confidence score using context of the entity.

Override this method in derived class in case a custom logic is needed, otherwise return value will be equal to raw_results.

in case a result score is boosted, derived class need to update result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

PARAMETER DESCRIPTION
text

The actual text that was analyzed

TYPE: str

raw_recognizer_results

This recognizer's results, to be updated based on recognizer specific context.

TYPE: List[RecognizerResult]

other_raw_recognizer_results

Other recognizer results matched in the given text to allow related entity context enhancement

TYPE: List[RecognizerResult]

nlp_artifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

TYPE: NlpArtifacts

context

list of context words

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/entity_recognizer.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def enhance_using_context(
    self,
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """Enhance confidence score using context of the entity.

    Override this method in derived class in case a custom logic
    is needed, otherwise return value will be equal to
    raw_results.

    in case a result score is boosted, derived class need to update
    result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

    :param text: The actual text that was analyzed
    :param raw_recognizer_results: This recognizer's results, to be updated
    based on recognizer specific context.
    :param other_raw_recognizer_results: Other recognizer results matched in
    the given text to allow related entity context enhancement
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param context: list of context words
    """
    return raw_recognizer_results

get_supported_entities

get_supported_entities() -> List[str]

Return the list of entities this recognizer can identify.

RETURNS DESCRIPTION
List[str]

A list of the supported entities by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
119
120
121
122
123
124
125
def get_supported_entities(self) -> List[str]:
    """
    Return the list of entities this recognizer can identify.

    :return: A list of the supported entities by this recognizer
    """
    return self.supported_entities

get_supported_language

get_supported_language() -> str

Return the language this recognizer can support.

RETURNS DESCRIPTION
str

A list of the supported language by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
127
128
129
130
131
132
133
def get_supported_language(self) -> str:
    """
    Return the language this recognizer can support.

    :return: A list of the supported language by this recognizer
    """
    return self.supported_language

get_version

get_version() -> str

Return the version of this recognizer.

RETURNS DESCRIPTION
str

The current version of this recognizer

Source code in presidio_analyzer/entity_recognizer.py
135
136
137
138
139
140
141
def get_version(self) -> str:
    """
    Return the version of this recognizer.

    :return: The current version of this recognizer
    """
    return self.version

to_dict

to_dict() -> Dict

Serialize self to dictionary.

RETURNS DESCRIPTION
Dict

a dictionary

Source code in presidio_analyzer/entity_recognizer.py
143
144
145
146
147
148
149
150
151
152
153
154
155
def to_dict(self) -> Dict:
    """
    Serialize self to dictionary.

    :return: a dictionary
    """
    return_dict = {
        "supported_entities": self.supported_entities,
        "supported_language": self.supported_language,
        "name": self.name,
        "version": self.version,
    }
    return return_dict

from_dict classmethod

from_dict(entity_recognizer_dict: Dict) -> EntityRecognizer

Create EntityRecognizer from a dict input.

PARAMETER DESCRIPTION
entity_recognizer_dict

Dict containing keys and values for instantiation

TYPE: Dict

Source code in presidio_analyzer/entity_recognizer.py
157
158
159
160
161
162
163
164
@classmethod
def from_dict(cls, entity_recognizer_dict: Dict) -> "EntityRecognizer":
    """
    Create EntityRecognizer from a dict input.

    :param entity_recognizer_dict: Dict containing keys and values for instantiation
    """
    return cls(**entity_recognizer_dict)

remove_duplicates staticmethod

remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]

Remove duplicate results.

Remove duplicates in case the two results have identical start and ends and types.

PARAMETER DESCRIPTION
results

List[RecognizerResult]

TYPE: List[RecognizerResult]

RETURNS DESCRIPTION
List[RecognizerResult]

List[RecognizerResult]

Source code in presidio_analyzer/entity_recognizer.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
@staticmethod
def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
    """
    Remove duplicate results.

    Remove duplicates in case the two results
    have identical start and ends and types.
    :param results: List[RecognizerResult]
    :return: List[RecognizerResult]
    """
    results = list(set(results))
    results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
    filtered_results = []

    for result in results:
        if result.score == 0:
            continue

        to_keep = result not in filtered_results  # equals based comparison
        if to_keep:
            for filtered in filtered_results:
                # If result is contained in one of the other results
                if (
                    result.contained_in(filtered)
                    and result.entity_type == filtered.entity_type
                ):
                    to_keep = False
                    break

        if to_keep:
            filtered_results.append(result)

    return filtered_results

sanitize_value staticmethod

sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str

Cleanse the input string of the replacement pairs specified as argument.

PARAMETER DESCRIPTION
text

input string

TYPE: str

replacement_pairs

pairs of what has to be replaced with which value

TYPE: List[Tuple[str, str]]

RETURNS DESCRIPTION
str

cleansed string

Source code in presidio_analyzer/entity_recognizer.py
200
201
202
203
204
205
206
207
208
209
210
211
@staticmethod
def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
    """
    Cleanse the input string of the replacement pairs specified as argument.

    :param text: input string
    :param replacement_pairs: pairs of what has to be replaced with which value
    :return: cleansed string
    """
    for search_string, replacement_string in replacement_pairs:
        text = text.replace(search_string, replacement_string)
    return text

build_explanation

build_explanation(
    original_score: float, explanation: str
) -> AnalysisExplanation

Create explanation for why this result was detected.

PARAMETER DESCRIPTION
original_score

Score given by this recognizer

TYPE: float

explanation

Explanation string

TYPE: str

RETURNS DESCRIPTION
AnalysisExplanation
Source code in presidio_analyzer/predefined_recognizers/spacy_recognizer.py
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
def build_explanation(
    self, original_score: float, explanation: str
) -> AnalysisExplanation:
    """
    Create explanation for why this result was detected.

    :param original_score: Score given by this recognizer
    :param explanation: Explanation string
    :return:
    """
    explanation = AnalysisExplanation(
        recognizer=self.name,
        original_score=original_score,
        textual_explanation=explanation,
    )
    return explanation

AbaRoutingRecognizer

Bases: PatternRecognizer

Recognize American Banking Association (ABA) routing number.

Also known as routing transit number (RTN) and used to identify financial institutions and process transactions.

PARAMETER DESCRIPTION
patterns

List of patterns to be used by this recognizer

TYPE: Optional[List[Pattern]] DEFAULT: None

context

List of context words to increase confidence in detection

TYPE: Optional[List[str]] DEFAULT: None

supported_language

Language this recognizer supports

TYPE: str DEFAULT: 'en'

supported_entity

The entity this recognizer can detect

TYPE: str DEFAULT: 'ABA_ROUTING_NUMBER'

replacement_pairs

List of tuples with potential replacement values for different strings to be used during pattern matching. This can allow a greater variety in input, for example by removing dashes or spaces.

TYPE: Optional[List[Tuple[str, str]]] DEFAULT: None

METHOD DESCRIPTION
analyze

Analyzes text to detect PII using regular expressions or deny-lists.

enhance_using_context

Enhance confidence score using context of the entity.

get_supported_entities

Return the list of entities this recognizer can identify.

get_supported_language

Return the language this recognizer can support.

get_version

Return the version of this recognizer.

to_dict

Serialize instance into a dictionary.

from_dict

Create instance from a serialized dict.

remove_duplicates

Remove duplicate results.

sanitize_value

Cleanse the input string of the replacement pairs specified as argument.

invalidate_result

Logic to check for result invalidation by running pruning logic.

build_regex_explanation

Construct an explanation for why this entity was detected.

Source code in presidio_analyzer/predefined_recognizers/aba_routing_recognizer.py
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
class AbaRoutingRecognizer(PatternRecognizer):
    """
    Recognize American Banking Association (ABA) routing number.

    Also known as routing transit number (RTN) and used to identify financial
    institutions and process transactions.

    :param patterns: List of patterns to be used by this recognizer
    :param context: List of context words to increase confidence in detection
    :param supported_language: Language this recognizer supports
    :param supported_entity: The entity this recognizer can detect
    :param replacement_pairs: List of tuples with potential replacement values
    for different strings to be used during pattern matching.
    This can allow a greater variety in input, for example by removing dashes or spaces.
    """

    PATTERNS = [
        Pattern(
            "ABA routing number (weak)",
            r"\b[0123678]\d{8}\b",
            0.05,
        ),
        Pattern(
            "ABA routing number",
            r"\b[0123678]\d{3}-\d{4}-\d\b",
            0.3,
        ),
    ]

    CONTEXT = [
        "aba",
        "routing",
        "abarouting",
        "association",
        "bankrouting",
    ]

    def __init__(
        self,
        patterns: Optional[List[Pattern]] = None,
        context: Optional[List[str]] = None,
        supported_language: str = "en",
        supported_entity: str = "ABA_ROUTING_NUMBER",
        replacement_pairs: Optional[List[Tuple[str, str]]] = None,
    ):
        self.replacement_pairs = replacement_pairs or [("-", "")]
        patterns = patterns if patterns else self.PATTERNS
        context = context if context else self.CONTEXT
        super().__init__(
            supported_entity=supported_entity,
            patterns=patterns,
            context=context,
            supported_language=supported_language,
        )

    def validate_result(self, pattern_text: str) -> bool:  # noqa D102
        sanitized_value = EntityRecognizer.sanitize_value(
            pattern_text, self.replacement_pairs
        )
        return self.__checksum(sanitized_value)

    @staticmethod
    def __checksum(sanitized_value: str) -> bool:
        s = 0
        for idx, m in enumerate([3, 7, 1, 3, 7, 1, 3, 7, 1]):
            s += int(sanitized_value[idx]) * m
        return s % 10 == 0

id property

id

Return a unique identifier of this recognizer.

analyze

analyze(
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]

Analyzes text to detect PII using regular expressions or deny-lists.

PARAMETER DESCRIPTION
text

Text to be analyzed

TYPE: str

entities

Entities this recognizer can detect

TYPE: List[str]

nlp_artifacts

Output values from the NLP engine

TYPE: Optional[NlpArtifacts] DEFAULT: None

regex_flags

regex flags to be used in regex matching

TYPE: Optional[int] DEFAULT: None

RETURNS DESCRIPTION
List[RecognizerResult]
Source code in presidio_analyzer/pattern_recognizer.py
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def analyze(
    self,
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]:
    """
    Analyzes text to detect PII using regular expressions or deny-lists.

    :param text: Text to be analyzed
    :param entities: Entities this recognizer can detect
    :param nlp_artifacts: Output values from the NLP engine
    :param regex_flags: regex flags to be used in regex matching
    :return:
    """
    results = []

    if self.patterns:
        pattern_result = self.__analyze_patterns(text, regex_flags)
        results.extend(pattern_result)

    return results

enhance_using_context

enhance_using_context(
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]

Enhance confidence score using context of the entity.

Override this method in derived class in case a custom logic is needed, otherwise return value will be equal to raw_results.

in case a result score is boosted, derived class need to update result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

PARAMETER DESCRIPTION
text

The actual text that was analyzed

TYPE: str

raw_recognizer_results

This recognizer's results, to be updated based on recognizer specific context.

TYPE: List[RecognizerResult]

other_raw_recognizer_results

Other recognizer results matched in the given text to allow related entity context enhancement

TYPE: List[RecognizerResult]

nlp_artifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

TYPE: NlpArtifacts

context

list of context words

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/entity_recognizer.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def enhance_using_context(
    self,
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """Enhance confidence score using context of the entity.

    Override this method in derived class in case a custom logic
    is needed, otherwise return value will be equal to
    raw_results.

    in case a result score is boosted, derived class need to update
    result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

    :param text: The actual text that was analyzed
    :param raw_recognizer_results: This recognizer's results, to be updated
    based on recognizer specific context.
    :param other_raw_recognizer_results: Other recognizer results matched in
    the given text to allow related entity context enhancement
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param context: list of context words
    """
    return raw_recognizer_results

get_supported_entities

get_supported_entities() -> List[str]

Return the list of entities this recognizer can identify.

RETURNS DESCRIPTION
List[str]

A list of the supported entities by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
119
120
121
122
123
124
125
def get_supported_entities(self) -> List[str]:
    """
    Return the list of entities this recognizer can identify.

    :return: A list of the supported entities by this recognizer
    """
    return self.supported_entities

get_supported_language

get_supported_language() -> str

Return the language this recognizer can support.

RETURNS DESCRIPTION
str

A list of the supported language by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
127
128
129
130
131
132
133
def get_supported_language(self) -> str:
    """
    Return the language this recognizer can support.

    :return: A list of the supported language by this recognizer
    """
    return self.supported_language

get_version

get_version() -> str

Return the version of this recognizer.

RETURNS DESCRIPTION
str

The current version of this recognizer

Source code in presidio_analyzer/entity_recognizer.py
135
136
137
138
139
140
141
def get_version(self) -> str:
    """
    Return the version of this recognizer.

    :return: The current version of this recognizer
    """
    return self.version

to_dict

to_dict() -> Dict

Serialize instance into a dictionary.

Source code in presidio_analyzer/pattern_recognizer.py
254
255
256
257
258
259
260
261
262
263
264
def to_dict(self) -> Dict:
    """Serialize instance into a dictionary."""
    return_dict = super().to_dict()

    return_dict["patterns"] = [pat.to_dict() for pat in self.patterns]
    return_dict["deny_list"] = self.deny_list
    return_dict["context"] = self.context
    return_dict["supported_entity"] = return_dict["supported_entities"][0]
    del return_dict["supported_entities"]

    return return_dict

from_dict classmethod

from_dict(entity_recognizer_dict: Dict) -> PatternRecognizer

Create instance from a serialized dict.

Source code in presidio_analyzer/pattern_recognizer.py
266
267
268
269
270
271
272
273
274
@classmethod
def from_dict(cls, entity_recognizer_dict: Dict) -> "PatternRecognizer":
    """Create instance from a serialized dict."""
    patterns = entity_recognizer_dict.get("patterns")
    if patterns:
        patterns_list = [Pattern.from_dict(pat) for pat in patterns]
        entity_recognizer_dict["patterns"] = patterns_list

    return cls(**entity_recognizer_dict)

remove_duplicates staticmethod

remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]

Remove duplicate results.

Remove duplicates in case the two results have identical start and ends and types.

PARAMETER DESCRIPTION
results

List[RecognizerResult]

TYPE: List[RecognizerResult]

RETURNS DESCRIPTION
List[RecognizerResult]

List[RecognizerResult]

Source code in presidio_analyzer/entity_recognizer.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
@staticmethod
def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
    """
    Remove duplicate results.

    Remove duplicates in case the two results
    have identical start and ends and types.
    :param results: List[RecognizerResult]
    :return: List[RecognizerResult]
    """
    results = list(set(results))
    results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
    filtered_results = []

    for result in results:
        if result.score == 0:
            continue

        to_keep = result not in filtered_results  # equals based comparison
        if to_keep:
            for filtered in filtered_results:
                # If result is contained in one of the other results
                if (
                    result.contained_in(filtered)
                    and result.entity_type == filtered.entity_type
                ):
                    to_keep = False
                    break

        if to_keep:
            filtered_results.append(result)

    return filtered_results

sanitize_value staticmethod

sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str

Cleanse the input string of the replacement pairs specified as argument.

PARAMETER DESCRIPTION
text

input string

TYPE: str

replacement_pairs

pairs of what has to be replaced with which value

TYPE: List[Tuple[str, str]]

RETURNS DESCRIPTION
str

cleansed string

Source code in presidio_analyzer/entity_recognizer.py
200
201
202
203
204
205
206
207
208
209
210
211
@staticmethod
def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
    """
    Cleanse the input string of the replacement pairs specified as argument.

    :param text: input string
    :param replacement_pairs: pairs of what has to be replaced with which value
    :return: cleansed string
    """
    for search_string, replacement_string in replacement_pairs:
        text = text.replace(search_string, replacement_string)
    return text

invalidate_result

invalidate_result(pattern_text: str) -> Optional[bool]

Logic to check for result invalidation by running pruning logic.

For example, each SSN number group should not consist of all the same digits.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the result is invalidated

Source code in presidio_analyzer/pattern_recognizer.py
127
128
129
130
131
132
133
134
135
136
137
def invalidate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Logic to check for result invalidation by running pruning logic.

    For example, each SSN number group should not consist of all the same digits.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the result is invalidated
    """
    return None

build_regex_explanation staticmethod

build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation

Construct an explanation for why this entity was detected.

PARAMETER DESCRIPTION
recognizer_name

Name of recognizer detecting the entity

TYPE: str

pattern_name

Regex pattern name which detected the entity

TYPE: str

pattern

Regex pattern logic

TYPE: str

original_score

Score given by the recognizer

TYPE: float

validation_result

Whether validation was used and its result

TYPE: bool

regex_flags

Regex flags used in the regex matching

TYPE: int

RETURNS DESCRIPTION
AnalysisExplanation

Analysis explanation

Source code in presidio_analyzer/pattern_recognizer.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
@staticmethod
def build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation:
    """
    Construct an explanation for why this entity was detected.

    :param recognizer_name: Name of recognizer detecting the entity
    :param pattern_name: Regex pattern name which detected the entity
    :param pattern: Regex pattern logic
    :param original_score: Score given by the recognizer
    :param validation_result: Whether validation was used and its result
    :param regex_flags: Regex flags used in the regex matching
    :return: Analysis explanation
    """
    textual_explanation = (
        f"Detected by `{recognizer_name}` " f"using pattern `{pattern_name}`"
    )

    explanation = AnalysisExplanation(
        recognizer=recognizer_name,
        original_score=original_score,
        pattern_name=pattern_name,
        pattern=pattern,
        validation_result=validation_result,
        regex_flags=regex_flags,
        textual_explanation=textual_explanation,
    )
    return explanation

AuAbnRecognizer

Bases: PatternRecognizer

Recognizes Australian Business Number ("ABN").

The Australian Business Number (ABN) is a unique 11 digit identifier issued to all entities registered in the Australian Business Register (ABR). The 11 digit ABN is structured as a 9 digit identifier with two leading check digits. The leading check digits are derived using a modulus 89 calculation. This recognizer identifies ABN using regex, context words and checksum. Reference: https://abr.business.gov.au/Help/AbnFormat

PARAMETER DESCRIPTION
patterns

List of patterns to be used by this recognizer

TYPE: Optional[List[Pattern]] DEFAULT: None

context

List of context words to increase confidence in detection

TYPE: Optional[List[str]] DEFAULT: None

supported_language

Language this recognizer supports

TYPE: str DEFAULT: 'en'

supported_entity

The entity this recognizer can detect

TYPE: str DEFAULT: 'AU_ABN'

replacement_pairs

List of tuples with potential replacement values for different strings to be used during pattern matching. This can allow a greater variety in input, for example by removing dashes or spaces.

TYPE: Optional[List[Tuple[str, str]]] DEFAULT: None

METHOD DESCRIPTION
analyze

Analyzes text to detect PII using regular expressions or deny-lists.

enhance_using_context

Enhance confidence score using context of the entity.

get_supported_entities

Return the list of entities this recognizer can identify.

get_supported_language

Return the language this recognizer can support.

get_version

Return the version of this recognizer.

to_dict

Serialize instance into a dictionary.

from_dict

Create instance from a serialized dict.

remove_duplicates

Remove duplicate results.

sanitize_value

Cleanse the input string of the replacement pairs specified as argument.

invalidate_result

Logic to check for result invalidation by running pruning logic.

build_regex_explanation

Construct an explanation for why this entity was detected.

validate_result

Validate the pattern logic e.g., by running checksum on a detected pattern.

Source code in presidio_analyzer/predefined_recognizers/au_abn_recognizer.py
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
class AuAbnRecognizer(PatternRecognizer):
    """
    Recognizes Australian Business Number ("ABN").

    The Australian Business Number (ABN) is a unique 11
    digit identifier issued to all entities registered in
    the Australian Business Register (ABR).
    The 11 digit ABN is structured as a 9 digit identifier
    with two leading check digits.
    The leading check digits are derived using a modulus 89 calculation.
    This recognizer identifies ABN using regex, context words and checksum.
    Reference: https://abr.business.gov.au/Help/AbnFormat

    :param patterns: List of patterns to be used by this recognizer
    :param context: List of context words to increase confidence in detection
    :param supported_language: Language this recognizer supports
    :param supported_entity: The entity this recognizer can detect
    :param replacement_pairs: List of tuples with potential replacement values
    for different strings to be used during pattern matching.
    This can allow a greater variety in input, for example by removing dashes or spaces.
    """

    PATTERNS = [
        Pattern(
            "ABN (Medium)",
            r"\b\d{2}\s\d{3}\s\d{3}\s\d{3}\b",
            0.1,
        ),
        Pattern(
            "ABN (Low)",
            r"\b\d{11}\b",
            0.01,
        ),
    ]

    CONTEXT = [
        "australian business number",
        "abn",
    ]

    def __init__(
        self,
        patterns: Optional[List[Pattern]] = None,
        context: Optional[List[str]] = None,
        supported_language: str = "en",
        supported_entity: str = "AU_ABN",
        replacement_pairs: Optional[List[Tuple[str, str]]] = None,
    ):
        self.replacement_pairs = (
            replacement_pairs if replacement_pairs else [("-", ""), (" ", "")]
        )
        patterns = patterns if patterns else self.PATTERNS
        context = context if context else self.CONTEXT
        super().__init__(
            supported_entity=supported_entity,
            patterns=patterns,
            context=context,
            supported_language=supported_language,
        )

    def validate_result(self, pattern_text: str) -> bool:
        """
        Validate the pattern logic e.g., by running checksum on a detected pattern.

        :param pattern_text: the text to validated.
        Only the part in text that was detected by the regex engine
        :return: A bool indicating whether the validation was successful.
        """
        # Pre-processing before validation checks
        text = EntityRecognizer.sanitize_value(pattern_text, self.replacement_pairs)
        abn_list = [int(digit) for digit in text if not digit.isspace()]

        # Set weights based on digit position
        weight = [10, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19]

        # Perform checksums
        abn_list[0] = 9 if abn_list[0] == 0 else abn_list[0] - 1
        sum_product = 0
        for i in range(11):
            sum_product += abn_list[i] * weight[i]
        remainder = sum_product % 89
        return remainder == 0

id property

id

Return a unique identifier of this recognizer.

analyze

analyze(
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]

Analyzes text to detect PII using regular expressions or deny-lists.

PARAMETER DESCRIPTION
text

Text to be analyzed

TYPE: str

entities

Entities this recognizer can detect

TYPE: List[str]

nlp_artifacts

Output values from the NLP engine

TYPE: Optional[NlpArtifacts] DEFAULT: None

regex_flags

regex flags to be used in regex matching

TYPE: Optional[int] DEFAULT: None

RETURNS DESCRIPTION
List[RecognizerResult]
Source code in presidio_analyzer/pattern_recognizer.py
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def analyze(
    self,
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]:
    """
    Analyzes text to detect PII using regular expressions or deny-lists.

    :param text: Text to be analyzed
    :param entities: Entities this recognizer can detect
    :param nlp_artifacts: Output values from the NLP engine
    :param regex_flags: regex flags to be used in regex matching
    :return:
    """
    results = []

    if self.patterns:
        pattern_result = self.__analyze_patterns(text, regex_flags)
        results.extend(pattern_result)

    return results

enhance_using_context

enhance_using_context(
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]

Enhance confidence score using context of the entity.

Override this method in derived class in case a custom logic is needed, otherwise return value will be equal to raw_results.

in case a result score is boosted, derived class need to update result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

PARAMETER DESCRIPTION
text

The actual text that was analyzed

TYPE: str

raw_recognizer_results

This recognizer's results, to be updated based on recognizer specific context.

TYPE: List[RecognizerResult]

other_raw_recognizer_results

Other recognizer results matched in the given text to allow related entity context enhancement

TYPE: List[RecognizerResult]

nlp_artifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

TYPE: NlpArtifacts

context

list of context words

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/entity_recognizer.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def enhance_using_context(
    self,
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """Enhance confidence score using context of the entity.

    Override this method in derived class in case a custom logic
    is needed, otherwise return value will be equal to
    raw_results.

    in case a result score is boosted, derived class need to update
    result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

    :param text: The actual text that was analyzed
    :param raw_recognizer_results: This recognizer's results, to be updated
    based on recognizer specific context.
    :param other_raw_recognizer_results: Other recognizer results matched in
    the given text to allow related entity context enhancement
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param context: list of context words
    """
    return raw_recognizer_results

get_supported_entities

get_supported_entities() -> List[str]

Return the list of entities this recognizer can identify.

RETURNS DESCRIPTION
List[str]

A list of the supported entities by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
119
120
121
122
123
124
125
def get_supported_entities(self) -> List[str]:
    """
    Return the list of entities this recognizer can identify.

    :return: A list of the supported entities by this recognizer
    """
    return self.supported_entities

get_supported_language

get_supported_language() -> str

Return the language this recognizer can support.

RETURNS DESCRIPTION
str

A list of the supported language by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
127
128
129
130
131
132
133
def get_supported_language(self) -> str:
    """
    Return the language this recognizer can support.

    :return: A list of the supported language by this recognizer
    """
    return self.supported_language

get_version

get_version() -> str

Return the version of this recognizer.

RETURNS DESCRIPTION
str

The current version of this recognizer

Source code in presidio_analyzer/entity_recognizer.py
135
136
137
138
139
140
141
def get_version(self) -> str:
    """
    Return the version of this recognizer.

    :return: The current version of this recognizer
    """
    return self.version

to_dict

to_dict() -> Dict

Serialize instance into a dictionary.

Source code in presidio_analyzer/pattern_recognizer.py
254
255
256
257
258
259
260
261
262
263
264
def to_dict(self) -> Dict:
    """Serialize instance into a dictionary."""
    return_dict = super().to_dict()

    return_dict["patterns"] = [pat.to_dict() for pat in self.patterns]
    return_dict["deny_list"] = self.deny_list
    return_dict["context"] = self.context
    return_dict["supported_entity"] = return_dict["supported_entities"][0]
    del return_dict["supported_entities"]

    return return_dict

from_dict classmethod

from_dict(entity_recognizer_dict: Dict) -> PatternRecognizer

Create instance from a serialized dict.

Source code in presidio_analyzer/pattern_recognizer.py
266
267
268
269
270
271
272
273
274
@classmethod
def from_dict(cls, entity_recognizer_dict: Dict) -> "PatternRecognizer":
    """Create instance from a serialized dict."""
    patterns = entity_recognizer_dict.get("patterns")
    if patterns:
        patterns_list = [Pattern.from_dict(pat) for pat in patterns]
        entity_recognizer_dict["patterns"] = patterns_list

    return cls(**entity_recognizer_dict)

remove_duplicates staticmethod

remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]

Remove duplicate results.

Remove duplicates in case the two results have identical start and ends and types.

PARAMETER DESCRIPTION
results

List[RecognizerResult]

TYPE: List[RecognizerResult]

RETURNS DESCRIPTION
List[RecognizerResult]

List[RecognizerResult]

Source code in presidio_analyzer/entity_recognizer.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
@staticmethod
def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
    """
    Remove duplicate results.

    Remove duplicates in case the two results
    have identical start and ends and types.
    :param results: List[RecognizerResult]
    :return: List[RecognizerResult]
    """
    results = list(set(results))
    results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
    filtered_results = []

    for result in results:
        if result.score == 0:
            continue

        to_keep = result not in filtered_results  # equals based comparison
        if to_keep:
            for filtered in filtered_results:
                # If result is contained in one of the other results
                if (
                    result.contained_in(filtered)
                    and result.entity_type == filtered.entity_type
                ):
                    to_keep = False
                    break

        if to_keep:
            filtered_results.append(result)

    return filtered_results

sanitize_value staticmethod

sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str

Cleanse the input string of the replacement pairs specified as argument.

PARAMETER DESCRIPTION
text

input string

TYPE: str

replacement_pairs

pairs of what has to be replaced with which value

TYPE: List[Tuple[str, str]]

RETURNS DESCRIPTION
str

cleansed string

Source code in presidio_analyzer/entity_recognizer.py
200
201
202
203
204
205
206
207
208
209
210
211
@staticmethod
def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
    """
    Cleanse the input string of the replacement pairs specified as argument.

    :param text: input string
    :param replacement_pairs: pairs of what has to be replaced with which value
    :return: cleansed string
    """
    for search_string, replacement_string in replacement_pairs:
        text = text.replace(search_string, replacement_string)
    return text

invalidate_result

invalidate_result(pattern_text: str) -> Optional[bool]

Logic to check for result invalidation by running pruning logic.

For example, each SSN number group should not consist of all the same digits.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the result is invalidated

Source code in presidio_analyzer/pattern_recognizer.py
127
128
129
130
131
132
133
134
135
136
137
def invalidate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Logic to check for result invalidation by running pruning logic.

    For example, each SSN number group should not consist of all the same digits.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the result is invalidated
    """
    return None

build_regex_explanation staticmethod

build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation

Construct an explanation for why this entity was detected.

PARAMETER DESCRIPTION
recognizer_name

Name of recognizer detecting the entity

TYPE: str

pattern_name

Regex pattern name which detected the entity

TYPE: str

pattern

Regex pattern logic

TYPE: str

original_score

Score given by the recognizer

TYPE: float

validation_result

Whether validation was used and its result

TYPE: bool

regex_flags

Regex flags used in the regex matching

TYPE: int

RETURNS DESCRIPTION
AnalysisExplanation

Analysis explanation

Source code in presidio_analyzer/pattern_recognizer.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
@staticmethod
def build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation:
    """
    Construct an explanation for why this entity was detected.

    :param recognizer_name: Name of recognizer detecting the entity
    :param pattern_name: Regex pattern name which detected the entity
    :param pattern: Regex pattern logic
    :param original_score: Score given by the recognizer
    :param validation_result: Whether validation was used and its result
    :param regex_flags: Regex flags used in the regex matching
    :return: Analysis explanation
    """
    textual_explanation = (
        f"Detected by `{recognizer_name}` " f"using pattern `{pattern_name}`"
    )

    explanation = AnalysisExplanation(
        recognizer=recognizer_name,
        original_score=original_score,
        pattern_name=pattern_name,
        pattern=pattern,
        validation_result=validation_result,
        regex_flags=regex_flags,
        textual_explanation=textual_explanation,
    )
    return explanation

validate_result

validate_result(pattern_text: str) -> bool

Validate the pattern logic e.g., by running checksum on a detected pattern.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
bool

A bool indicating whether the validation was successful.

Source code in presidio_analyzer/predefined_recognizers/au_abn_recognizer.py
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
def validate_result(self, pattern_text: str) -> bool:
    """
    Validate the pattern logic e.g., by running checksum on a detected pattern.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the validation was successful.
    """
    # Pre-processing before validation checks
    text = EntityRecognizer.sanitize_value(pattern_text, self.replacement_pairs)
    abn_list = [int(digit) for digit in text if not digit.isspace()]

    # Set weights based on digit position
    weight = [10, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19]

    # Perform checksums
    abn_list[0] = 9 if abn_list[0] == 0 else abn_list[0] - 1
    sum_product = 0
    for i in range(11):
        sum_product += abn_list[i] * weight[i]
    remainder = sum_product % 89
    return remainder == 0

AuAcnRecognizer

Bases: PatternRecognizer

Recognizes Australian Company Number ("ACN").

The Australian Company Number (ACN) is a nine digit number with the last digit being a check digit calculated using a modified modulus 10 calculation. This recognizer identifies ACN using regex, context words, and checksum. Reference: https://asic.gov.au/

PARAMETER DESCRIPTION
patterns

List of patterns to be used by this recognizer

TYPE: Optional[List[Pattern]] DEFAULT: None

context

List of context words to increase confidence in detection

TYPE: Optional[List[str]] DEFAULT: None

supported_language

Language this recognizer supports

TYPE: str DEFAULT: 'en'

supported_entity

The entity this recognizer can detect

TYPE: str DEFAULT: 'AU_ACN'

replacement_pairs

List of tuples with potential replacement values for different strings to be used during pattern matching. This can allow a greater variety in input, for example by removing dashes or spaces.

TYPE: Optional[List[Tuple[str, str]]] DEFAULT: None

METHOD DESCRIPTION
analyze

Analyzes text to detect PII using regular expressions or deny-lists.

enhance_using_context

Enhance confidence score using context of the entity.

get_supported_entities

Return the list of entities this recognizer can identify.

get_supported_language

Return the language this recognizer can support.

get_version

Return the version of this recognizer.

to_dict

Serialize instance into a dictionary.

from_dict

Create instance from a serialized dict.

remove_duplicates

Remove duplicate results.

sanitize_value

Cleanse the input string of the replacement pairs specified as argument.

invalidate_result

Logic to check for result invalidation by running pruning logic.

build_regex_explanation

Construct an explanation for why this entity was detected.

validate_result

Validate the pattern logic e.g., by running checksum on a detected pattern.

Source code in presidio_analyzer/predefined_recognizers/au_acn_recognizer.py
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
class AuAcnRecognizer(PatternRecognizer):
    """
    Recognizes Australian Company Number ("ACN").

    The Australian Company Number (ACN) is a nine digit number
    with the last digit being a check digit calculated using a
    modified modulus 10 calculation.
    This recognizer identifies ACN using regex, context words, and checksum.
    Reference: https://asic.gov.au/

    :param patterns: List of patterns to be used by this recognizer
    :param context: List of context words to increase confidence in detection
    :param supported_language: Language this recognizer supports
    :param supported_entity: The entity this recognizer can detect
    :param replacement_pairs: List of tuples with potential replacement values
    for different strings to be used during pattern matching.
    This can allow a greater variety in input, for example by removing dashes or spaces.
    """

    PATTERNS = [
        Pattern(
            "ACN (Medium)",
            r"\b\d{3}\s\d{3}\s\d{3}\b",
            0.1,
        ),
        Pattern(
            "ACN (Low)",
            r"\b\d{9}\b",
            0.01,
        ),
    ]

    CONTEXT = [
        "australian company number",
        "acn",
    ]

    def __init__(
        self,
        patterns: Optional[List[Pattern]] = None,
        context: Optional[List[str]] = None,
        supported_language: str = "en",
        supported_entity: str = "AU_ACN",
        replacement_pairs: Optional[List[Tuple[str, str]]] = None,
    ):
        self.replacement_pairs = (
            replacement_pairs if replacement_pairs else [("-", ""), (" ", "")]
        )
        patterns = patterns if patterns else self.PATTERNS
        context = context if context else self.CONTEXT
        super().__init__(
            supported_entity=supported_entity,
            patterns=patterns,
            context=context,
            supported_language=supported_language,
        )

    def validate_result(self, pattern_text: str) -> bool:
        """
        Validate the pattern logic e.g., by running checksum on a detected pattern.

        :param pattern_text: the text to validated.
        Only the part in text that was detected by the regex engine
        :return: A bool indicating whether the validation was successful.
        """
        # Pre-processing before validation checks
        text = EntityRecognizer.sanitize_value(pattern_text, self.replacement_pairs)
        acn_list = [int(digit) for digit in text if not digit.isspace()]

        # Set weights based on digit position
        weight = [8, 7, 6, 5, 4, 3, 2, 1]

        # Perform checksums
        sum_product = 0
        for i in range(8):
            sum_product += acn_list[i] * weight[i]
        remainder = sum_product % 10
        complement = 10 - remainder
        return complement == acn_list[-1]

id property

id

Return a unique identifier of this recognizer.

analyze

analyze(
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]

Analyzes text to detect PII using regular expressions or deny-lists.

PARAMETER DESCRIPTION
text

Text to be analyzed

TYPE: str

entities

Entities this recognizer can detect

TYPE: List[str]

nlp_artifacts

Output values from the NLP engine

TYPE: Optional[NlpArtifacts] DEFAULT: None

regex_flags

regex flags to be used in regex matching

TYPE: Optional[int] DEFAULT: None

RETURNS DESCRIPTION
List[RecognizerResult]
Source code in presidio_analyzer/pattern_recognizer.py
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def analyze(
    self,
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]:
    """
    Analyzes text to detect PII using regular expressions or deny-lists.

    :param text: Text to be analyzed
    :param entities: Entities this recognizer can detect
    :param nlp_artifacts: Output values from the NLP engine
    :param regex_flags: regex flags to be used in regex matching
    :return:
    """
    results = []

    if self.patterns:
        pattern_result = self.__analyze_patterns(text, regex_flags)
        results.extend(pattern_result)

    return results

enhance_using_context

enhance_using_context(
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]

Enhance confidence score using context of the entity.

Override this method in derived class in case a custom logic is needed, otherwise return value will be equal to raw_results.

in case a result score is boosted, derived class need to update result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

PARAMETER DESCRIPTION
text

The actual text that was analyzed

TYPE: str

raw_recognizer_results

This recognizer's results, to be updated based on recognizer specific context.

TYPE: List[RecognizerResult]

other_raw_recognizer_results

Other recognizer results matched in the given text to allow related entity context enhancement

TYPE: List[RecognizerResult]

nlp_artifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

TYPE: NlpArtifacts

context

list of context words

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/entity_recognizer.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def enhance_using_context(
    self,
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """Enhance confidence score using context of the entity.

    Override this method in derived class in case a custom logic
    is needed, otherwise return value will be equal to
    raw_results.

    in case a result score is boosted, derived class need to update
    result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

    :param text: The actual text that was analyzed
    :param raw_recognizer_results: This recognizer's results, to be updated
    based on recognizer specific context.
    :param other_raw_recognizer_results: Other recognizer results matched in
    the given text to allow related entity context enhancement
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param context: list of context words
    """
    return raw_recognizer_results

get_supported_entities

get_supported_entities() -> List[str]

Return the list of entities this recognizer can identify.

RETURNS DESCRIPTION
List[str]

A list of the supported entities by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
119
120
121
122
123
124
125
def get_supported_entities(self) -> List[str]:
    """
    Return the list of entities this recognizer can identify.

    :return: A list of the supported entities by this recognizer
    """
    return self.supported_entities

get_supported_language

get_supported_language() -> str

Return the language this recognizer can support.

RETURNS DESCRIPTION
str

A list of the supported language by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
127
128
129
130
131
132
133
def get_supported_language(self) -> str:
    """
    Return the language this recognizer can support.

    :return: A list of the supported language by this recognizer
    """
    return self.supported_language

get_version

get_version() -> str

Return the version of this recognizer.

RETURNS DESCRIPTION
str

The current version of this recognizer

Source code in presidio_analyzer/entity_recognizer.py
135
136
137
138
139
140
141
def get_version(self) -> str:
    """
    Return the version of this recognizer.

    :return: The current version of this recognizer
    """
    return self.version

to_dict

to_dict() -> Dict

Serialize instance into a dictionary.

Source code in presidio_analyzer/pattern_recognizer.py
254
255
256
257
258
259
260
261
262
263
264
def to_dict(self) -> Dict:
    """Serialize instance into a dictionary."""
    return_dict = super().to_dict()

    return_dict["patterns"] = [pat.to_dict() for pat in self.patterns]
    return_dict["deny_list"] = self.deny_list
    return_dict["context"] = self.context
    return_dict["supported_entity"] = return_dict["supported_entities"][0]
    del return_dict["supported_entities"]

    return return_dict

from_dict classmethod

from_dict(entity_recognizer_dict: Dict) -> PatternRecognizer

Create instance from a serialized dict.

Source code in presidio_analyzer/pattern_recognizer.py
266
267
268
269
270
271
272
273
274
@classmethod
def from_dict(cls, entity_recognizer_dict: Dict) -> "PatternRecognizer":
    """Create instance from a serialized dict."""
    patterns = entity_recognizer_dict.get("patterns")
    if patterns:
        patterns_list = [Pattern.from_dict(pat) for pat in patterns]
        entity_recognizer_dict["patterns"] = patterns_list

    return cls(**entity_recognizer_dict)

remove_duplicates staticmethod

remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]

Remove duplicate results.

Remove duplicates in case the two results have identical start and ends and types.

PARAMETER DESCRIPTION
results

List[RecognizerResult]

TYPE: List[RecognizerResult]

RETURNS DESCRIPTION
List[RecognizerResult]

List[RecognizerResult]

Source code in presidio_analyzer/entity_recognizer.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
@staticmethod
def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
    """
    Remove duplicate results.

    Remove duplicates in case the two results
    have identical start and ends and types.
    :param results: List[RecognizerResult]
    :return: List[RecognizerResult]
    """
    results = list(set(results))
    results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
    filtered_results = []

    for result in results:
        if result.score == 0:
            continue

        to_keep = result not in filtered_results  # equals based comparison
        if to_keep:
            for filtered in filtered_results:
                # If result is contained in one of the other results
                if (
                    result.contained_in(filtered)
                    and result.entity_type == filtered.entity_type
                ):
                    to_keep = False
                    break

        if to_keep:
            filtered_results.append(result)

    return filtered_results

sanitize_value staticmethod

sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str

Cleanse the input string of the replacement pairs specified as argument.

PARAMETER DESCRIPTION
text

input string

TYPE: str

replacement_pairs

pairs of what has to be replaced with which value

TYPE: List[Tuple[str, str]]

RETURNS DESCRIPTION
str

cleansed string

Source code in presidio_analyzer/entity_recognizer.py
200
201
202
203
204
205
206
207
208
209
210
211
@staticmethod
def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
    """
    Cleanse the input string of the replacement pairs specified as argument.

    :param text: input string
    :param replacement_pairs: pairs of what has to be replaced with which value
    :return: cleansed string
    """
    for search_string, replacement_string in replacement_pairs:
        text = text.replace(search_string, replacement_string)
    return text

invalidate_result

invalidate_result(pattern_text: str) -> Optional[bool]

Logic to check for result invalidation by running pruning logic.

For example, each SSN number group should not consist of all the same digits.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the result is invalidated

Source code in presidio_analyzer/pattern_recognizer.py
127
128
129
130
131
132
133
134
135
136
137
def invalidate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Logic to check for result invalidation by running pruning logic.

    For example, each SSN number group should not consist of all the same digits.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the result is invalidated
    """
    return None

build_regex_explanation staticmethod

build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation

Construct an explanation for why this entity was detected.

PARAMETER DESCRIPTION
recognizer_name

Name of recognizer detecting the entity

TYPE: str

pattern_name

Regex pattern name which detected the entity

TYPE: str

pattern

Regex pattern logic

TYPE: str

original_score

Score given by the recognizer

TYPE: float

validation_result

Whether validation was used and its result

TYPE: bool

regex_flags

Regex flags used in the regex matching

TYPE: int

RETURNS DESCRIPTION
AnalysisExplanation

Analysis explanation

Source code in presidio_analyzer/pattern_recognizer.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
@staticmethod
def build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation:
    """
    Construct an explanation for why this entity was detected.

    :param recognizer_name: Name of recognizer detecting the entity
    :param pattern_name: Regex pattern name which detected the entity
    :param pattern: Regex pattern logic
    :param original_score: Score given by the recognizer
    :param validation_result: Whether validation was used and its result
    :param regex_flags: Regex flags used in the regex matching
    :return: Analysis explanation
    """
    textual_explanation = (
        f"Detected by `{recognizer_name}` " f"using pattern `{pattern_name}`"
    )

    explanation = AnalysisExplanation(
        recognizer=recognizer_name,
        original_score=original_score,
        pattern_name=pattern_name,
        pattern=pattern,
        validation_result=validation_result,
        regex_flags=regex_flags,
        textual_explanation=textual_explanation,
    )
    return explanation

validate_result

validate_result(pattern_text: str) -> bool

Validate the pattern logic e.g., by running checksum on a detected pattern.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
bool

A bool indicating whether the validation was successful.

Source code in presidio_analyzer/predefined_recognizers/au_acn_recognizer.py
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
def validate_result(self, pattern_text: str) -> bool:
    """
    Validate the pattern logic e.g., by running checksum on a detected pattern.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the validation was successful.
    """
    # Pre-processing before validation checks
    text = EntityRecognizer.sanitize_value(pattern_text, self.replacement_pairs)
    acn_list = [int(digit) for digit in text if not digit.isspace()]

    # Set weights based on digit position
    weight = [8, 7, 6, 5, 4, 3, 2, 1]

    # Perform checksums
    sum_product = 0
    for i in range(8):
        sum_product += acn_list[i] * weight[i]
    remainder = sum_product % 10
    complement = 10 - remainder
    return complement == acn_list[-1]

AuMedicareRecognizer

Bases: PatternRecognizer

Recognizes Australian Medicare number using regex, context words, and checksum.

Medicare number is a unique identifier issued by Australian Government that enables the cardholder to receive a rebates of medical expenses under Australia's Medicare system. It uses a modulus 10 checksum scheme to validate the number. Reference: https://en.wikipedia.org/wiki/Medicare_card_(Australia)

PARAMETER DESCRIPTION
patterns

List of patterns to be used by this recognizer

TYPE: Optional[List[Pattern]] DEFAULT: None

context

List of context words to increase confidence in detection

TYPE: Optional[List[str]] DEFAULT: None

supported_language

Language this recognizer supports

TYPE: str DEFAULT: 'en'

supported_entity

The entity this recognizer can detect

TYPE: str DEFAULT: 'AU_MEDICARE'

replacement_pairs

List of tuples with potential replacement values for different strings to be used during pattern matching. This can allow a greater variety in input, for example by removing dashes or spaces.

TYPE: Optional[List[Tuple[str, str]]] DEFAULT: None

METHOD DESCRIPTION
analyze

Analyzes text to detect PII using regular expressions or deny-lists.

enhance_using_context

Enhance confidence score using context of the entity.

get_supported_entities

Return the list of entities this recognizer can identify.

get_supported_language

Return the language this recognizer can support.

get_version

Return the version of this recognizer.

to_dict

Serialize instance into a dictionary.

from_dict

Create instance from a serialized dict.

remove_duplicates

Remove duplicate results.

sanitize_value

Cleanse the input string of the replacement pairs specified as argument.

invalidate_result

Logic to check for result invalidation by running pruning logic.

build_regex_explanation

Construct an explanation for why this entity was detected.

validate_result

Validate the pattern logic e.g., by running checksum on a detected pattern.

Source code in presidio_analyzer/predefined_recognizers/au_medicare_recognizer.py
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
class AuMedicareRecognizer(PatternRecognizer):
    """
    Recognizes Australian Medicare number using regex, context words, and checksum.

    Medicare number is a unique identifier issued by Australian Government
    that enables the cardholder to receive a rebates of medical expenses
    under Australia's Medicare system.
    It uses a modulus 10 checksum scheme to validate the number.
    Reference: https://en.wikipedia.org/wiki/Medicare_card_(Australia)


    :param patterns: List of patterns to be used by this recognizer
    :param context: List of context words to increase confidence in detection
    :param supported_language: Language this recognizer supports
    :param supported_entity: The entity this recognizer can detect
    :param replacement_pairs: List of tuples with potential replacement values
    for different strings to be used during pattern matching.
    This can allow a greater variety in input, for example by removing dashes or spaces.
    """

    PATTERNS = [
        Pattern(
            "Australian Medicare Number (Medium)",
            r"\b[2-6]\d{3}\s\d{5}\s\d\b",
            0.1,
        ),
        Pattern(
            "Australian Medicare Number (Low)",
            r"\b[2-6]\d{9}\b",
            0.01,
        ),
    ]

    CONTEXT = [
        "medicare",
    ]

    def __init__(
        self,
        patterns: Optional[List[Pattern]] = None,
        context: Optional[List[str]] = None,
        supported_language: str = "en",
        supported_entity: str = "AU_MEDICARE",
        replacement_pairs: Optional[List[Tuple[str, str]]] = None,
    ):
        self.replacement_pairs = (
            replacement_pairs if replacement_pairs else [("-", ""), (" ", "")]
        )
        patterns = patterns if patterns else self.PATTERNS
        context = context if context else self.CONTEXT
        super().__init__(
            supported_entity=supported_entity,
            patterns=patterns,
            context=context,
            supported_language=supported_language,
        )

    def validate_result(self, pattern_text: str) -> bool:
        """
        Validate the pattern logic e.g., by running checksum on a detected pattern.

        :param pattern_text: the text to validated.
        Only the part in text that was detected by the regex engine
        :return: A bool indicating whether the validation was successful.
        """
        # Pre-processing before validation checks
        text = EntityRecognizer.sanitize_value(pattern_text, self.replacement_pairs)
        medicare_list = [int(digit) for digit in text if not digit.isspace()]

        # Set weights based on digit position
        weight = [1, 3, 7, 9, 1, 3, 7, 9]

        # Perform checksums
        sum_product = 0
        for i in range(8):
            sum_product += medicare_list[i] * weight[i]
        remainder = sum_product % 10
        return remainder == medicare_list[8]

id property

id

Return a unique identifier of this recognizer.

analyze

analyze(
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]

Analyzes text to detect PII using regular expressions or deny-lists.

PARAMETER DESCRIPTION
text

Text to be analyzed

TYPE: str

entities

Entities this recognizer can detect

TYPE: List[str]

nlp_artifacts

Output values from the NLP engine

TYPE: Optional[NlpArtifacts] DEFAULT: None

regex_flags

regex flags to be used in regex matching

TYPE: Optional[int] DEFAULT: None

RETURNS DESCRIPTION
List[RecognizerResult]
Source code in presidio_analyzer/pattern_recognizer.py
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def analyze(
    self,
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]:
    """
    Analyzes text to detect PII using regular expressions or deny-lists.

    :param text: Text to be analyzed
    :param entities: Entities this recognizer can detect
    :param nlp_artifacts: Output values from the NLP engine
    :param regex_flags: regex flags to be used in regex matching
    :return:
    """
    results = []

    if self.patterns:
        pattern_result = self.__analyze_patterns(text, regex_flags)
        results.extend(pattern_result)

    return results

enhance_using_context

enhance_using_context(
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]

Enhance confidence score using context of the entity.

Override this method in derived class in case a custom logic is needed, otherwise return value will be equal to raw_results.

in case a result score is boosted, derived class need to update result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

PARAMETER DESCRIPTION
text

The actual text that was analyzed

TYPE: str

raw_recognizer_results

This recognizer's results, to be updated based on recognizer specific context.

TYPE: List[RecognizerResult]

other_raw_recognizer_results

Other recognizer results matched in the given text to allow related entity context enhancement

TYPE: List[RecognizerResult]

nlp_artifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

TYPE: NlpArtifacts

context

list of context words

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/entity_recognizer.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def enhance_using_context(
    self,
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """Enhance confidence score using context of the entity.

    Override this method in derived class in case a custom logic
    is needed, otherwise return value will be equal to
    raw_results.

    in case a result score is boosted, derived class need to update
    result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

    :param text: The actual text that was analyzed
    :param raw_recognizer_results: This recognizer's results, to be updated
    based on recognizer specific context.
    :param other_raw_recognizer_results: Other recognizer results matched in
    the given text to allow related entity context enhancement
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param context: list of context words
    """
    return raw_recognizer_results

get_supported_entities

get_supported_entities() -> List[str]

Return the list of entities this recognizer can identify.

RETURNS DESCRIPTION
List[str]

A list of the supported entities by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
119
120
121
122
123
124
125
def get_supported_entities(self) -> List[str]:
    """
    Return the list of entities this recognizer can identify.

    :return: A list of the supported entities by this recognizer
    """
    return self.supported_entities

get_supported_language

get_supported_language() -> str

Return the language this recognizer can support.

RETURNS DESCRIPTION
str

A list of the supported language by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
127
128
129
130
131
132
133
def get_supported_language(self) -> str:
    """
    Return the language this recognizer can support.

    :return: A list of the supported language by this recognizer
    """
    return self.supported_language

get_version

get_version() -> str

Return the version of this recognizer.

RETURNS DESCRIPTION
str

The current version of this recognizer

Source code in presidio_analyzer/entity_recognizer.py
135
136
137
138
139
140
141
def get_version(self) -> str:
    """
    Return the version of this recognizer.

    :return: The current version of this recognizer
    """
    return self.version

to_dict

to_dict() -> Dict

Serialize instance into a dictionary.

Source code in presidio_analyzer/pattern_recognizer.py
254
255
256
257
258
259
260
261
262
263
264
def to_dict(self) -> Dict:
    """Serialize instance into a dictionary."""
    return_dict = super().to_dict()

    return_dict["patterns"] = [pat.to_dict() for pat in self.patterns]
    return_dict["deny_list"] = self.deny_list
    return_dict["context"] = self.context
    return_dict["supported_entity"] = return_dict["supported_entities"][0]
    del return_dict["supported_entities"]

    return return_dict

from_dict classmethod

from_dict(entity_recognizer_dict: Dict) -> PatternRecognizer

Create instance from a serialized dict.

Source code in presidio_analyzer/pattern_recognizer.py
266
267
268
269
270
271
272
273
274
@classmethod
def from_dict(cls, entity_recognizer_dict: Dict) -> "PatternRecognizer":
    """Create instance from a serialized dict."""
    patterns = entity_recognizer_dict.get("patterns")
    if patterns:
        patterns_list = [Pattern.from_dict(pat) for pat in patterns]
        entity_recognizer_dict["patterns"] = patterns_list

    return cls(**entity_recognizer_dict)

remove_duplicates staticmethod

remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]

Remove duplicate results.

Remove duplicates in case the two results have identical start and ends and types.

PARAMETER DESCRIPTION
results

List[RecognizerResult]

TYPE: List[RecognizerResult]

RETURNS DESCRIPTION
List[RecognizerResult]

List[RecognizerResult]

Source code in presidio_analyzer/entity_recognizer.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
@staticmethod
def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
    """
    Remove duplicate results.

    Remove duplicates in case the two results
    have identical start and ends and types.
    :param results: List[RecognizerResult]
    :return: List[RecognizerResult]
    """
    results = list(set(results))
    results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
    filtered_results = []

    for result in results:
        if result.score == 0:
            continue

        to_keep = result not in filtered_results  # equals based comparison
        if to_keep:
            for filtered in filtered_results:
                # If result is contained in one of the other results
                if (
                    result.contained_in(filtered)
                    and result.entity_type == filtered.entity_type
                ):
                    to_keep = False
                    break

        if to_keep:
            filtered_results.append(result)

    return filtered_results

sanitize_value staticmethod

sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str

Cleanse the input string of the replacement pairs specified as argument.

PARAMETER DESCRIPTION
text

input string

TYPE: str

replacement_pairs

pairs of what has to be replaced with which value

TYPE: List[Tuple[str, str]]

RETURNS DESCRIPTION
str

cleansed string

Source code in presidio_analyzer/entity_recognizer.py
200
201
202
203
204
205
206
207
208
209
210
211
@staticmethod
def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
    """
    Cleanse the input string of the replacement pairs specified as argument.

    :param text: input string
    :param replacement_pairs: pairs of what has to be replaced with which value
    :return: cleansed string
    """
    for search_string, replacement_string in replacement_pairs:
        text = text.replace(search_string, replacement_string)
    return text

invalidate_result

invalidate_result(pattern_text: str) -> Optional[bool]

Logic to check for result invalidation by running pruning logic.

For example, each SSN number group should not consist of all the same digits.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the result is invalidated

Source code in presidio_analyzer/pattern_recognizer.py
127
128
129
130
131
132
133
134
135
136
137
def invalidate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Logic to check for result invalidation by running pruning logic.

    For example, each SSN number group should not consist of all the same digits.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the result is invalidated
    """
    return None

build_regex_explanation staticmethod

build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation

Construct an explanation for why this entity was detected.

PARAMETER DESCRIPTION
recognizer_name

Name of recognizer detecting the entity

TYPE: str

pattern_name

Regex pattern name which detected the entity

TYPE: str

pattern

Regex pattern logic

TYPE: str

original_score

Score given by the recognizer

TYPE: float

validation_result

Whether validation was used and its result

TYPE: bool

regex_flags

Regex flags used in the regex matching

TYPE: int

RETURNS DESCRIPTION
AnalysisExplanation

Analysis explanation

Source code in presidio_analyzer/pattern_recognizer.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
@staticmethod
def build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation:
    """
    Construct an explanation for why this entity was detected.

    :param recognizer_name: Name of recognizer detecting the entity
    :param pattern_name: Regex pattern name which detected the entity
    :param pattern: Regex pattern logic
    :param original_score: Score given by the recognizer
    :param validation_result: Whether validation was used and its result
    :param regex_flags: Regex flags used in the regex matching
    :return: Analysis explanation
    """
    textual_explanation = (
        f"Detected by `{recognizer_name}` " f"using pattern `{pattern_name}`"
    )

    explanation = AnalysisExplanation(
        recognizer=recognizer_name,
        original_score=original_score,
        pattern_name=pattern_name,
        pattern=pattern,
        validation_result=validation_result,
        regex_flags=regex_flags,
        textual_explanation=textual_explanation,
    )
    return explanation

validate_result

validate_result(pattern_text: str) -> bool

Validate the pattern logic e.g., by running checksum on a detected pattern.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
bool

A bool indicating whether the validation was successful.

Source code in presidio_analyzer/predefined_recognizers/au_medicare_recognizer.py
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
def validate_result(self, pattern_text: str) -> bool:
    """
    Validate the pattern logic e.g., by running checksum on a detected pattern.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the validation was successful.
    """
    # Pre-processing before validation checks
    text = EntityRecognizer.sanitize_value(pattern_text, self.replacement_pairs)
    medicare_list = [int(digit) for digit in text if not digit.isspace()]

    # Set weights based on digit position
    weight = [1, 3, 7, 9, 1, 3, 7, 9]

    # Perform checksums
    sum_product = 0
    for i in range(8):
        sum_product += medicare_list[i] * weight[i]
    remainder = sum_product % 10
    return remainder == medicare_list[8]

AuTfnRecognizer

Bases: PatternRecognizer

Recognizes Australian Tax File Numbers ("TFN").

The tax file number (TFN) is a unique identifier issued by the Australian Taxation Office to each taxpaying entity — an individual, company, superannuation fund, partnership, or trust. The TFN consists of a nine digit number, usually presented in the format NNN NNN NNN. TFN includes a check digit for detecting erroneous number based on simple modulo 11. This recognizer uses regex, context words, and checksum to identify TFN. Reference: https://www.ato.gov.au/individuals/tax-file-number/

PARAMETER DESCRIPTION
patterns

List of patterns to be used by this recognizer

TYPE: Optional[List[Pattern]] DEFAULT: None

context

List of context words to increase confidence in detection

TYPE: Optional[List[str]] DEFAULT: None

supported_language

Language this recognizer supports

TYPE: str DEFAULT: 'en'

supported_entity

The entity this recognizer can detect

TYPE: str DEFAULT: 'AU_TFN'

replacement_pairs

List of tuples with potential replacement values for different strings to be used during pattern matching. This can allow a greater variety in input, for example by removing dashes or spaces.

TYPE: Optional[List[Tuple[str, str]]] DEFAULT: None

METHOD DESCRIPTION
analyze

Analyzes text to detect PII using regular expressions or deny-lists.

enhance_using_context

Enhance confidence score using context of the entity.

get_supported_entities

Return the list of entities this recognizer can identify.

get_supported_language

Return the language this recognizer can support.

get_version

Return the version of this recognizer.

to_dict

Serialize instance into a dictionary.

from_dict

Create instance from a serialized dict.

remove_duplicates

Remove duplicate results.

sanitize_value

Cleanse the input string of the replacement pairs specified as argument.

invalidate_result

Logic to check for result invalidation by running pruning logic.

build_regex_explanation

Construct an explanation for why this entity was detected.

validate_result

Validate the pattern logic e.g., by running checksum on a detected pattern.

Source code in presidio_analyzer/predefined_recognizers/au_tfn_recognizer.py
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
class AuTfnRecognizer(PatternRecognizer):
    """
    Recognizes Australian Tax File Numbers ("TFN").

    The tax file number (TFN) is a unique identifier
    issued by the Australian Taxation Office
    to each taxpaying entity — an individual, company,
    superannuation fund, partnership, or trust.
    The TFN consists of a nine digit number, usually
    presented in the format NNN NNN NNN.
    TFN includes a check digit for detecting erroneous
    number based on simple modulo 11.
    This recognizer uses regex, context words,
    and checksum to identify TFN.
    Reference: https://www.ato.gov.au/individuals/tax-file-number/

    :param patterns: List of patterns to be used by this recognizer
    :param context: List of context words to increase confidence in detection
    :param supported_language: Language this recognizer supports
    :param supported_entity: The entity this recognizer can detect
    :param replacement_pairs: List of tuples with potential replacement values
    for different strings to be used during pattern matching.
    This can allow a greater variety in input, for example by removing dashes or spaces.
    """

    PATTERNS = [
        Pattern(
            "TFN (Medium)",
            r"\b\d{3}\s\d{3}\s\d{3}\b",
            0.1,
        ),
        Pattern(
            "TFN (Low)",
            r"\b\d{9}\b",
            0.01,
        ),
    ]

    CONTEXT = [
        "tax file number",
        "tfn",
    ]

    def __init__(
        self,
        patterns: Optional[List[Pattern]] = None,
        context: Optional[List[str]] = None,
        supported_language: str = "en",
        supported_entity: str = "AU_TFN",
        replacement_pairs: Optional[List[Tuple[str, str]]] = None,
    ):
        self.replacement_pairs = (
            replacement_pairs if replacement_pairs else [("-", ""), (" ", "")]
        )
        patterns = patterns if patterns else self.PATTERNS
        context = context if context else self.CONTEXT
        super().__init__(
            supported_entity=supported_entity,
            patterns=patterns,
            context=context,
            supported_language=supported_language,
        )

    def validate_result(self, pattern_text: str) -> bool:
        """
        Validate the pattern logic e.g., by running checksum on a detected pattern.

        :param pattern_text: the text to validated.
        Only the part in text that was detected by the regex engine
        :return: A bool indicating whether the validation was successful.
        """
        # Pre-processing before validation checks
        text = EntityRecognizer.sanitize_value(pattern_text, self.replacement_pairs)
        tfn_list = [int(digit) for digit in text if not digit.isspace()]

        # Set weights based on digit position
        weight = [1, 4, 3, 7, 5, 8, 6, 9, 10]

        # Perform checksums
        sum_product = 0
        for i in range(9):
            sum_product += tfn_list[i] * weight[i]
        remainder = sum_product % 11
        return remainder == 0

id property

id

Return a unique identifier of this recognizer.

analyze

analyze(
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]

Analyzes text to detect PII using regular expressions or deny-lists.

PARAMETER DESCRIPTION
text

Text to be analyzed

TYPE: str

entities

Entities this recognizer can detect

TYPE: List[str]

nlp_artifacts

Output values from the NLP engine

TYPE: Optional[NlpArtifacts] DEFAULT: None

regex_flags

regex flags to be used in regex matching

TYPE: Optional[int] DEFAULT: None

RETURNS DESCRIPTION
List[RecognizerResult]
Source code in presidio_analyzer/pattern_recognizer.py
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def analyze(
    self,
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]:
    """
    Analyzes text to detect PII using regular expressions or deny-lists.

    :param text: Text to be analyzed
    :param entities: Entities this recognizer can detect
    :param nlp_artifacts: Output values from the NLP engine
    :param regex_flags: regex flags to be used in regex matching
    :return:
    """
    results = []

    if self.patterns:
        pattern_result = self.__analyze_patterns(text, regex_flags)
        results.extend(pattern_result)

    return results

enhance_using_context

enhance_using_context(
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]

Enhance confidence score using context of the entity.

Override this method in derived class in case a custom logic is needed, otherwise return value will be equal to raw_results.

in case a result score is boosted, derived class need to update result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

PARAMETER DESCRIPTION
text

The actual text that was analyzed

TYPE: str

raw_recognizer_results

This recognizer's results, to be updated based on recognizer specific context.

TYPE: List[RecognizerResult]

other_raw_recognizer_results

Other recognizer results matched in the given text to allow related entity context enhancement

TYPE: List[RecognizerResult]

nlp_artifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

TYPE: NlpArtifacts

context

list of context words

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/entity_recognizer.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def enhance_using_context(
    self,
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """Enhance confidence score using context of the entity.

    Override this method in derived class in case a custom logic
    is needed, otherwise return value will be equal to
    raw_results.

    in case a result score is boosted, derived class need to update
    result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

    :param text: The actual text that was analyzed
    :param raw_recognizer_results: This recognizer's results, to be updated
    based on recognizer specific context.
    :param other_raw_recognizer_results: Other recognizer results matched in
    the given text to allow related entity context enhancement
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param context: list of context words
    """
    return raw_recognizer_results

get_supported_entities

get_supported_entities() -> List[str]

Return the list of entities this recognizer can identify.

RETURNS DESCRIPTION
List[str]

A list of the supported entities by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
119
120
121
122
123
124
125
def get_supported_entities(self) -> List[str]:
    """
    Return the list of entities this recognizer can identify.

    :return: A list of the supported entities by this recognizer
    """
    return self.supported_entities

get_supported_language

get_supported_language() -> str

Return the language this recognizer can support.

RETURNS DESCRIPTION
str

A list of the supported language by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
127
128
129
130
131
132
133
def get_supported_language(self) -> str:
    """
    Return the language this recognizer can support.

    :return: A list of the supported language by this recognizer
    """
    return self.supported_language

get_version

get_version() -> str

Return the version of this recognizer.

RETURNS DESCRIPTION
str

The current version of this recognizer

Source code in presidio_analyzer/entity_recognizer.py
135
136
137
138
139
140
141
def get_version(self) -> str:
    """
    Return the version of this recognizer.

    :return: The current version of this recognizer
    """
    return self.version

to_dict

to_dict() -> Dict

Serialize instance into a dictionary.

Source code in presidio_analyzer/pattern_recognizer.py
254
255
256
257
258
259
260
261
262
263
264
def to_dict(self) -> Dict:
    """Serialize instance into a dictionary."""
    return_dict = super().to_dict()

    return_dict["patterns"] = [pat.to_dict() for pat in self.patterns]
    return_dict["deny_list"] = self.deny_list
    return_dict["context"] = self.context
    return_dict["supported_entity"] = return_dict["supported_entities"][0]
    del return_dict["supported_entities"]

    return return_dict

from_dict classmethod

from_dict(entity_recognizer_dict: Dict) -> PatternRecognizer

Create instance from a serialized dict.

Source code in presidio_analyzer/pattern_recognizer.py
266
267
268
269
270
271
272
273
274
@classmethod
def from_dict(cls, entity_recognizer_dict: Dict) -> "PatternRecognizer":
    """Create instance from a serialized dict."""
    patterns = entity_recognizer_dict.get("patterns")
    if patterns:
        patterns_list = [Pattern.from_dict(pat) for pat in patterns]
        entity_recognizer_dict["patterns"] = patterns_list

    return cls(**entity_recognizer_dict)

remove_duplicates staticmethod

remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]

Remove duplicate results.

Remove duplicates in case the two results have identical start and ends and types.

PARAMETER DESCRIPTION
results

List[RecognizerResult]

TYPE: List[RecognizerResult]

RETURNS DESCRIPTION
List[RecognizerResult]

List[RecognizerResult]

Source code in presidio_analyzer/entity_recognizer.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
@staticmethod
def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
    """
    Remove duplicate results.

    Remove duplicates in case the two results
    have identical start and ends and types.
    :param results: List[RecognizerResult]
    :return: List[RecognizerResult]
    """
    results = list(set(results))
    results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
    filtered_results = []

    for result in results:
        if result.score == 0:
            continue

        to_keep = result not in filtered_results  # equals based comparison
        if to_keep:
            for filtered in filtered_results:
                # If result is contained in one of the other results
                if (
                    result.contained_in(filtered)
                    and result.entity_type == filtered.entity_type
                ):
                    to_keep = False
                    break

        if to_keep:
            filtered_results.append(result)

    return filtered_results

sanitize_value staticmethod

sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str

Cleanse the input string of the replacement pairs specified as argument.

PARAMETER DESCRIPTION
text

input string

TYPE: str

replacement_pairs

pairs of what has to be replaced with which value

TYPE: List[Tuple[str, str]]

RETURNS DESCRIPTION
str

cleansed string

Source code in presidio_analyzer/entity_recognizer.py
200
201
202
203
204
205
206
207
208
209
210
211
@staticmethod
def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
    """
    Cleanse the input string of the replacement pairs specified as argument.

    :param text: input string
    :param replacement_pairs: pairs of what has to be replaced with which value
    :return: cleansed string
    """
    for search_string, replacement_string in replacement_pairs:
        text = text.replace(search_string, replacement_string)
    return text

invalidate_result

invalidate_result(pattern_text: str) -> Optional[bool]

Logic to check for result invalidation by running pruning logic.

For example, each SSN number group should not consist of all the same digits.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the result is invalidated

Source code in presidio_analyzer/pattern_recognizer.py
127
128
129
130
131
132
133
134
135
136
137
def invalidate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Logic to check for result invalidation by running pruning logic.

    For example, each SSN number group should not consist of all the same digits.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the result is invalidated
    """
    return None

build_regex_explanation staticmethod

build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation

Construct an explanation for why this entity was detected.

PARAMETER DESCRIPTION
recognizer_name

Name of recognizer detecting the entity

TYPE: str

pattern_name

Regex pattern name which detected the entity

TYPE: str

pattern

Regex pattern logic

TYPE: str

original_score

Score given by the recognizer

TYPE: float

validation_result

Whether validation was used and its result

TYPE: bool

regex_flags

Regex flags used in the regex matching

TYPE: int

RETURNS DESCRIPTION
AnalysisExplanation

Analysis explanation

Source code in presidio_analyzer/pattern_recognizer.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
@staticmethod
def build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation:
    """
    Construct an explanation for why this entity was detected.

    :param recognizer_name: Name of recognizer detecting the entity
    :param pattern_name: Regex pattern name which detected the entity
    :param pattern: Regex pattern logic
    :param original_score: Score given by the recognizer
    :param validation_result: Whether validation was used and its result
    :param regex_flags: Regex flags used in the regex matching
    :return: Analysis explanation
    """
    textual_explanation = (
        f"Detected by `{recognizer_name}` " f"using pattern `{pattern_name}`"
    )

    explanation = AnalysisExplanation(
        recognizer=recognizer_name,
        original_score=original_score,
        pattern_name=pattern_name,
        pattern=pattern,
        validation_result=validation_result,
        regex_flags=regex_flags,
        textual_explanation=textual_explanation,
    )
    return explanation

validate_result

validate_result(pattern_text: str) -> bool

Validate the pattern logic e.g., by running checksum on a detected pattern.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
bool

A bool indicating whether the validation was successful.

Source code in presidio_analyzer/predefined_recognizers/au_tfn_recognizer.py
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
def validate_result(self, pattern_text: str) -> bool:
    """
    Validate the pattern logic e.g., by running checksum on a detected pattern.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the validation was successful.
    """
    # Pre-processing before validation checks
    text = EntityRecognizer.sanitize_value(pattern_text, self.replacement_pairs)
    tfn_list = [int(digit) for digit in text if not digit.isspace()]

    # Set weights based on digit position
    weight = [1, 4, 3, 7, 5, 8, 6, 9, 10]

    # Perform checksums
    sum_product = 0
    for i in range(9):
        sum_product += tfn_list[i] * weight[i]
    remainder = sum_product % 11
    return remainder == 0

AzureAILanguageRecognizer

Bases: RemoteRecognizer

Wrapper for PII detection using Azure AI Language.

METHOD DESCRIPTION
enhance_using_context

Enhance confidence score using context of the entity.

get_supported_language

Return the language this recognizer can support.

get_version

Return the version of this recognizer.

to_dict

Serialize self to dictionary.

from_dict

Create EntityRecognizer from a dict input.

remove_duplicates

Remove duplicate results.

sanitize_value

Cleanse the input string of the replacement pairs specified as argument.

get_supported_entities

Return the list of entities this recognizer can identify.

analyze

Analyze text using Azure AI Language.

Source code in presidio_analyzer/predefined_recognizers/azure_ai_language.py
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
class AzureAILanguageRecognizer(RemoteRecognizer):
    """Wrapper for PII detection using Azure AI Language."""

    def __init__(
        self,
        supported_entities: Optional[List[str]] = None,
        supported_language: str = "en",
        ta_client: Optional["TextAnalyticsClient"] = None,
        azure_ai_key: Optional[str] = None,
        azure_ai_endpoint: Optional[str] = None,
        **kwargs
    ):
        """
        Wrap the PII detection in Azure AI Language.

        :param supported_entities: List of supported entities for this recognizer.
        If None, all supported entities will be used.
        :param supported_language: Language code to use for the recognizer.
        :param ta_client: object of type TextAnalyticsClient. If missing,
        the client will be created using the key and endpoint.
        :param azure_ai_key: Azure AI for language key
        :param azure_ai_endpoint: Azure AI for language endpoint
        :param kwargs: Additional arguments required by the parent class

        For more info, see https://learn.microsoft.com/en-us/azure/ai-services/language-service/personally-identifiable-information/overview
        """  # noqa E501

        super().__init__(
            supported_entities=supported_entities,
            supported_language=supported_language,
            name="Azure AI Language PII",
            version="5.2.0",
            **kwargs
        )

        is_available = bool(TextAnalyticsClient)
        if not ta_client and not is_available:
            raise ValueError(
                "Azure AI Language is not available. "
                "Please install the required dependencies:"
                "1. azure-ai-textanalytics"
                "2. azure-core"
            )

        if not supported_entities:
            self.supported_entities = self.__get_azure_ai_supported_entities()

        if not ta_client:
            ta_client = self.__authenticate_client(azure_ai_key, azure_ai_endpoint)
        self.ta_client = ta_client

    def get_supported_entities(self) -> List[str]:
        """
        Return the list of entities this recognizer can identify.

        :return: A list of the supported entities by this recognizer
        """
        return self.supported_entities

    @staticmethod
    def __get_azure_ai_supported_entities() -> List[str]:
        """Return the list of all supported entities for Azure AI Language."""
        from azure.ai.textanalytics._models import PiiEntityCategory  # noqa

        return [r.value.upper() for r in PiiEntityCategory]

    @staticmethod
    def __authenticate_client(key: str, endpoint: str) -> TextAnalyticsClient:
        """Authenticate the client using the key and endpoint.

        :param key: Azure AI Language key
        :param endpoint: Azure AI Language endpoint
        """
        key = key if key else os.getenv("AZURE_AI_KEY", None)
        endpoint = endpoint if endpoint else os.getenv("AZURE_AI_ENDPOINT", None)
        if key is None:
            raise ValueError(
                "Azure AI Language key is required. "
                "Please provide a key or set the AZURE_AI_KEY environment variable."
            )
        if endpoint is None:
            raise ValueError(
                "Azure AI Language endpoint is required. "
                "Please provide an endpoint "
                "or set the AZURE_AI_ENDPOINT environment variable."
            )

        ta_credential = AzureKeyCredential(key)
        text_analytics_client = TextAnalyticsClient(
            endpoint=endpoint, credential=ta_credential
        )
        return text_analytics_client

    def analyze(
        self, text: str, entities: List[str] = None, nlp_artifacts: NlpArtifacts = None
    ) -> List[RecognizerResult]:
        """
        Analyze text using Azure AI Language.

        :param text: Text to analyze
        :param entities: List of entities to return
        :param nlp_artifacts: Object of type NlpArtifacts, not used in this recognizer.
        :return: A list of RecognizerResult, one per each entity found in the text.
        """
        if not entities:
            entities = self.supported_entities
        response = self.ta_client.recognize_pii_entities(
            [text], language=self.supported_language
        )
        results = [doc for doc in response if not doc.is_error]
        recognizer_results = []
        for res in results:
            for entity in res.entities:
                entity.category = entity.category.upper()
                if entity.category.lower() not in [
                    ent.lower() for ent in self.supported_entities
                ]:
                    continue
                if entity.category.lower() not in [ent.lower() for ent in entities]:
                    continue
                analysis_explanation = AzureAILanguageRecognizer._build_explanation(
                    original_score=entity.confidence_score,
                    entity_type=entity.category,
                )
                recognizer_results.append(
                    RecognizerResult(
                        entity_type=entity.category,
                        start=entity.offset,
                        end=entity.offset + entity.length,
                        score=entity.confidence_score,
                        analysis_explanation=analysis_explanation,
                    )
                )

        return recognizer_results

    @staticmethod
    def _build_explanation(
        original_score: float, entity_type: str
    ) -> AnalysisExplanation:
        explanation = AnalysisExplanation(
            recognizer=AzureAILanguageRecognizer.__class__.__name__,
            original_score=original_score,
            textual_explanation=f"Identified as {entity_type} by Azure AI Language",
        )
        return explanation

id property

id

Return a unique identifier of this recognizer.

enhance_using_context

enhance_using_context(
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]

Enhance confidence score using context of the entity.

Override this method in derived class in case a custom logic is needed, otherwise return value will be equal to raw_results.

in case a result score is boosted, derived class need to update result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

PARAMETER DESCRIPTION
text

The actual text that was analyzed

TYPE: str

raw_recognizer_results

This recognizer's results, to be updated based on recognizer specific context.

TYPE: List[RecognizerResult]

other_raw_recognizer_results

Other recognizer results matched in the given text to allow related entity context enhancement

TYPE: List[RecognizerResult]

nlp_artifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

TYPE: NlpArtifacts

context

list of context words

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/entity_recognizer.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def enhance_using_context(
    self,
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """Enhance confidence score using context of the entity.

    Override this method in derived class in case a custom logic
    is needed, otherwise return value will be equal to
    raw_results.

    in case a result score is boosted, derived class need to update
    result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

    :param text: The actual text that was analyzed
    :param raw_recognizer_results: This recognizer's results, to be updated
    based on recognizer specific context.
    :param other_raw_recognizer_results: Other recognizer results matched in
    the given text to allow related entity context enhancement
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param context: list of context words
    """
    return raw_recognizer_results

get_supported_language

get_supported_language() -> str

Return the language this recognizer can support.

RETURNS DESCRIPTION
str

A list of the supported language by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
127
128
129
130
131
132
133
def get_supported_language(self) -> str:
    """
    Return the language this recognizer can support.

    :return: A list of the supported language by this recognizer
    """
    return self.supported_language

get_version

get_version() -> str

Return the version of this recognizer.

RETURNS DESCRIPTION
str

The current version of this recognizer

Source code in presidio_analyzer/entity_recognizer.py
135
136
137
138
139
140
141
def get_version(self) -> str:
    """
    Return the version of this recognizer.

    :return: The current version of this recognizer
    """
    return self.version

to_dict

to_dict() -> Dict

Serialize self to dictionary.

RETURNS DESCRIPTION
Dict

a dictionary

Source code in presidio_analyzer/entity_recognizer.py
143
144
145
146
147
148
149
150
151
152
153
154
155
def to_dict(self) -> Dict:
    """
    Serialize self to dictionary.

    :return: a dictionary
    """
    return_dict = {
        "supported_entities": self.supported_entities,
        "supported_language": self.supported_language,
        "name": self.name,
        "version": self.version,
    }
    return return_dict

from_dict classmethod

from_dict(entity_recognizer_dict: Dict) -> EntityRecognizer

Create EntityRecognizer from a dict input.

PARAMETER DESCRIPTION
entity_recognizer_dict

Dict containing keys and values for instantiation

TYPE: Dict

Source code in presidio_analyzer/entity_recognizer.py
157
158
159
160
161
162
163
164
@classmethod
def from_dict(cls, entity_recognizer_dict: Dict) -> "EntityRecognizer":
    """
    Create EntityRecognizer from a dict input.

    :param entity_recognizer_dict: Dict containing keys and values for instantiation
    """
    return cls(**entity_recognizer_dict)

remove_duplicates staticmethod

remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]

Remove duplicate results.

Remove duplicates in case the two results have identical start and ends and types.

PARAMETER DESCRIPTION
results

List[RecognizerResult]

TYPE: List[RecognizerResult]

RETURNS DESCRIPTION
List[RecognizerResult]

List[RecognizerResult]

Source code in presidio_analyzer/entity_recognizer.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
@staticmethod
def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
    """
    Remove duplicate results.

    Remove duplicates in case the two results
    have identical start and ends and types.
    :param results: List[RecognizerResult]
    :return: List[RecognizerResult]
    """
    results = list(set(results))
    results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
    filtered_results = []

    for result in results:
        if result.score == 0:
            continue

        to_keep = result not in filtered_results  # equals based comparison
        if to_keep:
            for filtered in filtered_results:
                # If result is contained in one of the other results
                if (
                    result.contained_in(filtered)
                    and result.entity_type == filtered.entity_type
                ):
                    to_keep = False
                    break

        if to_keep:
            filtered_results.append(result)

    return filtered_results

sanitize_value staticmethod

sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str

Cleanse the input string of the replacement pairs specified as argument.

PARAMETER DESCRIPTION
text

input string

TYPE: str

replacement_pairs

pairs of what has to be replaced with which value

TYPE: List[Tuple[str, str]]

RETURNS DESCRIPTION
str

cleansed string

Source code in presidio_analyzer/entity_recognizer.py
200
201
202
203
204
205
206
207
208
209
210
211
@staticmethod
def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
    """
    Cleanse the input string of the replacement pairs specified as argument.

    :param text: input string
    :param replacement_pairs: pairs of what has to be replaced with which value
    :return: cleansed string
    """
    for search_string, replacement_string in replacement_pairs:
        text = text.replace(search_string, replacement_string)
    return text

get_supported_entities

get_supported_entities() -> List[str]

Return the list of entities this recognizer can identify.

RETURNS DESCRIPTION
List[str]

A list of the supported entities by this recognizer

Source code in presidio_analyzer/predefined_recognizers/azure_ai_language.py
69
70
71
72
73
74
75
def get_supported_entities(self) -> List[str]:
    """
    Return the list of entities this recognizer can identify.

    :return: A list of the supported entities by this recognizer
    """
    return self.supported_entities

analyze

analyze(
    text: str, entities: List[str] = None, nlp_artifacts: NlpArtifacts = None
) -> List[RecognizerResult]

Analyze text using Azure AI Language.

PARAMETER DESCRIPTION
text

Text to analyze

TYPE: str

entities

List of entities to return

TYPE: List[str] DEFAULT: None

nlp_artifacts

Object of type NlpArtifacts, not used in this recognizer.

TYPE: NlpArtifacts DEFAULT: None

RETURNS DESCRIPTION
List[RecognizerResult]

A list of RecognizerResult, one per each entity found in the text.

Source code in presidio_analyzer/predefined_recognizers/azure_ai_language.py
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
def analyze(
    self, text: str, entities: List[str] = None, nlp_artifacts: NlpArtifacts = None
) -> List[RecognizerResult]:
    """
    Analyze text using Azure AI Language.

    :param text: Text to analyze
    :param entities: List of entities to return
    :param nlp_artifacts: Object of type NlpArtifacts, not used in this recognizer.
    :return: A list of RecognizerResult, one per each entity found in the text.
    """
    if not entities:
        entities = self.supported_entities
    response = self.ta_client.recognize_pii_entities(
        [text], language=self.supported_language
    )
    results = [doc for doc in response if not doc.is_error]
    recognizer_results = []
    for res in results:
        for entity in res.entities:
            entity.category = entity.category.upper()
            if entity.category.lower() not in [
                ent.lower() for ent in self.supported_entities
            ]:
                continue
            if entity.category.lower() not in [ent.lower() for ent in entities]:
                continue
            analysis_explanation = AzureAILanguageRecognizer._build_explanation(
                original_score=entity.confidence_score,
                entity_type=entity.category,
            )
            recognizer_results.append(
                RecognizerResult(
                    entity_type=entity.category,
                    start=entity.offset,
                    end=entity.offset + entity.length,
                    score=entity.confidence_score,
                    analysis_explanation=analysis_explanation,
                )
            )

    return recognizer_results

CreditCardRecognizer

Bases: PatternRecognizer

Recognize common credit card numbers using regex + checksum.

PARAMETER DESCRIPTION
patterns

List of patterns to be used by this recognizer

TYPE: Optional[List[Pattern]] DEFAULT: None

context

List of context words to increase confidence in detection

TYPE: Optional[List[str]] DEFAULT: None

supported_language

Language this recognizer supports

TYPE: str DEFAULT: 'en'

supported_entity

The entity this recognizer can detect

TYPE: str DEFAULT: 'CREDIT_CARD'

replacement_pairs

List of tuples with potential replacement values for different strings to be used during pattern matching. This can allow a greater variety in input, for example by removing dashes or spaces.

TYPE: Optional[List[Tuple[str, str]]] DEFAULT: None

METHOD DESCRIPTION
analyze

Analyzes text to detect PII using regular expressions or deny-lists.

enhance_using_context

Enhance confidence score using context of the entity.

get_supported_entities

Return the list of entities this recognizer can identify.

get_supported_language

Return the language this recognizer can support.

get_version

Return the version of this recognizer.

to_dict

Serialize instance into a dictionary.

from_dict

Create instance from a serialized dict.

remove_duplicates

Remove duplicate results.

sanitize_value

Cleanse the input string of the replacement pairs specified as argument.

invalidate_result

Logic to check for result invalidation by running pruning logic.

build_regex_explanation

Construct an explanation for why this entity was detected.

Source code in presidio_analyzer/predefined_recognizers/credit_card_recognizer.py
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
class CreditCardRecognizer(PatternRecognizer):
    """
    Recognize common credit card numbers using regex + checksum.

    :param patterns: List of patterns to be used by this recognizer
    :param context: List of context words to increase confidence in detection
    :param supported_language: Language this recognizer supports
    :param supported_entity: The entity this recognizer can detect
    :param replacement_pairs: List of tuples with potential replacement values
    for different strings to be used during pattern matching.
    This can allow a greater variety in input, for example by removing dashes or spaces.
    """

    PATTERNS = [
        Pattern(
            "All Credit Cards (weak)",
            r"\b((4\d{3})|(5[0-5]\d{2})|(6\d{3})|(1\d{3})|(3\d{3}))[- ]?(\d{3,4})[- ]?(\d{3,4})[- ]?(\d{3,5})\b",  # noqa: E501
            0.3,
        ),
    ]

    CONTEXT = [
        "credit",
        "card",
        "visa",
        "mastercard",
        "cc ",
        "amex",
        "discover",
        "jcb",
        "diners",
        "maestro",
        "instapayment",
    ]

    def __init__(
        self,
        patterns: Optional[List[Pattern]] = None,
        context: Optional[List[str]] = None,
        supported_language: str = "en",
        supported_entity: str = "CREDIT_CARD",
        replacement_pairs: Optional[List[Tuple[str, str]]] = None,
    ):
        self.replacement_pairs = (
            replacement_pairs if replacement_pairs else [("-", ""), (" ", "")]
        )
        patterns = patterns if patterns else self.PATTERNS
        context = context if context else self.CONTEXT
        super().__init__(
            supported_entity=supported_entity,
            patterns=patterns,
            context=context,
            supported_language=supported_language,
        )

    def validate_result(self, pattern_text: str) -> bool:  # noqa D102
        sanitized_value = EntityRecognizer.sanitize_value(
            pattern_text, self.replacement_pairs
        )
        checksum = self.__luhn_checksum(sanitized_value)

        return checksum

    @staticmethod
    def __luhn_checksum(sanitized_value: str) -> bool:
        def digits_of(n: str) -> List[int]:
            return [int(dig) for dig in str(n)]

        digits = digits_of(sanitized_value)
        odd_digits = digits[-1::-2]
        even_digits = digits[-2::-2]
        checksum = sum(odd_digits)
        for d in even_digits:
            checksum += sum(digits_of(str(d * 2)))
        return checksum % 10 == 0

id property

id

Return a unique identifier of this recognizer.

analyze

analyze(
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]

Analyzes text to detect PII using regular expressions or deny-lists.

PARAMETER DESCRIPTION
text

Text to be analyzed

TYPE: str

entities

Entities this recognizer can detect

TYPE: List[str]

nlp_artifacts

Output values from the NLP engine

TYPE: Optional[NlpArtifacts] DEFAULT: None

regex_flags

regex flags to be used in regex matching

TYPE: Optional[int] DEFAULT: None

RETURNS DESCRIPTION
List[RecognizerResult]
Source code in presidio_analyzer/pattern_recognizer.py
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def analyze(
    self,
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]:
    """
    Analyzes text to detect PII using regular expressions or deny-lists.

    :param text: Text to be analyzed
    :param entities: Entities this recognizer can detect
    :param nlp_artifacts: Output values from the NLP engine
    :param regex_flags: regex flags to be used in regex matching
    :return:
    """
    results = []

    if self.patterns:
        pattern_result = self.__analyze_patterns(text, regex_flags)
        results.extend(pattern_result)

    return results

enhance_using_context

enhance_using_context(
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]

Enhance confidence score using context of the entity.

Override this method in derived class in case a custom logic is needed, otherwise return value will be equal to raw_results.

in case a result score is boosted, derived class need to update result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

PARAMETER DESCRIPTION
text

The actual text that was analyzed

TYPE: str

raw_recognizer_results

This recognizer's results, to be updated based on recognizer specific context.

TYPE: List[RecognizerResult]

other_raw_recognizer_results

Other recognizer results matched in the given text to allow related entity context enhancement

TYPE: List[RecognizerResult]

nlp_artifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

TYPE: NlpArtifacts

context

list of context words

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/entity_recognizer.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def enhance_using_context(
    self,
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """Enhance confidence score using context of the entity.

    Override this method in derived class in case a custom logic
    is needed, otherwise return value will be equal to
    raw_results.

    in case a result score is boosted, derived class need to update
    result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

    :param text: The actual text that was analyzed
    :param raw_recognizer_results: This recognizer's results, to be updated
    based on recognizer specific context.
    :param other_raw_recognizer_results: Other recognizer results matched in
    the given text to allow related entity context enhancement
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param context: list of context words
    """
    return raw_recognizer_results

get_supported_entities

get_supported_entities() -> List[str]

Return the list of entities this recognizer can identify.

RETURNS DESCRIPTION
List[str]

A list of the supported entities by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
119
120
121
122
123
124
125
def get_supported_entities(self) -> List[str]:
    """
    Return the list of entities this recognizer can identify.

    :return: A list of the supported entities by this recognizer
    """
    return self.supported_entities

get_supported_language

get_supported_language() -> str

Return the language this recognizer can support.

RETURNS DESCRIPTION
str

A list of the supported language by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
127
128
129
130
131
132
133
def get_supported_language(self) -> str:
    """
    Return the language this recognizer can support.

    :return: A list of the supported language by this recognizer
    """
    return self.supported_language

get_version

get_version() -> str

Return the version of this recognizer.

RETURNS DESCRIPTION
str

The current version of this recognizer

Source code in presidio_analyzer/entity_recognizer.py
135
136
137
138
139
140
141
def get_version(self) -> str:
    """
    Return the version of this recognizer.

    :return: The current version of this recognizer
    """
    return self.version

to_dict

to_dict() -> Dict

Serialize instance into a dictionary.

Source code in presidio_analyzer/pattern_recognizer.py
254
255
256
257
258
259
260
261
262
263
264
def to_dict(self) -> Dict:
    """Serialize instance into a dictionary."""
    return_dict = super().to_dict()

    return_dict["patterns"] = [pat.to_dict() for pat in self.patterns]
    return_dict["deny_list"] = self.deny_list
    return_dict["context"] = self.context
    return_dict["supported_entity"] = return_dict["supported_entities"][0]
    del return_dict["supported_entities"]

    return return_dict

from_dict classmethod

from_dict(entity_recognizer_dict: Dict) -> PatternRecognizer

Create instance from a serialized dict.

Source code in presidio_analyzer/pattern_recognizer.py
266
267
268
269
270
271
272
273
274
@classmethod
def from_dict(cls, entity_recognizer_dict: Dict) -> "PatternRecognizer":
    """Create instance from a serialized dict."""
    patterns = entity_recognizer_dict.get("patterns")
    if patterns:
        patterns_list = [Pattern.from_dict(pat) for pat in patterns]
        entity_recognizer_dict["patterns"] = patterns_list

    return cls(**entity_recognizer_dict)

remove_duplicates staticmethod

remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]

Remove duplicate results.

Remove duplicates in case the two results have identical start and ends and types.

PARAMETER DESCRIPTION
results

List[RecognizerResult]

TYPE: List[RecognizerResult]

RETURNS DESCRIPTION
List[RecognizerResult]

List[RecognizerResult]

Source code in presidio_analyzer/entity_recognizer.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
@staticmethod
def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
    """
    Remove duplicate results.

    Remove duplicates in case the two results
    have identical start and ends and types.
    :param results: List[RecognizerResult]
    :return: List[RecognizerResult]
    """
    results = list(set(results))
    results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
    filtered_results = []

    for result in results:
        if result.score == 0:
            continue

        to_keep = result not in filtered_results  # equals based comparison
        if to_keep:
            for filtered in filtered_results:
                # If result is contained in one of the other results
                if (
                    result.contained_in(filtered)
                    and result.entity_type == filtered.entity_type
                ):
                    to_keep = False
                    break

        if to_keep:
            filtered_results.append(result)

    return filtered_results

sanitize_value staticmethod

sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str

Cleanse the input string of the replacement pairs specified as argument.

PARAMETER DESCRIPTION
text

input string

TYPE: str

replacement_pairs

pairs of what has to be replaced with which value

TYPE: List[Tuple[str, str]]

RETURNS DESCRIPTION
str

cleansed string

Source code in presidio_analyzer/entity_recognizer.py
200
201
202
203
204
205
206
207
208
209
210
211
@staticmethod
def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
    """
    Cleanse the input string of the replacement pairs specified as argument.

    :param text: input string
    :param replacement_pairs: pairs of what has to be replaced with which value
    :return: cleansed string
    """
    for search_string, replacement_string in replacement_pairs:
        text = text.replace(search_string, replacement_string)
    return text

invalidate_result

invalidate_result(pattern_text: str) -> Optional[bool]

Logic to check for result invalidation by running pruning logic.

For example, each SSN number group should not consist of all the same digits.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the result is invalidated

Source code in presidio_analyzer/pattern_recognizer.py
127
128
129
130
131
132
133
134
135
136
137
def invalidate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Logic to check for result invalidation by running pruning logic.

    For example, each SSN number group should not consist of all the same digits.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the result is invalidated
    """
    return None

build_regex_explanation staticmethod

build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation

Construct an explanation for why this entity was detected.

PARAMETER DESCRIPTION
recognizer_name

Name of recognizer detecting the entity

TYPE: str

pattern_name

Regex pattern name which detected the entity

TYPE: str

pattern

Regex pattern logic

TYPE: str

original_score

Score given by the recognizer

TYPE: float

validation_result

Whether validation was used and its result

TYPE: bool

regex_flags

Regex flags used in the regex matching

TYPE: int

RETURNS DESCRIPTION
AnalysisExplanation

Analysis explanation

Source code in presidio_analyzer/pattern_recognizer.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
@staticmethod
def build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation:
    """
    Construct an explanation for why this entity was detected.

    :param recognizer_name: Name of recognizer detecting the entity
    :param pattern_name: Regex pattern name which detected the entity
    :param pattern: Regex pattern logic
    :param original_score: Score given by the recognizer
    :param validation_result: Whether validation was used and its result
    :param regex_flags: Regex flags used in the regex matching
    :return: Analysis explanation
    """
    textual_explanation = (
        f"Detected by `{recognizer_name}` " f"using pattern `{pattern_name}`"
    )

    explanation = AnalysisExplanation(
        recognizer=recognizer_name,
        original_score=original_score,
        pattern_name=pattern_name,
        pattern=pattern,
        validation_result=validation_result,
        regex_flags=regex_flags,
        textual_explanation=textual_explanation,
    )
    return explanation

CryptoRecognizer

Bases: PatternRecognizer

Recognize common crypto account numbers using regex + checksum.

PARAMETER DESCRIPTION
patterns

List of patterns to be used by this recognizer

TYPE: Optional[List[Pattern]] DEFAULT: None

context

List of context words to increase confidence in detection

TYPE: Optional[List[str]] DEFAULT: None

supported_language

Language this recognizer supports

TYPE: str DEFAULT: 'en'

supported_entity

The entity this recognizer can detect

TYPE: str DEFAULT: 'CRYPTO'

METHOD DESCRIPTION
analyze

Analyzes text to detect PII using regular expressions or deny-lists.

enhance_using_context

Enhance confidence score using context of the entity.

get_supported_entities

Return the list of entities this recognizer can identify.

get_supported_language

Return the language this recognizer can support.

get_version

Return the version of this recognizer.

to_dict

Serialize instance into a dictionary.

from_dict

Create instance from a serialized dict.

remove_duplicates

Remove duplicate results.

sanitize_value

Cleanse the input string of the replacement pairs specified as argument.

invalidate_result

Logic to check for result invalidation by running pruning logic.

build_regex_explanation

Construct an explanation for why this entity was detected.

validate_result

Validate the Bitcoin address using checksum.

bech32_polymod

Compute the Bech32 checksum.

bech32_hrp_expand

Expand the HRP into values for checksum computation.

bech32_verify_checksum

Verify a checksum given HRP and converted data characters.

bech32_decode

Validate a Bech32/Bech32m string, and determine HRP and data.

validate_bech32_address

Validate a Bech32 or Bech32m address.

Source code in presidio_analyzer/predefined_recognizers/crypto_recognizer.py
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
class CryptoRecognizer(PatternRecognizer):
    """Recognize common crypto account numbers using regex + checksum.

    :param patterns: List of patterns to be used by this recognizer
    :param context: List of context words to increase confidence in detection
    :param supported_language: Language this recognizer supports
    :param supported_entity: The entity this recognizer can detect
    """

    PATTERNS = [
        Pattern("Crypto (Medium)", r"(bc1|[13])[a-zA-HJ-NP-Z0-9]{25,59}", 0.5),
    ]

    CONTEXT = ["wallet", "btc", "bitcoin", "crypto"]

    def __init__(
        self,
        patterns: Optional[List[Pattern]] = None,
        context: Optional[List[str]] = None,
        supported_language: str = "en",
        supported_entity: str = "CRYPTO",
    ):
        patterns = patterns if patterns else self.PATTERNS
        context = context if context else self.CONTEXT
        super().__init__(
            supported_entity=supported_entity,
            patterns=patterns,
            context=context,
            supported_language=supported_language,
        )

    def validate_result(self, pattern_text: str) -> bool:
        """Validate the Bitcoin address using checksum.

        :param pattern_text: The cryptocurrency address to validate.
        :return: True if the address is valid according to its respective
        format, False otherwise.
        """
        if pattern_text.startswith("1") or pattern_text.startswith("3"):
            # P2PKH or P2SH address validation
            try:
                bcbytes = self.__decode_base58(str.encode(pattern_text))
                checksum = sha256(sha256(bcbytes[:-4]).digest()).digest()[:4]
                return bcbytes[-4:] == checksum
            except ValueError:
                return False
        elif pattern_text.startswith("bc1"):
            # Bech32 or Bech32m address validation
            if CryptoRecognizer.validate_bech32_address(pattern_text)[0]:
                return True
        return False

    @staticmethod
    def __decode_base58(bc: bytes) -> bytes:
        digits58 = b"123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz"
        origlen = len(bc)
        bc = bc.lstrip(digits58[0:1])

        n = 0
        for char in bc:
            n = n * 58 + digits58.index(char)
        return n.to_bytes(origlen - len(bc) + (n.bit_length() + 7) // 8, "big")

    @staticmethod
    def bech32_polymod(values):
        """Compute the Bech32 checksum."""
        generator = [0x3B6A57B2, 0x26508E6D, 0x1EA119FA, 0x3D4233DD, 0x2A1462B3]
        chk = 1
        for value in values:
            top = chk >> 25
            chk = (chk & 0x1FFFFFF) << 5 ^ value
            for i in range(5):
                chk ^= generator[i] if ((top >> i) & 1) else 0
        return chk

    @staticmethod
    def bech32_hrp_expand(hrp):
        """Expand the HRP into values for checksum computation."""
        return [ord(x) >> 5 for x in hrp] + [0] + [ord(x) & 31 for x in hrp]

    @staticmethod
    def bech32_verify_checksum(hrp, data):
        """Verify a checksum given HRP and converted data characters."""
        const = CryptoRecognizer.bech32_polymod(
            CryptoRecognizer.bech32_hrp_expand(hrp) + data
        )
        if const == 1:
            return BECH32
        if const == BECH32M_CONST:
            return BECH32M
        return None

    @staticmethod
    def bech32_decode(bech):
        """Validate a Bech32/Bech32m string, and determine HRP and data."""
        if (any(ord(x) < 33 or ord(x) > 126 for x in bech)) or (
            bech.lower() != bech and bech.upper() != bech
        ):
            return (None, None, None)
        bech = bech.lower()
        pos = bech.rfind("1")
        if pos < 1 or pos + 7 > len(bech) or len(bech) > 90:
            return (None, None, None)
        if not all(x in CHARSET for x in bech[pos + 1 :]):
            return (None, None, None)
        hrp = bech[:pos]
        data = [CHARSET.find(x) for x in bech[pos + 1 :]]
        spec = CryptoRecognizer.bech32_verify_checksum(hrp, data)
        if spec is None:
            return (None, None, None)
        return (hrp, data[:-6], spec)

    @staticmethod
    def validate_bech32_address(address):
        """Validate a Bech32 or Bech32m address."""
        hrp, data, spec = CryptoRecognizer.bech32_decode(address)
        if hrp is not None and data is not None:
            return True, spec
        return False, None

id property

id

Return a unique identifier of this recognizer.

analyze

analyze(
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]

Analyzes text to detect PII using regular expressions or deny-lists.

PARAMETER DESCRIPTION
text

Text to be analyzed

TYPE: str

entities

Entities this recognizer can detect

TYPE: List[str]

nlp_artifacts

Output values from the NLP engine

TYPE: Optional[NlpArtifacts] DEFAULT: None

regex_flags

regex flags to be used in regex matching

TYPE: Optional[int] DEFAULT: None

RETURNS DESCRIPTION
List[RecognizerResult]
Source code in presidio_analyzer/pattern_recognizer.py
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def analyze(
    self,
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]:
    """
    Analyzes text to detect PII using regular expressions or deny-lists.

    :param text: Text to be analyzed
    :param entities: Entities this recognizer can detect
    :param nlp_artifacts: Output values from the NLP engine
    :param regex_flags: regex flags to be used in regex matching
    :return:
    """
    results = []

    if self.patterns:
        pattern_result = self.__analyze_patterns(text, regex_flags)
        results.extend(pattern_result)

    return results

enhance_using_context

enhance_using_context(
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]

Enhance confidence score using context of the entity.

Override this method in derived class in case a custom logic is needed, otherwise return value will be equal to raw_results.

in case a result score is boosted, derived class need to update result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

PARAMETER DESCRIPTION
text

The actual text that was analyzed

TYPE: str

raw_recognizer_results

This recognizer's results, to be updated based on recognizer specific context.

TYPE: List[RecognizerResult]

other_raw_recognizer_results

Other recognizer results matched in the given text to allow related entity context enhancement

TYPE: List[RecognizerResult]

nlp_artifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

TYPE: NlpArtifacts

context

list of context words

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/entity_recognizer.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def enhance_using_context(
    self,
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """Enhance confidence score using context of the entity.

    Override this method in derived class in case a custom logic
    is needed, otherwise return value will be equal to
    raw_results.

    in case a result score is boosted, derived class need to update
    result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

    :param text: The actual text that was analyzed
    :param raw_recognizer_results: This recognizer's results, to be updated
    based on recognizer specific context.
    :param other_raw_recognizer_results: Other recognizer results matched in
    the given text to allow related entity context enhancement
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param context: list of context words
    """
    return raw_recognizer_results

get_supported_entities

get_supported_entities() -> List[str]

Return the list of entities this recognizer can identify.

RETURNS DESCRIPTION
List[str]

A list of the supported entities by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
119
120
121
122
123
124
125
def get_supported_entities(self) -> List[str]:
    """
    Return the list of entities this recognizer can identify.

    :return: A list of the supported entities by this recognizer
    """
    return self.supported_entities

get_supported_language

get_supported_language() -> str

Return the language this recognizer can support.

RETURNS DESCRIPTION
str

A list of the supported language by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
127
128
129
130
131
132
133
def get_supported_language(self) -> str:
    """
    Return the language this recognizer can support.

    :return: A list of the supported language by this recognizer
    """
    return self.supported_language

get_version

get_version() -> str

Return the version of this recognizer.

RETURNS DESCRIPTION
str

The current version of this recognizer

Source code in presidio_analyzer/entity_recognizer.py
135
136
137
138
139
140
141
def get_version(self) -> str:
    """
    Return the version of this recognizer.

    :return: The current version of this recognizer
    """
    return self.version

to_dict

to_dict() -> Dict

Serialize instance into a dictionary.

Source code in presidio_analyzer/pattern_recognizer.py
254
255
256
257
258
259
260
261
262
263
264
def to_dict(self) -> Dict:
    """Serialize instance into a dictionary."""
    return_dict = super().to_dict()

    return_dict["patterns"] = [pat.to_dict() for pat in self.patterns]
    return_dict["deny_list"] = self.deny_list
    return_dict["context"] = self.context
    return_dict["supported_entity"] = return_dict["supported_entities"][0]
    del return_dict["supported_entities"]

    return return_dict

from_dict classmethod

from_dict(entity_recognizer_dict: Dict) -> PatternRecognizer

Create instance from a serialized dict.

Source code in presidio_analyzer/pattern_recognizer.py
266
267
268
269
270
271
272
273
274
@classmethod
def from_dict(cls, entity_recognizer_dict: Dict) -> "PatternRecognizer":
    """Create instance from a serialized dict."""
    patterns = entity_recognizer_dict.get("patterns")
    if patterns:
        patterns_list = [Pattern.from_dict(pat) for pat in patterns]
        entity_recognizer_dict["patterns"] = patterns_list

    return cls(**entity_recognizer_dict)

remove_duplicates staticmethod

remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]

Remove duplicate results.

Remove duplicates in case the two results have identical start and ends and types.

PARAMETER DESCRIPTION
results

List[RecognizerResult]

TYPE: List[RecognizerResult]

RETURNS DESCRIPTION
List[RecognizerResult]

List[RecognizerResult]

Source code in presidio_analyzer/entity_recognizer.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
@staticmethod
def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
    """
    Remove duplicate results.

    Remove duplicates in case the two results
    have identical start and ends and types.
    :param results: List[RecognizerResult]
    :return: List[RecognizerResult]
    """
    results = list(set(results))
    results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
    filtered_results = []

    for result in results:
        if result.score == 0:
            continue

        to_keep = result not in filtered_results  # equals based comparison
        if to_keep:
            for filtered in filtered_results:
                # If result is contained in one of the other results
                if (
                    result.contained_in(filtered)
                    and result.entity_type == filtered.entity_type
                ):
                    to_keep = False
                    break

        if to_keep:
            filtered_results.append(result)

    return filtered_results

sanitize_value staticmethod

sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str

Cleanse the input string of the replacement pairs specified as argument.

PARAMETER DESCRIPTION
text

input string

TYPE: str

replacement_pairs

pairs of what has to be replaced with which value

TYPE: List[Tuple[str, str]]

RETURNS DESCRIPTION
str

cleansed string

Source code in presidio_analyzer/entity_recognizer.py
200
201
202
203
204
205
206
207
208
209
210
211
@staticmethod
def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
    """
    Cleanse the input string of the replacement pairs specified as argument.

    :param text: input string
    :param replacement_pairs: pairs of what has to be replaced with which value
    :return: cleansed string
    """
    for search_string, replacement_string in replacement_pairs:
        text = text.replace(search_string, replacement_string)
    return text

invalidate_result

invalidate_result(pattern_text: str) -> Optional[bool]

Logic to check for result invalidation by running pruning logic.

For example, each SSN number group should not consist of all the same digits.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the result is invalidated

Source code in presidio_analyzer/pattern_recognizer.py
127
128
129
130
131
132
133
134
135
136
137
def invalidate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Logic to check for result invalidation by running pruning logic.

    For example, each SSN number group should not consist of all the same digits.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the result is invalidated
    """
    return None

build_regex_explanation staticmethod

build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation

Construct an explanation for why this entity was detected.

PARAMETER DESCRIPTION
recognizer_name

Name of recognizer detecting the entity

TYPE: str

pattern_name

Regex pattern name which detected the entity

TYPE: str

pattern

Regex pattern logic

TYPE: str

original_score

Score given by the recognizer

TYPE: float

validation_result

Whether validation was used and its result

TYPE: bool

regex_flags

Regex flags used in the regex matching

TYPE: int

RETURNS DESCRIPTION
AnalysisExplanation

Analysis explanation

Source code in presidio_analyzer/pattern_recognizer.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
@staticmethod
def build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation:
    """
    Construct an explanation for why this entity was detected.

    :param recognizer_name: Name of recognizer detecting the entity
    :param pattern_name: Regex pattern name which detected the entity
    :param pattern: Regex pattern logic
    :param original_score: Score given by the recognizer
    :param validation_result: Whether validation was used and its result
    :param regex_flags: Regex flags used in the regex matching
    :return: Analysis explanation
    """
    textual_explanation = (
        f"Detected by `{recognizer_name}` " f"using pattern `{pattern_name}`"
    )

    explanation = AnalysisExplanation(
        recognizer=recognizer_name,
        original_score=original_score,
        pattern_name=pattern_name,
        pattern=pattern,
        validation_result=validation_result,
        regex_flags=regex_flags,
        textual_explanation=textual_explanation,
    )
    return explanation

validate_result

validate_result(pattern_text: str) -> bool

Validate the Bitcoin address using checksum.

PARAMETER DESCRIPTION
pattern_text

The cryptocurrency address to validate.

TYPE: str

RETURNS DESCRIPTION
bool

True if the address is valid according to its respective format, False otherwise.

Source code in presidio_analyzer/predefined_recognizers/crypto_recognizer.py
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
def validate_result(self, pattern_text: str) -> bool:
    """Validate the Bitcoin address using checksum.

    :param pattern_text: The cryptocurrency address to validate.
    :return: True if the address is valid according to its respective
    format, False otherwise.
    """
    if pattern_text.startswith("1") or pattern_text.startswith("3"):
        # P2PKH or P2SH address validation
        try:
            bcbytes = self.__decode_base58(str.encode(pattern_text))
            checksum = sha256(sha256(bcbytes[:-4]).digest()).digest()[:4]
            return bcbytes[-4:] == checksum
        except ValueError:
            return False
    elif pattern_text.startswith("bc1"):
        # Bech32 or Bech32m address validation
        if CryptoRecognizer.validate_bech32_address(pattern_text)[0]:
            return True
    return False

bech32_polymod staticmethod

bech32_polymod(values)

Compute the Bech32 checksum.

Source code in presidio_analyzer/predefined_recognizers/crypto_recognizer.py
84
85
86
87
88
89
90
91
92
93
94
@staticmethod
def bech32_polymod(values):
    """Compute the Bech32 checksum."""
    generator = [0x3B6A57B2, 0x26508E6D, 0x1EA119FA, 0x3D4233DD, 0x2A1462B3]
    chk = 1
    for value in values:
        top = chk >> 25
        chk = (chk & 0x1FFFFFF) << 5 ^ value
        for i in range(5):
            chk ^= generator[i] if ((top >> i) & 1) else 0
    return chk

bech32_hrp_expand staticmethod

bech32_hrp_expand(hrp)

Expand the HRP into values for checksum computation.

Source code in presidio_analyzer/predefined_recognizers/crypto_recognizer.py
96
97
98
99
@staticmethod
def bech32_hrp_expand(hrp):
    """Expand the HRP into values for checksum computation."""
    return [ord(x) >> 5 for x in hrp] + [0] + [ord(x) & 31 for x in hrp]

bech32_verify_checksum staticmethod

bech32_verify_checksum(hrp, data)

Verify a checksum given HRP and converted data characters.

Source code in presidio_analyzer/predefined_recognizers/crypto_recognizer.py
101
102
103
104
105
106
107
108
109
110
111
@staticmethod
def bech32_verify_checksum(hrp, data):
    """Verify a checksum given HRP and converted data characters."""
    const = CryptoRecognizer.bech32_polymod(
        CryptoRecognizer.bech32_hrp_expand(hrp) + data
    )
    if const == 1:
        return BECH32
    if const == BECH32M_CONST:
        return BECH32M
    return None

bech32_decode staticmethod

bech32_decode(bech)

Validate a Bech32/Bech32m string, and determine HRP and data.

Source code in presidio_analyzer/predefined_recognizers/crypto_recognizer.py
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
@staticmethod
def bech32_decode(bech):
    """Validate a Bech32/Bech32m string, and determine HRP and data."""
    if (any(ord(x) < 33 or ord(x) > 126 for x in bech)) or (
        bech.lower() != bech and bech.upper() != bech
    ):
        return (None, None, None)
    bech = bech.lower()
    pos = bech.rfind("1")
    if pos < 1 or pos + 7 > len(bech) or len(bech) > 90:
        return (None, None, None)
    if not all(x in CHARSET for x in bech[pos + 1 :]):
        return (None, None, None)
    hrp = bech[:pos]
    data = [CHARSET.find(x) for x in bech[pos + 1 :]]
    spec = CryptoRecognizer.bech32_verify_checksum(hrp, data)
    if spec is None:
        return (None, None, None)
    return (hrp, data[:-6], spec)

validate_bech32_address staticmethod

validate_bech32_address(address)

Validate a Bech32 or Bech32m address.

Source code in presidio_analyzer/predefined_recognizers/crypto_recognizer.py
133
134
135
136
137
138
139
@staticmethod
def validate_bech32_address(address):
    """Validate a Bech32 or Bech32m address."""
    hrp, data, spec = CryptoRecognizer.bech32_decode(address)
    if hrp is not None and data is not None:
        return True, spec
    return False, None

DateRecognizer

Bases: PatternRecognizer

Recognize date using regex.

PARAMETER DESCRIPTION
patterns

List of patterns to be used by this recognizer

TYPE: Optional[List[Pattern]] DEFAULT: None

context

List of context words to increase confidence in detection

TYPE: Optional[List[str]] DEFAULT: None

supported_language

Language this recognizer supports

TYPE: str DEFAULT: 'en'

supported_entity

The entity this recognizer can detect

TYPE: str DEFAULT: 'DATE_TIME'

METHOD DESCRIPTION
analyze

Analyzes text to detect PII using regular expressions or deny-lists.

enhance_using_context

Enhance confidence score using context of the entity.

get_supported_entities

Return the list of entities this recognizer can identify.

get_supported_language

Return the language this recognizer can support.

get_version

Return the version of this recognizer.

to_dict

Serialize instance into a dictionary.

from_dict

Create instance from a serialized dict.

remove_duplicates

Remove duplicate results.

sanitize_value

Cleanse the input string of the replacement pairs specified as argument.

validate_result

Validate the pattern logic e.g., by running checksum on a detected pattern.

invalidate_result

Logic to check for result invalidation by running pruning logic.

build_regex_explanation

Construct an explanation for why this entity was detected.

Source code in presidio_analyzer/predefined_recognizers/date_recognizer.py
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
class DateRecognizer(PatternRecognizer):
    """
    Recognize date using regex.

    :param patterns: List of patterns to be used by this recognizer
    :param context: List of context words to increase confidence in detection
    :param supported_language: Language this recognizer supports
    :param supported_entity: The entity this recognizer can detect
    """

    PATTERNS = [
        Pattern(
            "mm/dd/yyyy or mm/dd/yy",
            r"\b(([1-9]|0[1-9]|1[0-2])/([1-9]|0[1-9]|[1-2][0-9]|3[0-1])/(\d{4}|\d{2}))\b",  # noqa: E501
            0.6,
        ),
        Pattern(
            "dd/mm/yyyy or dd/mm/yy",
            r"\b(([1-9]|0[1-9]|[1-2][0-9]|3[0-1])/([1-9]|0[1-9]|1[0-2])/(\d{4}|\d{2}))\b",  # noqa: E501
            0.6,
        ),
        Pattern(
            "yyyy/mm/dd",
            r"\b(\d{4}/([1-9]|0[1-9]|1[0-2])/([1-9]|0[1-9]|[1-2][0-9]|3[0-1]))\b",
            0.6,
        ),
        Pattern(
            "mm-dd-yyyy",
            r"\b(([1-9]|0[1-9]|1[0-2])-([1-9]|0[1-9]|[1-2][0-9]|3[0-1])-\d{4})\b",
            0.6,
        ),
        Pattern(
            "dd-mm-yyyy",
            r"\b(([1-9]|0[1-9]|[1-2][0-9]|3[0-1])-([1-9]|0[1-9]|1[0-2])-\d{4})\b",
            0.6,
        ),
        Pattern(
            "yyyy-mm-dd",
            r"\b(\d{4}-([1-9]|0[1-9]|1[0-2])-([1-9]|0[1-9]|[1-2][0-9]|3[0-1]))\b",
            0.6,
        ),
        Pattern(
            "dd.mm.yyyy or dd.mm.yy",
            r"\b(([1-9]|0[1-9]|[1-2][0-9]|3[0-1])\.([1-9]|0[1-9]|1[0-2])\.(\d{4}|\d{2}))\b",  # noqa: E501
            0.6,
        ),
        Pattern(
            "dd-MMM-yyyy or dd-MMM-yy",
            r"\b(([1-9]|0[1-9]|[1-2][0-9]|3[0-1])-(JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)-(\d{4}|\d{2}))\b",  # noqa: E501
            0.6,
        ),
        Pattern(
            "MMM-yyyy or MMM-yy",
            r"\b((JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)-(\d{4}|\d{2}))\b",  # noqa: E501
            0.6,
        ),
        Pattern(
            "dd-MMM or dd-MMM",
            r"\b(([1-9]|0[1-9]|[1-2][0-9]|3[0-1])-(JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC))\b",  # noqa: E501
            0.6,
        ),
        Pattern(
            "mm/yyyy or m/yyyy",
            r"\b(([1-9]|0[1-9]|1[0-2])/\d{4})\b",
            0.2,
        ),
        Pattern(
            "mm/yy or m/yy",
            r"\b(([1-9]|0[1-9]|1[0-2])/\d{2})\b",
            0.1,
        ),
    ]

    CONTEXT = ["date", "birthday"]

    def __init__(
        self,
        patterns: Optional[List[Pattern]] = None,
        context: Optional[List[str]] = None,
        supported_language: str = "en",
        supported_entity: str = "DATE_TIME",
    ):
        patterns = patterns if patterns else self.PATTERNS
        context = context if context else self.CONTEXT
        super().__init__(
            supported_entity=supported_entity,
            patterns=patterns,
            context=context,
            supported_language=supported_language,
        )

id property

id

Return a unique identifier of this recognizer.

analyze

analyze(
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]

Analyzes text to detect PII using regular expressions or deny-lists.

PARAMETER DESCRIPTION
text

Text to be analyzed

TYPE: str

entities

Entities this recognizer can detect

TYPE: List[str]

nlp_artifacts

Output values from the NLP engine

TYPE: Optional[NlpArtifacts] DEFAULT: None

regex_flags

regex flags to be used in regex matching

TYPE: Optional[int] DEFAULT: None

RETURNS DESCRIPTION
List[RecognizerResult]
Source code in presidio_analyzer/pattern_recognizer.py
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def analyze(
    self,
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]:
    """
    Analyzes text to detect PII using regular expressions or deny-lists.

    :param text: Text to be analyzed
    :param entities: Entities this recognizer can detect
    :param nlp_artifacts: Output values from the NLP engine
    :param regex_flags: regex flags to be used in regex matching
    :return:
    """
    results = []

    if self.patterns:
        pattern_result = self.__analyze_patterns(text, regex_flags)
        results.extend(pattern_result)

    return results

enhance_using_context

enhance_using_context(
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]

Enhance confidence score using context of the entity.

Override this method in derived class in case a custom logic is needed, otherwise return value will be equal to raw_results.

in case a result score is boosted, derived class need to update result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

PARAMETER DESCRIPTION
text

The actual text that was analyzed

TYPE: str

raw_recognizer_results

This recognizer's results, to be updated based on recognizer specific context.

TYPE: List[RecognizerResult]

other_raw_recognizer_results

Other recognizer results matched in the given text to allow related entity context enhancement

TYPE: List[RecognizerResult]

nlp_artifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

TYPE: NlpArtifacts

context

list of context words

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/entity_recognizer.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def enhance_using_context(
    self,
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """Enhance confidence score using context of the entity.

    Override this method in derived class in case a custom logic
    is needed, otherwise return value will be equal to
    raw_results.

    in case a result score is boosted, derived class need to update
    result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

    :param text: The actual text that was analyzed
    :param raw_recognizer_results: This recognizer's results, to be updated
    based on recognizer specific context.
    :param other_raw_recognizer_results: Other recognizer results matched in
    the given text to allow related entity context enhancement
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param context: list of context words
    """
    return raw_recognizer_results

get_supported_entities

get_supported_entities() -> List[str]

Return the list of entities this recognizer can identify.

RETURNS DESCRIPTION
List[str]

A list of the supported entities by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
119
120
121
122
123
124
125
def get_supported_entities(self) -> List[str]:
    """
    Return the list of entities this recognizer can identify.

    :return: A list of the supported entities by this recognizer
    """
    return self.supported_entities

get_supported_language

get_supported_language() -> str

Return the language this recognizer can support.

RETURNS DESCRIPTION
str

A list of the supported language by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
127
128
129
130
131
132
133
def get_supported_language(self) -> str:
    """
    Return the language this recognizer can support.

    :return: A list of the supported language by this recognizer
    """
    return self.supported_language

get_version

get_version() -> str

Return the version of this recognizer.

RETURNS DESCRIPTION
str

The current version of this recognizer

Source code in presidio_analyzer/entity_recognizer.py
135
136
137
138
139
140
141
def get_version(self) -> str:
    """
    Return the version of this recognizer.

    :return: The current version of this recognizer
    """
    return self.version

to_dict

to_dict() -> Dict

Serialize instance into a dictionary.

Source code in presidio_analyzer/pattern_recognizer.py
254
255
256
257
258
259
260
261
262
263
264
def to_dict(self) -> Dict:
    """Serialize instance into a dictionary."""
    return_dict = super().to_dict()

    return_dict["patterns"] = [pat.to_dict() for pat in self.patterns]
    return_dict["deny_list"] = self.deny_list
    return_dict["context"] = self.context
    return_dict["supported_entity"] = return_dict["supported_entities"][0]
    del return_dict["supported_entities"]

    return return_dict

from_dict classmethod

from_dict(entity_recognizer_dict: Dict) -> PatternRecognizer

Create instance from a serialized dict.

Source code in presidio_analyzer/pattern_recognizer.py
266
267
268
269
270
271
272
273
274
@classmethod
def from_dict(cls, entity_recognizer_dict: Dict) -> "PatternRecognizer":
    """Create instance from a serialized dict."""
    patterns = entity_recognizer_dict.get("patterns")
    if patterns:
        patterns_list = [Pattern.from_dict(pat) for pat in patterns]
        entity_recognizer_dict["patterns"] = patterns_list

    return cls(**entity_recognizer_dict)

remove_duplicates staticmethod

remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]

Remove duplicate results.

Remove duplicates in case the two results have identical start and ends and types.

PARAMETER DESCRIPTION
results

List[RecognizerResult]

TYPE: List[RecognizerResult]

RETURNS DESCRIPTION
List[RecognizerResult]

List[RecognizerResult]

Source code in presidio_analyzer/entity_recognizer.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
@staticmethod
def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
    """
    Remove duplicate results.

    Remove duplicates in case the two results
    have identical start and ends and types.
    :param results: List[RecognizerResult]
    :return: List[RecognizerResult]
    """
    results = list(set(results))
    results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
    filtered_results = []

    for result in results:
        if result.score == 0:
            continue

        to_keep = result not in filtered_results  # equals based comparison
        if to_keep:
            for filtered in filtered_results:
                # If result is contained in one of the other results
                if (
                    result.contained_in(filtered)
                    and result.entity_type == filtered.entity_type
                ):
                    to_keep = False
                    break

        if to_keep:
            filtered_results.append(result)

    return filtered_results

sanitize_value staticmethod

sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str

Cleanse the input string of the replacement pairs specified as argument.

PARAMETER DESCRIPTION
text

input string

TYPE: str

replacement_pairs

pairs of what has to be replaced with which value

TYPE: List[Tuple[str, str]]

RETURNS DESCRIPTION
str

cleansed string

Source code in presidio_analyzer/entity_recognizer.py
200
201
202
203
204
205
206
207
208
209
210
211
@staticmethod
def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
    """
    Cleanse the input string of the replacement pairs specified as argument.

    :param text: input string
    :param replacement_pairs: pairs of what has to be replaced with which value
    :return: cleansed string
    """
    for search_string, replacement_string in replacement_pairs:
        text = text.replace(search_string, replacement_string)
    return text

validate_result

validate_result(pattern_text: str) -> Optional[bool]

Validate the pattern logic e.g., by running checksum on a detected pattern.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the validation was successful.

Source code in presidio_analyzer/pattern_recognizer.py
117
118
119
120
121
122
123
124
125
def validate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Validate the pattern logic e.g., by running checksum on a detected pattern.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the validation was successful.
    """
    return None

invalidate_result

invalidate_result(pattern_text: str) -> Optional[bool]

Logic to check for result invalidation by running pruning logic.

For example, each SSN number group should not consist of all the same digits.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the result is invalidated

Source code in presidio_analyzer/pattern_recognizer.py
127
128
129
130
131
132
133
134
135
136
137
def invalidate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Logic to check for result invalidation by running pruning logic.

    For example, each SSN number group should not consist of all the same digits.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the result is invalidated
    """
    return None

build_regex_explanation staticmethod

build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation

Construct an explanation for why this entity was detected.

PARAMETER DESCRIPTION
recognizer_name

Name of recognizer detecting the entity

TYPE: str

pattern_name

Regex pattern name which detected the entity

TYPE: str

pattern

Regex pattern logic

TYPE: str

original_score

Score given by the recognizer

TYPE: float

validation_result

Whether validation was used and its result

TYPE: bool

regex_flags

Regex flags used in the regex matching

TYPE: int

RETURNS DESCRIPTION
AnalysisExplanation

Analysis explanation

Source code in presidio_analyzer/pattern_recognizer.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
@staticmethod
def build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation:
    """
    Construct an explanation for why this entity was detected.

    :param recognizer_name: Name of recognizer detecting the entity
    :param pattern_name: Regex pattern name which detected the entity
    :param pattern: Regex pattern logic
    :param original_score: Score given by the recognizer
    :param validation_result: Whether validation was used and its result
    :param regex_flags: Regex flags used in the regex matching
    :return: Analysis explanation
    """
    textual_explanation = (
        f"Detected by `{recognizer_name}` " f"using pattern `{pattern_name}`"
    )

    explanation = AnalysisExplanation(
        recognizer=recognizer_name,
        original_score=original_score,
        pattern_name=pattern_name,
        pattern=pattern,
        validation_result=validation_result,
        regex_flags=regex_flags,
        textual_explanation=textual_explanation,
    )
    return explanation

EmailRecognizer

Bases: PatternRecognizer

Recognize email addresses using regex.

PARAMETER DESCRIPTION
patterns

List of patterns to be used by this recognizer

TYPE: Optional[List[Pattern]] DEFAULT: None

context

List of context words to increase confidence in detection

TYPE: Optional[List[str]] DEFAULT: None

supported_language

Language this recognizer supports

TYPE: str DEFAULT: 'en'

supported_entity

The entity this recognizer can detect

TYPE: str DEFAULT: 'EMAIL_ADDRESS'

METHOD DESCRIPTION
analyze

Analyzes text to detect PII using regular expressions or deny-lists.

enhance_using_context

Enhance confidence score using context of the entity.

get_supported_entities

Return the list of entities this recognizer can identify.

get_supported_language

Return the language this recognizer can support.

get_version

Return the version of this recognizer.

to_dict

Serialize instance into a dictionary.

from_dict

Create instance from a serialized dict.

remove_duplicates

Remove duplicate results.

sanitize_value

Cleanse the input string of the replacement pairs specified as argument.

invalidate_result

Logic to check for result invalidation by running pruning logic.

build_regex_explanation

Construct an explanation for why this entity was detected.

Source code in presidio_analyzer/predefined_recognizers/email_recognizer.py
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
class EmailRecognizer(PatternRecognizer):
    """
    Recognize email addresses using regex.

    :param patterns: List of patterns to be used by this recognizer
    :param context: List of context words to increase confidence in detection
    :param supported_language: Language this recognizer supports
    :param supported_entity: The entity this recognizer can detect
    """

    PATTERNS = [
        Pattern(
            "Email (Medium)",
            r"\b((([!#$%&'*+\-/=?^_`{|}~\w])|([!#$%&'*+\-/=?^_`{|}~\w][!#$%&'*+\-/=?^_`{|}~\.\w]{0,}[!#$%&'*+\-/=?^_`{|}~\w]))[@]\w+([-.]\w+)*\.\w+([-.]\w+)*)\b",  # noqa: E501
            0.5,
        ),
    ]

    CONTEXT = ["email"]

    def __init__(
        self,
        patterns: Optional[List[Pattern]] = None,
        context: Optional[List[str]] = None,
        supported_language: str = "en",
        supported_entity: str = "EMAIL_ADDRESS",
    ):
        patterns = patterns if patterns else self.PATTERNS
        context = context if context else self.CONTEXT
        super().__init__(
            supported_entity=supported_entity,
            patterns=patterns,
            context=context,
            supported_language=supported_language,
        )

    def validate_result(self, pattern_text: str):  # noqa D102
        result = tldextract.extract(pattern_text)
        return result.fqdn != ""

id property

id

Return a unique identifier of this recognizer.

analyze

analyze(
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]

Analyzes text to detect PII using regular expressions or deny-lists.

PARAMETER DESCRIPTION
text

Text to be analyzed

TYPE: str

entities

Entities this recognizer can detect

TYPE: List[str]

nlp_artifacts

Output values from the NLP engine

TYPE: Optional[NlpArtifacts] DEFAULT: None

regex_flags

regex flags to be used in regex matching

TYPE: Optional[int] DEFAULT: None

RETURNS DESCRIPTION
List[RecognizerResult]
Source code in presidio_analyzer/pattern_recognizer.py
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def analyze(
    self,
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]:
    """
    Analyzes text to detect PII using regular expressions or deny-lists.

    :param text: Text to be analyzed
    :param entities: Entities this recognizer can detect
    :param nlp_artifacts: Output values from the NLP engine
    :param regex_flags: regex flags to be used in regex matching
    :return:
    """
    results = []

    if self.patterns:
        pattern_result = self.__analyze_patterns(text, regex_flags)
        results.extend(pattern_result)

    return results

enhance_using_context

enhance_using_context(
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]

Enhance confidence score using context of the entity.

Override this method in derived class in case a custom logic is needed, otherwise return value will be equal to raw_results.

in case a result score is boosted, derived class need to update result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

PARAMETER DESCRIPTION
text

The actual text that was analyzed

TYPE: str

raw_recognizer_results

This recognizer's results, to be updated based on recognizer specific context.

TYPE: List[RecognizerResult]

other_raw_recognizer_results

Other recognizer results matched in the given text to allow related entity context enhancement

TYPE: List[RecognizerResult]

nlp_artifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

TYPE: NlpArtifacts

context

list of context words

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/entity_recognizer.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def enhance_using_context(
    self,
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """Enhance confidence score using context of the entity.

    Override this method in derived class in case a custom logic
    is needed, otherwise return value will be equal to
    raw_results.

    in case a result score is boosted, derived class need to update
    result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

    :param text: The actual text that was analyzed
    :param raw_recognizer_results: This recognizer's results, to be updated
    based on recognizer specific context.
    :param other_raw_recognizer_results: Other recognizer results matched in
    the given text to allow related entity context enhancement
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param context: list of context words
    """
    return raw_recognizer_results

get_supported_entities

get_supported_entities() -> List[str]

Return the list of entities this recognizer can identify.

RETURNS DESCRIPTION
List[str]

A list of the supported entities by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
119
120
121
122
123
124
125
def get_supported_entities(self) -> List[str]:
    """
    Return the list of entities this recognizer can identify.

    :return: A list of the supported entities by this recognizer
    """
    return self.supported_entities

get_supported_language

get_supported_language() -> str

Return the language this recognizer can support.

RETURNS DESCRIPTION
str

A list of the supported language by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
127
128
129
130
131
132
133
def get_supported_language(self) -> str:
    """
    Return the language this recognizer can support.

    :return: A list of the supported language by this recognizer
    """
    return self.supported_language

get_version

get_version() -> str

Return the version of this recognizer.

RETURNS DESCRIPTION
str

The current version of this recognizer

Source code in presidio_analyzer/entity_recognizer.py
135
136
137
138
139
140
141
def get_version(self) -> str:
    """
    Return the version of this recognizer.

    :return: The current version of this recognizer
    """
    return self.version

to_dict

to_dict() -> Dict

Serialize instance into a dictionary.

Source code in presidio_analyzer/pattern_recognizer.py
254
255
256
257
258
259
260
261
262
263
264
def to_dict(self) -> Dict:
    """Serialize instance into a dictionary."""
    return_dict = super().to_dict()

    return_dict["patterns"] = [pat.to_dict() for pat in self.patterns]
    return_dict["deny_list"] = self.deny_list
    return_dict["context"] = self.context
    return_dict["supported_entity"] = return_dict["supported_entities"][0]
    del return_dict["supported_entities"]

    return return_dict

from_dict classmethod

from_dict(entity_recognizer_dict: Dict) -> PatternRecognizer

Create instance from a serialized dict.

Source code in presidio_analyzer/pattern_recognizer.py
266
267
268
269
270
271
272
273
274
@classmethod
def from_dict(cls, entity_recognizer_dict: Dict) -> "PatternRecognizer":
    """Create instance from a serialized dict."""
    patterns = entity_recognizer_dict.get("patterns")
    if patterns:
        patterns_list = [Pattern.from_dict(pat) for pat in patterns]
        entity_recognizer_dict["patterns"] = patterns_list

    return cls(**entity_recognizer_dict)

remove_duplicates staticmethod

remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]

Remove duplicate results.

Remove duplicates in case the two results have identical start and ends and types.

PARAMETER DESCRIPTION
results

List[RecognizerResult]

TYPE: List[RecognizerResult]

RETURNS DESCRIPTION
List[RecognizerResult]

List[RecognizerResult]

Source code in presidio_analyzer/entity_recognizer.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
@staticmethod
def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
    """
    Remove duplicate results.

    Remove duplicates in case the two results
    have identical start and ends and types.
    :param results: List[RecognizerResult]
    :return: List[RecognizerResult]
    """
    results = list(set(results))
    results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
    filtered_results = []

    for result in results:
        if result.score == 0:
            continue

        to_keep = result not in filtered_results  # equals based comparison
        if to_keep:
            for filtered in filtered_results:
                # If result is contained in one of the other results
                if (
                    result.contained_in(filtered)
                    and result.entity_type == filtered.entity_type
                ):
                    to_keep = False
                    break

        if to_keep:
            filtered_results.append(result)

    return filtered_results

sanitize_value staticmethod

sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str

Cleanse the input string of the replacement pairs specified as argument.

PARAMETER DESCRIPTION
text

input string

TYPE: str

replacement_pairs

pairs of what has to be replaced with which value

TYPE: List[Tuple[str, str]]

RETURNS DESCRIPTION
str

cleansed string

Source code in presidio_analyzer/entity_recognizer.py
200
201
202
203
204
205
206
207
208
209
210
211
@staticmethod
def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
    """
    Cleanse the input string of the replacement pairs specified as argument.

    :param text: input string
    :param replacement_pairs: pairs of what has to be replaced with which value
    :return: cleansed string
    """
    for search_string, replacement_string in replacement_pairs:
        text = text.replace(search_string, replacement_string)
    return text

invalidate_result

invalidate_result(pattern_text: str) -> Optional[bool]

Logic to check for result invalidation by running pruning logic.

For example, each SSN number group should not consist of all the same digits.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the result is invalidated

Source code in presidio_analyzer/pattern_recognizer.py
127
128
129
130
131
132
133
134
135
136
137
def invalidate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Logic to check for result invalidation by running pruning logic.

    For example, each SSN number group should not consist of all the same digits.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the result is invalidated
    """
    return None

build_regex_explanation staticmethod

build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation

Construct an explanation for why this entity was detected.

PARAMETER DESCRIPTION
recognizer_name

Name of recognizer detecting the entity

TYPE: str

pattern_name

Regex pattern name which detected the entity

TYPE: str

pattern

Regex pattern logic

TYPE: str

original_score

Score given by the recognizer

TYPE: float

validation_result

Whether validation was used and its result

TYPE: bool

regex_flags

Regex flags used in the regex matching

TYPE: int

RETURNS DESCRIPTION
AnalysisExplanation

Analysis explanation

Source code in presidio_analyzer/pattern_recognizer.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
@staticmethod
def build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation:
    """
    Construct an explanation for why this entity was detected.

    :param recognizer_name: Name of recognizer detecting the entity
    :param pattern_name: Regex pattern name which detected the entity
    :param pattern: Regex pattern logic
    :param original_score: Score given by the recognizer
    :param validation_result: Whether validation was used and its result
    :param regex_flags: Regex flags used in the regex matching
    :return: Analysis explanation
    """
    textual_explanation = (
        f"Detected by `{recognizer_name}` " f"using pattern `{pattern_name}`"
    )

    explanation = AnalysisExplanation(
        recognizer=recognizer_name,
        original_score=original_score,
        pattern_name=pattern_name,
        pattern=pattern,
        validation_result=validation_result,
        regex_flags=regex_flags,
        textual_explanation=textual_explanation,
    )
    return explanation

EsNieRecognizer

Bases: PatternRecognizer

Recognize NIE number using regex and checksum.

Reference(s): https://es.wikipedia.org/wiki/N%C3%BAmero_de_identidad_de_extranjero https://www.interior.gob.es/opencms/ca/servicios-al-ciudadano/tramites-y-gestiones/dni/calculo-del-digito-de-control-del-nif-nie/

PARAMETER DESCRIPTION
patterns

List of patterns to be used by this recognizer

TYPE: Optional[List[Pattern]] DEFAULT: None

context

List of context words to increase confidence in detection

TYPE: Optional[List[str]] DEFAULT: None

supported_language

Language this recognizer supports

TYPE: str DEFAULT: 'es'

supported_entity

The entity this recognizer can detect

TYPE: str DEFAULT: 'ES_NIE'

replacement_pairs

List of tuples with potential replacement values for different strings to be used during pattern matching. This can allow a greater variety in input, for example by removing dashes or spaces.

TYPE: Optional[List[Tuple[str, str]]] DEFAULT: None

METHOD DESCRIPTION
analyze

Analyzes text to detect PII using regular expressions or deny-lists.

enhance_using_context

Enhance confidence score using context of the entity.

get_supported_entities

Return the list of entities this recognizer can identify.

get_supported_language

Return the language this recognizer can support.

get_version

Return the version of this recognizer.

to_dict

Serialize instance into a dictionary.

from_dict

Create instance from a serialized dict.

remove_duplicates

Remove duplicate results.

sanitize_value

Cleanse the input string of the replacement pairs specified as argument.

invalidate_result

Logic to check for result invalidation by running pruning logic.

build_regex_explanation

Construct an explanation for why this entity was detected.

validate_result

Validate the pattern by using the control character.

Source code in presidio_analyzer/predefined_recognizers/es_nie_recognizer.py
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
class EsNieRecognizer(PatternRecognizer):
    """
    Recognize NIE number using regex and checksum.

    Reference(s):
    https://es.wikipedia.org/wiki/N%C3%BAmero_de_identidad_de_extranjero
    https://www.interior.gob.es/opencms/ca/servicios-al-ciudadano/tramites-y-gestiones/dni/calculo-del-digito-de-control-del-nif-nie/

    :param patterns: List of patterns to be used by this recognizer
    :param context: List of context words to increase confidence in detection
    :param supported_language: Language this recognizer supports
    :param supported_entity: The entity this recognizer can detect
    :param replacement_pairs: List of tuples with potential replacement values
    for different strings to be used during pattern matching.
    This can allow a greater variety in input, for example by removing dashes
    or spaces.
    """

    PATTERNS = [
        Pattern(
            "NIE",
            r"\b[X-Z]?[0-9]?[0-9]{7}[-]?[A-Z]\b",
            0.5,
        ),
    ]

    CONTEXT = ["número de identificación de extranjero", "NIE"]

    def __init__(
        self,
        patterns: Optional[List[Pattern]] = None,
        context: Optional[List[str]] = None,
        supported_language: str = "es",
        supported_entity: str = "ES_NIE",
        replacement_pairs: Optional[List[Tuple[str, str]]] = None,
    ):
        self.replacement_pairs = (
            replacement_pairs if replacement_pairs else [("-", ""), (" ", "")]
        )
        patterns = patterns if patterns else self.PATTERNS
        context = context if context else self.CONTEXT
        super().__init__(
            supported_entity=supported_entity,
            patterns=patterns,
            context=context,
            supported_language=supported_language,
        )

    def validate_result(self, pattern_text: str) -> bool:
        """Validate the pattern by using the control character."""

        pattern_text = EntityRecognizer.sanitize_value(
            pattern_text, self.replacement_pairs
        )

        letters = "TRWAGMYFPDXBNJZSQVHLCKE"
        letter = pattern_text[-1]

        # check last is a letter, and first is in X,Y,Z
        if not pattern_text[1:-1].isdigit or pattern_text[:1] not in "XYZ":
            return False
        # check size is 8 or 9
        if len(pattern_text) < 8 or len(pattern_text) > 9:
            return False

        # replace XYZ with 012, and check the mod 23
        number = int(str("XYZ".index(pattern_text[0])) + pattern_text[1:-1])
        return letter == letters[number % 23]

id property

id

Return a unique identifier of this recognizer.

analyze

analyze(
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]

Analyzes text to detect PII using regular expressions or deny-lists.

PARAMETER DESCRIPTION
text

Text to be analyzed

TYPE: str

entities

Entities this recognizer can detect

TYPE: List[str]

nlp_artifacts

Output values from the NLP engine

TYPE: Optional[NlpArtifacts] DEFAULT: None

regex_flags

regex flags to be used in regex matching

TYPE: Optional[int] DEFAULT: None

RETURNS DESCRIPTION
List[RecognizerResult]
Source code in presidio_analyzer/pattern_recognizer.py
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def analyze(
    self,
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]:
    """
    Analyzes text to detect PII using regular expressions or deny-lists.

    :param text: Text to be analyzed
    :param entities: Entities this recognizer can detect
    :param nlp_artifacts: Output values from the NLP engine
    :param regex_flags: regex flags to be used in regex matching
    :return:
    """
    results = []

    if self.patterns:
        pattern_result = self.__analyze_patterns(text, regex_flags)
        results.extend(pattern_result)

    return results

enhance_using_context

enhance_using_context(
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]

Enhance confidence score using context of the entity.

Override this method in derived class in case a custom logic is needed, otherwise return value will be equal to raw_results.

in case a result score is boosted, derived class need to update result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

PARAMETER DESCRIPTION
text

The actual text that was analyzed

TYPE: str

raw_recognizer_results

This recognizer's results, to be updated based on recognizer specific context.

TYPE: List[RecognizerResult]

other_raw_recognizer_results

Other recognizer results matched in the given text to allow related entity context enhancement

TYPE: List[RecognizerResult]

nlp_artifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

TYPE: NlpArtifacts

context

list of context words

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/entity_recognizer.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def enhance_using_context(
    self,
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """Enhance confidence score using context of the entity.

    Override this method in derived class in case a custom logic
    is needed, otherwise return value will be equal to
    raw_results.

    in case a result score is boosted, derived class need to update
    result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

    :param text: The actual text that was analyzed
    :param raw_recognizer_results: This recognizer's results, to be updated
    based on recognizer specific context.
    :param other_raw_recognizer_results: Other recognizer results matched in
    the given text to allow related entity context enhancement
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param context: list of context words
    """
    return raw_recognizer_results

get_supported_entities

get_supported_entities() -> List[str]

Return the list of entities this recognizer can identify.

RETURNS DESCRIPTION
List[str]

A list of the supported entities by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
119
120
121
122
123
124
125
def get_supported_entities(self) -> List[str]:
    """
    Return the list of entities this recognizer can identify.

    :return: A list of the supported entities by this recognizer
    """
    return self.supported_entities

get_supported_language

get_supported_language() -> str

Return the language this recognizer can support.

RETURNS DESCRIPTION
str

A list of the supported language by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
127
128
129
130
131
132
133
def get_supported_language(self) -> str:
    """
    Return the language this recognizer can support.

    :return: A list of the supported language by this recognizer
    """
    return self.supported_language

get_version

get_version() -> str

Return the version of this recognizer.

RETURNS DESCRIPTION
str

The current version of this recognizer

Source code in presidio_analyzer/entity_recognizer.py
135
136
137
138
139
140
141
def get_version(self) -> str:
    """
    Return the version of this recognizer.

    :return: The current version of this recognizer
    """
    return self.version

to_dict

to_dict() -> Dict

Serialize instance into a dictionary.

Source code in presidio_analyzer/pattern_recognizer.py
254
255
256
257
258
259
260
261
262
263
264
def to_dict(self) -> Dict:
    """Serialize instance into a dictionary."""
    return_dict = super().to_dict()

    return_dict["patterns"] = [pat.to_dict() for pat in self.patterns]
    return_dict["deny_list"] = self.deny_list
    return_dict["context"] = self.context
    return_dict["supported_entity"] = return_dict["supported_entities"][0]
    del return_dict["supported_entities"]

    return return_dict

from_dict classmethod

from_dict(entity_recognizer_dict: Dict) -> PatternRecognizer

Create instance from a serialized dict.

Source code in presidio_analyzer/pattern_recognizer.py
266
267
268
269
270
271
272
273
274
@classmethod
def from_dict(cls, entity_recognizer_dict: Dict) -> "PatternRecognizer":
    """Create instance from a serialized dict."""
    patterns = entity_recognizer_dict.get("patterns")
    if patterns:
        patterns_list = [Pattern.from_dict(pat) for pat in patterns]
        entity_recognizer_dict["patterns"] = patterns_list

    return cls(**entity_recognizer_dict)

remove_duplicates staticmethod

remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]

Remove duplicate results.

Remove duplicates in case the two results have identical start and ends and types.

PARAMETER DESCRIPTION
results

List[RecognizerResult]

TYPE: List[RecognizerResult]

RETURNS DESCRIPTION
List[RecognizerResult]

List[RecognizerResult]

Source code in presidio_analyzer/entity_recognizer.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
@staticmethod
def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
    """
    Remove duplicate results.

    Remove duplicates in case the two results
    have identical start and ends and types.
    :param results: List[RecognizerResult]
    :return: List[RecognizerResult]
    """
    results = list(set(results))
    results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
    filtered_results = []

    for result in results:
        if result.score == 0:
            continue

        to_keep = result not in filtered_results  # equals based comparison
        if to_keep:
            for filtered in filtered_results:
                # If result is contained in one of the other results
                if (
                    result.contained_in(filtered)
                    and result.entity_type == filtered.entity_type
                ):
                    to_keep = False
                    break

        if to_keep:
            filtered_results.append(result)

    return filtered_results

sanitize_value staticmethod

sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str

Cleanse the input string of the replacement pairs specified as argument.

PARAMETER DESCRIPTION
text

input string

TYPE: str

replacement_pairs

pairs of what has to be replaced with which value

TYPE: List[Tuple[str, str]]

RETURNS DESCRIPTION
str

cleansed string

Source code in presidio_analyzer/entity_recognizer.py
200
201
202
203
204
205
206
207
208
209
210
211
@staticmethod
def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
    """
    Cleanse the input string of the replacement pairs specified as argument.

    :param text: input string
    :param replacement_pairs: pairs of what has to be replaced with which value
    :return: cleansed string
    """
    for search_string, replacement_string in replacement_pairs:
        text = text.replace(search_string, replacement_string)
    return text

invalidate_result

invalidate_result(pattern_text: str) -> Optional[bool]

Logic to check for result invalidation by running pruning logic.

For example, each SSN number group should not consist of all the same digits.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the result is invalidated

Source code in presidio_analyzer/pattern_recognizer.py
127
128
129
130
131
132
133
134
135
136
137
def invalidate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Logic to check for result invalidation by running pruning logic.

    For example, each SSN number group should not consist of all the same digits.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the result is invalidated
    """
    return None

build_regex_explanation staticmethod

build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation

Construct an explanation for why this entity was detected.

PARAMETER DESCRIPTION
recognizer_name

Name of recognizer detecting the entity

TYPE: str

pattern_name

Regex pattern name which detected the entity

TYPE: str

pattern

Regex pattern logic

TYPE: str

original_score

Score given by the recognizer

TYPE: float

validation_result

Whether validation was used and its result

TYPE: bool

regex_flags

Regex flags used in the regex matching

TYPE: int

RETURNS DESCRIPTION
AnalysisExplanation

Analysis explanation

Source code in presidio_analyzer/pattern_recognizer.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
@staticmethod
def build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation:
    """
    Construct an explanation for why this entity was detected.

    :param recognizer_name: Name of recognizer detecting the entity
    :param pattern_name: Regex pattern name which detected the entity
    :param pattern: Regex pattern logic
    :param original_score: Score given by the recognizer
    :param validation_result: Whether validation was used and its result
    :param regex_flags: Regex flags used in the regex matching
    :return: Analysis explanation
    """
    textual_explanation = (
        f"Detected by `{recognizer_name}` " f"using pattern `{pattern_name}`"
    )

    explanation = AnalysisExplanation(
        recognizer=recognizer_name,
        original_score=original_score,
        pattern_name=pattern_name,
        pattern=pattern,
        validation_result=validation_result,
        regex_flags=regex_flags,
        textual_explanation=textual_explanation,
    )
    return explanation

validate_result

validate_result(pattern_text: str) -> bool

Validate the pattern by using the control character.

Source code in presidio_analyzer/predefined_recognizers/es_nie_recognizer.py
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
def validate_result(self, pattern_text: str) -> bool:
    """Validate the pattern by using the control character."""

    pattern_text = EntityRecognizer.sanitize_value(
        pattern_text, self.replacement_pairs
    )

    letters = "TRWAGMYFPDXBNJZSQVHLCKE"
    letter = pattern_text[-1]

    # check last is a letter, and first is in X,Y,Z
    if not pattern_text[1:-1].isdigit or pattern_text[:1] not in "XYZ":
        return False
    # check size is 8 or 9
    if len(pattern_text) < 8 or len(pattern_text) > 9:
        return False

    # replace XYZ with 012, and check the mod 23
    number = int(str("XYZ".index(pattern_text[0])) + pattern_text[1:-1])
    return letter == letters[number % 23]

EsNifRecognizer

Bases: PatternRecognizer

Recognize NIF number using regex and checksum.

PARAMETER DESCRIPTION
patterns

List of patterns to be used by this recognizer

TYPE: Optional[List[Pattern]] DEFAULT: None

context

List of context words to increase confidence in detection

TYPE: Optional[List[str]] DEFAULT: None

supported_language

Language this recognizer supports

TYPE: str DEFAULT: 'es'

supported_entity

The entity this recognizer can detect

TYPE: str DEFAULT: 'ES_NIF'

replacement_pairs

List of tuples with potential replacement values for different strings to be used during pattern matching. This can allow a greater variety in input, for example by removing dashes or spaces.

TYPE: Optional[List[Tuple[str, str]]] DEFAULT: None

METHOD DESCRIPTION
analyze

Analyzes text to detect PII using regular expressions or deny-lists.

enhance_using_context

Enhance confidence score using context of the entity.

get_supported_entities

Return the list of entities this recognizer can identify.

get_supported_language

Return the language this recognizer can support.

get_version

Return the version of this recognizer.

to_dict

Serialize instance into a dictionary.

from_dict

Create instance from a serialized dict.

remove_duplicates

Remove duplicate results.

sanitize_value

Cleanse the input string of the replacement pairs specified as argument.

invalidate_result

Logic to check for result invalidation by running pruning logic.

build_regex_explanation

Construct an explanation for why this entity was detected.

Source code in presidio_analyzer/predefined_recognizers/es_nif_recognizer.py
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
class EsNifRecognizer(PatternRecognizer):
    """
    Recognize NIF number using regex and checksum.

    :param patterns: List of patterns to be used by this recognizer
    :param context: List of context words to increase confidence in detection
    :param supported_language: Language this recognizer supports
    :param supported_entity: The entity this recognizer can detect
    :param replacement_pairs: List of tuples with potential replacement values
    for different strings to be used during pattern matching.
    This can allow a greater variety in input, for example by removing dashes or spaces.
    """

    PATTERNS = [
        Pattern(
            "NIF",
            r"\b[0-9]?[0-9]{7}[-]?[A-Z]\b",
            0.5,
        ),
    ]

    CONTEXT = ["documento nacional de identidad", "DNI", "NIF", "identificación"]

    def __init__(
        self,
        patterns: Optional[List[Pattern]] = None,
        context: Optional[List[str]] = None,
        supported_language: str = "es",
        supported_entity: str = "ES_NIF",
        replacement_pairs: Optional[List[Tuple[str, str]]] = None,
    ):
        self.replacement_pairs = (
            replacement_pairs if replacement_pairs else [("-", ""), (" ", "")]
        )
        patterns = patterns if patterns else self.PATTERNS
        context = context if context else self.CONTEXT
        super().__init__(
            supported_entity=supported_entity,
            patterns=patterns,
            context=context,
            supported_language=supported_language,
        )

    def validate_result(self, pattern_text: str) -> bool:  # noqa D102
        pattern_text = EntityRecognizer.sanitize_value(
            pattern_text, self.replacement_pairs
        )
        letter = pattern_text[-1]
        number = int("".join(filter(str.isdigit, pattern_text)))
        letters = "TRWAGMYFPDXBNJZSQVHLCKE"
        return letter == letters[number % 23]

id property

id

Return a unique identifier of this recognizer.

analyze

analyze(
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]

Analyzes text to detect PII using regular expressions or deny-lists.

PARAMETER DESCRIPTION
text

Text to be analyzed

TYPE: str

entities

Entities this recognizer can detect

TYPE: List[str]

nlp_artifacts

Output values from the NLP engine

TYPE: Optional[NlpArtifacts] DEFAULT: None

regex_flags

regex flags to be used in regex matching

TYPE: Optional[int] DEFAULT: None

RETURNS DESCRIPTION
List[RecognizerResult]
Source code in presidio_analyzer/pattern_recognizer.py
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def analyze(
    self,
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]:
    """
    Analyzes text to detect PII using regular expressions or deny-lists.

    :param text: Text to be analyzed
    :param entities: Entities this recognizer can detect
    :param nlp_artifacts: Output values from the NLP engine
    :param regex_flags: regex flags to be used in regex matching
    :return:
    """
    results = []

    if self.patterns:
        pattern_result = self.__analyze_patterns(text, regex_flags)
        results.extend(pattern_result)

    return results

enhance_using_context

enhance_using_context(
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]

Enhance confidence score using context of the entity.

Override this method in derived class in case a custom logic is needed, otherwise return value will be equal to raw_results.

in case a result score is boosted, derived class need to update result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

PARAMETER DESCRIPTION
text

The actual text that was analyzed

TYPE: str

raw_recognizer_results

This recognizer's results, to be updated based on recognizer specific context.

TYPE: List[RecognizerResult]

other_raw_recognizer_results

Other recognizer results matched in the given text to allow related entity context enhancement

TYPE: List[RecognizerResult]

nlp_artifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

TYPE: NlpArtifacts

context

list of context words

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/entity_recognizer.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def enhance_using_context(
    self,
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """Enhance confidence score using context of the entity.

    Override this method in derived class in case a custom logic
    is needed, otherwise return value will be equal to
    raw_results.

    in case a result score is boosted, derived class need to update
    result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

    :param text: The actual text that was analyzed
    :param raw_recognizer_results: This recognizer's results, to be updated
    based on recognizer specific context.
    :param other_raw_recognizer_results: Other recognizer results matched in
    the given text to allow related entity context enhancement
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param context: list of context words
    """
    return raw_recognizer_results

get_supported_entities

get_supported_entities() -> List[str]

Return the list of entities this recognizer can identify.

RETURNS DESCRIPTION
List[str]

A list of the supported entities by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
119
120
121
122
123
124
125
def get_supported_entities(self) -> List[str]:
    """
    Return the list of entities this recognizer can identify.

    :return: A list of the supported entities by this recognizer
    """
    return self.supported_entities

get_supported_language

get_supported_language() -> str

Return the language this recognizer can support.

RETURNS DESCRIPTION
str

A list of the supported language by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
127
128
129
130
131
132
133
def get_supported_language(self) -> str:
    """
    Return the language this recognizer can support.

    :return: A list of the supported language by this recognizer
    """
    return self.supported_language

get_version

get_version() -> str

Return the version of this recognizer.

RETURNS DESCRIPTION
str

The current version of this recognizer

Source code in presidio_analyzer/entity_recognizer.py
135
136
137
138
139
140
141
def get_version(self) -> str:
    """
    Return the version of this recognizer.

    :return: The current version of this recognizer
    """
    return self.version

to_dict

to_dict() -> Dict

Serialize instance into a dictionary.

Source code in presidio_analyzer/pattern_recognizer.py
254
255
256
257
258
259
260
261
262
263
264
def to_dict(self) -> Dict:
    """Serialize instance into a dictionary."""
    return_dict = super().to_dict()

    return_dict["patterns"] = [pat.to_dict() for pat in self.patterns]
    return_dict["deny_list"] = self.deny_list
    return_dict["context"] = self.context
    return_dict["supported_entity"] = return_dict["supported_entities"][0]
    del return_dict["supported_entities"]

    return return_dict

from_dict classmethod

from_dict(entity_recognizer_dict: Dict) -> PatternRecognizer

Create instance from a serialized dict.

Source code in presidio_analyzer/pattern_recognizer.py
266
267
268
269
270
271
272
273
274
@classmethod
def from_dict(cls, entity_recognizer_dict: Dict) -> "PatternRecognizer":
    """Create instance from a serialized dict."""
    patterns = entity_recognizer_dict.get("patterns")
    if patterns:
        patterns_list = [Pattern.from_dict(pat) for pat in patterns]
        entity_recognizer_dict["patterns"] = patterns_list

    return cls(**entity_recognizer_dict)

remove_duplicates staticmethod

remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]

Remove duplicate results.

Remove duplicates in case the two results have identical start and ends and types.

PARAMETER DESCRIPTION
results

List[RecognizerResult]

TYPE: List[RecognizerResult]

RETURNS DESCRIPTION
List[RecognizerResult]

List[RecognizerResult]

Source code in presidio_analyzer/entity_recognizer.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
@staticmethod
def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
    """
    Remove duplicate results.

    Remove duplicates in case the two results
    have identical start and ends and types.
    :param results: List[RecognizerResult]
    :return: List[RecognizerResult]
    """
    results = list(set(results))
    results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
    filtered_results = []

    for result in results:
        if result.score == 0:
            continue

        to_keep = result not in filtered_results  # equals based comparison
        if to_keep:
            for filtered in filtered_results:
                # If result is contained in one of the other results
                if (
                    result.contained_in(filtered)
                    and result.entity_type == filtered.entity_type
                ):
                    to_keep = False
                    break

        if to_keep:
            filtered_results.append(result)

    return filtered_results

sanitize_value staticmethod

sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str

Cleanse the input string of the replacement pairs specified as argument.

PARAMETER DESCRIPTION
text

input string

TYPE: str

replacement_pairs

pairs of what has to be replaced with which value

TYPE: List[Tuple[str, str]]

RETURNS DESCRIPTION
str

cleansed string

Source code in presidio_analyzer/entity_recognizer.py
200
201
202
203
204
205
206
207
208
209
210
211
@staticmethod
def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
    """
    Cleanse the input string of the replacement pairs specified as argument.

    :param text: input string
    :param replacement_pairs: pairs of what has to be replaced with which value
    :return: cleansed string
    """
    for search_string, replacement_string in replacement_pairs:
        text = text.replace(search_string, replacement_string)
    return text

invalidate_result

invalidate_result(pattern_text: str) -> Optional[bool]

Logic to check for result invalidation by running pruning logic.

For example, each SSN number group should not consist of all the same digits.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the result is invalidated

Source code in presidio_analyzer/pattern_recognizer.py
127
128
129
130
131
132
133
134
135
136
137
def invalidate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Logic to check for result invalidation by running pruning logic.

    For example, each SSN number group should not consist of all the same digits.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the result is invalidated
    """
    return None

build_regex_explanation staticmethod

build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation

Construct an explanation for why this entity was detected.

PARAMETER DESCRIPTION
recognizer_name

Name of recognizer detecting the entity

TYPE: str

pattern_name

Regex pattern name which detected the entity

TYPE: str

pattern

Regex pattern logic

TYPE: str

original_score

Score given by the recognizer

TYPE: float

validation_result

Whether validation was used and its result

TYPE: bool

regex_flags

Regex flags used in the regex matching

TYPE: int

RETURNS DESCRIPTION
AnalysisExplanation

Analysis explanation

Source code in presidio_analyzer/pattern_recognizer.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
@staticmethod
def build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation:
    """
    Construct an explanation for why this entity was detected.

    :param recognizer_name: Name of recognizer detecting the entity
    :param pattern_name: Regex pattern name which detected the entity
    :param pattern: Regex pattern logic
    :param original_score: Score given by the recognizer
    :param validation_result: Whether validation was used and its result
    :param regex_flags: Regex flags used in the regex matching
    :return: Analysis explanation
    """
    textual_explanation = (
        f"Detected by `{recognizer_name}` " f"using pattern `{pattern_name}`"
    )

    explanation = AnalysisExplanation(
        recognizer=recognizer_name,
        original_score=original_score,
        pattern_name=pattern_name,
        pattern=pattern,
        validation_result=validation_result,
        regex_flags=regex_flags,
        textual_explanation=textual_explanation,
    )
    return explanation

FiPersonalIdentityCodeRecognizer

Bases: PatternRecognizer

Recognizes and validates Finnish Personal Identity Codes (Henkilötunnus).

PARAMETER DESCRIPTION
patterns

List of patterns to be used by this recognizer

TYPE: Optional[List[Pattern]] DEFAULT: None

context

List of context words to increase confidence in detection

TYPE: Optional[List[str]] DEFAULT: None

supported_language

Language this recognizer supports

TYPE: str DEFAULT: 'fi'

supported_entity

The entity this recognizer can detect

TYPE: str DEFAULT: 'FI_PERSONAL_IDENTITY_CODE'

METHOD DESCRIPTION
analyze

Analyzes text to detect PII using regular expressions or deny-lists.

enhance_using_context

Enhance confidence score using context of the entity.

get_supported_entities

Return the list of entities this recognizer can identify.

get_supported_language

Return the language this recognizer can support.

get_version

Return the version of this recognizer.

to_dict

Serialize instance into a dictionary.

from_dict

Create instance from a serialized dict.

remove_duplicates

Remove duplicate results.

sanitize_value

Cleanse the input string of the replacement pairs specified as argument.

invalidate_result

Logic to check for result invalidation by running pruning logic.

build_regex_explanation

Construct an explanation for why this entity was detected.

validate_result

Validate the pattern by using the control character.

Source code in presidio_analyzer/predefined_recognizers/fi_personal_identity_code_recognizer.py
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
class FiPersonalIdentityCodeRecognizer(PatternRecognizer):
    """
    Recognizes and validates Finnish Personal Identity Codes (Henkilötunnus).

    :param patterns: List of patterns to be used by this recognizer
    :param context: List of context words to increase confidence in detection
    :param supported_language: Language this recognizer supports
    :param supported_entity: The entity this recognizer can detect
    """

    PATTERNS = [
        Pattern(
            "Finnish Personal Identity Code (Medium)",
            r"\b(\d{6})([+-ABCDEFYXWVU])(\d{3})([0123456789ABCDEFHJKLMNPRSTUVWXY])\b",
            0.5,
        ),
        Pattern(
            "Finnish Personal Identity Code (Very Weak)",
            r"(\d{6})([+-ABCDEFYXWVU])(\d{3})([0123456789ABCDEFHJKLMNPRSTUVWXY])",
            0.1,
        ),
    ]
    CONTEXT = ["hetu", "henkilötunnus", "personbeteckningen", "personal identity code"]

    def __init__(
        self,
        patterns: Optional[List[Pattern]] = None,
        context: Optional[List[str]] = None,
        supported_language: str = "fi",
        supported_entity: str = "FI_PERSONAL_IDENTITY_CODE",
    ):
        patterns = patterns if patterns else self.PATTERNS
        context = context if context else self.CONTEXT
        super().__init__(
            supported_entity=supported_entity,
            patterns=patterns,
            context=context,
            supported_language=supported_language,
        )

    def validate_result(self, pattern_text: str) -> Optional[bool]:
        """Validate the pattern by using the control character."""

        # More information on the validation logic from:
        # https://dvv.fi/en/personal-identity-code
        # Under "How is the control character for a personal identity code calculated?".
        if len(pattern_text) != 11:
            return False

        date_part = pattern_text[0:6]
        try:
            # Checking if we do not have invalid dates e.g. 310211.
            datetime.strptime(date_part, "%d%m%y")
        except ValueError:
            return False
        individual_number = pattern_text[7:10]
        control_character = pattern_text[-1]
        valid_control_characters = "0123456789ABCDEFHJKLMNPRSTUVWXY"
        number_to_check = int(date_part + individual_number)
        return valid_control_characters[number_to_check % 31] == control_character

id property

id

Return a unique identifier of this recognizer.

analyze

analyze(
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]

Analyzes text to detect PII using regular expressions or deny-lists.

PARAMETER DESCRIPTION
text

Text to be analyzed

TYPE: str

entities

Entities this recognizer can detect

TYPE: List[str]

nlp_artifacts

Output values from the NLP engine

TYPE: Optional[NlpArtifacts] DEFAULT: None

regex_flags

regex flags to be used in regex matching

TYPE: Optional[int] DEFAULT: None

RETURNS DESCRIPTION
List[RecognizerResult]
Source code in presidio_analyzer/pattern_recognizer.py
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def analyze(
    self,
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]:
    """
    Analyzes text to detect PII using regular expressions or deny-lists.

    :param text: Text to be analyzed
    :param entities: Entities this recognizer can detect
    :param nlp_artifacts: Output values from the NLP engine
    :param regex_flags: regex flags to be used in regex matching
    :return:
    """
    results = []

    if self.patterns:
        pattern_result = self.__analyze_patterns(text, regex_flags)
        results.extend(pattern_result)

    return results

enhance_using_context

enhance_using_context(
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]

Enhance confidence score using context of the entity.

Override this method in derived class in case a custom logic is needed, otherwise return value will be equal to raw_results.

in case a result score is boosted, derived class need to update result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

PARAMETER DESCRIPTION
text

The actual text that was analyzed

TYPE: str

raw_recognizer_results

This recognizer's results, to be updated based on recognizer specific context.

TYPE: List[RecognizerResult]

other_raw_recognizer_results

Other recognizer results matched in the given text to allow related entity context enhancement

TYPE: List[RecognizerResult]

nlp_artifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

TYPE: NlpArtifacts

context

list of context words

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/entity_recognizer.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def enhance_using_context(
    self,
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """Enhance confidence score using context of the entity.

    Override this method in derived class in case a custom logic
    is needed, otherwise return value will be equal to
    raw_results.

    in case a result score is boosted, derived class need to update
    result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

    :param text: The actual text that was analyzed
    :param raw_recognizer_results: This recognizer's results, to be updated
    based on recognizer specific context.
    :param other_raw_recognizer_results: Other recognizer results matched in
    the given text to allow related entity context enhancement
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param context: list of context words
    """
    return raw_recognizer_results

get_supported_entities

get_supported_entities() -> List[str]

Return the list of entities this recognizer can identify.

RETURNS DESCRIPTION
List[str]

A list of the supported entities by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
119
120
121
122
123
124
125
def get_supported_entities(self) -> List[str]:
    """
    Return the list of entities this recognizer can identify.

    :return: A list of the supported entities by this recognizer
    """
    return self.supported_entities

get_supported_language

get_supported_language() -> str

Return the language this recognizer can support.

RETURNS DESCRIPTION
str

A list of the supported language by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
127
128
129
130
131
132
133
def get_supported_language(self) -> str:
    """
    Return the language this recognizer can support.

    :return: A list of the supported language by this recognizer
    """
    return self.supported_language

get_version

get_version() -> str

Return the version of this recognizer.

RETURNS DESCRIPTION
str

The current version of this recognizer

Source code in presidio_analyzer/entity_recognizer.py
135
136
137
138
139
140
141
def get_version(self) -> str:
    """
    Return the version of this recognizer.

    :return: The current version of this recognizer
    """
    return self.version

to_dict

to_dict() -> Dict

Serialize instance into a dictionary.

Source code in presidio_analyzer/pattern_recognizer.py
254
255
256
257
258
259
260
261
262
263
264
def to_dict(self) -> Dict:
    """Serialize instance into a dictionary."""
    return_dict = super().to_dict()

    return_dict["patterns"] = [pat.to_dict() for pat in self.patterns]
    return_dict["deny_list"] = self.deny_list
    return_dict["context"] = self.context
    return_dict["supported_entity"] = return_dict["supported_entities"][0]
    del return_dict["supported_entities"]

    return return_dict

from_dict classmethod

from_dict(entity_recognizer_dict: Dict) -> PatternRecognizer

Create instance from a serialized dict.

Source code in presidio_analyzer/pattern_recognizer.py
266
267
268
269
270
271
272
273
274
@classmethod
def from_dict(cls, entity_recognizer_dict: Dict) -> "PatternRecognizer":
    """Create instance from a serialized dict."""
    patterns = entity_recognizer_dict.get("patterns")
    if patterns:
        patterns_list = [Pattern.from_dict(pat) for pat in patterns]
        entity_recognizer_dict["patterns"] = patterns_list

    return cls(**entity_recognizer_dict)

remove_duplicates staticmethod

remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]

Remove duplicate results.

Remove duplicates in case the two results have identical start and ends and types.

PARAMETER DESCRIPTION
results

List[RecognizerResult]

TYPE: List[RecognizerResult]

RETURNS DESCRIPTION
List[RecognizerResult]

List[RecognizerResult]

Source code in presidio_analyzer/entity_recognizer.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
@staticmethod
def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
    """
    Remove duplicate results.

    Remove duplicates in case the two results
    have identical start and ends and types.
    :param results: List[RecognizerResult]
    :return: List[RecognizerResult]
    """
    results = list(set(results))
    results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
    filtered_results = []

    for result in results:
        if result.score == 0:
            continue

        to_keep = result not in filtered_results  # equals based comparison
        if to_keep:
            for filtered in filtered_results:
                # If result is contained in one of the other results
                if (
                    result.contained_in(filtered)
                    and result.entity_type == filtered.entity_type
                ):
                    to_keep = False
                    break

        if to_keep:
            filtered_results.append(result)

    return filtered_results

sanitize_value staticmethod

sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str

Cleanse the input string of the replacement pairs specified as argument.

PARAMETER DESCRIPTION
text

input string

TYPE: str

replacement_pairs

pairs of what has to be replaced with which value

TYPE: List[Tuple[str, str]]

RETURNS DESCRIPTION
str

cleansed string

Source code in presidio_analyzer/entity_recognizer.py
200
201
202
203
204
205
206
207
208
209
210
211
@staticmethod
def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
    """
    Cleanse the input string of the replacement pairs specified as argument.

    :param text: input string
    :param replacement_pairs: pairs of what has to be replaced with which value
    :return: cleansed string
    """
    for search_string, replacement_string in replacement_pairs:
        text = text.replace(search_string, replacement_string)
    return text

invalidate_result

invalidate_result(pattern_text: str) -> Optional[bool]

Logic to check for result invalidation by running pruning logic.

For example, each SSN number group should not consist of all the same digits.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the result is invalidated

Source code in presidio_analyzer/pattern_recognizer.py
127
128
129
130
131
132
133
134
135
136
137
def invalidate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Logic to check for result invalidation by running pruning logic.

    For example, each SSN number group should not consist of all the same digits.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the result is invalidated
    """
    return None

build_regex_explanation staticmethod

build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation

Construct an explanation for why this entity was detected.

PARAMETER DESCRIPTION
recognizer_name

Name of recognizer detecting the entity

TYPE: str

pattern_name

Regex pattern name which detected the entity

TYPE: str

pattern

Regex pattern logic

TYPE: str

original_score

Score given by the recognizer

TYPE: float

validation_result

Whether validation was used and its result

TYPE: bool

regex_flags

Regex flags used in the regex matching

TYPE: int

RETURNS DESCRIPTION
AnalysisExplanation

Analysis explanation

Source code in presidio_analyzer/pattern_recognizer.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
@staticmethod
def build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation:
    """
    Construct an explanation for why this entity was detected.

    :param recognizer_name: Name of recognizer detecting the entity
    :param pattern_name: Regex pattern name which detected the entity
    :param pattern: Regex pattern logic
    :param original_score: Score given by the recognizer
    :param validation_result: Whether validation was used and its result
    :param regex_flags: Regex flags used in the regex matching
    :return: Analysis explanation
    """
    textual_explanation = (
        f"Detected by `{recognizer_name}` " f"using pattern `{pattern_name}`"
    )

    explanation = AnalysisExplanation(
        recognizer=recognizer_name,
        original_score=original_score,
        pattern_name=pattern_name,
        pattern=pattern,
        validation_result=validation_result,
        regex_flags=regex_flags,
        textual_explanation=textual_explanation,
    )
    return explanation

validate_result

validate_result(pattern_text: str) -> Optional[bool]

Validate the pattern by using the control character.

Source code in presidio_analyzer/predefined_recognizers/fi_personal_identity_code_recognizer.py
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
def validate_result(self, pattern_text: str) -> Optional[bool]:
    """Validate the pattern by using the control character."""

    # More information on the validation logic from:
    # https://dvv.fi/en/personal-identity-code
    # Under "How is the control character for a personal identity code calculated?".
    if len(pattern_text) != 11:
        return False

    date_part = pattern_text[0:6]
    try:
        # Checking if we do not have invalid dates e.g. 310211.
        datetime.strptime(date_part, "%d%m%y")
    except ValueError:
        return False
    individual_number = pattern_text[7:10]
    control_character = pattern_text[-1]
    valid_control_characters = "0123456789ABCDEFHJKLMNPRSTUVWXY"
    number_to_check = int(date_part + individual_number)
    return valid_control_characters[number_to_check % 31] == control_character

GLiNERRecognizer

Bases: LocalRecognizer

GLiNER model based entity recognizer.

METHOD DESCRIPTION
enhance_using_context

Enhance confidence score using context of the entity.

get_supported_entities

Return the list of entities this recognizer can identify.

get_supported_language

Return the language this recognizer can support.

get_version

Return the version of this recognizer.

to_dict

Serialize self to dictionary.

from_dict

Create EntityRecognizer from a dict input.

remove_duplicates

Remove duplicate results.

sanitize_value

Cleanse the input string of the replacement pairs specified as argument.

load

Load the GLiNER model.

analyze

Analyze text to identify entities using a GLiNER model.

Source code in presidio_analyzer/predefined_recognizers/gliner_recognizer.py
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
class GLiNERRecognizer(LocalRecognizer):
    """GLiNER model based entity recognizer."""

    def __init__(
        self,
        supported_entities: Optional[List[str]] = None,
        name: str = "GLiNERRecognizer",
        supported_language: str = "en",
        version: str = "0.0.1",
        context: Optional[List[str]] = None,
        entity_mapping: Optional[Dict[str, str]] = None,
        model_name: str = "urchade/gliner_multi_pii-v1",
        flat_ner: bool = True,
        multi_label: bool = False,
        threshold: float = 0.30,
        map_location: str = "cpu",
    ):
        """GLiNER model based entity recognizer.

        The model is based on the GLiNER library.

        :param supported_entities: List of supported entities for this recognizer.
        If None, all entities in Presidio's default configuration will be used.
        see `NerModelConfiguration`
        :param name: Name of the recognizer
        :param supported_language: Language code to use for the recognizer
        :param version: Version of the recognizer
        :param context: N/A for this recognizer
        :param model_name: The name of the GLiNER model to load
        :param flat_ner: Whether to use flat NER or not (see GLiNER's documentation)
        :param multi_label: Whether to use multi-label classification or not
        (see GLiNER's documentation)
        :param threshold: The threshold for the model's output
        (see GLiNER's documentation)
        :param map_location: The device to use for the model


        """

        if entity_mapping:
            if supported_entities:
                raise ValueError(
                    "entity_mapping and supported_entities cannot be used together"
                )

            self.model_to_presidio_entity_mapping = entity_mapping
        else:
            if not supported_entities:
                logger.info(
                    "No supported entities provided, "
                    "using default entities from NerModelConfiguration"
                )
                self.model_to_presidio_entity_mapping = (
                    NerModelConfiguration().model_to_presidio_entity_mapping
                )
            else:
                self.model_to_presidio_entity_mapping = {
                    entity: entity for entity in supported_entities
                }

        logger.info("Using entity mapping %s",
                    json.dumps(entity_mapping, indent=2))
        supported_entities = list(set(self.model_to_presidio_entity_mapping.values()))
        self.model_name = model_name
        self.map_location = map_location
        self.flat_ner = flat_ner
        self.multi_label = multi_label
        self.threshold = threshold

        self.gliner = None

        super().__init__(
            supported_entities=supported_entities,
            name=name,
            supported_language=supported_language,
            version=version,
            context=context,
        )

        self.gliner_labels = list(self.model_to_presidio_entity_mapping.keys())

    def load(self) -> None:
        """Load the GLiNER model."""
        if not GLiNER:
            raise ImportError("GLiNER is not installed. Please install it.")
        self.gliner = GLiNER.from_pretrained(self.model_name)

    def analyze(
        self,
        text: str,
        entities: List[str],
        nlp_artifacts: Optional[NlpArtifacts] = None,
    ) -> List[RecognizerResult]:
        """Analyze text to identify entities using a GLiNER model.

        :param text: The text to be analyzed
        :param entities: The list of entities this recognizer is requested to return
        :param nlp_artifacts: N/A for this recognizer
        """

        # combine the input labels as this model allows for ad-hoc labels
        labels = self.__create_input_labels(entities)

        predictions = self.gliner.predict_entities(
            text=text,
            labels=labels,
            flat_ner=self.flat_ner,
            threshold=self.threshold,
            multi_label=self.multi_label,
        )
        recognizer_results = []
        for prediction in predictions:
            presidio_entity = self.model_to_presidio_entity_mapping.get(
                prediction["label"], prediction["label"]
            )
            if entities and presidio_entity not in entities:
                continue

            analysis_explanation = AnalysisExplanation(
                recognizer=self.name,
                original_score=prediction["score"],
                textual_explanation=f"Identified as {presidio_entity} by GLiNER",
            )

            recognizer_results.append(
                RecognizerResult(
                    entity_type=presidio_entity,
                    start=prediction["start"],
                    end=prediction["end"],
                    score=prediction["score"],
                    analysis_explanation=analysis_explanation,
                )
            )

        return recognizer_results

    def __create_input_labels(self, entities):
        """Append the entities requested by the user to the list of labels if it's not there.""" # noqa: E501
        labels = self.gliner_labels
        for entity in entities:
            if (
                entity not in self.model_to_presidio_entity_mapping.values()
                and entity not in self.gliner_labels
            ):
                labels.append(entity)
        return labels

id property

id

Return a unique identifier of this recognizer.

enhance_using_context

enhance_using_context(
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]

Enhance confidence score using context of the entity.

Override this method in derived class in case a custom logic is needed, otherwise return value will be equal to raw_results.

in case a result score is boosted, derived class need to update result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

PARAMETER DESCRIPTION
text

The actual text that was analyzed

TYPE: str

raw_recognizer_results

This recognizer's results, to be updated based on recognizer specific context.

TYPE: List[RecognizerResult]

other_raw_recognizer_results

Other recognizer results matched in the given text to allow related entity context enhancement

TYPE: List[RecognizerResult]

nlp_artifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

TYPE: NlpArtifacts

context

list of context words

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/entity_recognizer.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def enhance_using_context(
    self,
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """Enhance confidence score using context of the entity.

    Override this method in derived class in case a custom logic
    is needed, otherwise return value will be equal to
    raw_results.

    in case a result score is boosted, derived class need to update
    result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

    :param text: The actual text that was analyzed
    :param raw_recognizer_results: This recognizer's results, to be updated
    based on recognizer specific context.
    :param other_raw_recognizer_results: Other recognizer results matched in
    the given text to allow related entity context enhancement
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param context: list of context words
    """
    return raw_recognizer_results

get_supported_entities

get_supported_entities() -> List[str]

Return the list of entities this recognizer can identify.

RETURNS DESCRIPTION
List[str]

A list of the supported entities by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
119
120
121
122
123
124
125
def get_supported_entities(self) -> List[str]:
    """
    Return the list of entities this recognizer can identify.

    :return: A list of the supported entities by this recognizer
    """
    return self.supported_entities

get_supported_language

get_supported_language() -> str

Return the language this recognizer can support.

RETURNS DESCRIPTION
str

A list of the supported language by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
127
128
129
130
131
132
133
def get_supported_language(self) -> str:
    """
    Return the language this recognizer can support.

    :return: A list of the supported language by this recognizer
    """
    return self.supported_language

get_version

get_version() -> str

Return the version of this recognizer.

RETURNS DESCRIPTION
str

The current version of this recognizer

Source code in presidio_analyzer/entity_recognizer.py
135
136
137
138
139
140
141
def get_version(self) -> str:
    """
    Return the version of this recognizer.

    :return: The current version of this recognizer
    """
    return self.version

to_dict

to_dict() -> Dict

Serialize self to dictionary.

RETURNS DESCRIPTION
Dict

a dictionary

Source code in presidio_analyzer/entity_recognizer.py
143
144
145
146
147
148
149
150
151
152
153
154
155
def to_dict(self) -> Dict:
    """
    Serialize self to dictionary.

    :return: a dictionary
    """
    return_dict = {
        "supported_entities": self.supported_entities,
        "supported_language": self.supported_language,
        "name": self.name,
        "version": self.version,
    }
    return return_dict

from_dict classmethod

from_dict(entity_recognizer_dict: Dict) -> EntityRecognizer

Create EntityRecognizer from a dict input.

PARAMETER DESCRIPTION
entity_recognizer_dict

Dict containing keys and values for instantiation

TYPE: Dict

Source code in presidio_analyzer/entity_recognizer.py
157
158
159
160
161
162
163
164
@classmethod
def from_dict(cls, entity_recognizer_dict: Dict) -> "EntityRecognizer":
    """
    Create EntityRecognizer from a dict input.

    :param entity_recognizer_dict: Dict containing keys and values for instantiation
    """
    return cls(**entity_recognizer_dict)

remove_duplicates staticmethod

remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]

Remove duplicate results.

Remove duplicates in case the two results have identical start and ends and types.

PARAMETER DESCRIPTION
results

List[RecognizerResult]

TYPE: List[RecognizerResult]

RETURNS DESCRIPTION
List[RecognizerResult]

List[RecognizerResult]

Source code in presidio_analyzer/entity_recognizer.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
@staticmethod
def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
    """
    Remove duplicate results.

    Remove duplicates in case the two results
    have identical start and ends and types.
    :param results: List[RecognizerResult]
    :return: List[RecognizerResult]
    """
    results = list(set(results))
    results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
    filtered_results = []

    for result in results:
        if result.score == 0:
            continue

        to_keep = result not in filtered_results  # equals based comparison
        if to_keep:
            for filtered in filtered_results:
                # If result is contained in one of the other results
                if (
                    result.contained_in(filtered)
                    and result.entity_type == filtered.entity_type
                ):
                    to_keep = False
                    break

        if to_keep:
            filtered_results.append(result)

    return filtered_results

sanitize_value staticmethod

sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str

Cleanse the input string of the replacement pairs specified as argument.

PARAMETER DESCRIPTION
text

input string

TYPE: str

replacement_pairs

pairs of what has to be replaced with which value

TYPE: List[Tuple[str, str]]

RETURNS DESCRIPTION
str

cleansed string

Source code in presidio_analyzer/entity_recognizer.py
200
201
202
203
204
205
206
207
208
209
210
211
@staticmethod
def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
    """
    Cleanse the input string of the replacement pairs specified as argument.

    :param text: input string
    :param replacement_pairs: pairs of what has to be replaced with which value
    :return: cleansed string
    """
    for search_string, replacement_string in replacement_pairs:
        text = text.replace(search_string, replacement_string)
    return text

load

load() -> None

Load the GLiNER model.

Source code in presidio_analyzer/predefined_recognizers/gliner_recognizer.py
103
104
105
106
107
def load(self) -> None:
    """Load the GLiNER model."""
    if not GLiNER:
        raise ImportError("GLiNER is not installed. Please install it.")
    self.gliner = GLiNER.from_pretrained(self.model_name)

analyze

analyze(
    text: str, entities: List[str], nlp_artifacts: Optional[NlpArtifacts] = None
) -> List[RecognizerResult]

Analyze text to identify entities using a GLiNER model.

PARAMETER DESCRIPTION
text

The text to be analyzed

TYPE: str

entities

The list of entities this recognizer is requested to return

TYPE: List[str]

nlp_artifacts

N/A for this recognizer

TYPE: Optional[NlpArtifacts] DEFAULT: None

Source code in presidio_analyzer/predefined_recognizers/gliner_recognizer.py
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
def analyze(
    self,
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
) -> List[RecognizerResult]:
    """Analyze text to identify entities using a GLiNER model.

    :param text: The text to be analyzed
    :param entities: The list of entities this recognizer is requested to return
    :param nlp_artifacts: N/A for this recognizer
    """

    # combine the input labels as this model allows for ad-hoc labels
    labels = self.__create_input_labels(entities)

    predictions = self.gliner.predict_entities(
        text=text,
        labels=labels,
        flat_ner=self.flat_ner,
        threshold=self.threshold,
        multi_label=self.multi_label,
    )
    recognizer_results = []
    for prediction in predictions:
        presidio_entity = self.model_to_presidio_entity_mapping.get(
            prediction["label"], prediction["label"]
        )
        if entities and presidio_entity not in entities:
            continue

        analysis_explanation = AnalysisExplanation(
            recognizer=self.name,
            original_score=prediction["score"],
            textual_explanation=f"Identified as {presidio_entity} by GLiNER",
        )

        recognizer_results.append(
            RecognizerResult(
                entity_type=presidio_entity,
                start=prediction["start"],
                end=prediction["end"],
                score=prediction["score"],
                analysis_explanation=analysis_explanation,
            )
        )

    return recognizer_results

IbanRecognizer

Bases: PatternRecognizer

Recognize IBAN code using regex and checksum.

PARAMETER DESCRIPTION
patterns

List of patterns to be used by this recognizer

TYPE: List[str] DEFAULT: None

context

List of context words to increase confidence in detection

TYPE: List[str] DEFAULT: None

supported_language

Language this recognizer supports

TYPE: str DEFAULT: 'en'

supported_entity

The entity this recognizer can detect

TYPE: str DEFAULT: 'IBAN_CODE'

exact_match

Whether patterns should be exactly matched or not

TYPE: bool DEFAULT: False

bos_eos

Tuple of strings for beginning of string (BOS) and end of string (EOS)

TYPE: Tuple[str, str] DEFAULT: (BOS, EOS)

regex_flags

Regex flags options

TYPE: int DEFAULT: DOTALL | MULTILINE

replacement_pairs

List of tuples with potential replacement values for different strings to be used during pattern matching. This can allow a greater variety in input, for example by removing dashes or spaces.

TYPE: Optional[List[Tuple[str, str]]] DEFAULT: None

METHOD DESCRIPTION
enhance_using_context

Enhance confidence score using context of the entity.

get_supported_entities

Return the list of entities this recognizer can identify.

get_supported_language

Return the language this recognizer can support.

get_version

Return the version of this recognizer.

to_dict

Serialize instance into a dictionary.

from_dict

Create instance from a serialized dict.

remove_duplicates

Remove duplicate results.

sanitize_value

Cleanse the input string of the replacement pairs specified as argument.

invalidate_result

Logic to check for result invalidation by running pruning logic.

build_regex_explanation

Construct an explanation for why this entity was detected.

analyze

Analyze IBAN.

Source code in presidio_analyzer/predefined_recognizers/iban_recognizer.py
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
class IbanRecognizer(PatternRecognizer):
    """
    Recognize IBAN code using regex and checksum.

    :param patterns: List of patterns to be used by this recognizer
    :param context: List of context words to increase confidence in detection
    :param supported_language: Language this recognizer supports
    :param supported_entity: The entity this recognizer can detect
    :param exact_match: Whether patterns should be exactly matched or not
    :param bos_eos: Tuple of strings for beginning of string (BOS)
    and end of string (EOS)
    :param regex_flags: Regex flags options
    :param replacement_pairs: List of tuples with potential replacement values
    for different strings to be used during pattern matching.
    This can allow a greater variety in input, for example by removing dashes or spaces.
    """

    PATTERNS = [
        Pattern(
            "IBAN Generic",
            r"\b([A-Z]{2}[ \-]?[0-9]{2})(?=(?:[ \-]?[A-Z0-9]){9,30})((?:[ \-]?[A-Z0-9]{3,5}){2})"  # noqa
            r"([ \-]?[A-Z0-9]{3,5})?([ \-]?[A-Z0-9]{3,5})?([ \-]?[A-Z0-9]{3,5})?([ \-]?[A-Z0-9]{3,5})?([ \-]?[A-Z0-9]{3,5})?"  # noqa
            r"([ \-]?[A-Z0-9]{1,3})?\b",  # noqa
            0.5,
        ),
    ]

    CONTEXT = ["iban", "bank", "transaction"]

    LETTERS: Dict[int, str] = {
        ord(d): str(i) for i, d in enumerate(string.digits + string.ascii_uppercase)
    }

    def __init__(
        self,
        patterns: List[str] = None,
        context: List[str] = None,
        supported_language: str = "en",
        supported_entity: str = "IBAN_CODE",
        exact_match: bool = False,
        bos_eos: Tuple[str, str] = (BOS, EOS),
        regex_flags: int = re.DOTALL | re.MULTILINE,
        replacement_pairs: Optional[List[Tuple[str, str]]] = None,
    ):
        self.replacement_pairs = replacement_pairs or [("-", ""), (" ", "")]
        self.exact_match = exact_match
        self.BOSEOS = bos_eos if exact_match else ()
        patterns = patterns if patterns else self.PATTERNS
        context = context if context else self.CONTEXT
        super().__init__(
            supported_entity=supported_entity,
            patterns=patterns,
            context=context,
            supported_language=supported_language,
            global_regex_flags=regex_flags,
        )

    def validate_result(self, pattern_text: str):  # noqa D102
        try:
            pattern_text = EntityRecognizer.sanitize_value(
                pattern_text, self.replacement_pairs
            )
            is_valid_checksum = (
                self.__generate_iban_check_digits(pattern_text, self.LETTERS)
                == pattern_text[2:4]
            )
            # score = EntityRecognizer.MIN_SCORE
            result = False
            if is_valid_checksum:
                if self.__is_valid_format(pattern_text, self.BOSEOS):
                    result = True
                elif self.__is_valid_format(pattern_text.upper(), self.BOSEOS):
                    result = None
            return result
        except ValueError:
            logger.error("Failed to validate text %s", pattern_text)
            return False

    def analyze(
        self,
        text: str,
        entities: List[str],
        nlp_artifacts: NlpArtifacts = None,
        regex_flags: int = None,
    ) -> List[RecognizerResult]:
        """Analyze IBAN."""
        results = []

        if self.patterns:
            pattern_result = self.__analyze_patterns(text)
            results.extend(pattern_result)

        return results

    def __analyze_patterns(self, text: str, flags: int = None):
        """
        Evaluate all patterns in the provided text.

        Logic includes detecting words in the provided deny list.
        In a sentence we could get a false positive at the end of our regex, were we
        want to find the IBAN but not the false positive at the end of the match.

        i.e. "I want my deposit in DE89370400440532013000 2 days from today."

        :param text: text to analyze
        :param flags: regex flags
        :return: A list of RecognizerResult
        """
        flags = flags if flags else self.global_regex_flags
        results = []
        for pattern in self.patterns:
            matches = re.finditer(pattern.regex, text, flags=flags)

            for match in matches:
                for grp_num in reversed(range(1, len(match.groups()) + 1)):
                    start = match.span(0)[0]
                    end = (
                        match.span(grp_num)[1]
                        if match.span(grp_num)[1] > 0
                        else match.span(0)[1]
                    )
                    current_match = text[start:end]

                    # Skip empty results
                    if current_match == "":
                        continue

                    score = pattern.score

                    validation_result = self.validate_result(current_match)
                    description = PatternRecognizer.build_regex_explanation(
                        self.name,
                        pattern.name,
                        pattern.regex,
                        score,
                        validation_result,
                        flags,
                    )
                    pattern_result = RecognizerResult(
                        entity_type=self.supported_entities[0],
                        start=start,
                        end=end,
                        score=score,
                        analysis_explanation=description,
                        recognition_metadata={
                            RecognizerResult.RECOGNIZER_NAME_KEY: self.name,
                            RecognizerResult.RECOGNIZER_IDENTIFIER_KEY: self.id,
                        },
                    )

                    if validation_result is not None:
                        if validation_result:
                            pattern_result.score = EntityRecognizer.MAX_SCORE
                        else:
                            pattern_result.score = EntityRecognizer.MIN_SCORE

                    if pattern_result.score > EntityRecognizer.MIN_SCORE:
                        results.append(pattern_result)
                        break

        return results

    @staticmethod
    def __number_iban(iban: str, letters: Dict[int, str]) -> str:
        return (iban[4:] + iban[:4]).translate(letters)

    @staticmethod
    def __generate_iban_check_digits(iban: str, letters: Dict[int, str]) -> str:
        transformed_iban = (iban[:2] + "00" + iban[4:]).upper()
        number_iban = IbanRecognizer.__number_iban(transformed_iban, letters)
        return f"{98 - (int(number_iban) % 97):0>2}"

    @staticmethod
    def __is_valid_format(
        iban: str,
        bos_eos: Tuple[str, str] = (BOS, EOS),
        flags: int = re.DOTALL | re.MULTILINE,
    ) -> bool:
        country_code = iban[:2]
        if country_code in regex_per_country:
            country_regex = regex_per_country.get(country_code, "")
            if bos_eos and country_regex:
                country_regex = bos_eos[0] + country_regex + bos_eos[1]
            return country_regex and re.match(country_regex, iban, flags=flags)

        return False

id property

id

Return a unique identifier of this recognizer.

enhance_using_context

enhance_using_context(
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]

Enhance confidence score using context of the entity.

Override this method in derived class in case a custom logic is needed, otherwise return value will be equal to raw_results.

in case a result score is boosted, derived class need to update result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

PARAMETER DESCRIPTION
text

The actual text that was analyzed

TYPE: str

raw_recognizer_results

This recognizer's results, to be updated based on recognizer specific context.

TYPE: List[RecognizerResult]

other_raw_recognizer_results

Other recognizer results matched in the given text to allow related entity context enhancement

TYPE: List[RecognizerResult]

nlp_artifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

TYPE: NlpArtifacts

context

list of context words

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/entity_recognizer.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def enhance_using_context(
    self,
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """Enhance confidence score using context of the entity.

    Override this method in derived class in case a custom logic
    is needed, otherwise return value will be equal to
    raw_results.

    in case a result score is boosted, derived class need to update
    result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

    :param text: The actual text that was analyzed
    :param raw_recognizer_results: This recognizer's results, to be updated
    based on recognizer specific context.
    :param other_raw_recognizer_results: Other recognizer results matched in
    the given text to allow related entity context enhancement
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param context: list of context words
    """
    return raw_recognizer_results

get_supported_entities

get_supported_entities() -> List[str]

Return the list of entities this recognizer can identify.

RETURNS DESCRIPTION
List[str]

A list of the supported entities by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
119
120
121
122
123
124
125
def get_supported_entities(self) -> List[str]:
    """
    Return the list of entities this recognizer can identify.

    :return: A list of the supported entities by this recognizer
    """
    return self.supported_entities

get_supported_language

get_supported_language() -> str

Return the language this recognizer can support.

RETURNS DESCRIPTION
str

A list of the supported language by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
127
128
129
130
131
132
133
def get_supported_language(self) -> str:
    """
    Return the language this recognizer can support.

    :return: A list of the supported language by this recognizer
    """
    return self.supported_language

get_version

get_version() -> str

Return the version of this recognizer.

RETURNS DESCRIPTION
str

The current version of this recognizer

Source code in presidio_analyzer/entity_recognizer.py
135
136
137
138
139
140
141
def get_version(self) -> str:
    """
    Return the version of this recognizer.

    :return: The current version of this recognizer
    """
    return self.version

to_dict

to_dict() -> Dict

Serialize instance into a dictionary.

Source code in presidio_analyzer/pattern_recognizer.py
254
255
256
257
258
259
260
261
262
263
264
def to_dict(self) -> Dict:
    """Serialize instance into a dictionary."""
    return_dict = super().to_dict()

    return_dict["patterns"] = [pat.to_dict() for pat in self.patterns]
    return_dict["deny_list"] = self.deny_list
    return_dict["context"] = self.context
    return_dict["supported_entity"] = return_dict["supported_entities"][0]
    del return_dict["supported_entities"]

    return return_dict

from_dict classmethod

from_dict(entity_recognizer_dict: Dict) -> PatternRecognizer

Create instance from a serialized dict.

Source code in presidio_analyzer/pattern_recognizer.py
266
267
268
269
270
271
272
273
274
@classmethod
def from_dict(cls, entity_recognizer_dict: Dict) -> "PatternRecognizer":
    """Create instance from a serialized dict."""
    patterns = entity_recognizer_dict.get("patterns")
    if patterns:
        patterns_list = [Pattern.from_dict(pat) for pat in patterns]
        entity_recognizer_dict["patterns"] = patterns_list

    return cls(**entity_recognizer_dict)

remove_duplicates staticmethod

remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]

Remove duplicate results.

Remove duplicates in case the two results have identical start and ends and types.

PARAMETER DESCRIPTION
results

List[RecognizerResult]

TYPE: List[RecognizerResult]

RETURNS DESCRIPTION
List[RecognizerResult]

List[RecognizerResult]

Source code in presidio_analyzer/entity_recognizer.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
@staticmethod
def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
    """
    Remove duplicate results.

    Remove duplicates in case the two results
    have identical start and ends and types.
    :param results: List[RecognizerResult]
    :return: List[RecognizerResult]
    """
    results = list(set(results))
    results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
    filtered_results = []

    for result in results:
        if result.score == 0:
            continue

        to_keep = result not in filtered_results  # equals based comparison
        if to_keep:
            for filtered in filtered_results:
                # If result is contained in one of the other results
                if (
                    result.contained_in(filtered)
                    and result.entity_type == filtered.entity_type
                ):
                    to_keep = False
                    break

        if to_keep:
            filtered_results.append(result)

    return filtered_results

sanitize_value staticmethod

sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str

Cleanse the input string of the replacement pairs specified as argument.

PARAMETER DESCRIPTION
text

input string

TYPE: str

replacement_pairs

pairs of what has to be replaced with which value

TYPE: List[Tuple[str, str]]

RETURNS DESCRIPTION
str

cleansed string

Source code in presidio_analyzer/entity_recognizer.py
200
201
202
203
204
205
206
207
208
209
210
211
@staticmethod
def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
    """
    Cleanse the input string of the replacement pairs specified as argument.

    :param text: input string
    :param replacement_pairs: pairs of what has to be replaced with which value
    :return: cleansed string
    """
    for search_string, replacement_string in replacement_pairs:
        text = text.replace(search_string, replacement_string)
    return text

invalidate_result

invalidate_result(pattern_text: str) -> Optional[bool]

Logic to check for result invalidation by running pruning logic.

For example, each SSN number group should not consist of all the same digits.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the result is invalidated

Source code in presidio_analyzer/pattern_recognizer.py
127
128
129
130
131
132
133
134
135
136
137
def invalidate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Logic to check for result invalidation by running pruning logic.

    For example, each SSN number group should not consist of all the same digits.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the result is invalidated
    """
    return None

build_regex_explanation staticmethod

build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation

Construct an explanation for why this entity was detected.

PARAMETER DESCRIPTION
recognizer_name

Name of recognizer detecting the entity

TYPE: str

pattern_name

Regex pattern name which detected the entity

TYPE: str

pattern

Regex pattern logic

TYPE: str

original_score

Score given by the recognizer

TYPE: float

validation_result

Whether validation was used and its result

TYPE: bool

regex_flags

Regex flags used in the regex matching

TYPE: int

RETURNS DESCRIPTION
AnalysisExplanation

Analysis explanation

Source code in presidio_analyzer/pattern_recognizer.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
@staticmethod
def build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation:
    """
    Construct an explanation for why this entity was detected.

    :param recognizer_name: Name of recognizer detecting the entity
    :param pattern_name: Regex pattern name which detected the entity
    :param pattern: Regex pattern logic
    :param original_score: Score given by the recognizer
    :param validation_result: Whether validation was used and its result
    :param regex_flags: Regex flags used in the regex matching
    :return: Analysis explanation
    """
    textual_explanation = (
        f"Detected by `{recognizer_name}` " f"using pattern `{pattern_name}`"
    )

    explanation = AnalysisExplanation(
        recognizer=recognizer_name,
        original_score=original_score,
        pattern_name=pattern_name,
        pattern=pattern,
        validation_result=validation_result,
        regex_flags=regex_flags,
        textual_explanation=textual_explanation,
    )
    return explanation

analyze

analyze(
    text: str,
    entities: List[str],
    nlp_artifacts: NlpArtifacts = None,
    regex_flags: int = None,
) -> List[RecognizerResult]

Analyze IBAN.

Source code in presidio_analyzer/predefined_recognizers/iban_recognizer.py
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
def analyze(
    self,
    text: str,
    entities: List[str],
    nlp_artifacts: NlpArtifacts = None,
    regex_flags: int = None,
) -> List[RecognizerResult]:
    """Analyze IBAN."""
    results = []

    if self.patterns:
        pattern_result = self.__analyze_patterns(text)
        results.extend(pattern_result)

    return results

InAadhaarRecognizer

Bases: PatternRecognizer

Recognizes Indian UIDAI Person Identification Number ("AADHAAR").

Reference: https://en.wikipedia.org/wiki/Aadhaar A 12 digit unique number that is issued to each individual by Government of India

PARAMETER DESCRIPTION
patterns

List of patterns to be used by this recognizer

TYPE: Optional[List[Pattern]] DEFAULT: None

context

List of context words to increase confidence in detection

TYPE: Optional[List[str]] DEFAULT: None

supported_language

Language this recognizer supports

TYPE: str DEFAULT: 'en'

supported_entity

The entity this recognizer can detect

TYPE: str DEFAULT: 'IN_AADHAAR'

replacement_pairs

List of tuples with potential replacement values for different strings to be used during pattern matching. This can allow a greater variety in input, for example by removing dashes or spaces.

TYPE: Optional[List[Tuple[str, str]]] DEFAULT: None

METHOD DESCRIPTION
analyze

Analyzes text to detect PII using regular expressions or deny-lists.

enhance_using_context

Enhance confidence score using context of the entity.

get_supported_entities

Return the list of entities this recognizer can identify.

get_supported_language

Return the language this recognizer can support.

get_version

Return the version of this recognizer.

to_dict

Serialize instance into a dictionary.

from_dict

Create instance from a serialized dict.

remove_duplicates

Remove duplicate results.

sanitize_value

Cleanse the input string of the replacement pairs specified as argument.

invalidate_result

Logic to check for result invalidation by running pruning logic.

build_regex_explanation

Construct an explanation for why this entity was detected.

validate_result

Determine absolute value based on calculation.

Source code in presidio_analyzer/predefined_recognizers/in_aadhaar_recognizer.py
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
class InAadhaarRecognizer(PatternRecognizer):
    """
    Recognizes Indian UIDAI Person Identification Number ("AADHAAR").

    Reference: https://en.wikipedia.org/wiki/Aadhaar
    A 12 digit unique number that is issued to each individual by Government of India
    :param patterns: List of patterns to be used by this recognizer
    :param context: List of context words to increase confidence in detection
    :param supported_language: Language this recognizer supports
    :param supported_entity: The entity this recognizer can detect
    :param replacement_pairs: List of tuples with potential replacement values
    for different strings to be used during pattern matching.
    This can allow a greater variety in input, for example by removing dashes or spaces.
    """

    PATTERNS = [
        Pattern(
            "AADHAAR (Very Weak)",
            r"\b[0-9]{12}\b",
            0.01,
        ),
    ]

    CONTEXT = [
        "aadhaar",
        "uidai",
    ]

    utils = None

    def __init__(
        self,
        patterns: Optional[List[Pattern]] = None,
        context: Optional[List[str]] = None,
        supported_language: str = "en",
        supported_entity: str = "IN_AADHAAR",
        replacement_pairs: Optional[List[Tuple[str, str]]] = None,
    ):
        self.replacement_pairs = (
            replacement_pairs
            if replacement_pairs
            else [("-", ""), (" ", ""), (":", "")]
        )
        patterns = patterns if patterns else self.PATTERNS
        context = context if context else self.CONTEXT
        super().__init__(
            supported_entity=supported_entity,
            patterns=patterns,
            context=context,
            supported_language=supported_language,
        )

    def validate_result(self, pattern_text: str) -> bool:
        """Determine absolute value based on calculation."""
        sanitized_value = EntityRecognizer.sanitize_value(
            pattern_text, self.replacement_pairs
        )
        return self.__check_aadhaar(sanitized_value)

    def __check_aadhaar(self, sanitized_value: str) -> bool:
        is_valid_aadhaar: bool = False
        if (
            len(sanitized_value) == 12
            and sanitized_value.isnumeric() is True
            and int(sanitized_value[0]) >= 2
            and self._is_verhoeff_number(int(sanitized_value)) is True
            and self._is_palindrome(sanitized_value) is False
        ):
            is_valid_aadhaar = True
        return is_valid_aadhaar

    @staticmethod
    def _is_palindrome(text: str, case_insensitive: bool = False):
        """
        Validate if input text is a true palindrome.

        :param text: input text string to check for palindrome
        :param case_insensitive: optional flag to check palindrome with no case
        :return: True / False
        """
        palindrome_text = text
        if case_insensitive:
            palindrome_text = palindrome_text.replace(" ", "").lower()
        return palindrome_text == palindrome_text[::-1]

    @staticmethod
    def _is_verhoeff_number(input_number: int):
        """
        Check if the input number is a true verhoeff number.

        :param input_number:
        :return:
        """
        __d__ = [
            [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
            [1, 2, 3, 4, 0, 6, 7, 8, 9, 5],
            [2, 3, 4, 0, 1, 7, 8, 9, 5, 6],
            [3, 4, 0, 1, 2, 8, 9, 5, 6, 7],
            [4, 0, 1, 2, 3, 9, 5, 6, 7, 8],
            [5, 9, 8, 7, 6, 0, 4, 3, 2, 1],
            [6, 5, 9, 8, 7, 1, 0, 4, 3, 2],
            [7, 6, 5, 9, 8, 2, 1, 0, 4, 3],
            [8, 7, 6, 5, 9, 3, 2, 1, 0, 4],
            [9, 8, 7, 6, 5, 4, 3, 2, 1, 0],
        ]
        __p__ = [
            [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
            [1, 5, 7, 6, 2, 8, 3, 0, 9, 4],
            [5, 8, 0, 3, 7, 9, 6, 1, 4, 2],
            [8, 9, 1, 6, 0, 4, 3, 5, 2, 7],
            [9, 4, 5, 3, 1, 2, 6, 8, 7, 0],
            [4, 2, 8, 6, 5, 7, 3, 9, 0, 1],
            [2, 7, 9, 3, 8, 0, 6, 4, 1, 5],
            [7, 0, 4, 6, 9, 1, 3, 2, 5, 8],
        ]
        __inv__ = [0, 4, 3, 2, 1, 5, 6, 7, 8, 9]

        c = 0
        inverted_number = list(map(int, reversed(str(input_number))))
        for i in range(len(inverted_number)):
            c = __d__[c][__p__[i % 8][inverted_number[i]]]
        return __inv__[c] == 0

id property

id

Return a unique identifier of this recognizer.

analyze

analyze(
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]

Analyzes text to detect PII using regular expressions or deny-lists.

PARAMETER DESCRIPTION
text

Text to be analyzed

TYPE: str

entities

Entities this recognizer can detect

TYPE: List[str]

nlp_artifacts

Output values from the NLP engine

TYPE: Optional[NlpArtifacts] DEFAULT: None

regex_flags

regex flags to be used in regex matching

TYPE: Optional[int] DEFAULT: None

RETURNS DESCRIPTION
List[RecognizerResult]
Source code in presidio_analyzer/pattern_recognizer.py
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def analyze(
    self,
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]:
    """
    Analyzes text to detect PII using regular expressions or deny-lists.

    :param text: Text to be analyzed
    :param entities: Entities this recognizer can detect
    :param nlp_artifacts: Output values from the NLP engine
    :param regex_flags: regex flags to be used in regex matching
    :return:
    """
    results = []

    if self.patterns:
        pattern_result = self.__analyze_patterns(text, regex_flags)
        results.extend(pattern_result)

    return results

enhance_using_context

enhance_using_context(
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]

Enhance confidence score using context of the entity.

Override this method in derived class in case a custom logic is needed, otherwise return value will be equal to raw_results.

in case a result score is boosted, derived class need to update result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

PARAMETER DESCRIPTION
text

The actual text that was analyzed

TYPE: str

raw_recognizer_results

This recognizer's results, to be updated based on recognizer specific context.

TYPE: List[RecognizerResult]

other_raw_recognizer_results

Other recognizer results matched in the given text to allow related entity context enhancement

TYPE: List[RecognizerResult]

nlp_artifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

TYPE: NlpArtifacts

context

list of context words

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/entity_recognizer.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def enhance_using_context(
    self,
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """Enhance confidence score using context of the entity.

    Override this method in derived class in case a custom logic
    is needed, otherwise return value will be equal to
    raw_results.

    in case a result score is boosted, derived class need to update
    result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

    :param text: The actual text that was analyzed
    :param raw_recognizer_results: This recognizer's results, to be updated
    based on recognizer specific context.
    :param other_raw_recognizer_results: Other recognizer results matched in
    the given text to allow related entity context enhancement
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param context: list of context words
    """
    return raw_recognizer_results

get_supported_entities

get_supported_entities() -> List[str]

Return the list of entities this recognizer can identify.

RETURNS DESCRIPTION
List[str]

A list of the supported entities by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
119
120
121
122
123
124
125
def get_supported_entities(self) -> List[str]:
    """
    Return the list of entities this recognizer can identify.

    :return: A list of the supported entities by this recognizer
    """
    return self.supported_entities

get_supported_language

get_supported_language() -> str

Return the language this recognizer can support.

RETURNS DESCRIPTION
str

A list of the supported language by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
127
128
129
130
131
132
133
def get_supported_language(self) -> str:
    """
    Return the language this recognizer can support.

    :return: A list of the supported language by this recognizer
    """
    return self.supported_language

get_version

get_version() -> str

Return the version of this recognizer.

RETURNS DESCRIPTION
str

The current version of this recognizer

Source code in presidio_analyzer/entity_recognizer.py
135
136
137
138
139
140
141
def get_version(self) -> str:
    """
    Return the version of this recognizer.

    :return: The current version of this recognizer
    """
    return self.version

to_dict

to_dict() -> Dict

Serialize instance into a dictionary.

Source code in presidio_analyzer/pattern_recognizer.py
254
255
256
257
258
259
260
261
262
263
264
def to_dict(self) -> Dict:
    """Serialize instance into a dictionary."""
    return_dict = super().to_dict()

    return_dict["patterns"] = [pat.to_dict() for pat in self.patterns]
    return_dict["deny_list"] = self.deny_list
    return_dict["context"] = self.context
    return_dict["supported_entity"] = return_dict["supported_entities"][0]
    del return_dict["supported_entities"]

    return return_dict

from_dict classmethod

from_dict(entity_recognizer_dict: Dict) -> PatternRecognizer

Create instance from a serialized dict.

Source code in presidio_analyzer/pattern_recognizer.py
266
267
268
269
270
271
272
273
274
@classmethod
def from_dict(cls, entity_recognizer_dict: Dict) -> "PatternRecognizer":
    """Create instance from a serialized dict."""
    patterns = entity_recognizer_dict.get("patterns")
    if patterns:
        patterns_list = [Pattern.from_dict(pat) for pat in patterns]
        entity_recognizer_dict["patterns"] = patterns_list

    return cls(**entity_recognizer_dict)

remove_duplicates staticmethod

remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]

Remove duplicate results.

Remove duplicates in case the two results have identical start and ends and types.

PARAMETER DESCRIPTION
results

List[RecognizerResult]

TYPE: List[RecognizerResult]

RETURNS DESCRIPTION
List[RecognizerResult]

List[RecognizerResult]

Source code in presidio_analyzer/entity_recognizer.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
@staticmethod
def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
    """
    Remove duplicate results.

    Remove duplicates in case the two results
    have identical start and ends and types.
    :param results: List[RecognizerResult]
    :return: List[RecognizerResult]
    """
    results = list(set(results))
    results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
    filtered_results = []

    for result in results:
        if result.score == 0:
            continue

        to_keep = result not in filtered_results  # equals based comparison
        if to_keep:
            for filtered in filtered_results:
                # If result is contained in one of the other results
                if (
                    result.contained_in(filtered)
                    and result.entity_type == filtered.entity_type
                ):
                    to_keep = False
                    break

        if to_keep:
            filtered_results.append(result)

    return filtered_results

sanitize_value staticmethod

sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str

Cleanse the input string of the replacement pairs specified as argument.

PARAMETER DESCRIPTION
text

input string

TYPE: str

replacement_pairs

pairs of what has to be replaced with which value

TYPE: List[Tuple[str, str]]

RETURNS DESCRIPTION
str

cleansed string

Source code in presidio_analyzer/entity_recognizer.py
200
201
202
203
204
205
206
207
208
209
210
211
@staticmethod
def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
    """
    Cleanse the input string of the replacement pairs specified as argument.

    :param text: input string
    :param replacement_pairs: pairs of what has to be replaced with which value
    :return: cleansed string
    """
    for search_string, replacement_string in replacement_pairs:
        text = text.replace(search_string, replacement_string)
    return text

invalidate_result

invalidate_result(pattern_text: str) -> Optional[bool]

Logic to check for result invalidation by running pruning logic.

For example, each SSN number group should not consist of all the same digits.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the result is invalidated

Source code in presidio_analyzer/pattern_recognizer.py
127
128
129
130
131
132
133
134
135
136
137
def invalidate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Logic to check for result invalidation by running pruning logic.

    For example, each SSN number group should not consist of all the same digits.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the result is invalidated
    """
    return None

build_regex_explanation staticmethod

build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation

Construct an explanation for why this entity was detected.

PARAMETER DESCRIPTION
recognizer_name

Name of recognizer detecting the entity

TYPE: str

pattern_name

Regex pattern name which detected the entity

TYPE: str

pattern

Regex pattern logic

TYPE: str

original_score

Score given by the recognizer

TYPE: float

validation_result

Whether validation was used and its result

TYPE: bool

regex_flags

Regex flags used in the regex matching

TYPE: int

RETURNS DESCRIPTION
AnalysisExplanation

Analysis explanation

Source code in presidio_analyzer/pattern_recognizer.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
@staticmethod
def build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation:
    """
    Construct an explanation for why this entity was detected.

    :param recognizer_name: Name of recognizer detecting the entity
    :param pattern_name: Regex pattern name which detected the entity
    :param pattern: Regex pattern logic
    :param original_score: Score given by the recognizer
    :param validation_result: Whether validation was used and its result
    :param regex_flags: Regex flags used in the regex matching
    :return: Analysis explanation
    """
    textual_explanation = (
        f"Detected by `{recognizer_name}` " f"using pattern `{pattern_name}`"
    )

    explanation = AnalysisExplanation(
        recognizer=recognizer_name,
        original_score=original_score,
        pattern_name=pattern_name,
        pattern=pattern,
        validation_result=validation_result,
        regex_flags=regex_flags,
        textual_explanation=textual_explanation,
    )
    return explanation

validate_result

validate_result(pattern_text: str) -> bool

Determine absolute value based on calculation.

Source code in presidio_analyzer/predefined_recognizers/in_aadhaar_recognizer.py
58
59
60
61
62
63
def validate_result(self, pattern_text: str) -> bool:
    """Determine absolute value based on calculation."""
    sanitized_value = EntityRecognizer.sanitize_value(
        pattern_text, self.replacement_pairs
    )
    return self.__check_aadhaar(sanitized_value)

InPanRecognizer

Bases: PatternRecognizer

Recognizes Indian Permanent Account Number ("PAN").

The Permanent Account Number (PAN) is a ten digit alpha-numeric code with the last digit being a check digit calculated using a modified modulus 10 calculation. This recognizer identifies PAN using regex and context words. Reference: https://en.wikipedia.org/wiki/Permanent_account_number, https://incometaxindia.gov.in/Forms/tps/1.Permanent%20Account%20Number%20(PAN).pdf

PARAMETER DESCRIPTION
patterns

List of patterns to be used by this recognizer

TYPE: Optional[List[Pattern]] DEFAULT: None

context

List of context words to increase confidence in detection

TYPE: Optional[List[str]] DEFAULT: None

supported_language

Language this recognizer supports

TYPE: str DEFAULT: 'en'

supported_entity

The entity this recognizer can detect

TYPE: str DEFAULT: 'IN_PAN'

replacement_pairs

List of tuples with potential replacement values for different strings to be used during pattern matching. This can allow a greater variety in input, for example by removing dashes or spaces.

TYPE: Optional[List[Tuple[str, str]]] DEFAULT: None

METHOD DESCRIPTION
analyze

Analyzes text to detect PII using regular expressions or deny-lists.

enhance_using_context

Enhance confidence score using context of the entity.

get_supported_entities

Return the list of entities this recognizer can identify.

get_supported_language

Return the language this recognizer can support.

get_version

Return the version of this recognizer.

to_dict

Serialize instance into a dictionary.

from_dict

Create instance from a serialized dict.

remove_duplicates

Remove duplicate results.

sanitize_value

Cleanse the input string of the replacement pairs specified as argument.

validate_result

Validate the pattern logic e.g., by running checksum on a detected pattern.

invalidate_result

Logic to check for result invalidation by running pruning logic.

build_regex_explanation

Construct an explanation for why this entity was detected.

Source code in presidio_analyzer/predefined_recognizers/in_pan_recognizer.py
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
class InPanRecognizer(PatternRecognizer):
    """
    Recognizes Indian Permanent Account Number ("PAN").

    The Permanent Account Number (PAN) is a ten digit alpha-numeric code
    with the last digit being a check digit calculated using a
    modified modulus 10 calculation.
    This recognizer identifies PAN using regex and context words.
    Reference: https://en.wikipedia.org/wiki/Permanent_account_number,
               https://incometaxindia.gov.in/Forms/tps/1.Permanent%20Account%20Number%20(PAN).pdf

    :param patterns: List of patterns to be used by this recognizer
    :param context: List of context words to increase confidence in detection
    :param supported_language: Language this recognizer supports
    :param supported_entity: The entity this recognizer can detect
    :param replacement_pairs: List of tuples with potential replacement values
    for different strings to be used during pattern matching.
    This can allow a greater variety in input, for example by removing dashes or spaces.
    """

    PATTERNS = [
        Pattern(
            "PAN (High)",
            r"\b([A-Za-z]{3}[AaBbCcFfGgHhJjLlPpTt]{1}[A-Za-z]{1}[0-9]{4}[A-Za-z]{1})\b",
            0.85,
        ),
        Pattern(
            "PAN (Medium)",
            r"\b([A-Za-z]{5}[0-9]{4}[A-Za-z]{1})\b",
            0.6,
        ),
        Pattern(
            "PAN (Low)",
            r"\b((?=.*?[a-zA-Z])(?=.*?[0-9]{4})[\w@#$%^?~-]{10})\b",
            0.05,
        ),
    ]

    CONTEXT = [
        "permanent account number",
        "pan",
    ]

    def __init__(
        self,
        patterns: Optional[List[Pattern]] = None,
        context: Optional[List[str]] = None,
        supported_language: str = "en",
        supported_entity: str = "IN_PAN",
        replacement_pairs: Optional[List[Tuple[str, str]]] = None,
    ):
        self.replacement_pairs = (
            replacement_pairs if replacement_pairs else [("-", ""), (" ", "")]
        )
        patterns = patterns if patterns else self.PATTERNS
        context = context if context else self.CONTEXT
        super().__init__(
            supported_entity=supported_entity,
            patterns=patterns,
            context=context,
            supported_language=supported_language,
        )

id property

id

Return a unique identifier of this recognizer.

analyze

analyze(
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]

Analyzes text to detect PII using regular expressions or deny-lists.

PARAMETER DESCRIPTION
text

Text to be analyzed

TYPE: str

entities

Entities this recognizer can detect

TYPE: List[str]

nlp_artifacts

Output values from the NLP engine

TYPE: Optional[NlpArtifacts] DEFAULT: None

regex_flags

regex flags to be used in regex matching

TYPE: Optional[int] DEFAULT: None

RETURNS DESCRIPTION
List[RecognizerResult]
Source code in presidio_analyzer/pattern_recognizer.py
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def analyze(
    self,
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]:
    """
    Analyzes text to detect PII using regular expressions or deny-lists.

    :param text: Text to be analyzed
    :param entities: Entities this recognizer can detect
    :param nlp_artifacts: Output values from the NLP engine
    :param regex_flags: regex flags to be used in regex matching
    :return:
    """
    results = []

    if self.patterns:
        pattern_result = self.__analyze_patterns(text, regex_flags)
        results.extend(pattern_result)

    return results

enhance_using_context

enhance_using_context(
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]

Enhance confidence score using context of the entity.

Override this method in derived class in case a custom logic is needed, otherwise return value will be equal to raw_results.

in case a result score is boosted, derived class need to update result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

PARAMETER DESCRIPTION
text

The actual text that was analyzed

TYPE: str

raw_recognizer_results

This recognizer's results, to be updated based on recognizer specific context.

TYPE: List[RecognizerResult]

other_raw_recognizer_results

Other recognizer results matched in the given text to allow related entity context enhancement

TYPE: List[RecognizerResult]

nlp_artifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

TYPE: NlpArtifacts

context

list of context words

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/entity_recognizer.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def enhance_using_context(
    self,
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """Enhance confidence score using context of the entity.

    Override this method in derived class in case a custom logic
    is needed, otherwise return value will be equal to
    raw_results.

    in case a result score is boosted, derived class need to update
    result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

    :param text: The actual text that was analyzed
    :param raw_recognizer_results: This recognizer's results, to be updated
    based on recognizer specific context.
    :param other_raw_recognizer_results: Other recognizer results matched in
    the given text to allow related entity context enhancement
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param context: list of context words
    """
    return raw_recognizer_results

get_supported_entities

get_supported_entities() -> List[str]

Return the list of entities this recognizer can identify.

RETURNS DESCRIPTION
List[str]

A list of the supported entities by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
119
120
121
122
123
124
125
def get_supported_entities(self) -> List[str]:
    """
    Return the list of entities this recognizer can identify.

    :return: A list of the supported entities by this recognizer
    """
    return self.supported_entities

get_supported_language

get_supported_language() -> str

Return the language this recognizer can support.

RETURNS DESCRIPTION
str

A list of the supported language by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
127
128
129
130
131
132
133
def get_supported_language(self) -> str:
    """
    Return the language this recognizer can support.

    :return: A list of the supported language by this recognizer
    """
    return self.supported_language

get_version

get_version() -> str

Return the version of this recognizer.

RETURNS DESCRIPTION
str

The current version of this recognizer

Source code in presidio_analyzer/entity_recognizer.py
135
136
137
138
139
140
141
def get_version(self) -> str:
    """
    Return the version of this recognizer.

    :return: The current version of this recognizer
    """
    return self.version

to_dict

to_dict() -> Dict

Serialize instance into a dictionary.

Source code in presidio_analyzer/pattern_recognizer.py
254
255
256
257
258
259
260
261
262
263
264
def to_dict(self) -> Dict:
    """Serialize instance into a dictionary."""
    return_dict = super().to_dict()

    return_dict["patterns"] = [pat.to_dict() for pat in self.patterns]
    return_dict["deny_list"] = self.deny_list
    return_dict["context"] = self.context
    return_dict["supported_entity"] = return_dict["supported_entities"][0]
    del return_dict["supported_entities"]

    return return_dict

from_dict classmethod

from_dict(entity_recognizer_dict: Dict) -> PatternRecognizer

Create instance from a serialized dict.

Source code in presidio_analyzer/pattern_recognizer.py
266
267
268
269
270
271
272
273
274
@classmethod
def from_dict(cls, entity_recognizer_dict: Dict) -> "PatternRecognizer":
    """Create instance from a serialized dict."""
    patterns = entity_recognizer_dict.get("patterns")
    if patterns:
        patterns_list = [Pattern.from_dict(pat) for pat in patterns]
        entity_recognizer_dict["patterns"] = patterns_list

    return cls(**entity_recognizer_dict)

remove_duplicates staticmethod

remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]

Remove duplicate results.

Remove duplicates in case the two results have identical start and ends and types.

PARAMETER DESCRIPTION
results

List[RecognizerResult]

TYPE: List[RecognizerResult]

RETURNS DESCRIPTION
List[RecognizerResult]

List[RecognizerResult]

Source code in presidio_analyzer/entity_recognizer.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
@staticmethod
def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
    """
    Remove duplicate results.

    Remove duplicates in case the two results
    have identical start and ends and types.
    :param results: List[RecognizerResult]
    :return: List[RecognizerResult]
    """
    results = list(set(results))
    results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
    filtered_results = []

    for result in results:
        if result.score == 0:
            continue

        to_keep = result not in filtered_results  # equals based comparison
        if to_keep:
            for filtered in filtered_results:
                # If result is contained in one of the other results
                if (
                    result.contained_in(filtered)
                    and result.entity_type == filtered.entity_type
                ):
                    to_keep = False
                    break

        if to_keep:
            filtered_results.append(result)

    return filtered_results

sanitize_value staticmethod

sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str

Cleanse the input string of the replacement pairs specified as argument.

PARAMETER DESCRIPTION
text

input string

TYPE: str

replacement_pairs

pairs of what has to be replaced with which value

TYPE: List[Tuple[str, str]]

RETURNS DESCRIPTION
str

cleansed string

Source code in presidio_analyzer/entity_recognizer.py
200
201
202
203
204
205
206
207
208
209
210
211
@staticmethod
def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
    """
    Cleanse the input string of the replacement pairs specified as argument.

    :param text: input string
    :param replacement_pairs: pairs of what has to be replaced with which value
    :return: cleansed string
    """
    for search_string, replacement_string in replacement_pairs:
        text = text.replace(search_string, replacement_string)
    return text

validate_result

validate_result(pattern_text: str) -> Optional[bool]

Validate the pattern logic e.g., by running checksum on a detected pattern.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the validation was successful.

Source code in presidio_analyzer/pattern_recognizer.py
117
118
119
120
121
122
123
124
125
def validate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Validate the pattern logic e.g., by running checksum on a detected pattern.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the validation was successful.
    """
    return None

invalidate_result

invalidate_result(pattern_text: str) -> Optional[bool]

Logic to check for result invalidation by running pruning logic.

For example, each SSN number group should not consist of all the same digits.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the result is invalidated

Source code in presidio_analyzer/pattern_recognizer.py
127
128
129
130
131
132
133
134
135
136
137
def invalidate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Logic to check for result invalidation by running pruning logic.

    For example, each SSN number group should not consist of all the same digits.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the result is invalidated
    """
    return None

build_regex_explanation staticmethod

build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation

Construct an explanation for why this entity was detected.

PARAMETER DESCRIPTION
recognizer_name

Name of recognizer detecting the entity

TYPE: str

pattern_name

Regex pattern name which detected the entity

TYPE: str

pattern

Regex pattern logic

TYPE: str

original_score

Score given by the recognizer

TYPE: float

validation_result

Whether validation was used and its result

TYPE: bool

regex_flags

Regex flags used in the regex matching

TYPE: int

RETURNS DESCRIPTION
AnalysisExplanation

Analysis explanation

Source code in presidio_analyzer/pattern_recognizer.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
@staticmethod
def build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation:
    """
    Construct an explanation for why this entity was detected.

    :param recognizer_name: Name of recognizer detecting the entity
    :param pattern_name: Regex pattern name which detected the entity
    :param pattern: Regex pattern logic
    :param original_score: Score given by the recognizer
    :param validation_result: Whether validation was used and its result
    :param regex_flags: Regex flags used in the regex matching
    :return: Analysis explanation
    """
    textual_explanation = (
        f"Detected by `{recognizer_name}` " f"using pattern `{pattern_name}`"
    )

    explanation = AnalysisExplanation(
        recognizer=recognizer_name,
        original_score=original_score,
        pattern_name=pattern_name,
        pattern=pattern,
        validation_result=validation_result,
        regex_flags=regex_flags,
        textual_explanation=textual_explanation,
    )
    return explanation

InPassportRecognizer

Bases: PatternRecognizer

Recognizes Indian Passport Number.

Indian Passport Number is a eight digit alphanumeric number.

Reference: https://www.bajajallianz.com/blog/travel-insurance-articles/where-is-passport-number-in-indian-passport.html

PARAMETER DESCRIPTION
patterns

List of patterns to be used by this recognizer

TYPE: Optional[List[Pattern]] DEFAULT: None

context

List of context words to increase confidence in detection

TYPE: Optional[List[str]] DEFAULT: None

supported_language

Language this recognizer supports

TYPE: str DEFAULT: 'en'

supported_entity

The entity this recognizer can detect

TYPE: str DEFAULT: 'IN_PASSPORT'

METHOD DESCRIPTION
analyze

Analyzes text to detect PII using regular expressions or deny-lists.

enhance_using_context

Enhance confidence score using context of the entity.

get_supported_entities

Return the list of entities this recognizer can identify.

get_supported_language

Return the language this recognizer can support.

get_version

Return the version of this recognizer.

to_dict

Serialize instance into a dictionary.

from_dict

Create instance from a serialized dict.

remove_duplicates

Remove duplicate results.

sanitize_value

Cleanse the input string of the replacement pairs specified as argument.

validate_result

Validate the pattern logic e.g., by running checksum on a detected pattern.

invalidate_result

Logic to check for result invalidation by running pruning logic.

build_regex_explanation

Construct an explanation for why this entity was detected.

Source code in presidio_analyzer/predefined_recognizers/in_passport_recognizer.py
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
class InPassportRecognizer(PatternRecognizer):
    """
    Recognizes Indian Passport Number.

    Indian Passport Number is a eight digit alphanumeric number.

    Reference:
    https://www.bajajallianz.com/blog/travel-insurance-articles/where-is-passport-number-in-indian-passport.html

    :param patterns: List of patterns to be used by this recognizer
    :param context: List of context words to increase confidence in detection
    :param supported_language: Language this recognizer supports
    :param supported_entity: The entity this recognizer can detect
    """

    PATTERNS = [
        Pattern(
            "PASSPORT",
            r"\b[A-Z][1-9]\d\s?\d{4}[1-9]\b",
            0.1,
        ),
    ]

    CONTEXT = ["passport", "indian passport", "passport number"]

    def __init__(
        self,
        patterns: Optional[List[Pattern]] = None,
        context: Optional[List[str]] = None,
        supported_language: str = "en",
        supported_entity: str = "IN_PASSPORT",
    ):
        patterns = patterns if patterns else self.PATTERNS
        context = context if context else self.CONTEXT
        super().__init__(
            supported_entity=supported_entity,
            patterns=patterns,
            context=context,
            supported_language=supported_language,
        )

id property

id

Return a unique identifier of this recognizer.

analyze

analyze(
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]

Analyzes text to detect PII using regular expressions or deny-lists.

PARAMETER DESCRIPTION
text

Text to be analyzed

TYPE: str

entities

Entities this recognizer can detect

TYPE: List[str]

nlp_artifacts

Output values from the NLP engine

TYPE: Optional[NlpArtifacts] DEFAULT: None

regex_flags

regex flags to be used in regex matching

TYPE: Optional[int] DEFAULT: None

RETURNS DESCRIPTION
List[RecognizerResult]
Source code in presidio_analyzer/pattern_recognizer.py
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def analyze(
    self,
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]:
    """
    Analyzes text to detect PII using regular expressions or deny-lists.

    :param text: Text to be analyzed
    :param entities: Entities this recognizer can detect
    :param nlp_artifacts: Output values from the NLP engine
    :param regex_flags: regex flags to be used in regex matching
    :return:
    """
    results = []

    if self.patterns:
        pattern_result = self.__analyze_patterns(text, regex_flags)
        results.extend(pattern_result)

    return results

enhance_using_context

enhance_using_context(
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]

Enhance confidence score using context of the entity.

Override this method in derived class in case a custom logic is needed, otherwise return value will be equal to raw_results.

in case a result score is boosted, derived class need to update result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

PARAMETER DESCRIPTION
text

The actual text that was analyzed

TYPE: str

raw_recognizer_results

This recognizer's results, to be updated based on recognizer specific context.

TYPE: List[RecognizerResult]

other_raw_recognizer_results

Other recognizer results matched in the given text to allow related entity context enhancement

TYPE: List[RecognizerResult]

nlp_artifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

TYPE: NlpArtifacts

context

list of context words

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/entity_recognizer.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def enhance_using_context(
    self,
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """Enhance confidence score using context of the entity.

    Override this method in derived class in case a custom logic
    is needed, otherwise return value will be equal to
    raw_results.

    in case a result score is boosted, derived class need to update
    result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

    :param text: The actual text that was analyzed
    :param raw_recognizer_results: This recognizer's results, to be updated
    based on recognizer specific context.
    :param other_raw_recognizer_results: Other recognizer results matched in
    the given text to allow related entity context enhancement
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param context: list of context words
    """
    return raw_recognizer_results

get_supported_entities

get_supported_entities() -> List[str]

Return the list of entities this recognizer can identify.

RETURNS DESCRIPTION
List[str]

A list of the supported entities by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
119
120
121
122
123
124
125
def get_supported_entities(self) -> List[str]:
    """
    Return the list of entities this recognizer can identify.

    :return: A list of the supported entities by this recognizer
    """
    return self.supported_entities

get_supported_language

get_supported_language() -> str

Return the language this recognizer can support.

RETURNS DESCRIPTION
str

A list of the supported language by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
127
128
129
130
131
132
133
def get_supported_language(self) -> str:
    """
    Return the language this recognizer can support.

    :return: A list of the supported language by this recognizer
    """
    return self.supported_language

get_version

get_version() -> str

Return the version of this recognizer.

RETURNS DESCRIPTION
str

The current version of this recognizer

Source code in presidio_analyzer/entity_recognizer.py
135
136
137
138
139
140
141
def get_version(self) -> str:
    """
    Return the version of this recognizer.

    :return: The current version of this recognizer
    """
    return self.version

to_dict

to_dict() -> Dict

Serialize instance into a dictionary.

Source code in presidio_analyzer/pattern_recognizer.py
254
255
256
257
258
259
260
261
262
263
264
def to_dict(self) -> Dict:
    """Serialize instance into a dictionary."""
    return_dict = super().to_dict()

    return_dict["patterns"] = [pat.to_dict() for pat in self.patterns]
    return_dict["deny_list"] = self.deny_list
    return_dict["context"] = self.context
    return_dict["supported_entity"] = return_dict["supported_entities"][0]
    del return_dict["supported_entities"]

    return return_dict

from_dict classmethod

from_dict(entity_recognizer_dict: Dict) -> PatternRecognizer

Create instance from a serialized dict.

Source code in presidio_analyzer/pattern_recognizer.py
266
267
268
269
270
271
272
273
274
@classmethod
def from_dict(cls, entity_recognizer_dict: Dict) -> "PatternRecognizer":
    """Create instance from a serialized dict."""
    patterns = entity_recognizer_dict.get("patterns")
    if patterns:
        patterns_list = [Pattern.from_dict(pat) for pat in patterns]
        entity_recognizer_dict["patterns"] = patterns_list

    return cls(**entity_recognizer_dict)

remove_duplicates staticmethod

remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]

Remove duplicate results.

Remove duplicates in case the two results have identical start and ends and types.

PARAMETER DESCRIPTION
results

List[RecognizerResult]

TYPE: List[RecognizerResult]

RETURNS DESCRIPTION
List[RecognizerResult]

List[RecognizerResult]

Source code in presidio_analyzer/entity_recognizer.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
@staticmethod
def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
    """
    Remove duplicate results.

    Remove duplicates in case the two results
    have identical start and ends and types.
    :param results: List[RecognizerResult]
    :return: List[RecognizerResult]
    """
    results = list(set(results))
    results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
    filtered_results = []

    for result in results:
        if result.score == 0:
            continue

        to_keep = result not in filtered_results  # equals based comparison
        if to_keep:
            for filtered in filtered_results:
                # If result is contained in one of the other results
                if (
                    result.contained_in(filtered)
                    and result.entity_type == filtered.entity_type
                ):
                    to_keep = False
                    break

        if to_keep:
            filtered_results.append(result)

    return filtered_results

sanitize_value staticmethod

sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str

Cleanse the input string of the replacement pairs specified as argument.

PARAMETER DESCRIPTION
text

input string

TYPE: str

replacement_pairs

pairs of what has to be replaced with which value

TYPE: List[Tuple[str, str]]

RETURNS DESCRIPTION
str

cleansed string

Source code in presidio_analyzer/entity_recognizer.py
200
201
202
203
204
205
206
207
208
209
210
211
@staticmethod
def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
    """
    Cleanse the input string of the replacement pairs specified as argument.

    :param text: input string
    :param replacement_pairs: pairs of what has to be replaced with which value
    :return: cleansed string
    """
    for search_string, replacement_string in replacement_pairs:
        text = text.replace(search_string, replacement_string)
    return text

validate_result

validate_result(pattern_text: str) -> Optional[bool]

Validate the pattern logic e.g., by running checksum on a detected pattern.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the validation was successful.

Source code in presidio_analyzer/pattern_recognizer.py
117
118
119
120
121
122
123
124
125
def validate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Validate the pattern logic e.g., by running checksum on a detected pattern.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the validation was successful.
    """
    return None

invalidate_result

invalidate_result(pattern_text: str) -> Optional[bool]

Logic to check for result invalidation by running pruning logic.

For example, each SSN number group should not consist of all the same digits.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the result is invalidated

Source code in presidio_analyzer/pattern_recognizer.py
127
128
129
130
131
132
133
134
135
136
137
def invalidate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Logic to check for result invalidation by running pruning logic.

    For example, each SSN number group should not consist of all the same digits.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the result is invalidated
    """
    return None

build_regex_explanation staticmethod

build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation

Construct an explanation for why this entity was detected.

PARAMETER DESCRIPTION
recognizer_name

Name of recognizer detecting the entity

TYPE: str

pattern_name

Regex pattern name which detected the entity

TYPE: str

pattern

Regex pattern logic

TYPE: str

original_score

Score given by the recognizer

TYPE: float

validation_result

Whether validation was used and its result

TYPE: bool

regex_flags

Regex flags used in the regex matching

TYPE: int

RETURNS DESCRIPTION
AnalysisExplanation

Analysis explanation

Source code in presidio_analyzer/pattern_recognizer.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
@staticmethod
def build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation:
    """
    Construct an explanation for why this entity was detected.

    :param recognizer_name: Name of recognizer detecting the entity
    :param pattern_name: Regex pattern name which detected the entity
    :param pattern: Regex pattern logic
    :param original_score: Score given by the recognizer
    :param validation_result: Whether validation was used and its result
    :param regex_flags: Regex flags used in the regex matching
    :return: Analysis explanation
    """
    textual_explanation = (
        f"Detected by `{recognizer_name}` " f"using pattern `{pattern_name}`"
    )

    explanation = AnalysisExplanation(
        recognizer=recognizer_name,
        original_score=original_score,
        pattern_name=pattern_name,
        pattern=pattern,
        validation_result=validation_result,
        regex_flags=regex_flags,
        textual_explanation=textual_explanation,
    )
    return explanation

InVehicleRegistrationRecognizer

Bases: PatternRecognizer

Recognizes Indian Vehicle Registration Number issued by RTO.

Reference(s): https://en.wikipedia.org/wiki/Vehicle_registration_plates_of_India https://en.wikipedia.org/wiki/Regional_Transport_Office https://en.wikipedia.org/wiki/List_of_Regional_Transport_Office_districts_in_India

The registration scheme changed over time with multiple formats in play over the years India has multiple active patterns for registration plates issued to different vehicle categories

PARAMETER DESCRIPTION
patterns

List of patterns to be used by this recognizer

TYPE: Optional[List[Pattern]] DEFAULT: None

context

List of context words to increase confidence in detection

TYPE: Optional[List[str]] DEFAULT: None

supported_language

Language this recognizer supports

TYPE: str DEFAULT: 'en'

supported_entity

The entity this recognizer can detect

TYPE: str DEFAULT: 'IN_VEHICLE_REGISTRATION'

replacement_pairs

List of tuples with potential replacement values for different strings to be used during pattern matching. This can allow a greater variety in input e.g. by removing dashes or spaces

TYPE: Optional[List[Tuple[str, str]]] DEFAULT: None

METHOD DESCRIPTION
analyze

Analyzes text to detect PII using regular expressions or deny-lists.

enhance_using_context

Enhance confidence score using context of the entity.

get_supported_entities

Return the list of entities this recognizer can identify.

get_supported_language

Return the language this recognizer can support.

get_version

Return the version of this recognizer.

to_dict

Serialize instance into a dictionary.

from_dict

Create instance from a serialized dict.

remove_duplicates

Remove duplicate results.

sanitize_value

Cleanse the input string of the replacement pairs specified as argument.

invalidate_result

Logic to check for result invalidation by running pruning logic.

build_regex_explanation

Construct an explanation for why this entity was detected.

validate_result

Determine absolute value based on calculation.

Source code in presidio_analyzer/predefined_recognizers/in_vehicle_registration_recognizer.py
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
class InVehicleRegistrationRecognizer(PatternRecognizer):
    """
    Recognizes Indian Vehicle Registration Number issued by RTO.

    Reference(s):
    https://en.wikipedia.org/wiki/Vehicle_registration_plates_of_India
    https://en.wikipedia.org/wiki/Regional_Transport_Office
    https://en.wikipedia.org/wiki/List_of_Regional_Transport_Office_districts_in_India

    The registration scheme changed over time with multiple formats
    in play over the years
    India has multiple active patterns for registration plates issued to different
    vehicle categories

    :param patterns: List of patterns to be used by this recognizer
    :param context: List of context words to increase confidence in detection
    :param supported_language: Language this recognizer supports
    :param supported_entity: The entity this recognizer can detect
    :param replacement_pairs: List of tuples with potential replacement values
        for different strings to be used during pattern matching.
    This can allow a greater variety in input e.g. by removing dashes or spaces
    """

    PATTERNS = [
        Pattern(
            "India Vehicle Registration (Very Weak)",
            r"\b[A-Z]{1}(?!0000)[0-9]{4}\b",
            0.01,
        ),
        Pattern(
            "India Vehicle Registration (Very Weak)",
            r"\b[A-Z]{2}(?!0000)\d{4}\b",
            0.01,
        ),
        Pattern(
            "India Vehicle Registration (Very Weak)",
            r"\b(I)(?!00000)\d{5}\b",
            0.01,
        ),
        Pattern(
            "India Vehicle Registration (Weak)",
            r"\b[A-Z]{3}(?!0000)\d{4}\b",
            0.20,
        ),
        Pattern(
            "India Vehicle Registration (Medium)",
            r"\b\d{1,3}(CD|CC|UN)[1-9]{1}[0-9]{1,3}\b",
            0.40,
        ),
        Pattern(
            "India Vehicle Registration",
            r"\b[A-Z]{2}\d{1}[A-Z]{1,3}(?!0000)\d{4}\b",
            0.50,
        ),
        Pattern(
            "India Vehicle Registration",
            r"\b[A-Z]{2}\d{2}[A-Z]{1,2}(?!0000)\d{4}\b",
            0.50,
        ),
        Pattern(
            "India Vehicle Registration",
            r"\b[2-9]{1}[1-9]{1}(BH)(?!0000)\d{4}[A-HJ-NP-Z]{2}\b",
            0.85,
        ),
        Pattern(
            "India Vehicle Registration",
            r"\b(?!00)\d{2}(A|B|C|D|E|F|H|K|P|R|X)\d{6}[A-Z]{1}\b",
            0.85,
        ),
    ]

    CONTEXT = ["RTO", "vehicle", "plate", "registration"]

    # fmt: off
    in_vehicle_foreign_mission_codes = {
        84, 85, 89, 93, 94, 95, 97, 98, 99, 102, 104, 105, 106, 109, 111, 112,
        113, 117, 119, 120, 121, 122, 123, 125, 126, 128, 133, 134, 135, 137,
        141, 145, 147, 149, 152, 153, 155, 156, 157, 159, 160
    }

    in_vehicle_armed_forces_codes = {
        'A', 'B', 'C', 'D', 'E', 'F', 'H', 'K', 'P', 'R', 'X'}
    in_vehicle_diplomatic_codes = {"CC", "CD", "UN"}
    in_vehicle_dist_an = {"01"}
    in_vehicle_dist_ap = {"39", "40"}
    in_vehicle_dist_ar = {
        "01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12",
        "13", "14", "15", "16", "17", "19", "20", "22"
    }
    in_vehicle_dist_as = {
        "01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12",
        "13", "14", "15", "16", "17", "18", "19", "20", "22", "23", "24", "25",
        "26", "27", "28", "29", "30", "31", "32", "33", "34"
    }
    in_vehicle_dist_br = {
        "01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "19",
        "21", "22", "24", "25", "26", "27", "28", "29", "30", "31", "32", "33",
        "34", "37", "38", "39", "43", "44", "45", "46", "50", "51", "52", "53",
        "55", "56"
    }
    in_vehicle_dist_cg = {
        "01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12",
        "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24",
        "25", "26", "27", "28", "29", "30"
    }
    in_vehicle_dist_ch = {"01", "02", "03", "04"}
    in_vehicle_dist_dd = {"01", "02", "03"}
    in_vehicle_dist_dn = {"09"}  # old list
    in_vehicle_dist_dl = {
        "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13"}
    in_vehicle_dist_ga = {
        "01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12"}
    in_vehicle_dist_gj = {
        "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
        "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", "25",
        "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37",
        "38", "39"
    }
    in_vehicle_dist_hp = {
        "01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12",
        "13", "14", "15", "16", "17", "18", "19", "20", "22", "23", "24", "25",
        "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37",
        "38", "39", "40", "41", "42", "43", "44", "45", "46", "47", "48", "49",
        "50", "51", "52", "53", "54", "55", "56", "57", "58", "59", "60", "61",
        "62", "63", "64", "65", "66", "67", "68", "69", "70", "71", "72", "73",
        "74", "75", "76", "77", "78", "79", "80", "81", "82", "83", "84", "85",
        "86", "87", "88", "89", "90", "91", "92", "93", "94", "95", "96", "97",
        "98", "99"
    }
    in_vehicle_dist_hr = {
        "01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12",
        "13", "14", "15", "16", "17", "18", "19", "20", "22", "23", "24", "25",
        "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37",
        "38", "39", "40", "41", "42", "43", "44", "45", "46", "47", "48", "49",
        "50", "51", "52", "53", "54", "55", "56", "57", "58", "59", "60", "61",
        "62", "63", "64", "65", "66", "67", "68", "69", "70", "71", "72", "73",
        "74", "75", "76", "77", "78", "79", "80", "81", "82", "83", "84", "85",
        "86", "87", "88", "89", "90", "91", "92", "93", "94", "95", "96", "97",
        "98", "99"
    }
    in_vehicle_dist_jh = {
        "01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12",
        "13", "14", "15", "16", "17", "18", "19", "20", "22", "23", "24"
    }
    in_vehicle_dist_jk = {
        "01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12",
        "13", "14", "15", "16", "17", "18", "19", "20", "22"
    }
    in_vehicle_dist_ka = {
        "01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12",
        "13", "14", "15", "16", "17", "18", "19", "20", "22", "23", "24", "25",
        "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37",
        "38", "39", "40", "41", "42", "43", "44", "45", "46", "47", "48", "49",
        "50", "51", "52", "53", "54", "55", "56", "57", "58", "59", "60", "61",
        "62", "63", "64", "65", "66", "67", "68", "69", "70", "71"
    }
    in_vehicle_dist_kl = {
        "01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12",
        "13", "14", "15", "16", "17", "18", "19", "20", "22", "23", "24", "25",
        "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37",
        "38", "39", "40", "41", "42", "43", "44", "45", "46", "47", "48", "49",
        "50", "51", "52", "53", "54", "55", "56", "57", "58", "59", "60", "61",
        "62", "63", "64", "65", "66", "67", "68", "69", "70", "71", "72", "73",
        "74", "75", "76", "77", "78", "79", "80", "81", "82", "83", "84", "85",
        "86", "87", "88", "89", "90", "91", "92", "93", "94", "95", "96", "97",
        "98", "99"
    }
    in_vehicle_dist_la = {"01", "02"}
    in_vehicle_dist_ld = {"01", "02", "03", "04", "05", "06", "07", "08", "09"}
    in_vehicle_dist_mh = {
        "01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12",
        "13", "14", "15", "16", "17", "18", "19", "20", "22", "23", "24", "25",
        "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37",
        "38", "39", "40", "41", "42", "43", "44", "45", "46", "47", "48", "49",
        "50", "51"
    }
    in_vehicle_dist_ml = {
        "01", "02", "03", "04", "05", "06", "07", "08", "09", "10"}
    in_vehicle_dist_mn = {"01", "02", "03", "04", "05", "06", "07"}
    in_vehicle_dist_mp = {
        "01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12",
        "13", "14", "15", "16", "17", "18", "19", "20", "22", "23", "24", "25",
        "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37",
        "38", "39", "40", "41", "42", "43", "44", "45", "46", "47", "48", "49",
        "50", "51", "52", "53", "54", "55", "56", "57", "58", "59", "60", "61",
        "62", "63", "64", "65", "66", "67", "68", "69", "70", "71"
    }
    in_vehicle_dist_mz = {"01", "02", "03", "04", "05", "06", "07", "08"}
    in_vehicle_dist_nl = {"01", "02", "03", "04", "05", "06", "07", "08", "09", "10"}
    in_vehicle_dist_od = {
        "01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12",
        "13", "14", "15", "16", "17", "18", "19", "20", "22", "23", "24", "25",
        "26", "27", "28", "29", "30", "31", "32", "33", "34", "35"
    }
    in_vehicle_dist_or = {
        "01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12",
        "13", "14", "15", "16", "17", "18", "19", "20", "22", "23", "24", "25",
        "26", "27", "28", "29", "30", "31"
    }  # old list
    in_vehicle_dist_pb = {
        "01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12",
        "13", "14", "15", "16", "17", "18", "19", "20", "22", "23", "24", "25",
        "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37",
        "38", "39", "40", "41", "42", "43", "44", "45", "46", "47", "48", "49",
        "50", "51", "52", "53", "54", "55", "56", "57", "58", "59", "60", "61",
        "62", "63", "64", "65", "66", "67", "68", "69", "70", "71", "72", "73",
        "74", "75", "76", "77", "78", "79", "80", "81", "82", "83", "84", "85",
        "86", "87", "88", "89", "90", "91", "92", "93", "94", "95", "96", "97",
        "98", "99"
    }
    in_vehicle_dist_py = {"01", "02", "03", "04", "05"}
    in_vehicle_dist_rj = {
        "01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12",
        "13", "14", "15", "16", "17", "18", "19", "20", "22", "23", "24", "25",
        "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37",
        "38", "39", "40", "41", "42", "43", "44", "45", "46", "47", "48", "49",
        "50", "51", "52", "53", "54", "55", "56", "57", "58"
    }
    in_vehicle_dist_sk = {"01", "02", "03", "04", "05", "06", "07", "08"}
    in_vehicle_dist_tn = {
        "01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12",
        "13", "14", "15", "16", "17", "18", "19", "20", "22", "23", "24", "25",
        "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37",
        "38", "39", "40", "41", "42", "43", "44", "45", "46", "47", "48", "49",
        "50", "51", "52", "53", "54", "55", "56", "57", "58", "59", "60", "61",
        "62", "63", "64", "65", "66", "67", "68", "69", "70", "71", "72", "73",
        "74", "75", "76", "77", "78", "79", "80", "81", "82", "83", "84", "85",
        "86", "87", "88", "89", "90", "91", "92", "93", "94", "95", "96", "97",
        "98", "99"
    }
    in_vehicle_dist_tr = {"01", "02", "03", "04", "05", "06", "07", "08"}
    in_vehicle_dist_ts = {
        "01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12",
        "13", "14", "15", "16", "17", "18", "19", "20", "22", "23", "24", "25",
        "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37",
        "38"
    }
    in_vehicle_dist_uk = {
        "01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12",
        "13", "14", "15", "16", "17", "18", "19", "20"
    }
    in_vehicle_dist_up = {
        "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "22", "23",
        "24", "25", "26", "27", "28", "29", "30", "31", "32", "33", "34", "35",
        "36", "37", "38", "39", "40", "41", "42", "43", "44", "45", "46", "47",
        "48", "49", "50", "51", "52", "53", "54", "55", "56", "57", "58", "59",
        "60", "61", "62", "63", "64", "65", "66", "67", "68", "69", "70", "71",
        "72", "73", "74", "75", "76", "77", "78", "79", "80", "81", "82", "83",
        "84", "85", "86", "87", "88", "89", "90", "91", "92", "93", "94", "95",
        "96"
    }
    in_vehicle_dist_wb = {
        "01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12",
        "13", "14", "15", "16", "17", "18", "19", "20", "22", "23", "24", "25",
        "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", "36", "37",
        "38", "39", "40", "41", "42", "43", "44", "45", "46", "47", "48", "49",
        "50", "51", "52", "53", "54", "55", "56", "57", "58", "59", "60", "61",
        "62", "63", "64", "65", "66", "67", "68", "69", "70", "71", "72", "73",
        "74", "75", "76", "77", "78", "79", "80", "81", "82", "83", "84", "85",
        "86", "87", "88", "89", "90", "91", "92", "93", "94", "95", "96", "97",
        "98"
    }
    in_union_territories = {"AN", "CH", "DH", "DL", "JK", "LA", "LD", "PY"}
    in_old_union_territories = {"CT", "DN"}
    in_states = {
        "AP", "AR", "AS", "BR", "CG", "GA", "GJ", "HR", "HP", "JH", "KA", "KL",
        "MP", "MH", "MN", "ML", "MZ", "NL", "OD", "PB", "RJ", "SK", "TN", "TS",
        "TR", "UP", "UK", "WB", "UT"
    }
    in_old_states = {"UL", "OR", "UA"}
    in_non_standard_state_or_ut = {"DD"}

    state_rto_district_map = {
        "AN": in_vehicle_dist_an,
        "AP": in_vehicle_dist_ap,
        "AR": in_vehicle_dist_ar,
        "AS": in_vehicle_dist_as,
        "BR": in_vehicle_dist_br,
        "CG": in_vehicle_dist_cg,
        "CH": in_vehicle_dist_ch,
        "DD": in_vehicle_dist_dd,
        "DN": in_vehicle_dist_dn,
        "DL": in_vehicle_dist_dl,
        "GA": in_vehicle_dist_ga,
        "GJ": in_vehicle_dist_gj,
        "HP": in_vehicle_dist_hp,
        "HR": in_vehicle_dist_hr,
        "JH": in_vehicle_dist_jh,
        "JK": in_vehicle_dist_jk,
        "KA": in_vehicle_dist_ka,
        "KL": in_vehicle_dist_kl,
        "LA": in_vehicle_dist_la,
        "LD": in_vehicle_dist_ld,
        "MH": in_vehicle_dist_mh,
        "ML": in_vehicle_dist_ml,
        "MN": in_vehicle_dist_mn,
        "MP": in_vehicle_dist_mp,
        "MZ": in_vehicle_dist_mz,
        "NL": in_vehicle_dist_nl,
        "OD": in_vehicle_dist_od,
        "OR": in_vehicle_dist_or,
        "PB": in_vehicle_dist_pb,
        "PY": in_vehicle_dist_py,
        "RJ": in_vehicle_dist_rj,
        "SK": in_vehicle_dist_sk,
        "TN": in_vehicle_dist_tn,
        "TR": in_vehicle_dist_tr,
        "TS": in_vehicle_dist_ts,
        "UK": in_vehicle_dist_uk,
        "UP": in_vehicle_dist_up,
        "WB": in_vehicle_dist_wb,
    }

    two_factor_registration_prefix = set()
    two_factor_registration_prefix |= in_union_territories
    two_factor_registration_prefix |= in_states
    two_factor_registration_prefix |= in_old_states
    two_factor_registration_prefix |= in_old_union_territories
    two_factor_registration_prefix |= in_non_standard_state_or_ut
    # fmt: on

    def __init__(
        self,
        patterns: Optional[List[Pattern]] = None,
        context: Optional[List[str]] = None,
        supported_language: str = "en",
        supported_entity: str = "IN_VEHICLE_REGISTRATION",
        replacement_pairs: Optional[List[Tuple[str, str]]] = None,
    ):
        self.replacement_pairs = (
            replacement_pairs
            if replacement_pairs
            else [("-", ""), (" ", ""), (":", "")]
        )
        patterns = patterns if patterns else self.PATTERNS
        context = context if context else self.CONTEXT
        super().__init__(
            supported_entity=supported_entity,
            patterns=patterns,
            context=context,
            supported_language=supported_language,
        )

    def validate_result(self, pattern_text: str) -> bool:
        """Determine absolute value based on calculation."""
        sanitized_value = EntityRecognizer.sanitize_value(
            pattern_text, self.replacement_pairs
        )
        return self.__check_vehicle_registration(sanitized_value)

    def __check_vehicle_registration(self, sanitized_value: str) -> bool:
        # print('check function called')
        is_valid_registration = None  # deliberately not typecasted or set to bool False
        # logic here
        if len(sanitized_value) >= 8:
            first_two_char = sanitized_value[:2].upper()
            dist_code: str = ""

            if first_two_char in self.two_factor_registration_prefix:
                if sanitized_value[2].isdigit():
                    if sanitized_value[3].isdigit():
                        dist_code = sanitized_value[2:4]
                    else:
                        dist_code = sanitized_value[2:3]

                    registration_digits = sanitized_value[-4:]
                    if registration_digits.isnumeric():
                        if 0 < int(registration_digits) <= 9999:
                            if (
                                dist_code
                                and dist_code
                                in self.state_rto_district_map.get(first_two_char, "")
                            ):
                                is_valid_registration = True

                for diplomatic_vehicle_code in self.in_vehicle_diplomatic_codes:
                    if diplomatic_vehicle_code in sanitized_value:
                        vehicle_prefix = sanitized_value.partition(
                            diplomatic_vehicle_code
                        )[0]
                        if vehicle_prefix.isnumeric() and (
                            1 <= int(vehicle_prefix) <= 80
                            or int(vehicle_prefix)
                            in self.in_vehicle_foreign_mission_codes
                        ):
                            is_valid_registration = True

        return is_valid_registration

id property

id

Return a unique identifier of this recognizer.

analyze

analyze(
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]

Analyzes text to detect PII using regular expressions or deny-lists.

PARAMETER DESCRIPTION
text

Text to be analyzed

TYPE: str

entities

Entities this recognizer can detect

TYPE: List[str]

nlp_artifacts

Output values from the NLP engine

TYPE: Optional[NlpArtifacts] DEFAULT: None

regex_flags

regex flags to be used in regex matching

TYPE: Optional[int] DEFAULT: None

RETURNS DESCRIPTION
List[RecognizerResult]
Source code in presidio_analyzer/pattern_recognizer.py
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def analyze(
    self,
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]:
    """
    Analyzes text to detect PII using regular expressions or deny-lists.

    :param text: Text to be analyzed
    :param entities: Entities this recognizer can detect
    :param nlp_artifacts: Output values from the NLP engine
    :param regex_flags: regex flags to be used in regex matching
    :return:
    """
    results = []

    if self.patterns:
        pattern_result = self.__analyze_patterns(text, regex_flags)
        results.extend(pattern_result)

    return results

enhance_using_context

enhance_using_context(
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]

Enhance confidence score using context of the entity.

Override this method in derived class in case a custom logic is needed, otherwise return value will be equal to raw_results.

in case a result score is boosted, derived class need to update result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

PARAMETER DESCRIPTION
text

The actual text that was analyzed

TYPE: str

raw_recognizer_results

This recognizer's results, to be updated based on recognizer specific context.

TYPE: List[RecognizerResult]

other_raw_recognizer_results

Other recognizer results matched in the given text to allow related entity context enhancement

TYPE: List[RecognizerResult]

nlp_artifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

TYPE: NlpArtifacts

context

list of context words

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/entity_recognizer.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def enhance_using_context(
    self,
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """Enhance confidence score using context of the entity.

    Override this method in derived class in case a custom logic
    is needed, otherwise return value will be equal to
    raw_results.

    in case a result score is boosted, derived class need to update
    result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

    :param text: The actual text that was analyzed
    :param raw_recognizer_results: This recognizer's results, to be updated
    based on recognizer specific context.
    :param other_raw_recognizer_results: Other recognizer results matched in
    the given text to allow related entity context enhancement
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param context: list of context words
    """
    return raw_recognizer_results

get_supported_entities

get_supported_entities() -> List[str]

Return the list of entities this recognizer can identify.

RETURNS DESCRIPTION
List[str]

A list of the supported entities by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
119
120
121
122
123
124
125
def get_supported_entities(self) -> List[str]:
    """
    Return the list of entities this recognizer can identify.

    :return: A list of the supported entities by this recognizer
    """
    return self.supported_entities

get_supported_language

get_supported_language() -> str

Return the language this recognizer can support.

RETURNS DESCRIPTION
str

A list of the supported language by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
127
128
129
130
131
132
133
def get_supported_language(self) -> str:
    """
    Return the language this recognizer can support.

    :return: A list of the supported language by this recognizer
    """
    return self.supported_language

get_version

get_version() -> str

Return the version of this recognizer.

RETURNS DESCRIPTION
str

The current version of this recognizer

Source code in presidio_analyzer/entity_recognizer.py
135
136
137
138
139
140
141
def get_version(self) -> str:
    """
    Return the version of this recognizer.

    :return: The current version of this recognizer
    """
    return self.version

to_dict

to_dict() -> Dict

Serialize instance into a dictionary.

Source code in presidio_analyzer/pattern_recognizer.py
254
255
256
257
258
259
260
261
262
263
264
def to_dict(self) -> Dict:
    """Serialize instance into a dictionary."""
    return_dict = super().to_dict()

    return_dict["patterns"] = [pat.to_dict() for pat in self.patterns]
    return_dict["deny_list"] = self.deny_list
    return_dict["context"] = self.context
    return_dict["supported_entity"] = return_dict["supported_entities"][0]
    del return_dict["supported_entities"]

    return return_dict

from_dict classmethod

from_dict(entity_recognizer_dict: Dict) -> PatternRecognizer

Create instance from a serialized dict.

Source code in presidio_analyzer/pattern_recognizer.py
266
267
268
269
270
271
272
273
274
@classmethod
def from_dict(cls, entity_recognizer_dict: Dict) -> "PatternRecognizer":
    """Create instance from a serialized dict."""
    patterns = entity_recognizer_dict.get("patterns")
    if patterns:
        patterns_list = [Pattern.from_dict(pat) for pat in patterns]
        entity_recognizer_dict["patterns"] = patterns_list

    return cls(**entity_recognizer_dict)

remove_duplicates staticmethod

remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]

Remove duplicate results.

Remove duplicates in case the two results have identical start and ends and types.

PARAMETER DESCRIPTION
results

List[RecognizerResult]

TYPE: List[RecognizerResult]

RETURNS DESCRIPTION
List[RecognizerResult]

List[RecognizerResult]

Source code in presidio_analyzer/entity_recognizer.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
@staticmethod
def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
    """
    Remove duplicate results.

    Remove duplicates in case the two results
    have identical start and ends and types.
    :param results: List[RecognizerResult]
    :return: List[RecognizerResult]
    """
    results = list(set(results))
    results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
    filtered_results = []

    for result in results:
        if result.score == 0:
            continue

        to_keep = result not in filtered_results  # equals based comparison
        if to_keep:
            for filtered in filtered_results:
                # If result is contained in one of the other results
                if (
                    result.contained_in(filtered)
                    and result.entity_type == filtered.entity_type
                ):
                    to_keep = False
                    break

        if to_keep:
            filtered_results.append(result)

    return filtered_results

sanitize_value staticmethod

sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str

Cleanse the input string of the replacement pairs specified as argument.

PARAMETER DESCRIPTION
text

input string

TYPE: str

replacement_pairs

pairs of what has to be replaced with which value

TYPE: List[Tuple[str, str]]

RETURNS DESCRIPTION
str

cleansed string

Source code in presidio_analyzer/entity_recognizer.py
200
201
202
203
204
205
206
207
208
209
210
211
@staticmethod
def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
    """
    Cleanse the input string of the replacement pairs specified as argument.

    :param text: input string
    :param replacement_pairs: pairs of what has to be replaced with which value
    :return: cleansed string
    """
    for search_string, replacement_string in replacement_pairs:
        text = text.replace(search_string, replacement_string)
    return text

invalidate_result

invalidate_result(pattern_text: str) -> Optional[bool]

Logic to check for result invalidation by running pruning logic.

For example, each SSN number group should not consist of all the same digits.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the result is invalidated

Source code in presidio_analyzer/pattern_recognizer.py
127
128
129
130
131
132
133
134
135
136
137
def invalidate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Logic to check for result invalidation by running pruning logic.

    For example, each SSN number group should not consist of all the same digits.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the result is invalidated
    """
    return None

build_regex_explanation staticmethod

build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation

Construct an explanation for why this entity was detected.

PARAMETER DESCRIPTION
recognizer_name

Name of recognizer detecting the entity

TYPE: str

pattern_name

Regex pattern name which detected the entity

TYPE: str

pattern

Regex pattern logic

TYPE: str

original_score

Score given by the recognizer

TYPE: float

validation_result

Whether validation was used and its result

TYPE: bool

regex_flags

Regex flags used in the regex matching

TYPE: int

RETURNS DESCRIPTION
AnalysisExplanation

Analysis explanation

Source code in presidio_analyzer/pattern_recognizer.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
@staticmethod
def build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation:
    """
    Construct an explanation for why this entity was detected.

    :param recognizer_name: Name of recognizer detecting the entity
    :param pattern_name: Regex pattern name which detected the entity
    :param pattern: Regex pattern logic
    :param original_score: Score given by the recognizer
    :param validation_result: Whether validation was used and its result
    :param regex_flags: Regex flags used in the regex matching
    :return: Analysis explanation
    """
    textual_explanation = (
        f"Detected by `{recognizer_name}` " f"using pattern `{pattern_name}`"
    )

    explanation = AnalysisExplanation(
        recognizer=recognizer_name,
        original_score=original_score,
        pattern_name=pattern_name,
        pattern=pattern,
        validation_result=validation_result,
        regex_flags=regex_flags,
        textual_explanation=textual_explanation,
    )
    return explanation

validate_result

validate_result(pattern_text: str) -> bool

Determine absolute value based on calculation.

Source code in presidio_analyzer/predefined_recognizers/in_vehicle_registration_recognizer.py
349
350
351
352
353
354
def validate_result(self, pattern_text: str) -> bool:
    """Determine absolute value based on calculation."""
    sanitized_value = EntityRecognizer.sanitize_value(
        pattern_text, self.replacement_pairs
    )
    return self.__check_vehicle_registration(sanitized_value)

InVoterRecognizer

Bases: PatternRecognizer

Recognize Indian Voter/Election Id(EPIC).

The Elector's Photo Identity Card or Voter id is a ten digit alpha-numeric code issued by Election Commission of India to adult domiciles who have reached the age of 18 Ref: https://en.wikipedia.org/wiki/Voter_ID_(India)

PARAMETER DESCRIPTION
patterns

List of patterns to be used by this recognizer

TYPE: Optional[List[Pattern]] DEFAULT: None

context

List of context words to increase confidence in detection

TYPE: Optional[List[str]] DEFAULT: None

supported_language

Language this recognizer supports

TYPE: str DEFAULT: 'en'

supported_entity

The entity this recognizer can detect

TYPE: str DEFAULT: 'IN_VOTER'

METHOD DESCRIPTION
analyze

Analyzes text to detect PII using regular expressions or deny-lists.

enhance_using_context

Enhance confidence score using context of the entity.

get_supported_entities

Return the list of entities this recognizer can identify.

get_supported_language

Return the language this recognizer can support.

get_version

Return the version of this recognizer.

to_dict

Serialize instance into a dictionary.

from_dict

Create instance from a serialized dict.

remove_duplicates

Remove duplicate results.

sanitize_value

Cleanse the input string of the replacement pairs specified as argument.

validate_result

Validate the pattern logic e.g., by running checksum on a detected pattern.

invalidate_result

Logic to check for result invalidation by running pruning logic.

build_regex_explanation

Construct an explanation for why this entity was detected.

Source code in presidio_analyzer/predefined_recognizers/in_voter_recognizer.py
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
class InVoterRecognizer(PatternRecognizer):
    """
    Recognize Indian Voter/Election Id(EPIC).

    The Elector's Photo Identity Card or Voter id is a ten digit
    alpha-numeric code issued by Election Commission of India
    to adult domiciles who have reached the age of 18
    Ref: https://en.wikipedia.org/wiki/Voter_ID_(India)

    :param patterns: List of patterns to be used by this recognizer
    :param context: List of context words to increase confidence in detection
    :param supported_language: Language this recognizer supports
    :param supported_entity: The entity this recognizer can detect
    """

    PATTERNS = [
        Pattern(
            "VOTER",
            r"\b([A-Za-z]{1}[ABCDGHJKMNPRSYabcdghjkmnprsy]{1}[A-Za-z]{1}([0-9]){7})\b",
            0.4,
        ),
        Pattern(
            "VOTER",
            r"\b([A-Za-z]){3}([0-9]){7}\b",
            0.3,
        ),
    ]

    CONTEXT = [
        "voter",
        "epic",
        "elector photo identity card",
    ]

    def __init__(
        self,
        patterns: Optional[List[Pattern]] = None,
        context: Optional[List[str]] = None,
        supported_language: str = "en",
        supported_entity: str = "IN_VOTER",
    ):
        patterns = patterns if patterns else self.PATTERNS
        context = context if context else self.CONTEXT
        super().__init__(
            patterns=patterns,
            context=context,
            supported_language=supported_language,
            supported_entity=supported_entity,
        )

id property

id

Return a unique identifier of this recognizer.

analyze

analyze(
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]

Analyzes text to detect PII using regular expressions or deny-lists.

PARAMETER DESCRIPTION
text

Text to be analyzed

TYPE: str

entities

Entities this recognizer can detect

TYPE: List[str]

nlp_artifacts

Output values from the NLP engine

TYPE: Optional[NlpArtifacts] DEFAULT: None

regex_flags

regex flags to be used in regex matching

TYPE: Optional[int] DEFAULT: None

RETURNS DESCRIPTION
List[RecognizerResult]
Source code in presidio_analyzer/pattern_recognizer.py
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def analyze(
    self,
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]:
    """
    Analyzes text to detect PII using regular expressions or deny-lists.

    :param text: Text to be analyzed
    :param entities: Entities this recognizer can detect
    :param nlp_artifacts: Output values from the NLP engine
    :param regex_flags: regex flags to be used in regex matching
    :return:
    """
    results = []

    if self.patterns:
        pattern_result = self.__analyze_patterns(text, regex_flags)
        results.extend(pattern_result)

    return results

enhance_using_context

enhance_using_context(
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]

Enhance confidence score using context of the entity.

Override this method in derived class in case a custom logic is needed, otherwise return value will be equal to raw_results.

in case a result score is boosted, derived class need to update result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

PARAMETER DESCRIPTION
text

The actual text that was analyzed

TYPE: str

raw_recognizer_results

This recognizer's results, to be updated based on recognizer specific context.

TYPE: List[RecognizerResult]

other_raw_recognizer_results

Other recognizer results matched in the given text to allow related entity context enhancement

TYPE: List[RecognizerResult]

nlp_artifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

TYPE: NlpArtifacts

context

list of context words

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/entity_recognizer.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def enhance_using_context(
    self,
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """Enhance confidence score using context of the entity.

    Override this method in derived class in case a custom logic
    is needed, otherwise return value will be equal to
    raw_results.

    in case a result score is boosted, derived class need to update
    result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

    :param text: The actual text that was analyzed
    :param raw_recognizer_results: This recognizer's results, to be updated
    based on recognizer specific context.
    :param other_raw_recognizer_results: Other recognizer results matched in
    the given text to allow related entity context enhancement
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param context: list of context words
    """
    return raw_recognizer_results

get_supported_entities

get_supported_entities() -> List[str]

Return the list of entities this recognizer can identify.

RETURNS DESCRIPTION
List[str]

A list of the supported entities by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
119
120
121
122
123
124
125
def get_supported_entities(self) -> List[str]:
    """
    Return the list of entities this recognizer can identify.

    :return: A list of the supported entities by this recognizer
    """
    return self.supported_entities

get_supported_language

get_supported_language() -> str

Return the language this recognizer can support.

RETURNS DESCRIPTION
str

A list of the supported language by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
127
128
129
130
131
132
133
def get_supported_language(self) -> str:
    """
    Return the language this recognizer can support.

    :return: A list of the supported language by this recognizer
    """
    return self.supported_language

get_version

get_version() -> str

Return the version of this recognizer.

RETURNS DESCRIPTION
str

The current version of this recognizer

Source code in presidio_analyzer/entity_recognizer.py
135
136
137
138
139
140
141
def get_version(self) -> str:
    """
    Return the version of this recognizer.

    :return: The current version of this recognizer
    """
    return self.version

to_dict

to_dict() -> Dict

Serialize instance into a dictionary.

Source code in presidio_analyzer/pattern_recognizer.py
254
255
256
257
258
259
260
261
262
263
264
def to_dict(self) -> Dict:
    """Serialize instance into a dictionary."""
    return_dict = super().to_dict()

    return_dict["patterns"] = [pat.to_dict() for pat in self.patterns]
    return_dict["deny_list"] = self.deny_list
    return_dict["context"] = self.context
    return_dict["supported_entity"] = return_dict["supported_entities"][0]
    del return_dict["supported_entities"]

    return return_dict

from_dict classmethod

from_dict(entity_recognizer_dict: Dict) -> PatternRecognizer

Create instance from a serialized dict.

Source code in presidio_analyzer/pattern_recognizer.py
266
267
268
269
270
271
272
273
274
@classmethod
def from_dict(cls, entity_recognizer_dict: Dict) -> "PatternRecognizer":
    """Create instance from a serialized dict."""
    patterns = entity_recognizer_dict.get("patterns")
    if patterns:
        patterns_list = [Pattern.from_dict(pat) for pat in patterns]
        entity_recognizer_dict["patterns"] = patterns_list

    return cls(**entity_recognizer_dict)

remove_duplicates staticmethod

remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]

Remove duplicate results.

Remove duplicates in case the two results have identical start and ends and types.

PARAMETER DESCRIPTION
results

List[RecognizerResult]

TYPE: List[RecognizerResult]

RETURNS DESCRIPTION
List[RecognizerResult]

List[RecognizerResult]

Source code in presidio_analyzer/entity_recognizer.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
@staticmethod
def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
    """
    Remove duplicate results.

    Remove duplicates in case the two results
    have identical start and ends and types.
    :param results: List[RecognizerResult]
    :return: List[RecognizerResult]
    """
    results = list(set(results))
    results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
    filtered_results = []

    for result in results:
        if result.score == 0:
            continue

        to_keep = result not in filtered_results  # equals based comparison
        if to_keep:
            for filtered in filtered_results:
                # If result is contained in one of the other results
                if (
                    result.contained_in(filtered)
                    and result.entity_type == filtered.entity_type
                ):
                    to_keep = False
                    break

        if to_keep:
            filtered_results.append(result)

    return filtered_results

sanitize_value staticmethod

sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str

Cleanse the input string of the replacement pairs specified as argument.

PARAMETER DESCRIPTION
text

input string

TYPE: str

replacement_pairs

pairs of what has to be replaced with which value

TYPE: List[Tuple[str, str]]

RETURNS DESCRIPTION
str

cleansed string

Source code in presidio_analyzer/entity_recognizer.py
200
201
202
203
204
205
206
207
208
209
210
211
@staticmethod
def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
    """
    Cleanse the input string of the replacement pairs specified as argument.

    :param text: input string
    :param replacement_pairs: pairs of what has to be replaced with which value
    :return: cleansed string
    """
    for search_string, replacement_string in replacement_pairs:
        text = text.replace(search_string, replacement_string)
    return text

validate_result

validate_result(pattern_text: str) -> Optional[bool]

Validate the pattern logic e.g., by running checksum on a detected pattern.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the validation was successful.

Source code in presidio_analyzer/pattern_recognizer.py
117
118
119
120
121
122
123
124
125
def validate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Validate the pattern logic e.g., by running checksum on a detected pattern.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the validation was successful.
    """
    return None

invalidate_result

invalidate_result(pattern_text: str) -> Optional[bool]

Logic to check for result invalidation by running pruning logic.

For example, each SSN number group should not consist of all the same digits.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the result is invalidated

Source code in presidio_analyzer/pattern_recognizer.py
127
128
129
130
131
132
133
134
135
136
137
def invalidate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Logic to check for result invalidation by running pruning logic.

    For example, each SSN number group should not consist of all the same digits.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the result is invalidated
    """
    return None

build_regex_explanation staticmethod

build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation

Construct an explanation for why this entity was detected.

PARAMETER DESCRIPTION
recognizer_name

Name of recognizer detecting the entity

TYPE: str

pattern_name

Regex pattern name which detected the entity

TYPE: str

pattern

Regex pattern logic

TYPE: str

original_score

Score given by the recognizer

TYPE: float

validation_result

Whether validation was used and its result

TYPE: bool

regex_flags

Regex flags used in the regex matching

TYPE: int

RETURNS DESCRIPTION
AnalysisExplanation

Analysis explanation

Source code in presidio_analyzer/pattern_recognizer.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
@staticmethod
def build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation:
    """
    Construct an explanation for why this entity was detected.

    :param recognizer_name: Name of recognizer detecting the entity
    :param pattern_name: Regex pattern name which detected the entity
    :param pattern: Regex pattern logic
    :param original_score: Score given by the recognizer
    :param validation_result: Whether validation was used and its result
    :param regex_flags: Regex flags used in the regex matching
    :return: Analysis explanation
    """
    textual_explanation = (
        f"Detected by `{recognizer_name}` " f"using pattern `{pattern_name}`"
    )

    explanation = AnalysisExplanation(
        recognizer=recognizer_name,
        original_score=original_score,
        pattern_name=pattern_name,
        pattern=pattern,
        validation_result=validation_result,
        regex_flags=regex_flags,
        textual_explanation=textual_explanation,
    )
    return explanation

IpRecognizer

Bases: PatternRecognizer

Recognize IP address using regex.

PARAMETER DESCRIPTION
patterns

List of patterns to be used by this recognizer

TYPE: Optional[List[Pattern]] DEFAULT: None

context

List of context words to increase confidence in detection

TYPE: Optional[List[str]] DEFAULT: None

supported_language

Language this recognizer supports

TYPE: str DEFAULT: 'en'

supported_entity

The entity this recognizer can detect

TYPE: str DEFAULT: 'IP_ADDRESS'

METHOD DESCRIPTION
analyze

Analyzes text to detect PII using regular expressions or deny-lists.

enhance_using_context

Enhance confidence score using context of the entity.

get_supported_entities

Return the list of entities this recognizer can identify.

get_supported_language

Return the language this recognizer can support.

get_version

Return the version of this recognizer.

to_dict

Serialize instance into a dictionary.

from_dict

Create instance from a serialized dict.

remove_duplicates

Remove duplicate results.

sanitize_value

Cleanse the input string of the replacement pairs specified as argument.

validate_result

Validate the pattern logic e.g., by running checksum on a detected pattern.

build_regex_explanation

Construct an explanation for why this entity was detected.

invalidate_result

Check if the pattern text cannot be validated as an IP address.

Source code in presidio_analyzer/predefined_recognizers/ip_recognizer.py
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
class IpRecognizer(PatternRecognizer):
    """
    Recognize IP address using regex.

    :param patterns: List of patterns to be used by this recognizer
    :param context: List of context words to increase confidence in detection
    :param supported_language: Language this recognizer supports
    :param supported_entity: The entity this recognizer can detect
    """

    PATTERNS = [
        Pattern(
            "IPv4",
            r"\b(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b",  # noqa: E501
            0.6,
        ),
        Pattern(
            "IPv6",
            r"\b(([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]))\b",  # noqa: E501
            0.6,
        ),
        Pattern(
            "IPv6",
            r"::",
            0.1,
        ),
    ]

    CONTEXT = ["ip", "ipv4", "ipv6"]

    def __init__(
        self,
        patterns: Optional[List[Pattern]] = None,
        context: Optional[List[str]] = None,
        supported_language: str = "en",
        supported_entity: str = "IP_ADDRESS",
    ):
        patterns = patterns if patterns else self.PATTERNS
        context = context if context else self.CONTEXT
        super().__init__(
            supported_entity=supported_entity,
            patterns=patterns,
            context=context,
            supported_language=supported_language,
        )

    def invalidate_result(self, pattern_text: str) -> bool:
        """
        Check if the pattern text cannot be validated as an IP address.

        :param pattern_text: Text detected as pattern by regex
        :return: True if invalidated
        """
        try:
            ipaddress.ip_address(pattern_text)
        except ValueError:
            return True

id property

id

Return a unique identifier of this recognizer.

analyze

analyze(
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]

Analyzes text to detect PII using regular expressions or deny-lists.

PARAMETER DESCRIPTION
text

Text to be analyzed

TYPE: str

entities

Entities this recognizer can detect

TYPE: List[str]

nlp_artifacts

Output values from the NLP engine

TYPE: Optional[NlpArtifacts] DEFAULT: None

regex_flags

regex flags to be used in regex matching

TYPE: Optional[int] DEFAULT: None

RETURNS DESCRIPTION
List[RecognizerResult]
Source code in presidio_analyzer/pattern_recognizer.py
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def analyze(
    self,
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]:
    """
    Analyzes text to detect PII using regular expressions or deny-lists.

    :param text: Text to be analyzed
    :param entities: Entities this recognizer can detect
    :param nlp_artifacts: Output values from the NLP engine
    :param regex_flags: regex flags to be used in regex matching
    :return:
    """
    results = []

    if self.patterns:
        pattern_result = self.__analyze_patterns(text, regex_flags)
        results.extend(pattern_result)

    return results

enhance_using_context

enhance_using_context(
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]

Enhance confidence score using context of the entity.

Override this method in derived class in case a custom logic is needed, otherwise return value will be equal to raw_results.

in case a result score is boosted, derived class need to update result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

PARAMETER DESCRIPTION
text

The actual text that was analyzed

TYPE: str

raw_recognizer_results

This recognizer's results, to be updated based on recognizer specific context.

TYPE: List[RecognizerResult]

other_raw_recognizer_results

Other recognizer results matched in the given text to allow related entity context enhancement

TYPE: List[RecognizerResult]

nlp_artifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

TYPE: NlpArtifacts

context

list of context words

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/entity_recognizer.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def enhance_using_context(
    self,
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """Enhance confidence score using context of the entity.

    Override this method in derived class in case a custom logic
    is needed, otherwise return value will be equal to
    raw_results.

    in case a result score is boosted, derived class need to update
    result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

    :param text: The actual text that was analyzed
    :param raw_recognizer_results: This recognizer's results, to be updated
    based on recognizer specific context.
    :param other_raw_recognizer_results: Other recognizer results matched in
    the given text to allow related entity context enhancement
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param context: list of context words
    """
    return raw_recognizer_results

get_supported_entities

get_supported_entities() -> List[str]

Return the list of entities this recognizer can identify.

RETURNS DESCRIPTION
List[str]

A list of the supported entities by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
119
120
121
122
123
124
125
def get_supported_entities(self) -> List[str]:
    """
    Return the list of entities this recognizer can identify.

    :return: A list of the supported entities by this recognizer
    """
    return self.supported_entities

get_supported_language

get_supported_language() -> str

Return the language this recognizer can support.

RETURNS DESCRIPTION
str

A list of the supported language by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
127
128
129
130
131
132
133
def get_supported_language(self) -> str:
    """
    Return the language this recognizer can support.

    :return: A list of the supported language by this recognizer
    """
    return self.supported_language

get_version

get_version() -> str

Return the version of this recognizer.

RETURNS DESCRIPTION
str

The current version of this recognizer

Source code in presidio_analyzer/entity_recognizer.py
135
136
137
138
139
140
141
def get_version(self) -> str:
    """
    Return the version of this recognizer.

    :return: The current version of this recognizer
    """
    return self.version

to_dict

to_dict() -> Dict

Serialize instance into a dictionary.

Source code in presidio_analyzer/pattern_recognizer.py
254
255
256
257
258
259
260
261
262
263
264
def to_dict(self) -> Dict:
    """Serialize instance into a dictionary."""
    return_dict = super().to_dict()

    return_dict["patterns"] = [pat.to_dict() for pat in self.patterns]
    return_dict["deny_list"] = self.deny_list
    return_dict["context"] = self.context
    return_dict["supported_entity"] = return_dict["supported_entities"][0]
    del return_dict["supported_entities"]

    return return_dict

from_dict classmethod

from_dict(entity_recognizer_dict: Dict) -> PatternRecognizer

Create instance from a serialized dict.

Source code in presidio_analyzer/pattern_recognizer.py
266
267
268
269
270
271
272
273
274
@classmethod
def from_dict(cls, entity_recognizer_dict: Dict) -> "PatternRecognizer":
    """Create instance from a serialized dict."""
    patterns = entity_recognizer_dict.get("patterns")
    if patterns:
        patterns_list = [Pattern.from_dict(pat) for pat in patterns]
        entity_recognizer_dict["patterns"] = patterns_list

    return cls(**entity_recognizer_dict)

remove_duplicates staticmethod

remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]

Remove duplicate results.

Remove duplicates in case the two results have identical start and ends and types.

PARAMETER DESCRIPTION
results

List[RecognizerResult]

TYPE: List[RecognizerResult]

RETURNS DESCRIPTION
List[RecognizerResult]

List[RecognizerResult]

Source code in presidio_analyzer/entity_recognizer.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
@staticmethod
def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
    """
    Remove duplicate results.

    Remove duplicates in case the two results
    have identical start and ends and types.
    :param results: List[RecognizerResult]
    :return: List[RecognizerResult]
    """
    results = list(set(results))
    results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
    filtered_results = []

    for result in results:
        if result.score == 0:
            continue

        to_keep = result not in filtered_results  # equals based comparison
        if to_keep:
            for filtered in filtered_results:
                # If result is contained in one of the other results
                if (
                    result.contained_in(filtered)
                    and result.entity_type == filtered.entity_type
                ):
                    to_keep = False
                    break

        if to_keep:
            filtered_results.append(result)

    return filtered_results

sanitize_value staticmethod

sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str

Cleanse the input string of the replacement pairs specified as argument.

PARAMETER DESCRIPTION
text

input string

TYPE: str

replacement_pairs

pairs of what has to be replaced with which value

TYPE: List[Tuple[str, str]]

RETURNS DESCRIPTION
str

cleansed string

Source code in presidio_analyzer/entity_recognizer.py
200
201
202
203
204
205
206
207
208
209
210
211
@staticmethod
def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
    """
    Cleanse the input string of the replacement pairs specified as argument.

    :param text: input string
    :param replacement_pairs: pairs of what has to be replaced with which value
    :return: cleansed string
    """
    for search_string, replacement_string in replacement_pairs:
        text = text.replace(search_string, replacement_string)
    return text

validate_result

validate_result(pattern_text: str) -> Optional[bool]

Validate the pattern logic e.g., by running checksum on a detected pattern.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the validation was successful.

Source code in presidio_analyzer/pattern_recognizer.py
117
118
119
120
121
122
123
124
125
def validate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Validate the pattern logic e.g., by running checksum on a detected pattern.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the validation was successful.
    """
    return None

build_regex_explanation staticmethod

build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation

Construct an explanation for why this entity was detected.

PARAMETER DESCRIPTION
recognizer_name

Name of recognizer detecting the entity

TYPE: str

pattern_name

Regex pattern name which detected the entity

TYPE: str

pattern

Regex pattern logic

TYPE: str

original_score

Score given by the recognizer

TYPE: float

validation_result

Whether validation was used and its result

TYPE: bool

regex_flags

Regex flags used in the regex matching

TYPE: int

RETURNS DESCRIPTION
AnalysisExplanation

Analysis explanation

Source code in presidio_analyzer/pattern_recognizer.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
@staticmethod
def build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation:
    """
    Construct an explanation for why this entity was detected.

    :param recognizer_name: Name of recognizer detecting the entity
    :param pattern_name: Regex pattern name which detected the entity
    :param pattern: Regex pattern logic
    :param original_score: Score given by the recognizer
    :param validation_result: Whether validation was used and its result
    :param regex_flags: Regex flags used in the regex matching
    :return: Analysis explanation
    """
    textual_explanation = (
        f"Detected by `{recognizer_name}` " f"using pattern `{pattern_name}`"
    )

    explanation = AnalysisExplanation(
        recognizer=recognizer_name,
        original_score=original_score,
        pattern_name=pattern_name,
        pattern=pattern,
        validation_result=validation_result,
        regex_flags=regex_flags,
        textual_explanation=textual_explanation,
    )
    return explanation

invalidate_result

invalidate_result(pattern_text: str) -> bool

Check if the pattern text cannot be validated as an IP address.

PARAMETER DESCRIPTION
pattern_text

Text detected as pattern by regex

TYPE: str

RETURNS DESCRIPTION
bool

True if invalidated

Source code in presidio_analyzer/predefined_recognizers/ip_recognizer.py
53
54
55
56
57
58
59
60
61
62
63
def invalidate_result(self, pattern_text: str) -> bool:
    """
    Check if the pattern text cannot be validated as an IP address.

    :param pattern_text: Text detected as pattern by regex
    :return: True if invalidated
    """
    try:
        ipaddress.ip_address(pattern_text)
    except ValueError:
        return True

ItDriverLicenseRecognizer

Bases: PatternRecognizer

Recognizes IT Driver License using regex.

PARAMETER DESCRIPTION
patterns

List of patterns to be used by this recognizer

TYPE: Optional[List[Pattern]] DEFAULT: None

context

List of context words to increase confidence in detection

TYPE: Optional[List[str]] DEFAULT: None

supported_language

Language this recognizer supports

TYPE: str DEFAULT: 'it'

supported_entity

The entity this recognizer can detect

TYPE: str DEFAULT: 'IT_DRIVER_LICENSE'

METHOD DESCRIPTION
analyze

Analyzes text to detect PII using regular expressions or deny-lists.

enhance_using_context

Enhance confidence score using context of the entity.

get_supported_entities

Return the list of entities this recognizer can identify.

get_supported_language

Return the language this recognizer can support.

get_version

Return the version of this recognizer.

to_dict

Serialize instance into a dictionary.

from_dict

Create instance from a serialized dict.

remove_duplicates

Remove duplicate results.

sanitize_value

Cleanse the input string of the replacement pairs specified as argument.

validate_result

Validate the pattern logic e.g., by running checksum on a detected pattern.

invalidate_result

Logic to check for result invalidation by running pruning logic.

build_regex_explanation

Construct an explanation for why this entity was detected.

Source code in presidio_analyzer/predefined_recognizers/it_driver_license_recognizer.py
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
class ItDriverLicenseRecognizer(PatternRecognizer):
    """
    Recognizes IT Driver License using regex.

    :param patterns: List of patterns to be used by this recognizer
    :param context: List of context words to increase confidence in detection
    :param supported_language: Language this recognizer supports
    :param supported_entity: The entity this recognizer can detect
    """

    PATTERNS = [
        Pattern(
            "Driver License",
            (
                r"\b(?i)(([A-Z]{2}\d{7}[A-Z])"
                r"|(^[U]1[BCDEFGHLMNPRSTUWYXZ]\w{6}[A-Z]))\b"
            ),
            0.2,
        ),
    ]
    CONTEXT = ["patente", "patente di guida", "licenza", "licenza di guida"]

    def __init__(
        self,
        patterns: Optional[List[Pattern]] = None,
        context: Optional[List[str]] = None,
        supported_language: str = "it",
        supported_entity: str = "IT_DRIVER_LICENSE",
    ):
        patterns = patterns if patterns else self.PATTERNS
        context = context if context else self.CONTEXT
        super().__init__(
            supported_entity=supported_entity,
            patterns=patterns,
            context=context,
            supported_language=supported_language,
        )

id property

id

Return a unique identifier of this recognizer.

analyze

analyze(
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]

Analyzes text to detect PII using regular expressions or deny-lists.

PARAMETER DESCRIPTION
text

Text to be analyzed

TYPE: str

entities

Entities this recognizer can detect

TYPE: List[str]

nlp_artifacts

Output values from the NLP engine

TYPE: Optional[NlpArtifacts] DEFAULT: None

regex_flags

regex flags to be used in regex matching

TYPE: Optional[int] DEFAULT: None

RETURNS DESCRIPTION
List[RecognizerResult]
Source code in presidio_analyzer/pattern_recognizer.py
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def analyze(
    self,
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]:
    """
    Analyzes text to detect PII using regular expressions or deny-lists.

    :param text: Text to be analyzed
    :param entities: Entities this recognizer can detect
    :param nlp_artifacts: Output values from the NLP engine
    :param regex_flags: regex flags to be used in regex matching
    :return:
    """
    results = []

    if self.patterns:
        pattern_result = self.__analyze_patterns(text, regex_flags)
        results.extend(pattern_result)

    return results

enhance_using_context

enhance_using_context(
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]

Enhance confidence score using context of the entity.

Override this method in derived class in case a custom logic is needed, otherwise return value will be equal to raw_results.

in case a result score is boosted, derived class need to update result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

PARAMETER DESCRIPTION
text

The actual text that was analyzed

TYPE: str

raw_recognizer_results

This recognizer's results, to be updated based on recognizer specific context.

TYPE: List[RecognizerResult]

other_raw_recognizer_results

Other recognizer results matched in the given text to allow related entity context enhancement

TYPE: List[RecognizerResult]

nlp_artifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

TYPE: NlpArtifacts

context

list of context words

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/entity_recognizer.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def enhance_using_context(
    self,
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """Enhance confidence score using context of the entity.

    Override this method in derived class in case a custom logic
    is needed, otherwise return value will be equal to
    raw_results.

    in case a result score is boosted, derived class need to update
    result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

    :param text: The actual text that was analyzed
    :param raw_recognizer_results: This recognizer's results, to be updated
    based on recognizer specific context.
    :param other_raw_recognizer_results: Other recognizer results matched in
    the given text to allow related entity context enhancement
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param context: list of context words
    """
    return raw_recognizer_results

get_supported_entities

get_supported_entities() -> List[str]

Return the list of entities this recognizer can identify.

RETURNS DESCRIPTION
List[str]

A list of the supported entities by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
119
120
121
122
123
124
125
def get_supported_entities(self) -> List[str]:
    """
    Return the list of entities this recognizer can identify.

    :return: A list of the supported entities by this recognizer
    """
    return self.supported_entities

get_supported_language

get_supported_language() -> str

Return the language this recognizer can support.

RETURNS DESCRIPTION
str

A list of the supported language by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
127
128
129
130
131
132
133
def get_supported_language(self) -> str:
    """
    Return the language this recognizer can support.

    :return: A list of the supported language by this recognizer
    """
    return self.supported_language

get_version

get_version() -> str

Return the version of this recognizer.

RETURNS DESCRIPTION
str

The current version of this recognizer

Source code in presidio_analyzer/entity_recognizer.py
135
136
137
138
139
140
141
def get_version(self) -> str:
    """
    Return the version of this recognizer.

    :return: The current version of this recognizer
    """
    return self.version

to_dict

to_dict() -> Dict

Serialize instance into a dictionary.

Source code in presidio_analyzer/pattern_recognizer.py
254
255
256
257
258
259
260
261
262
263
264
def to_dict(self) -> Dict:
    """Serialize instance into a dictionary."""
    return_dict = super().to_dict()

    return_dict["patterns"] = [pat.to_dict() for pat in self.patterns]
    return_dict["deny_list"] = self.deny_list
    return_dict["context"] = self.context
    return_dict["supported_entity"] = return_dict["supported_entities"][0]
    del return_dict["supported_entities"]

    return return_dict

from_dict classmethod

from_dict(entity_recognizer_dict: Dict) -> PatternRecognizer

Create instance from a serialized dict.

Source code in presidio_analyzer/pattern_recognizer.py
266
267
268
269
270
271
272
273
274
@classmethod
def from_dict(cls, entity_recognizer_dict: Dict) -> "PatternRecognizer":
    """Create instance from a serialized dict."""
    patterns = entity_recognizer_dict.get("patterns")
    if patterns:
        patterns_list = [Pattern.from_dict(pat) for pat in patterns]
        entity_recognizer_dict["patterns"] = patterns_list

    return cls(**entity_recognizer_dict)

remove_duplicates staticmethod

remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]

Remove duplicate results.

Remove duplicates in case the two results have identical start and ends and types.

PARAMETER DESCRIPTION
results

List[RecognizerResult]

TYPE: List[RecognizerResult]

RETURNS DESCRIPTION
List[RecognizerResult]

List[RecognizerResult]

Source code in presidio_analyzer/entity_recognizer.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
@staticmethod
def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
    """
    Remove duplicate results.

    Remove duplicates in case the two results
    have identical start and ends and types.
    :param results: List[RecognizerResult]
    :return: List[RecognizerResult]
    """
    results = list(set(results))
    results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
    filtered_results = []

    for result in results:
        if result.score == 0:
            continue

        to_keep = result not in filtered_results  # equals based comparison
        if to_keep:
            for filtered in filtered_results:
                # If result is contained in one of the other results
                if (
                    result.contained_in(filtered)
                    and result.entity_type == filtered.entity_type
                ):
                    to_keep = False
                    break

        if to_keep:
            filtered_results.append(result)

    return filtered_results

sanitize_value staticmethod

sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str

Cleanse the input string of the replacement pairs specified as argument.

PARAMETER DESCRIPTION
text

input string

TYPE: str

replacement_pairs

pairs of what has to be replaced with which value

TYPE: List[Tuple[str, str]]

RETURNS DESCRIPTION
str

cleansed string

Source code in presidio_analyzer/entity_recognizer.py
200
201
202
203
204
205
206
207
208
209
210
211
@staticmethod
def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
    """
    Cleanse the input string of the replacement pairs specified as argument.

    :param text: input string
    :param replacement_pairs: pairs of what has to be replaced with which value
    :return: cleansed string
    """
    for search_string, replacement_string in replacement_pairs:
        text = text.replace(search_string, replacement_string)
    return text

validate_result

validate_result(pattern_text: str) -> Optional[bool]

Validate the pattern logic e.g., by running checksum on a detected pattern.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the validation was successful.

Source code in presidio_analyzer/pattern_recognizer.py
117
118
119
120
121
122
123
124
125
def validate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Validate the pattern logic e.g., by running checksum on a detected pattern.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the validation was successful.
    """
    return None

invalidate_result

invalidate_result(pattern_text: str) -> Optional[bool]

Logic to check for result invalidation by running pruning logic.

For example, each SSN number group should not consist of all the same digits.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the result is invalidated

Source code in presidio_analyzer/pattern_recognizer.py
127
128
129
130
131
132
133
134
135
136
137
def invalidate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Logic to check for result invalidation by running pruning logic.

    For example, each SSN number group should not consist of all the same digits.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the result is invalidated
    """
    return None

build_regex_explanation staticmethod

build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation

Construct an explanation for why this entity was detected.

PARAMETER DESCRIPTION
recognizer_name

Name of recognizer detecting the entity

TYPE: str

pattern_name

Regex pattern name which detected the entity

TYPE: str

pattern

Regex pattern logic

TYPE: str

original_score

Score given by the recognizer

TYPE: float

validation_result

Whether validation was used and its result

TYPE: bool

regex_flags

Regex flags used in the regex matching

TYPE: int

RETURNS DESCRIPTION
AnalysisExplanation

Analysis explanation

Source code in presidio_analyzer/pattern_recognizer.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
@staticmethod
def build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation:
    """
    Construct an explanation for why this entity was detected.

    :param recognizer_name: Name of recognizer detecting the entity
    :param pattern_name: Regex pattern name which detected the entity
    :param pattern: Regex pattern logic
    :param original_score: Score given by the recognizer
    :param validation_result: Whether validation was used and its result
    :param regex_flags: Regex flags used in the regex matching
    :return: Analysis explanation
    """
    textual_explanation = (
        f"Detected by `{recognizer_name}` " f"using pattern `{pattern_name}`"
    )

    explanation = AnalysisExplanation(
        recognizer=recognizer_name,
        original_score=original_score,
        pattern_name=pattern_name,
        pattern=pattern,
        validation_result=validation_result,
        regex_flags=regex_flags,
        textual_explanation=textual_explanation,
    )
    return explanation

ItFiscalCodeRecognizer

Bases: PatternRecognizer

Recognizes IT Fiscal Code using regex.

PARAMETER DESCRIPTION
patterns

List of patterns to be used by this recognizer

TYPE: Optional[List[Pattern]] DEFAULT: None

context

List of context words to increase confidence in detection

TYPE: Optional[List[str]] DEFAULT: None

supported_language

Language this recognizer supports

TYPE: str DEFAULT: 'it'

supported_entity

The entity this recognizer can detect

TYPE: str DEFAULT: 'IT_FISCAL_CODE'

METHOD DESCRIPTION
analyze

Analyzes text to detect PII using regular expressions or deny-lists.

enhance_using_context

Enhance confidence score using context of the entity.

get_supported_entities

Return the list of entities this recognizer can identify.

get_supported_language

Return the language this recognizer can support.

get_version

Return the version of this recognizer.

to_dict

Serialize instance into a dictionary.

from_dict

Create instance from a serialized dict.

remove_duplicates

Remove duplicate results.

sanitize_value

Cleanse the input string of the replacement pairs specified as argument.

invalidate_result

Logic to check for result invalidation by running pruning logic.

build_regex_explanation

Construct an explanation for why this entity was detected.

validate_result

Validate the pattern logic e.g., by running checksum on a detected pattern.

Source code in presidio_analyzer/predefined_recognizers/it_fiscal_code_recognizer.py
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
class ItFiscalCodeRecognizer(PatternRecognizer):
    """
    Recognizes IT Fiscal Code using regex.

    :param patterns: List of patterns to be used by this recognizer
    :param context: List of context words to increase confidence in detection
    :param supported_language: Language this recognizer supports
    :param supported_entity: The entity this recognizer can detect
    """

    PATTERNS = [
        Pattern(
            "Fiscal Code",
            (
                r"(?i)((?:[A-Z][AEIOU][AEIOUX]|[AEIOU]X{2}"
                r"|[B-DF-HJ-NP-TV-Z]{2}[A-Z]){2}"
                r"(?:[\dLMNP-V]{2}(?:[A-EHLMPR-T](?:[04LQ][1-9MNP-V]|[15MR][\dLMNP-V]"
                r"|[26NS][0-8LMNP-U])|[DHPS][37PT][0L]|[ACELMRT][37PT][01LM]"
                r"|[AC-EHLMPR-T][26NS][9V])|(?:[02468LNQSU][048LQU]"
                r"|[13579MPRTV][26NS])B[26NS][9V])(?:[A-MZ][1-9MNP-V][\dLMNP-V]{2}"
                r"|[A-M][0L](?:[1-9MNP-V][\dLMNP-V]|[0L][1-9MNP-V]))[A-Z])"
            ),
            0.3,
        ),
    ]
    CONTEXT = ["codice fiscale", "cf"]

    def __init__(
        self,
        patterns: Optional[List[Pattern]] = None,
        context: Optional[List[str]] = None,
        supported_language: str = "it",
        supported_entity: str = "IT_FISCAL_CODE",
    ):
        patterns = patterns if patterns else self.PATTERNS
        context = context if context else self.CONTEXT
        super().__init__(
            supported_entity=supported_entity,
            patterns=patterns,
            context=context,
            supported_language=supported_language,
        )

    def validate_result(self, pattern_text: str) -> Optional[bool]:
        """
        Validate the pattern logic e.g., by running checksum on a detected pattern.

        :param pattern_text: the text to validated.
        Only the part in text that was detected by the regex engine
        :return: A bool indicating whether the validation was successful.
        """
        pattern_text = pattern_text.upper()
        control = pattern_text[-1]
        text_to_validate = pattern_text[:-1]
        odd_values = text_to_validate[0::2]
        even_values = text_to_validate[1::2]

        # Odd values
        map_odd = {
            "0": 1,
            "1": 0,
            "2": 5,
            "3": 7,
            "4": 9,
            "5": 13,
            "6": 15,
            "7": 17,
            "8": 19,
            "9": 21,
            "A": 1,
            "B": 0,
            "C": 5,
            "D": 7,
            "E": 9,
            "F": 13,
            "G": 15,
            "H": 17,
            "I": 19,
            "J": 21,
            "K": 2,
            "L": 4,
            "M": 18,
            "N": 20,
            "O": 11,
            "P": 3,
            "Q": 6,
            "R": 8,
            "S": 12,
            "T": 14,
            "U": 16,
            "V": 10,
            "W": 22,
            "X": 25,
            "Y": 24,
            "Z": 23,
        }

        odd_sum = 0
        for char in odd_values:
            odd_sum += map_odd[char]

        # Even values
        map_even = {
            "0": 0,
            "1": 1,
            "2": 2,
            "3": 3,
            "4": 4,
            "5": 5,
            "6": 6,
            "7": 7,
            "8": 8,
            "9": 9,
            "A": 0,
            "B": 1,
            "C": 2,
            "D": 3,
            "E": 4,
            "F": 5,
            "G": 6,
            "H": 7,
            "I": 8,
            "J": 9,
            "K": 10,
            "L": 11,
            "M": 12,
            "N": 13,
            "O": 14,
            "P": 15,
            "Q": 16,
            "R": 17,
            "S": 18,
            "T": 19,
            "U": 20,
            "V": 21,
            "W": 22,
            "X": 23,
            "Y": 24,
            "Z": 25,
        }

        even_sum = 0
        for char in even_values:
            even_sum += map_even[char]

        # Mod value
        map_mod = {
            0: "A",
            1: "B",
            2: "C",
            3: "D",
            4: "E",
            5: "F",
            6: "G",
            7: "H",
            8: "I",
            9: "J",
            10: "K",
            11: "L",
            12: "M",
            13: "N",
            14: "O",
            15: "P",
            16: "Q",
            17: "R",
            18: "S",
            19: "T",
            20: "U",
            21: "V",
            22: "W",
            23: "X",
            24: "Y",
            25: "Z",
        }
        check_value = map_mod[((odd_sum + even_sum) % 26)]

        if check_value == control:
            result = True
        else:
            result = None

        return result

id property

id

Return a unique identifier of this recognizer.

analyze

analyze(
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]

Analyzes text to detect PII using regular expressions or deny-lists.

PARAMETER DESCRIPTION
text

Text to be analyzed

TYPE: str

entities

Entities this recognizer can detect

TYPE: List[str]

nlp_artifacts

Output values from the NLP engine

TYPE: Optional[NlpArtifacts] DEFAULT: None

regex_flags

regex flags to be used in regex matching

TYPE: Optional[int] DEFAULT: None

RETURNS DESCRIPTION
List[RecognizerResult]
Source code in presidio_analyzer/pattern_recognizer.py
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def analyze(
    self,
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]:
    """
    Analyzes text to detect PII using regular expressions or deny-lists.

    :param text: Text to be analyzed
    :param entities: Entities this recognizer can detect
    :param nlp_artifacts: Output values from the NLP engine
    :param regex_flags: regex flags to be used in regex matching
    :return:
    """
    results = []

    if self.patterns:
        pattern_result = self.__analyze_patterns(text, regex_flags)
        results.extend(pattern_result)

    return results

enhance_using_context

enhance_using_context(
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]

Enhance confidence score using context of the entity.

Override this method in derived class in case a custom logic is needed, otherwise return value will be equal to raw_results.

in case a result score is boosted, derived class need to update result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

PARAMETER DESCRIPTION
text

The actual text that was analyzed

TYPE: str

raw_recognizer_results

This recognizer's results, to be updated based on recognizer specific context.

TYPE: List[RecognizerResult]

other_raw_recognizer_results

Other recognizer results matched in the given text to allow related entity context enhancement

TYPE: List[RecognizerResult]

nlp_artifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

TYPE: NlpArtifacts

context

list of context words

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/entity_recognizer.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def enhance_using_context(
    self,
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """Enhance confidence score using context of the entity.

    Override this method in derived class in case a custom logic
    is needed, otherwise return value will be equal to
    raw_results.

    in case a result score is boosted, derived class need to update
    result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

    :param text: The actual text that was analyzed
    :param raw_recognizer_results: This recognizer's results, to be updated
    based on recognizer specific context.
    :param other_raw_recognizer_results: Other recognizer results matched in
    the given text to allow related entity context enhancement
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param context: list of context words
    """
    return raw_recognizer_results

get_supported_entities

get_supported_entities() -> List[str]

Return the list of entities this recognizer can identify.

RETURNS DESCRIPTION
List[str]

A list of the supported entities by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
119
120
121
122
123
124
125
def get_supported_entities(self) -> List[str]:
    """
    Return the list of entities this recognizer can identify.

    :return: A list of the supported entities by this recognizer
    """
    return self.supported_entities

get_supported_language

get_supported_language() -> str

Return the language this recognizer can support.

RETURNS DESCRIPTION
str

A list of the supported language by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
127
128
129
130
131
132
133
def get_supported_language(self) -> str:
    """
    Return the language this recognizer can support.

    :return: A list of the supported language by this recognizer
    """
    return self.supported_language

get_version

get_version() -> str

Return the version of this recognizer.

RETURNS DESCRIPTION
str

The current version of this recognizer

Source code in presidio_analyzer/entity_recognizer.py
135
136
137
138
139
140
141
def get_version(self) -> str:
    """
    Return the version of this recognizer.

    :return: The current version of this recognizer
    """
    return self.version

to_dict

to_dict() -> Dict

Serialize instance into a dictionary.

Source code in presidio_analyzer/pattern_recognizer.py
254
255
256
257
258
259
260
261
262
263
264
def to_dict(self) -> Dict:
    """Serialize instance into a dictionary."""
    return_dict = super().to_dict()

    return_dict["patterns"] = [pat.to_dict() for pat in self.patterns]
    return_dict["deny_list"] = self.deny_list
    return_dict["context"] = self.context
    return_dict["supported_entity"] = return_dict["supported_entities"][0]
    del return_dict["supported_entities"]

    return return_dict

from_dict classmethod

from_dict(entity_recognizer_dict: Dict) -> PatternRecognizer

Create instance from a serialized dict.

Source code in presidio_analyzer/pattern_recognizer.py
266
267
268
269
270
271
272
273
274
@classmethod
def from_dict(cls, entity_recognizer_dict: Dict) -> "PatternRecognizer":
    """Create instance from a serialized dict."""
    patterns = entity_recognizer_dict.get("patterns")
    if patterns:
        patterns_list = [Pattern.from_dict(pat) for pat in patterns]
        entity_recognizer_dict["patterns"] = patterns_list

    return cls(**entity_recognizer_dict)

remove_duplicates staticmethod

remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]

Remove duplicate results.

Remove duplicates in case the two results have identical start and ends and types.

PARAMETER DESCRIPTION
results

List[RecognizerResult]

TYPE: List[RecognizerResult]

RETURNS DESCRIPTION
List[RecognizerResult]

List[RecognizerResult]

Source code in presidio_analyzer/entity_recognizer.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
@staticmethod
def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
    """
    Remove duplicate results.

    Remove duplicates in case the two results
    have identical start and ends and types.
    :param results: List[RecognizerResult]
    :return: List[RecognizerResult]
    """
    results = list(set(results))
    results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
    filtered_results = []

    for result in results:
        if result.score == 0:
            continue

        to_keep = result not in filtered_results  # equals based comparison
        if to_keep:
            for filtered in filtered_results:
                # If result is contained in one of the other results
                if (
                    result.contained_in(filtered)
                    and result.entity_type == filtered.entity_type
                ):
                    to_keep = False
                    break

        if to_keep:
            filtered_results.append(result)

    return filtered_results

sanitize_value staticmethod

sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str

Cleanse the input string of the replacement pairs specified as argument.

PARAMETER DESCRIPTION
text

input string

TYPE: str

replacement_pairs

pairs of what has to be replaced with which value

TYPE: List[Tuple[str, str]]

RETURNS DESCRIPTION
str

cleansed string

Source code in presidio_analyzer/entity_recognizer.py
200
201
202
203
204
205
206
207
208
209
210
211
@staticmethod
def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
    """
    Cleanse the input string of the replacement pairs specified as argument.

    :param text: input string
    :param replacement_pairs: pairs of what has to be replaced with which value
    :return: cleansed string
    """
    for search_string, replacement_string in replacement_pairs:
        text = text.replace(search_string, replacement_string)
    return text

invalidate_result

invalidate_result(pattern_text: str) -> Optional[bool]

Logic to check for result invalidation by running pruning logic.

For example, each SSN number group should not consist of all the same digits.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the result is invalidated

Source code in presidio_analyzer/pattern_recognizer.py
127
128
129
130
131
132
133
134
135
136
137
def invalidate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Logic to check for result invalidation by running pruning logic.

    For example, each SSN number group should not consist of all the same digits.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the result is invalidated
    """
    return None

build_regex_explanation staticmethod

build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation

Construct an explanation for why this entity was detected.

PARAMETER DESCRIPTION
recognizer_name

Name of recognizer detecting the entity

TYPE: str

pattern_name

Regex pattern name which detected the entity

TYPE: str

pattern

Regex pattern logic

TYPE: str

original_score

Score given by the recognizer

TYPE: float

validation_result

Whether validation was used and its result

TYPE: bool

regex_flags

Regex flags used in the regex matching

TYPE: int

RETURNS DESCRIPTION
AnalysisExplanation

Analysis explanation

Source code in presidio_analyzer/pattern_recognizer.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
@staticmethod
def build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation:
    """
    Construct an explanation for why this entity was detected.

    :param recognizer_name: Name of recognizer detecting the entity
    :param pattern_name: Regex pattern name which detected the entity
    :param pattern: Regex pattern logic
    :param original_score: Score given by the recognizer
    :param validation_result: Whether validation was used and its result
    :param regex_flags: Regex flags used in the regex matching
    :return: Analysis explanation
    """
    textual_explanation = (
        f"Detected by `{recognizer_name}` " f"using pattern `{pattern_name}`"
    )

    explanation = AnalysisExplanation(
        recognizer=recognizer_name,
        original_score=original_score,
        pattern_name=pattern_name,
        pattern=pattern,
        validation_result=validation_result,
        regex_flags=regex_flags,
        textual_explanation=textual_explanation,
    )
    return explanation

validate_result

validate_result(pattern_text: str) -> Optional[bool]

Validate the pattern logic e.g., by running checksum on a detected pattern.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the validation was successful.

Source code in presidio_analyzer/predefined_recognizers/it_fiscal_code_recognizer.py
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
def validate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Validate the pattern logic e.g., by running checksum on a detected pattern.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the validation was successful.
    """
    pattern_text = pattern_text.upper()
    control = pattern_text[-1]
    text_to_validate = pattern_text[:-1]
    odd_values = text_to_validate[0::2]
    even_values = text_to_validate[1::2]

    # Odd values
    map_odd = {
        "0": 1,
        "1": 0,
        "2": 5,
        "3": 7,
        "4": 9,
        "5": 13,
        "6": 15,
        "7": 17,
        "8": 19,
        "9": 21,
        "A": 1,
        "B": 0,
        "C": 5,
        "D": 7,
        "E": 9,
        "F": 13,
        "G": 15,
        "H": 17,
        "I": 19,
        "J": 21,
        "K": 2,
        "L": 4,
        "M": 18,
        "N": 20,
        "O": 11,
        "P": 3,
        "Q": 6,
        "R": 8,
        "S": 12,
        "T": 14,
        "U": 16,
        "V": 10,
        "W": 22,
        "X": 25,
        "Y": 24,
        "Z": 23,
    }

    odd_sum = 0
    for char in odd_values:
        odd_sum += map_odd[char]

    # Even values
    map_even = {
        "0": 0,
        "1": 1,
        "2": 2,
        "3": 3,
        "4": 4,
        "5": 5,
        "6": 6,
        "7": 7,
        "8": 8,
        "9": 9,
        "A": 0,
        "B": 1,
        "C": 2,
        "D": 3,
        "E": 4,
        "F": 5,
        "G": 6,
        "H": 7,
        "I": 8,
        "J": 9,
        "K": 10,
        "L": 11,
        "M": 12,
        "N": 13,
        "O": 14,
        "P": 15,
        "Q": 16,
        "R": 17,
        "S": 18,
        "T": 19,
        "U": 20,
        "V": 21,
        "W": 22,
        "X": 23,
        "Y": 24,
        "Z": 25,
    }

    even_sum = 0
    for char in even_values:
        even_sum += map_even[char]

    # Mod value
    map_mod = {
        0: "A",
        1: "B",
        2: "C",
        3: "D",
        4: "E",
        5: "F",
        6: "G",
        7: "H",
        8: "I",
        9: "J",
        10: "K",
        11: "L",
        12: "M",
        13: "N",
        14: "O",
        15: "P",
        16: "Q",
        17: "R",
        18: "S",
        19: "T",
        20: "U",
        21: "V",
        22: "W",
        23: "X",
        24: "Y",
        25: "Z",
    }
    check_value = map_mod[((odd_sum + even_sum) % 26)]

    if check_value == control:
        result = True
    else:
        result = None

    return result

ItIdentityCardRecognizer

Bases: PatternRecognizer

Recognizes Italian Identity Card number using case-insensitive regex.

PARAMETER DESCRIPTION
patterns

List of patterns to be used by this recognizer

TYPE: Optional[List[Pattern]] DEFAULT: None

context

List of context words to increase confidence in detection

TYPE: Optional[List[str]] DEFAULT: None

supported_language

Language this recognizer supports

TYPE: str DEFAULT: 'it'

supported_entity

The entity this recognizer can detect

TYPE: str DEFAULT: 'IT_IDENTITY_CARD'

METHOD DESCRIPTION
analyze

Analyzes text to detect PII using regular expressions or deny-lists.

enhance_using_context

Enhance confidence score using context of the entity.

get_supported_entities

Return the list of entities this recognizer can identify.

get_supported_language

Return the language this recognizer can support.

get_version

Return the version of this recognizer.

to_dict

Serialize instance into a dictionary.

from_dict

Create instance from a serialized dict.

remove_duplicates

Remove duplicate results.

sanitize_value

Cleanse the input string of the replacement pairs specified as argument.

validate_result

Validate the pattern logic e.g., by running checksum on a detected pattern.

invalidate_result

Logic to check for result invalidation by running pruning logic.

build_regex_explanation

Construct an explanation for why this entity was detected.

Source code in presidio_analyzer/predefined_recognizers/it_identity_card_recognizer.py
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
class ItIdentityCardRecognizer(PatternRecognizer):
    """
    Recognizes Italian Identity Card number using case-insensitive regex.

    :param patterns: List of patterns to be used by this recognizer
    :param context: List of context words to increase confidence in detection
    :param supported_language: Language this recognizer supports
    :param supported_entity: The entity this recognizer can detect
    """

    PATTERNS = [
        Pattern(
            "Paper-based Identity Card (very weak)",
            # The number is composed of 2 letters, space (optional), 7 digits
            r"(?i)\b[A-Z]{2}\s?\d{7}\b",  # noqa: E501
            0.01,
        ),
        Pattern(
            "Electronic Identity Card (CIE) 2.0 (very weak)",
            r"(?i)\b\d{7}[A-Z]{2}\b",  # noqa: E501
            0.01,
        ),
        Pattern(
            "Electronic Identity Card (CIE) 3.0 (very weak)",
            r"(?i)\b[A-Z]{2}\d{5}[A-Z]{2}\b",  # noqa: E501
            0.01,
        ),
    ]

    CONTEXT = [
        "carta",
        "identità",
        "elettronica",
        "cie",
        "documento",
        "riconoscimento",
        "espatrio",
    ]

    def __init__(
        self,
        patterns: Optional[List[Pattern]] = None,
        context: Optional[List[str]] = None,
        supported_language: str = "it",
        supported_entity: str = "IT_IDENTITY_CARD",
    ):
        patterns = patterns if patterns else self.PATTERNS
        context = context if context else self.CONTEXT
        super().__init__(
            supported_entity=supported_entity,
            patterns=patterns,
            context=context,
            supported_language=supported_language,
        )

id property

id

Return a unique identifier of this recognizer.

analyze

analyze(
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]

Analyzes text to detect PII using regular expressions or deny-lists.

PARAMETER DESCRIPTION
text

Text to be analyzed

TYPE: str

entities

Entities this recognizer can detect

TYPE: List[str]

nlp_artifacts

Output values from the NLP engine

TYPE: Optional[NlpArtifacts] DEFAULT: None

regex_flags

regex flags to be used in regex matching

TYPE: Optional[int] DEFAULT: None

RETURNS DESCRIPTION
List[RecognizerResult]
Source code in presidio_analyzer/pattern_recognizer.py
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def analyze(
    self,
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]:
    """
    Analyzes text to detect PII using regular expressions or deny-lists.

    :param text: Text to be analyzed
    :param entities: Entities this recognizer can detect
    :param nlp_artifacts: Output values from the NLP engine
    :param regex_flags: regex flags to be used in regex matching
    :return:
    """
    results = []

    if self.patterns:
        pattern_result = self.__analyze_patterns(text, regex_flags)
        results.extend(pattern_result)

    return results

enhance_using_context

enhance_using_context(
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]

Enhance confidence score using context of the entity.

Override this method in derived class in case a custom logic is needed, otherwise return value will be equal to raw_results.

in case a result score is boosted, derived class need to update result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

PARAMETER DESCRIPTION
text

The actual text that was analyzed

TYPE: str

raw_recognizer_results

This recognizer's results, to be updated based on recognizer specific context.

TYPE: List[RecognizerResult]

other_raw_recognizer_results

Other recognizer results matched in the given text to allow related entity context enhancement

TYPE: List[RecognizerResult]

nlp_artifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

TYPE: NlpArtifacts

context

list of context words

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/entity_recognizer.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def enhance_using_context(
    self,
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """Enhance confidence score using context of the entity.

    Override this method in derived class in case a custom logic
    is needed, otherwise return value will be equal to
    raw_results.

    in case a result score is boosted, derived class need to update
    result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

    :param text: The actual text that was analyzed
    :param raw_recognizer_results: This recognizer's results, to be updated
    based on recognizer specific context.
    :param other_raw_recognizer_results: Other recognizer results matched in
    the given text to allow related entity context enhancement
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param context: list of context words
    """
    return raw_recognizer_results

get_supported_entities

get_supported_entities() -> List[str]

Return the list of entities this recognizer can identify.

RETURNS DESCRIPTION
List[str]

A list of the supported entities by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
119
120
121
122
123
124
125
def get_supported_entities(self) -> List[str]:
    """
    Return the list of entities this recognizer can identify.

    :return: A list of the supported entities by this recognizer
    """
    return self.supported_entities

get_supported_language

get_supported_language() -> str

Return the language this recognizer can support.

RETURNS DESCRIPTION
str

A list of the supported language by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
127
128
129
130
131
132
133
def get_supported_language(self) -> str:
    """
    Return the language this recognizer can support.

    :return: A list of the supported language by this recognizer
    """
    return self.supported_language

get_version

get_version() -> str

Return the version of this recognizer.

RETURNS DESCRIPTION
str

The current version of this recognizer

Source code in presidio_analyzer/entity_recognizer.py
135
136
137
138
139
140
141
def get_version(self) -> str:
    """
    Return the version of this recognizer.

    :return: The current version of this recognizer
    """
    return self.version

to_dict

to_dict() -> Dict

Serialize instance into a dictionary.

Source code in presidio_analyzer/pattern_recognizer.py
254
255
256
257
258
259
260
261
262
263
264
def to_dict(self) -> Dict:
    """Serialize instance into a dictionary."""
    return_dict = super().to_dict()

    return_dict["patterns"] = [pat.to_dict() for pat in self.patterns]
    return_dict["deny_list"] = self.deny_list
    return_dict["context"] = self.context
    return_dict["supported_entity"] = return_dict["supported_entities"][0]
    del return_dict["supported_entities"]

    return return_dict

from_dict classmethod

from_dict(entity_recognizer_dict: Dict) -> PatternRecognizer

Create instance from a serialized dict.

Source code in presidio_analyzer/pattern_recognizer.py
266
267
268
269
270
271
272
273
274
@classmethod
def from_dict(cls, entity_recognizer_dict: Dict) -> "PatternRecognizer":
    """Create instance from a serialized dict."""
    patterns = entity_recognizer_dict.get("patterns")
    if patterns:
        patterns_list = [Pattern.from_dict(pat) for pat in patterns]
        entity_recognizer_dict["patterns"] = patterns_list

    return cls(**entity_recognizer_dict)

remove_duplicates staticmethod

remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]

Remove duplicate results.

Remove duplicates in case the two results have identical start and ends and types.

PARAMETER DESCRIPTION
results

List[RecognizerResult]

TYPE: List[RecognizerResult]

RETURNS DESCRIPTION
List[RecognizerResult]

List[RecognizerResult]

Source code in presidio_analyzer/entity_recognizer.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
@staticmethod
def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
    """
    Remove duplicate results.

    Remove duplicates in case the two results
    have identical start and ends and types.
    :param results: List[RecognizerResult]
    :return: List[RecognizerResult]
    """
    results = list(set(results))
    results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
    filtered_results = []

    for result in results:
        if result.score == 0:
            continue

        to_keep = result not in filtered_results  # equals based comparison
        if to_keep:
            for filtered in filtered_results:
                # If result is contained in one of the other results
                if (
                    result.contained_in(filtered)
                    and result.entity_type == filtered.entity_type
                ):
                    to_keep = False
                    break

        if to_keep:
            filtered_results.append(result)

    return filtered_results

sanitize_value staticmethod

sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str

Cleanse the input string of the replacement pairs specified as argument.

PARAMETER DESCRIPTION
text

input string

TYPE: str

replacement_pairs

pairs of what has to be replaced with which value

TYPE: List[Tuple[str, str]]

RETURNS DESCRIPTION
str

cleansed string

Source code in presidio_analyzer/entity_recognizer.py
200
201
202
203
204
205
206
207
208
209
210
211
@staticmethod
def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
    """
    Cleanse the input string of the replacement pairs specified as argument.

    :param text: input string
    :param replacement_pairs: pairs of what has to be replaced with which value
    :return: cleansed string
    """
    for search_string, replacement_string in replacement_pairs:
        text = text.replace(search_string, replacement_string)
    return text

validate_result

validate_result(pattern_text: str) -> Optional[bool]

Validate the pattern logic e.g., by running checksum on a detected pattern.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the validation was successful.

Source code in presidio_analyzer/pattern_recognizer.py
117
118
119
120
121
122
123
124
125
def validate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Validate the pattern logic e.g., by running checksum on a detected pattern.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the validation was successful.
    """
    return None

invalidate_result

invalidate_result(pattern_text: str) -> Optional[bool]

Logic to check for result invalidation by running pruning logic.

For example, each SSN number group should not consist of all the same digits.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the result is invalidated

Source code in presidio_analyzer/pattern_recognizer.py
127
128
129
130
131
132
133
134
135
136
137
def invalidate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Logic to check for result invalidation by running pruning logic.

    For example, each SSN number group should not consist of all the same digits.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the result is invalidated
    """
    return None

build_regex_explanation staticmethod

build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation

Construct an explanation for why this entity was detected.

PARAMETER DESCRIPTION
recognizer_name

Name of recognizer detecting the entity

TYPE: str

pattern_name

Regex pattern name which detected the entity

TYPE: str

pattern

Regex pattern logic

TYPE: str

original_score

Score given by the recognizer

TYPE: float

validation_result

Whether validation was used and its result

TYPE: bool

regex_flags

Regex flags used in the regex matching

TYPE: int

RETURNS DESCRIPTION
AnalysisExplanation

Analysis explanation

Source code in presidio_analyzer/pattern_recognizer.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
@staticmethod
def build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation:
    """
    Construct an explanation for why this entity was detected.

    :param recognizer_name: Name of recognizer detecting the entity
    :param pattern_name: Regex pattern name which detected the entity
    :param pattern: Regex pattern logic
    :param original_score: Score given by the recognizer
    :param validation_result: Whether validation was used and its result
    :param regex_flags: Regex flags used in the regex matching
    :return: Analysis explanation
    """
    textual_explanation = (
        f"Detected by `{recognizer_name}` " f"using pattern `{pattern_name}`"
    )

    explanation = AnalysisExplanation(
        recognizer=recognizer_name,
        original_score=original_score,
        pattern_name=pattern_name,
        pattern=pattern,
        validation_result=validation_result,
        regex_flags=regex_flags,
        textual_explanation=textual_explanation,
    )
    return explanation

ItPassportRecognizer

Bases: PatternRecognizer

Recognizes IT Passport number using case-insensitive regex.

PARAMETER DESCRIPTION
patterns

List of patterns to be used by this recognizer

TYPE: Optional[List[Pattern]] DEFAULT: None

context

List of context words to increase confidence in detection

TYPE: Optional[List[str]] DEFAULT: None

supported_language

Language this recognizer supports

TYPE: str DEFAULT: 'it'

supported_entity

The entity this recognizer can detect

TYPE: str DEFAULT: 'IT_PASSPORT'

METHOD DESCRIPTION
analyze

Analyzes text to detect PII using regular expressions or deny-lists.

enhance_using_context

Enhance confidence score using context of the entity.

get_supported_entities

Return the list of entities this recognizer can identify.

get_supported_language

Return the language this recognizer can support.

get_version

Return the version of this recognizer.

to_dict

Serialize instance into a dictionary.

from_dict

Create instance from a serialized dict.

remove_duplicates

Remove duplicate results.

sanitize_value

Cleanse the input string of the replacement pairs specified as argument.

validate_result

Validate the pattern logic e.g., by running checksum on a detected pattern.

invalidate_result

Logic to check for result invalidation by running pruning logic.

build_regex_explanation

Construct an explanation for why this entity was detected.

Source code in presidio_analyzer/predefined_recognizers/it_passport_recognizer.py
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
class ItPassportRecognizer(PatternRecognizer):
    """
    Recognizes IT Passport number using case-insensitive regex.

    :param patterns: List of patterns to be used by this recognizer
    :param context: List of context words to increase confidence in detection
    :param supported_language: Language this recognizer supports
    :param supported_entity: The entity this recognizer can detect
    """

    PATTERNS = [
        Pattern(
            "Passport (very weak)",
            r"(?i)\b[A-Z]{2}\d{7}\b",  # noqa: E501
            0.01,
        ),
    ]

    CONTEXT = [
        "passaporto",
        "elettronico",
        "italiano",
        "viaggio",
        "viaggiare",
        "estero",
        "documento",
        "dogana",
    ]

    def __init__(
        self,
        patterns: Optional[List[Pattern]] = None,
        context: Optional[List[str]] = None,
        supported_language: str = "it",
        supported_entity: str = "IT_PASSPORT",
    ):
        patterns = patterns if patterns else self.PATTERNS
        context = context if context else self.CONTEXT
        super().__init__(
            supported_entity=supported_entity,
            patterns=patterns,
            context=context,
            supported_language=supported_language,
        )

id property

id

Return a unique identifier of this recognizer.

analyze

analyze(
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]

Analyzes text to detect PII using regular expressions or deny-lists.

PARAMETER DESCRIPTION
text

Text to be analyzed

TYPE: str

entities

Entities this recognizer can detect

TYPE: List[str]

nlp_artifacts

Output values from the NLP engine

TYPE: Optional[NlpArtifacts] DEFAULT: None

regex_flags

regex flags to be used in regex matching

TYPE: Optional[int] DEFAULT: None

RETURNS DESCRIPTION
List[RecognizerResult]
Source code in presidio_analyzer/pattern_recognizer.py
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def analyze(
    self,
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]:
    """
    Analyzes text to detect PII using regular expressions or deny-lists.

    :param text: Text to be analyzed
    :param entities: Entities this recognizer can detect
    :param nlp_artifacts: Output values from the NLP engine
    :param regex_flags: regex flags to be used in regex matching
    :return:
    """
    results = []

    if self.patterns:
        pattern_result = self.__analyze_patterns(text, regex_flags)
        results.extend(pattern_result)

    return results

enhance_using_context

enhance_using_context(
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]

Enhance confidence score using context of the entity.

Override this method in derived class in case a custom logic is needed, otherwise return value will be equal to raw_results.

in case a result score is boosted, derived class need to update result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

PARAMETER DESCRIPTION
text

The actual text that was analyzed

TYPE: str

raw_recognizer_results

This recognizer's results, to be updated based on recognizer specific context.

TYPE: List[RecognizerResult]

other_raw_recognizer_results

Other recognizer results matched in the given text to allow related entity context enhancement

TYPE: List[RecognizerResult]

nlp_artifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

TYPE: NlpArtifacts

context

list of context words

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/entity_recognizer.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def enhance_using_context(
    self,
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """Enhance confidence score using context of the entity.

    Override this method in derived class in case a custom logic
    is needed, otherwise return value will be equal to
    raw_results.

    in case a result score is boosted, derived class need to update
    result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

    :param text: The actual text that was analyzed
    :param raw_recognizer_results: This recognizer's results, to be updated
    based on recognizer specific context.
    :param other_raw_recognizer_results: Other recognizer results matched in
    the given text to allow related entity context enhancement
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param context: list of context words
    """
    return raw_recognizer_results

get_supported_entities

get_supported_entities() -> List[str]

Return the list of entities this recognizer can identify.

RETURNS DESCRIPTION
List[str]

A list of the supported entities by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
119
120
121
122
123
124
125
def get_supported_entities(self) -> List[str]:
    """
    Return the list of entities this recognizer can identify.

    :return: A list of the supported entities by this recognizer
    """
    return self.supported_entities

get_supported_language

get_supported_language() -> str

Return the language this recognizer can support.

RETURNS DESCRIPTION
str

A list of the supported language by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
127
128
129
130
131
132
133
def get_supported_language(self) -> str:
    """
    Return the language this recognizer can support.

    :return: A list of the supported language by this recognizer
    """
    return self.supported_language

get_version

get_version() -> str

Return the version of this recognizer.

RETURNS DESCRIPTION
str

The current version of this recognizer

Source code in presidio_analyzer/entity_recognizer.py
135
136
137
138
139
140
141
def get_version(self) -> str:
    """
    Return the version of this recognizer.

    :return: The current version of this recognizer
    """
    return self.version

to_dict

to_dict() -> Dict

Serialize instance into a dictionary.

Source code in presidio_analyzer/pattern_recognizer.py
254
255
256
257
258
259
260
261
262
263
264
def to_dict(self) -> Dict:
    """Serialize instance into a dictionary."""
    return_dict = super().to_dict()

    return_dict["patterns"] = [pat.to_dict() for pat in self.patterns]
    return_dict["deny_list"] = self.deny_list
    return_dict["context"] = self.context
    return_dict["supported_entity"] = return_dict["supported_entities"][0]
    del return_dict["supported_entities"]

    return return_dict

from_dict classmethod

from_dict(entity_recognizer_dict: Dict) -> PatternRecognizer

Create instance from a serialized dict.

Source code in presidio_analyzer/pattern_recognizer.py
266
267
268
269
270
271
272
273
274
@classmethod
def from_dict(cls, entity_recognizer_dict: Dict) -> "PatternRecognizer":
    """Create instance from a serialized dict."""
    patterns = entity_recognizer_dict.get("patterns")
    if patterns:
        patterns_list = [Pattern.from_dict(pat) for pat in patterns]
        entity_recognizer_dict["patterns"] = patterns_list

    return cls(**entity_recognizer_dict)

remove_duplicates staticmethod

remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]

Remove duplicate results.

Remove duplicates in case the two results have identical start and ends and types.

PARAMETER DESCRIPTION
results

List[RecognizerResult]

TYPE: List[RecognizerResult]

RETURNS DESCRIPTION
List[RecognizerResult]

List[RecognizerResult]

Source code in presidio_analyzer/entity_recognizer.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
@staticmethod
def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
    """
    Remove duplicate results.

    Remove duplicates in case the two results
    have identical start and ends and types.
    :param results: List[RecognizerResult]
    :return: List[RecognizerResult]
    """
    results = list(set(results))
    results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
    filtered_results = []

    for result in results:
        if result.score == 0:
            continue

        to_keep = result not in filtered_results  # equals based comparison
        if to_keep:
            for filtered in filtered_results:
                # If result is contained in one of the other results
                if (
                    result.contained_in(filtered)
                    and result.entity_type == filtered.entity_type
                ):
                    to_keep = False
                    break

        if to_keep:
            filtered_results.append(result)

    return filtered_results

sanitize_value staticmethod

sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str

Cleanse the input string of the replacement pairs specified as argument.

PARAMETER DESCRIPTION
text

input string

TYPE: str

replacement_pairs

pairs of what has to be replaced with which value

TYPE: List[Tuple[str, str]]

RETURNS DESCRIPTION
str

cleansed string

Source code in presidio_analyzer/entity_recognizer.py
200
201
202
203
204
205
206
207
208
209
210
211
@staticmethod
def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
    """
    Cleanse the input string of the replacement pairs specified as argument.

    :param text: input string
    :param replacement_pairs: pairs of what has to be replaced with which value
    :return: cleansed string
    """
    for search_string, replacement_string in replacement_pairs:
        text = text.replace(search_string, replacement_string)
    return text

validate_result

validate_result(pattern_text: str) -> Optional[bool]

Validate the pattern logic e.g., by running checksum on a detected pattern.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the validation was successful.

Source code in presidio_analyzer/pattern_recognizer.py
117
118
119
120
121
122
123
124
125
def validate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Validate the pattern logic e.g., by running checksum on a detected pattern.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the validation was successful.
    """
    return None

invalidate_result

invalidate_result(pattern_text: str) -> Optional[bool]

Logic to check for result invalidation by running pruning logic.

For example, each SSN number group should not consist of all the same digits.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the result is invalidated

Source code in presidio_analyzer/pattern_recognizer.py
127
128
129
130
131
132
133
134
135
136
137
def invalidate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Logic to check for result invalidation by running pruning logic.

    For example, each SSN number group should not consist of all the same digits.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the result is invalidated
    """
    return None

build_regex_explanation staticmethod

build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation

Construct an explanation for why this entity was detected.

PARAMETER DESCRIPTION
recognizer_name

Name of recognizer detecting the entity

TYPE: str

pattern_name

Regex pattern name which detected the entity

TYPE: str

pattern

Regex pattern logic

TYPE: str

original_score

Score given by the recognizer

TYPE: float

validation_result

Whether validation was used and its result

TYPE: bool

regex_flags

Regex flags used in the regex matching

TYPE: int

RETURNS DESCRIPTION
AnalysisExplanation

Analysis explanation

Source code in presidio_analyzer/pattern_recognizer.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
@staticmethod
def build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation:
    """
    Construct an explanation for why this entity was detected.

    :param recognizer_name: Name of recognizer detecting the entity
    :param pattern_name: Regex pattern name which detected the entity
    :param pattern: Regex pattern logic
    :param original_score: Score given by the recognizer
    :param validation_result: Whether validation was used and its result
    :param regex_flags: Regex flags used in the regex matching
    :return: Analysis explanation
    """
    textual_explanation = (
        f"Detected by `{recognizer_name}` " f"using pattern `{pattern_name}`"
    )

    explanation = AnalysisExplanation(
        recognizer=recognizer_name,
        original_score=original_score,
        pattern_name=pattern_name,
        pattern=pattern,
        validation_result=validation_result,
        regex_flags=regex_flags,
        textual_explanation=textual_explanation,
    )
    return explanation

ItVatCodeRecognizer

Bases: PatternRecognizer

Recognizes Italian VAT code using regex and checksum.

For more information about italian VAT code: https://en.wikipedia.org/wiki/VAT_identification_number#:~:text=%5B2%5D)-,Italy,-Partita%20IVA

PARAMETER DESCRIPTION
patterns

List of patterns to be used by this recognizer

TYPE: Optional[List[Pattern]] DEFAULT: None

context

List of context words to increase confidence in detection

TYPE: Optional[List[str]] DEFAULT: None

supported_language

Language this recognizer supports

TYPE: str DEFAULT: 'it'

supported_entity

The entity this recognizer can detect

TYPE: str DEFAULT: 'IT_VAT_CODE'

replacement_pairs

List of tuples with potential replacement values for different strings to be used during pattern matching. This can allow a greater variety in input, for example by removing dashes or spaces.

TYPE: Optional[List[Tuple[str, str]]] DEFAULT: None

METHOD DESCRIPTION
analyze

Analyzes text to detect PII using regular expressions or deny-lists.

enhance_using_context

Enhance confidence score using context of the entity.

get_supported_entities

Return the list of entities this recognizer can identify.

get_supported_language

Return the language this recognizer can support.

get_version

Return the version of this recognizer.

to_dict

Serialize instance into a dictionary.

from_dict

Create instance from a serialized dict.

remove_duplicates

Remove duplicate results.

sanitize_value

Cleanse the input string of the replacement pairs specified as argument.

invalidate_result

Logic to check for result invalidation by running pruning logic.

build_regex_explanation

Construct an explanation for why this entity was detected.

validate_result

Validate the pattern logic e.g., by running checksum on a detected pattern.

Source code in presidio_analyzer/predefined_recognizers/it_vat_code.py
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
class ItVatCodeRecognizer(PatternRecognizer):
    """
    Recognizes Italian VAT code using regex and checksum.

    For more information about italian VAT code:
        https://en.wikipedia.org/wiki/VAT_identification_number#:~:text=%5B2%5D)-,Italy,-Partita%20IVA

    :param patterns: List of patterns to be used by this recognizer
    :param context: List of context words to increase confidence in detection
    :param supported_language: Language this recognizer supports
    :param supported_entity: The entity this recognizer can detect
    :param replacement_pairs: List of tuples with potential replacement values
    for different strings to be used during pattern matching.
    This can allow a greater variety in input, for example by removing dashes or spaces.
    """

    # Class variables
    PATTERNS = [
        Pattern(
            "IT Vat code (piva)",
            r"\b([0-9][ _]?){11}\b",
            0.1,
        )
    ]

    CONTEXT = ["piva", "partita iva", "pi"]

    def __init__(
        self,
        patterns: Optional[List[Pattern]] = None,
        context: Optional[List[str]] = None,
        supported_language: str = "it",
        supported_entity: str = "IT_VAT_CODE",
        replacement_pairs: Optional[List[Tuple[str, str]]] = None,
    ):
        self.replacement_pairs = (
            replacement_pairs
            if replacement_pairs
            else [("-", ""), (" ", ""), ("_", "")]
        )
        patterns = patterns if patterns else self.PATTERNS
        context = context if context else self.CONTEXT
        super().__init__(
            supported_entity=supported_entity,
            patterns=patterns,
            context=context,
            supported_language=supported_language,
        )

    def validate_result(self, pattern_text: str) -> bool:
        """
        Validate the pattern logic e.g., by running checksum on a detected pattern.

        :param pattern_text: the text to validated.
        Only the part in text that was detected by the regex engine
        :return: A bool indicating whether the validation was successful.
        """

        # Pre-processing before validation checks
        text = EntityRecognizer.sanitize_value(pattern_text, self.replacement_pairs)

        # Edge-case that passes the checksum even though it is not a
        # valid italian vat code.
        if text == "00000000000":
            return False

        x = 0
        y = 0

        for i in range(0, 5):
            x += int(text[2 * i])
            tmp_y = int(text[2 * i + 1]) * 2
            if tmp_y > 9:
                tmp_y = tmp_y - 9
            y += tmp_y

        t = (x + y) % 10
        c = (10 - t) % 10

        if c == int(text[10]):
            result = True
        else:
            result = False

        return result

id property

id

Return a unique identifier of this recognizer.

analyze

analyze(
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]

Analyzes text to detect PII using regular expressions or deny-lists.

PARAMETER DESCRIPTION
text

Text to be analyzed

TYPE: str

entities

Entities this recognizer can detect

TYPE: List[str]

nlp_artifacts

Output values from the NLP engine

TYPE: Optional[NlpArtifacts] DEFAULT: None

regex_flags

regex flags to be used in regex matching

TYPE: Optional[int] DEFAULT: None

RETURNS DESCRIPTION
List[RecognizerResult]
Source code in presidio_analyzer/pattern_recognizer.py
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def analyze(
    self,
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]:
    """
    Analyzes text to detect PII using regular expressions or deny-lists.

    :param text: Text to be analyzed
    :param entities: Entities this recognizer can detect
    :param nlp_artifacts: Output values from the NLP engine
    :param regex_flags: regex flags to be used in regex matching
    :return:
    """
    results = []

    if self.patterns:
        pattern_result = self.__analyze_patterns(text, regex_flags)
        results.extend(pattern_result)

    return results

enhance_using_context

enhance_using_context(
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]

Enhance confidence score using context of the entity.

Override this method in derived class in case a custom logic is needed, otherwise return value will be equal to raw_results.

in case a result score is boosted, derived class need to update result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

PARAMETER DESCRIPTION
text

The actual text that was analyzed

TYPE: str

raw_recognizer_results

This recognizer's results, to be updated based on recognizer specific context.

TYPE: List[RecognizerResult]

other_raw_recognizer_results

Other recognizer results matched in the given text to allow related entity context enhancement

TYPE: List[RecognizerResult]

nlp_artifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

TYPE: NlpArtifacts

context

list of context words

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/entity_recognizer.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def enhance_using_context(
    self,
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """Enhance confidence score using context of the entity.

    Override this method in derived class in case a custom logic
    is needed, otherwise return value will be equal to
    raw_results.

    in case a result score is boosted, derived class need to update
    result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

    :param text: The actual text that was analyzed
    :param raw_recognizer_results: This recognizer's results, to be updated
    based on recognizer specific context.
    :param other_raw_recognizer_results: Other recognizer results matched in
    the given text to allow related entity context enhancement
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param context: list of context words
    """
    return raw_recognizer_results

get_supported_entities

get_supported_entities() -> List[str]

Return the list of entities this recognizer can identify.

RETURNS DESCRIPTION
List[str]

A list of the supported entities by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
119
120
121
122
123
124
125
def get_supported_entities(self) -> List[str]:
    """
    Return the list of entities this recognizer can identify.

    :return: A list of the supported entities by this recognizer
    """
    return self.supported_entities

get_supported_language

get_supported_language() -> str

Return the language this recognizer can support.

RETURNS DESCRIPTION
str

A list of the supported language by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
127
128
129
130
131
132
133
def get_supported_language(self) -> str:
    """
    Return the language this recognizer can support.

    :return: A list of the supported language by this recognizer
    """
    return self.supported_language

get_version

get_version() -> str

Return the version of this recognizer.

RETURNS DESCRIPTION
str

The current version of this recognizer

Source code in presidio_analyzer/entity_recognizer.py
135
136
137
138
139
140
141
def get_version(self) -> str:
    """
    Return the version of this recognizer.

    :return: The current version of this recognizer
    """
    return self.version

to_dict

to_dict() -> Dict

Serialize instance into a dictionary.

Source code in presidio_analyzer/pattern_recognizer.py
254
255
256
257
258
259
260
261
262
263
264
def to_dict(self) -> Dict:
    """Serialize instance into a dictionary."""
    return_dict = super().to_dict()

    return_dict["patterns"] = [pat.to_dict() for pat in self.patterns]
    return_dict["deny_list"] = self.deny_list
    return_dict["context"] = self.context
    return_dict["supported_entity"] = return_dict["supported_entities"][0]
    del return_dict["supported_entities"]

    return return_dict

from_dict classmethod

from_dict(entity_recognizer_dict: Dict) -> PatternRecognizer

Create instance from a serialized dict.

Source code in presidio_analyzer/pattern_recognizer.py
266
267
268
269
270
271
272
273
274
@classmethod
def from_dict(cls, entity_recognizer_dict: Dict) -> "PatternRecognizer":
    """Create instance from a serialized dict."""
    patterns = entity_recognizer_dict.get("patterns")
    if patterns:
        patterns_list = [Pattern.from_dict(pat) for pat in patterns]
        entity_recognizer_dict["patterns"] = patterns_list

    return cls(**entity_recognizer_dict)

remove_duplicates staticmethod

remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]

Remove duplicate results.

Remove duplicates in case the two results have identical start and ends and types.

PARAMETER DESCRIPTION
results

List[RecognizerResult]

TYPE: List[RecognizerResult]

RETURNS DESCRIPTION
List[RecognizerResult]

List[RecognizerResult]

Source code in presidio_analyzer/entity_recognizer.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
@staticmethod
def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
    """
    Remove duplicate results.

    Remove duplicates in case the two results
    have identical start and ends and types.
    :param results: List[RecognizerResult]
    :return: List[RecognizerResult]
    """
    results = list(set(results))
    results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
    filtered_results = []

    for result in results:
        if result.score == 0:
            continue

        to_keep = result not in filtered_results  # equals based comparison
        if to_keep:
            for filtered in filtered_results:
                # If result is contained in one of the other results
                if (
                    result.contained_in(filtered)
                    and result.entity_type == filtered.entity_type
                ):
                    to_keep = False
                    break

        if to_keep:
            filtered_results.append(result)

    return filtered_results

sanitize_value staticmethod

sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str

Cleanse the input string of the replacement pairs specified as argument.

PARAMETER DESCRIPTION
text

input string

TYPE: str

replacement_pairs

pairs of what has to be replaced with which value

TYPE: List[Tuple[str, str]]

RETURNS DESCRIPTION
str

cleansed string

Source code in presidio_analyzer/entity_recognizer.py
200
201
202
203
204
205
206
207
208
209
210
211
@staticmethod
def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
    """
    Cleanse the input string of the replacement pairs specified as argument.

    :param text: input string
    :param replacement_pairs: pairs of what has to be replaced with which value
    :return: cleansed string
    """
    for search_string, replacement_string in replacement_pairs:
        text = text.replace(search_string, replacement_string)
    return text

invalidate_result

invalidate_result(pattern_text: str) -> Optional[bool]

Logic to check for result invalidation by running pruning logic.

For example, each SSN number group should not consist of all the same digits.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the result is invalidated

Source code in presidio_analyzer/pattern_recognizer.py
127
128
129
130
131
132
133
134
135
136
137
def invalidate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Logic to check for result invalidation by running pruning logic.

    For example, each SSN number group should not consist of all the same digits.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the result is invalidated
    """
    return None

build_regex_explanation staticmethod

build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation

Construct an explanation for why this entity was detected.

PARAMETER DESCRIPTION
recognizer_name

Name of recognizer detecting the entity

TYPE: str

pattern_name

Regex pattern name which detected the entity

TYPE: str

pattern

Regex pattern logic

TYPE: str

original_score

Score given by the recognizer

TYPE: float

validation_result

Whether validation was used and its result

TYPE: bool

regex_flags

Regex flags used in the regex matching

TYPE: int

RETURNS DESCRIPTION
AnalysisExplanation

Analysis explanation

Source code in presidio_analyzer/pattern_recognizer.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
@staticmethod
def build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation:
    """
    Construct an explanation for why this entity was detected.

    :param recognizer_name: Name of recognizer detecting the entity
    :param pattern_name: Regex pattern name which detected the entity
    :param pattern: Regex pattern logic
    :param original_score: Score given by the recognizer
    :param validation_result: Whether validation was used and its result
    :param regex_flags: Regex flags used in the regex matching
    :return: Analysis explanation
    """
    textual_explanation = (
        f"Detected by `{recognizer_name}` " f"using pattern `{pattern_name}`"
    )

    explanation = AnalysisExplanation(
        recognizer=recognizer_name,
        original_score=original_score,
        pattern_name=pattern_name,
        pattern=pattern,
        validation_result=validation_result,
        regex_flags=regex_flags,
        textual_explanation=textual_explanation,
    )
    return explanation

validate_result

validate_result(pattern_text: str) -> bool

Validate the pattern logic e.g., by running checksum on a detected pattern.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
bool

A bool indicating whether the validation was successful.

Source code in presidio_analyzer/predefined_recognizers/it_vat_code.py
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
def validate_result(self, pattern_text: str) -> bool:
    """
    Validate the pattern logic e.g., by running checksum on a detected pattern.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the validation was successful.
    """

    # Pre-processing before validation checks
    text = EntityRecognizer.sanitize_value(pattern_text, self.replacement_pairs)

    # Edge-case that passes the checksum even though it is not a
    # valid italian vat code.
    if text == "00000000000":
        return False

    x = 0
    y = 0

    for i in range(0, 5):
        x += int(text[2 * i])
        tmp_y = int(text[2 * i + 1]) * 2
        if tmp_y > 9:
            tmp_y = tmp_y - 9
        y += tmp_y

    t = (x + y) % 10
    c = (10 - t) % 10

    if c == int(text[10]):
        result = True
    else:
        result = False

    return result

MedicalLicenseRecognizer

Bases: PatternRecognizer

Recognize common Medical license numbers using regex + checksum.

PARAMETER DESCRIPTION
patterns

List of patterns to be used by this recognizer

TYPE: Optional[List[Pattern]] DEFAULT: None

context

List of context words to increase confidence in detection

TYPE: Optional[List[str]] DEFAULT: None

supported_language

Language this recognizer supports

TYPE: str DEFAULT: 'en'

supported_entity

The entity this recognizer can detect

TYPE: str DEFAULT: 'MEDICAL_LICENSE'

replacement_pairs

List of tuples with potential replacement values for different strings to be used during pattern matching. This can allow a greater variety in input, for example by removing dashes or spaces.

TYPE: Optional[List[Tuple[str, str]]] DEFAULT: None

METHOD DESCRIPTION
analyze

Analyzes text to detect PII using regular expressions or deny-lists.

enhance_using_context

Enhance confidence score using context of the entity.

get_supported_entities

Return the list of entities this recognizer can identify.

get_supported_language

Return the language this recognizer can support.

get_version

Return the version of this recognizer.

to_dict

Serialize instance into a dictionary.

from_dict

Create instance from a serialized dict.

remove_duplicates

Remove duplicate results.

sanitize_value

Cleanse the input string of the replacement pairs specified as argument.

invalidate_result

Logic to check for result invalidation by running pruning logic.

build_regex_explanation

Construct an explanation for why this entity was detected.

Source code in presidio_analyzer/predefined_recognizers/medical_license_recognizer.py
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
class MedicalLicenseRecognizer(PatternRecognizer):
    """
    Recognize common Medical license numbers using regex + checksum.

    :param patterns: List of patterns to be used by this recognizer
    :param context: List of context words to increase confidence in detection
    :param supported_language: Language this recognizer supports
    :param supported_entity: The entity this recognizer can detect
    :param replacement_pairs: List of tuples with potential replacement values
    for different strings to be used during pattern matching.
    This can allow a greater variety in input, for example by removing dashes or spaces.
    """

    PATTERNS = [
        Pattern(
            "USA DEA Certificate Number (weak)",
            r"[abcdefghjklmprstuxABCDEFGHJKLMPRSTUX]{1}[a-zA-Z]{1}\d{7}|"
            r"[abcdefghjklmprstuxABCDEFGHJKLMPRSTUX]{1}9\d{7}",
            0.4,
        ),
    ]

    CONTEXT = ["medical", "certificate", "DEA"]

    def __init__(
        self,
        patterns: Optional[List[Pattern]] = None,
        context: Optional[List[str]] = None,
        supported_language: str = "en",
        supported_entity: str = "MEDICAL_LICENSE",
        replacement_pairs: Optional[List[Tuple[str, str]]] = None,
    ):
        self.replacement_pairs = (
            replacement_pairs if replacement_pairs else [("-", ""), (" ", "")]
        )
        patterns = patterns if patterns else self.PATTERNS
        context = context if context else self.CONTEXT
        super().__init__(
            supported_entity=supported_entity,
            patterns=patterns,
            context=context,
            supported_language=supported_language,
        )

    def validate_result(self, pattern_text: str) -> bool:  # noqa D102
        sanitized_value = EntityRecognizer.sanitize_value(
            pattern_text, self.replacement_pairs
        )
        checksum = self.__luhn_checksum(sanitized_value)

        return checksum

    @staticmethod
    def __luhn_checksum(sanitized_value: str) -> bool:
        def digits_of(n: str) -> List[int]:
            return [int(dig) for dig in str(n)]

        digits = digits_of(sanitized_value[2:])
        checksum = digits.pop()
        even_digits = digits[-1::-2]
        odd_digits = digits[-2::-2]
        checksum *= -1
        checksum += 2 * sum(even_digits) + sum(odd_digits)
        return checksum % 10 == 0

id property

id

Return a unique identifier of this recognizer.

analyze

analyze(
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]

Analyzes text to detect PII using regular expressions or deny-lists.

PARAMETER DESCRIPTION
text

Text to be analyzed

TYPE: str

entities

Entities this recognizer can detect

TYPE: List[str]

nlp_artifacts

Output values from the NLP engine

TYPE: Optional[NlpArtifacts] DEFAULT: None

regex_flags

regex flags to be used in regex matching

TYPE: Optional[int] DEFAULT: None

RETURNS DESCRIPTION
List[RecognizerResult]
Source code in presidio_analyzer/pattern_recognizer.py
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def analyze(
    self,
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]:
    """
    Analyzes text to detect PII using regular expressions or deny-lists.

    :param text: Text to be analyzed
    :param entities: Entities this recognizer can detect
    :param nlp_artifacts: Output values from the NLP engine
    :param regex_flags: regex flags to be used in regex matching
    :return:
    """
    results = []

    if self.patterns:
        pattern_result = self.__analyze_patterns(text, regex_flags)
        results.extend(pattern_result)

    return results

enhance_using_context

enhance_using_context(
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]

Enhance confidence score using context of the entity.

Override this method in derived class in case a custom logic is needed, otherwise return value will be equal to raw_results.

in case a result score is boosted, derived class need to update result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

PARAMETER DESCRIPTION
text

The actual text that was analyzed

TYPE: str

raw_recognizer_results

This recognizer's results, to be updated based on recognizer specific context.

TYPE: List[RecognizerResult]

other_raw_recognizer_results

Other recognizer results matched in the given text to allow related entity context enhancement

TYPE: List[RecognizerResult]

nlp_artifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

TYPE: NlpArtifacts

context

list of context words

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/entity_recognizer.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def enhance_using_context(
    self,
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """Enhance confidence score using context of the entity.

    Override this method in derived class in case a custom logic
    is needed, otherwise return value will be equal to
    raw_results.

    in case a result score is boosted, derived class need to update
    result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

    :param text: The actual text that was analyzed
    :param raw_recognizer_results: This recognizer's results, to be updated
    based on recognizer specific context.
    :param other_raw_recognizer_results: Other recognizer results matched in
    the given text to allow related entity context enhancement
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param context: list of context words
    """
    return raw_recognizer_results

get_supported_entities

get_supported_entities() -> List[str]

Return the list of entities this recognizer can identify.

RETURNS DESCRIPTION
List[str]

A list of the supported entities by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
119
120
121
122
123
124
125
def get_supported_entities(self) -> List[str]:
    """
    Return the list of entities this recognizer can identify.

    :return: A list of the supported entities by this recognizer
    """
    return self.supported_entities

get_supported_language

get_supported_language() -> str

Return the language this recognizer can support.

RETURNS DESCRIPTION
str

A list of the supported language by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
127
128
129
130
131
132
133
def get_supported_language(self) -> str:
    """
    Return the language this recognizer can support.

    :return: A list of the supported language by this recognizer
    """
    return self.supported_language

get_version

get_version() -> str

Return the version of this recognizer.

RETURNS DESCRIPTION
str

The current version of this recognizer

Source code in presidio_analyzer/entity_recognizer.py
135
136
137
138
139
140
141
def get_version(self) -> str:
    """
    Return the version of this recognizer.

    :return: The current version of this recognizer
    """
    return self.version

to_dict

to_dict() -> Dict

Serialize instance into a dictionary.

Source code in presidio_analyzer/pattern_recognizer.py
254
255
256
257
258
259
260
261
262
263
264
def to_dict(self) -> Dict:
    """Serialize instance into a dictionary."""
    return_dict = super().to_dict()

    return_dict["patterns"] = [pat.to_dict() for pat in self.patterns]
    return_dict["deny_list"] = self.deny_list
    return_dict["context"] = self.context
    return_dict["supported_entity"] = return_dict["supported_entities"][0]
    del return_dict["supported_entities"]

    return return_dict

from_dict classmethod

from_dict(entity_recognizer_dict: Dict) -> PatternRecognizer

Create instance from a serialized dict.

Source code in presidio_analyzer/pattern_recognizer.py
266
267
268
269
270
271
272
273
274
@classmethod
def from_dict(cls, entity_recognizer_dict: Dict) -> "PatternRecognizer":
    """Create instance from a serialized dict."""
    patterns = entity_recognizer_dict.get("patterns")
    if patterns:
        patterns_list = [Pattern.from_dict(pat) for pat in patterns]
        entity_recognizer_dict["patterns"] = patterns_list

    return cls(**entity_recognizer_dict)

remove_duplicates staticmethod

remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]

Remove duplicate results.

Remove duplicates in case the two results have identical start and ends and types.

PARAMETER DESCRIPTION
results

List[RecognizerResult]

TYPE: List[RecognizerResult]

RETURNS DESCRIPTION
List[RecognizerResult]

List[RecognizerResult]

Source code in presidio_analyzer/entity_recognizer.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
@staticmethod
def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
    """
    Remove duplicate results.

    Remove duplicates in case the two results
    have identical start and ends and types.
    :param results: List[RecognizerResult]
    :return: List[RecognizerResult]
    """
    results = list(set(results))
    results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
    filtered_results = []

    for result in results:
        if result.score == 0:
            continue

        to_keep = result not in filtered_results  # equals based comparison
        if to_keep:
            for filtered in filtered_results:
                # If result is contained in one of the other results
                if (
                    result.contained_in(filtered)
                    and result.entity_type == filtered.entity_type
                ):
                    to_keep = False
                    break

        if to_keep:
            filtered_results.append(result)

    return filtered_results

sanitize_value staticmethod

sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str

Cleanse the input string of the replacement pairs specified as argument.

PARAMETER DESCRIPTION
text

input string

TYPE: str

replacement_pairs

pairs of what has to be replaced with which value

TYPE: List[Tuple[str, str]]

RETURNS DESCRIPTION
str

cleansed string

Source code in presidio_analyzer/entity_recognizer.py
200
201
202
203
204
205
206
207
208
209
210
211
@staticmethod
def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
    """
    Cleanse the input string of the replacement pairs specified as argument.

    :param text: input string
    :param replacement_pairs: pairs of what has to be replaced with which value
    :return: cleansed string
    """
    for search_string, replacement_string in replacement_pairs:
        text = text.replace(search_string, replacement_string)
    return text

invalidate_result

invalidate_result(pattern_text: str) -> Optional[bool]

Logic to check for result invalidation by running pruning logic.

For example, each SSN number group should not consist of all the same digits.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the result is invalidated

Source code in presidio_analyzer/pattern_recognizer.py
127
128
129
130
131
132
133
134
135
136
137
def invalidate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Logic to check for result invalidation by running pruning logic.

    For example, each SSN number group should not consist of all the same digits.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the result is invalidated
    """
    return None

build_regex_explanation staticmethod

build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation

Construct an explanation for why this entity was detected.

PARAMETER DESCRIPTION
recognizer_name

Name of recognizer detecting the entity

TYPE: str

pattern_name

Regex pattern name which detected the entity

TYPE: str

pattern

Regex pattern logic

TYPE: str

original_score

Score given by the recognizer

TYPE: float

validation_result

Whether validation was used and its result

TYPE: bool

regex_flags

Regex flags used in the regex matching

TYPE: int

RETURNS DESCRIPTION
AnalysisExplanation

Analysis explanation

Source code in presidio_analyzer/pattern_recognizer.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
@staticmethod
def build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation:
    """
    Construct an explanation for why this entity was detected.

    :param recognizer_name: Name of recognizer detecting the entity
    :param pattern_name: Regex pattern name which detected the entity
    :param pattern: Regex pattern logic
    :param original_score: Score given by the recognizer
    :param validation_result: Whether validation was used and its result
    :param regex_flags: Regex flags used in the regex matching
    :return: Analysis explanation
    """
    textual_explanation = (
        f"Detected by `{recognizer_name}` " f"using pattern `{pattern_name}`"
    )

    explanation = AnalysisExplanation(
        recognizer=recognizer_name,
        original_score=original_score,
        pattern_name=pattern_name,
        pattern=pattern,
        validation_result=validation_result,
        regex_flags=regex_flags,
        textual_explanation=textual_explanation,
    )
    return explanation

PhoneRecognizer

Bases: LocalRecognizer

Recognize multi-regional phone numbers.

Using python-phonenumbers, along with fixed and regional context words.

PARAMETER DESCRIPTION
context

Base context words for enhancing the assurance scores.

TYPE: Optional[List[str]] DEFAULT: None

supported_language

Language this recognizer supports

TYPE: str DEFAULT: 'en'

supported_regions

The regions for phone number matching and validation

DEFAULT: DEFAULT_SUPPORTED_REGIONS

leniency

The strictness level of phone number formats. Accepts values from 0 to 3, where 0 is the lenient and 3 is the most strictest.

TYPE: Optional[int] DEFAULT: 1

METHOD DESCRIPTION
enhance_using_context

Enhance confidence score using context of the entity.

get_supported_language

Return the language this recognizer can support.

get_version

Return the version of this recognizer.

to_dict

Serialize self to dictionary.

from_dict

Create EntityRecognizer from a dict input.

remove_duplicates

Remove duplicate results.

sanitize_value

Cleanse the input string of the replacement pairs specified as argument.

analyze

Analyzes text to detect phone numbers using python-phonenumbers.

Source code in presidio_analyzer/predefined_recognizers/phone_recognizer.py
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
class PhoneRecognizer(LocalRecognizer):
    """Recognize multi-regional phone numbers.

     Using python-phonenumbers, along with fixed and regional context words.
    :param context: Base context words for enhancing the assurance scores.
    :param supported_language: Language this recognizer supports
    :param supported_regions: The regions for phone number matching and validation
    :param leniency: The strictness level of phone number formats.
    Accepts values from 0 to 3, where 0 is the lenient and 3 is the most strictest.
    """

    SCORE = 0.4
    CONTEXT = ["phone", "number", "telephone", "cell", "cellphone", "mobile", "call"]
    DEFAULT_SUPPORTED_REGIONS = ("US", "UK", "DE", "FE", "IL", "IN", "CA", "BR")

    def __init__(
        self,
        context: Optional[List[str]] = None,
        supported_language: str = "en",
        # For all regions, use phonenumbers.SUPPORTED_REGIONS
        supported_regions=DEFAULT_SUPPORTED_REGIONS,
        leniency: Optional[int] = 1,
    ):
        context = context if context else self.CONTEXT
        self.supported_regions = supported_regions
        self.leniency = leniency
        super().__init__(
            supported_entities=self.get_supported_entities(),
            supported_language=supported_language,
            context=context,
        )

    def load(self) -> None:  # noqa D102
        pass

    def get_supported_entities(self):  # noqa D102
        return ["PHONE_NUMBER"]

    def analyze(
        self, text: str, entities: List[str], nlp_artifacts: NlpArtifacts = None
    ) -> List[RecognizerResult]:
        """Analyzes text to detect phone numbers using python-phonenumbers.

        Iterates over entities, fetching regions, then matching regional
        phone numbers patterns against the text.
        :param text: Text to be analyzed
        :param entities: Entities this recognizer can detect
        :param nlp_artifacts: Additional metadata from the NLP engine
        :return: List of phone numbers RecognizerResults
        """
        results = []
        for region in self.supported_regions:
            for match in phonenumbers.PhoneNumberMatcher(
                text, region, leniency=self.leniency
            ):
                try:
                    parsed_number = phonenumbers.parse(text[match.start:match.end])
                    region = phonenumbers.region_code_for_number(parsed_number)
                    results += [
                    self._get_recognizer_result(match, text, region, nlp_artifacts)
                ]
                except NumberParseException:
                    results += [
                        self._get_recognizer_result(match, text, region, nlp_artifacts)
                    ]

        return EntityRecognizer.remove_duplicates(results)

    def _get_recognizer_result(self, match, text, region, nlp_artifacts):
        result = RecognizerResult(
            entity_type="PHONE_NUMBER",
            start=match.start,
            end=match.end,
            score=self.SCORE,
            analysis_explanation=self._get_analysis_explanation(region),
            recognition_metadata={
                RecognizerResult.RECOGNIZER_NAME_KEY: self.name,
                RecognizerResult.RECOGNIZER_IDENTIFIER_KEY: self.id,
            },
        )

        return result

    def _get_analysis_explanation(self, region):
        return AnalysisExplanation(
            recognizer=PhoneRecognizer.__name__,
            original_score=self.SCORE,
            textual_explanation=f"Recognized as {region} region phone number, "
            f"using PhoneRecognizer",
        )

id property

id

Return a unique identifier of this recognizer.

enhance_using_context

enhance_using_context(
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]

Enhance confidence score using context of the entity.

Override this method in derived class in case a custom logic is needed, otherwise return value will be equal to raw_results.

in case a result score is boosted, derived class need to update result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

PARAMETER DESCRIPTION
text

The actual text that was analyzed

TYPE: str

raw_recognizer_results

This recognizer's results, to be updated based on recognizer specific context.

TYPE: List[RecognizerResult]

other_raw_recognizer_results

Other recognizer results matched in the given text to allow related entity context enhancement

TYPE: List[RecognizerResult]

nlp_artifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

TYPE: NlpArtifacts

context

list of context words

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/entity_recognizer.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def enhance_using_context(
    self,
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """Enhance confidence score using context of the entity.

    Override this method in derived class in case a custom logic
    is needed, otherwise return value will be equal to
    raw_results.

    in case a result score is boosted, derived class need to update
    result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

    :param text: The actual text that was analyzed
    :param raw_recognizer_results: This recognizer's results, to be updated
    based on recognizer specific context.
    :param other_raw_recognizer_results: Other recognizer results matched in
    the given text to allow related entity context enhancement
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param context: list of context words
    """
    return raw_recognizer_results

get_supported_language

get_supported_language() -> str

Return the language this recognizer can support.

RETURNS DESCRIPTION
str

A list of the supported language by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
127
128
129
130
131
132
133
def get_supported_language(self) -> str:
    """
    Return the language this recognizer can support.

    :return: A list of the supported language by this recognizer
    """
    return self.supported_language

get_version

get_version() -> str

Return the version of this recognizer.

RETURNS DESCRIPTION
str

The current version of this recognizer

Source code in presidio_analyzer/entity_recognizer.py
135
136
137
138
139
140
141
def get_version(self) -> str:
    """
    Return the version of this recognizer.

    :return: The current version of this recognizer
    """
    return self.version

to_dict

to_dict() -> Dict

Serialize self to dictionary.

RETURNS DESCRIPTION
Dict

a dictionary

Source code in presidio_analyzer/entity_recognizer.py
143
144
145
146
147
148
149
150
151
152
153
154
155
def to_dict(self) -> Dict:
    """
    Serialize self to dictionary.

    :return: a dictionary
    """
    return_dict = {
        "supported_entities": self.supported_entities,
        "supported_language": self.supported_language,
        "name": self.name,
        "version": self.version,
    }
    return return_dict

from_dict classmethod

from_dict(entity_recognizer_dict: Dict) -> EntityRecognizer

Create EntityRecognizer from a dict input.

PARAMETER DESCRIPTION
entity_recognizer_dict

Dict containing keys and values for instantiation

TYPE: Dict

Source code in presidio_analyzer/entity_recognizer.py
157
158
159
160
161
162
163
164
@classmethod
def from_dict(cls, entity_recognizer_dict: Dict) -> "EntityRecognizer":
    """
    Create EntityRecognizer from a dict input.

    :param entity_recognizer_dict: Dict containing keys and values for instantiation
    """
    return cls(**entity_recognizer_dict)

remove_duplicates staticmethod

remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]

Remove duplicate results.

Remove duplicates in case the two results have identical start and ends and types.

PARAMETER DESCRIPTION
results

List[RecognizerResult]

TYPE: List[RecognizerResult]

RETURNS DESCRIPTION
List[RecognizerResult]

List[RecognizerResult]

Source code in presidio_analyzer/entity_recognizer.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
@staticmethod
def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
    """
    Remove duplicate results.

    Remove duplicates in case the two results
    have identical start and ends and types.
    :param results: List[RecognizerResult]
    :return: List[RecognizerResult]
    """
    results = list(set(results))
    results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
    filtered_results = []

    for result in results:
        if result.score == 0:
            continue

        to_keep = result not in filtered_results  # equals based comparison
        if to_keep:
            for filtered in filtered_results:
                # If result is contained in one of the other results
                if (
                    result.contained_in(filtered)
                    and result.entity_type == filtered.entity_type
                ):
                    to_keep = False
                    break

        if to_keep:
            filtered_results.append(result)

    return filtered_results

sanitize_value staticmethod

sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str

Cleanse the input string of the replacement pairs specified as argument.

PARAMETER DESCRIPTION
text

input string

TYPE: str

replacement_pairs

pairs of what has to be replaced with which value

TYPE: List[Tuple[str, str]]

RETURNS DESCRIPTION
str

cleansed string

Source code in presidio_analyzer/entity_recognizer.py
200
201
202
203
204
205
206
207
208
209
210
211
@staticmethod
def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
    """
    Cleanse the input string of the replacement pairs specified as argument.

    :param text: input string
    :param replacement_pairs: pairs of what has to be replaced with which value
    :return: cleansed string
    """
    for search_string, replacement_string in replacement_pairs:
        text = text.replace(search_string, replacement_string)
    return text

analyze

analyze(
    text: str, entities: List[str], nlp_artifacts: NlpArtifacts = None
) -> List[RecognizerResult]

Analyzes text to detect phone numbers using python-phonenumbers.

Iterates over entities, fetching regions, then matching regional phone numbers patterns against the text.

PARAMETER DESCRIPTION
text

Text to be analyzed

TYPE: str

entities

Entities this recognizer can detect

TYPE: List[str]

nlp_artifacts

Additional metadata from the NLP engine

TYPE: NlpArtifacts DEFAULT: None

RETURNS DESCRIPTION
List[RecognizerResult]

List of phone numbers RecognizerResults

Source code in presidio_analyzer/predefined_recognizers/phone_recognizer.py
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
def analyze(
    self, text: str, entities: List[str], nlp_artifacts: NlpArtifacts = None
) -> List[RecognizerResult]:
    """Analyzes text to detect phone numbers using python-phonenumbers.

    Iterates over entities, fetching regions, then matching regional
    phone numbers patterns against the text.
    :param text: Text to be analyzed
    :param entities: Entities this recognizer can detect
    :param nlp_artifacts: Additional metadata from the NLP engine
    :return: List of phone numbers RecognizerResults
    """
    results = []
    for region in self.supported_regions:
        for match in phonenumbers.PhoneNumberMatcher(
            text, region, leniency=self.leniency
        ):
            try:
                parsed_number = phonenumbers.parse(text[match.start:match.end])
                region = phonenumbers.region_code_for_number(parsed_number)
                results += [
                self._get_recognizer_result(match, text, region, nlp_artifacts)
            ]
            except NumberParseException:
                results += [
                    self._get_recognizer_result(match, text, region, nlp_artifacts)
                ]

    return EntityRecognizer.remove_duplicates(results)

PlPeselRecognizer

Bases: PatternRecognizer

Recognize PESEL number using regex and checksum.

For more information about PESEL: https://en.wikipedia.org/wiki/PESEL

PARAMETER DESCRIPTION
patterns

List of patterns to be used by this recognizer

TYPE: Optional[List[Pattern]] DEFAULT: None

context

List of context words to increase confidence in detection

TYPE: Optional[List[str]] DEFAULT: None

supported_language

Language this recognizer supports

TYPE: str DEFAULT: 'pl'

supported_entity

The entity this recognizer can detect

TYPE: str DEFAULT: 'PL_PESEL'

METHOD DESCRIPTION
analyze

Analyzes text to detect PII using regular expressions or deny-lists.

enhance_using_context

Enhance confidence score using context of the entity.

get_supported_entities

Return the list of entities this recognizer can identify.

get_supported_language

Return the language this recognizer can support.

get_version

Return the version of this recognizer.

to_dict

Serialize instance into a dictionary.

from_dict

Create instance from a serialized dict.

remove_duplicates

Remove duplicate results.

sanitize_value

Cleanse the input string of the replacement pairs specified as argument.

invalidate_result

Logic to check for result invalidation by running pruning logic.

build_regex_explanation

Construct an explanation for why this entity was detected.

Source code in presidio_analyzer/predefined_recognizers/pl_pesel_recognizer.py
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
class PlPeselRecognizer(PatternRecognizer):
    """
    Recognize PESEL number using regex and checksum.

    For more information about PESEL: https://en.wikipedia.org/wiki/PESEL

    :param patterns: List of patterns to be used by this recognizer
    :param context: List of context words to increase confidence in detection
    :param supported_language: Language this recognizer supports
    :param supported_entity: The entity this recognizer can detect
    """

    PATTERNS = [
        Pattern(
            "PESEL",
            r"[0-9]{2}([02468][1-9]|[13579][012])(0[1-9]|1[0-9]|2[0-9]|3[01])[0-9]{5}",
            0.4,
        ),
    ]

    CONTEXT = ["PESEL"]

    def __init__(
        self,
        patterns: Optional[List[Pattern]] = None,
        context: Optional[List[str]] = None,
        supported_language: str = "pl",
        supported_entity: str = "PL_PESEL",
    ):
        patterns = patterns if patterns else self.PATTERNS
        context = context if context else self.CONTEXT
        super().__init__(
            supported_entity=supported_entity,
            patterns=patterns,
            context=context,
            supported_language=supported_language,
        )

    def validate_result(self, pattern_text: str) -> bool:  # noqa D102
        digits = [int(digit) for digit in pattern_text]
        weights = [1, 3, 7, 9, 1, 3, 7, 9, 1, 3]

        checksum = sum(digit * weight for digit, weight in zip(digits[:10], weights))
        checksum %= 10

        return checksum == digits[10]

id property

id

Return a unique identifier of this recognizer.

analyze

analyze(
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]

Analyzes text to detect PII using regular expressions or deny-lists.

PARAMETER DESCRIPTION
text

Text to be analyzed

TYPE: str

entities

Entities this recognizer can detect

TYPE: List[str]

nlp_artifacts

Output values from the NLP engine

TYPE: Optional[NlpArtifacts] DEFAULT: None

regex_flags

regex flags to be used in regex matching

TYPE: Optional[int] DEFAULT: None

RETURNS DESCRIPTION
List[RecognizerResult]
Source code in presidio_analyzer/pattern_recognizer.py
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def analyze(
    self,
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]:
    """
    Analyzes text to detect PII using regular expressions or deny-lists.

    :param text: Text to be analyzed
    :param entities: Entities this recognizer can detect
    :param nlp_artifacts: Output values from the NLP engine
    :param regex_flags: regex flags to be used in regex matching
    :return:
    """
    results = []

    if self.patterns:
        pattern_result = self.__analyze_patterns(text, regex_flags)
        results.extend(pattern_result)

    return results

enhance_using_context

enhance_using_context(
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]

Enhance confidence score using context of the entity.

Override this method in derived class in case a custom logic is needed, otherwise return value will be equal to raw_results.

in case a result score is boosted, derived class need to update result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

PARAMETER DESCRIPTION
text

The actual text that was analyzed

TYPE: str

raw_recognizer_results

This recognizer's results, to be updated based on recognizer specific context.

TYPE: List[RecognizerResult]

other_raw_recognizer_results

Other recognizer results matched in the given text to allow related entity context enhancement

TYPE: List[RecognizerResult]

nlp_artifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

TYPE: NlpArtifacts

context

list of context words

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/entity_recognizer.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def enhance_using_context(
    self,
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """Enhance confidence score using context of the entity.

    Override this method in derived class in case a custom logic
    is needed, otherwise return value will be equal to
    raw_results.

    in case a result score is boosted, derived class need to update
    result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

    :param text: The actual text that was analyzed
    :param raw_recognizer_results: This recognizer's results, to be updated
    based on recognizer specific context.
    :param other_raw_recognizer_results: Other recognizer results matched in
    the given text to allow related entity context enhancement
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param context: list of context words
    """
    return raw_recognizer_results

get_supported_entities

get_supported_entities() -> List[str]

Return the list of entities this recognizer can identify.

RETURNS DESCRIPTION
List[str]

A list of the supported entities by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
119
120
121
122
123
124
125
def get_supported_entities(self) -> List[str]:
    """
    Return the list of entities this recognizer can identify.

    :return: A list of the supported entities by this recognizer
    """
    return self.supported_entities

get_supported_language

get_supported_language() -> str

Return the language this recognizer can support.

RETURNS DESCRIPTION
str

A list of the supported language by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
127
128
129
130
131
132
133
def get_supported_language(self) -> str:
    """
    Return the language this recognizer can support.

    :return: A list of the supported language by this recognizer
    """
    return self.supported_language

get_version

get_version() -> str

Return the version of this recognizer.

RETURNS DESCRIPTION
str

The current version of this recognizer

Source code in presidio_analyzer/entity_recognizer.py
135
136
137
138
139
140
141
def get_version(self) -> str:
    """
    Return the version of this recognizer.

    :return: The current version of this recognizer
    """
    return self.version

to_dict

to_dict() -> Dict

Serialize instance into a dictionary.

Source code in presidio_analyzer/pattern_recognizer.py
254
255
256
257
258
259
260
261
262
263
264
def to_dict(self) -> Dict:
    """Serialize instance into a dictionary."""
    return_dict = super().to_dict()

    return_dict["patterns"] = [pat.to_dict() for pat in self.patterns]
    return_dict["deny_list"] = self.deny_list
    return_dict["context"] = self.context
    return_dict["supported_entity"] = return_dict["supported_entities"][0]
    del return_dict["supported_entities"]

    return return_dict

from_dict classmethod

from_dict(entity_recognizer_dict: Dict) -> PatternRecognizer

Create instance from a serialized dict.

Source code in presidio_analyzer/pattern_recognizer.py
266
267
268
269
270
271
272
273
274
@classmethod
def from_dict(cls, entity_recognizer_dict: Dict) -> "PatternRecognizer":
    """Create instance from a serialized dict."""
    patterns = entity_recognizer_dict.get("patterns")
    if patterns:
        patterns_list = [Pattern.from_dict(pat) for pat in patterns]
        entity_recognizer_dict["patterns"] = patterns_list

    return cls(**entity_recognizer_dict)

remove_duplicates staticmethod

remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]

Remove duplicate results.

Remove duplicates in case the two results have identical start and ends and types.

PARAMETER DESCRIPTION
results

List[RecognizerResult]

TYPE: List[RecognizerResult]

RETURNS DESCRIPTION
List[RecognizerResult]

List[RecognizerResult]

Source code in presidio_analyzer/entity_recognizer.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
@staticmethod
def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
    """
    Remove duplicate results.

    Remove duplicates in case the two results
    have identical start and ends and types.
    :param results: List[RecognizerResult]
    :return: List[RecognizerResult]
    """
    results = list(set(results))
    results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
    filtered_results = []

    for result in results:
        if result.score == 0:
            continue

        to_keep = result not in filtered_results  # equals based comparison
        if to_keep:
            for filtered in filtered_results:
                # If result is contained in one of the other results
                if (
                    result.contained_in(filtered)
                    and result.entity_type == filtered.entity_type
                ):
                    to_keep = False
                    break

        if to_keep:
            filtered_results.append(result)

    return filtered_results

sanitize_value staticmethod

sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str

Cleanse the input string of the replacement pairs specified as argument.

PARAMETER DESCRIPTION
text

input string

TYPE: str

replacement_pairs

pairs of what has to be replaced with which value

TYPE: List[Tuple[str, str]]

RETURNS DESCRIPTION
str

cleansed string

Source code in presidio_analyzer/entity_recognizer.py
200
201
202
203
204
205
206
207
208
209
210
211
@staticmethod
def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
    """
    Cleanse the input string of the replacement pairs specified as argument.

    :param text: input string
    :param replacement_pairs: pairs of what has to be replaced with which value
    :return: cleansed string
    """
    for search_string, replacement_string in replacement_pairs:
        text = text.replace(search_string, replacement_string)
    return text

invalidate_result

invalidate_result(pattern_text: str) -> Optional[bool]

Logic to check for result invalidation by running pruning logic.

For example, each SSN number group should not consist of all the same digits.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the result is invalidated

Source code in presidio_analyzer/pattern_recognizer.py
127
128
129
130
131
132
133
134
135
136
137
def invalidate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Logic to check for result invalidation by running pruning logic.

    For example, each SSN number group should not consist of all the same digits.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the result is invalidated
    """
    return None

build_regex_explanation staticmethod

build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation

Construct an explanation for why this entity was detected.

PARAMETER DESCRIPTION
recognizer_name

Name of recognizer detecting the entity

TYPE: str

pattern_name

Regex pattern name which detected the entity

TYPE: str

pattern

Regex pattern logic

TYPE: str

original_score

Score given by the recognizer

TYPE: float

validation_result

Whether validation was used and its result

TYPE: bool

regex_flags

Regex flags used in the regex matching

TYPE: int

RETURNS DESCRIPTION
AnalysisExplanation

Analysis explanation

Source code in presidio_analyzer/pattern_recognizer.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
@staticmethod
def build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation:
    """
    Construct an explanation for why this entity was detected.

    :param recognizer_name: Name of recognizer detecting the entity
    :param pattern_name: Regex pattern name which detected the entity
    :param pattern: Regex pattern logic
    :param original_score: Score given by the recognizer
    :param validation_result: Whether validation was used and its result
    :param regex_flags: Regex flags used in the regex matching
    :return: Analysis explanation
    """
    textual_explanation = (
        f"Detected by `{recognizer_name}` " f"using pattern `{pattern_name}`"
    )

    explanation = AnalysisExplanation(
        recognizer=recognizer_name,
        original_score=original_score,
        pattern_name=pattern_name,
        pattern=pattern,
        validation_result=validation_result,
        regex_flags=regex_flags,
        textual_explanation=textual_explanation,
    )
    return explanation

SgFinRecognizer

Bases: PatternRecognizer

Recognize SG FIN/NRIC number using regex.

PARAMETER DESCRIPTION
patterns

List of patterns to be used by this recognizer

TYPE: Optional[List[Pattern]] DEFAULT: None

context

List of context words to increase confidence in detection

TYPE: Optional[List[str]] DEFAULT: None

supported_language

Language this recognizer supports

TYPE: str DEFAULT: 'en'

supported_entity

The entity this recognizer can detect

TYPE: str DEFAULT: 'SG_NRIC_FIN'

METHOD DESCRIPTION
analyze

Analyzes text to detect PII using regular expressions or deny-lists.

enhance_using_context

Enhance confidence score using context of the entity.

get_supported_entities

Return the list of entities this recognizer can identify.

get_supported_language

Return the language this recognizer can support.

get_version

Return the version of this recognizer.

to_dict

Serialize instance into a dictionary.

from_dict

Create instance from a serialized dict.

remove_duplicates

Remove duplicate results.

sanitize_value

Cleanse the input string of the replacement pairs specified as argument.

validate_result

Validate the pattern logic e.g., by running checksum on a detected pattern.

invalidate_result

Logic to check for result invalidation by running pruning logic.

build_regex_explanation

Construct an explanation for why this entity was detected.

Source code in presidio_analyzer/predefined_recognizers/sg_fin_recognizer.py
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
class SgFinRecognizer(PatternRecognizer):
    """
    Recognize SG FIN/NRIC number using regex.

    :param patterns: List of patterns to be used by this recognizer
    :param context: List of context words to increase confidence in detection
    :param supported_language: Language this recognizer supports
    :param supported_entity: The entity this recognizer can detect
    """

    PATTERNS = [
        Pattern("Nric (weak)", r"(?i)(\b[A-Z][0-9]{7}[A-Z]\b)", 0.3),
        Pattern("Nric (medium)", r"(?i)(\b[STFGM][0-9]{7}[A-Z]\b)", 0.5),
    ]

    CONTEXT = ["fin", "fin#", "nric", "nric#"]

    def __init__(
        self,
        patterns: Optional[List[Pattern]] = None,
        context: Optional[List[str]] = None,
        supported_language: str = "en",
        supported_entity: str = "SG_NRIC_FIN",
    ):
        patterns = patterns if patterns else self.PATTERNS
        context = context if context else self.CONTEXT
        super().__init__(
            supported_entity=supported_entity,
            patterns=patterns,
            context=context,
            supported_language=supported_language,
        )

id property

id

Return a unique identifier of this recognizer.

analyze

analyze(
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]

Analyzes text to detect PII using regular expressions or deny-lists.

PARAMETER DESCRIPTION
text

Text to be analyzed

TYPE: str

entities

Entities this recognizer can detect

TYPE: List[str]

nlp_artifacts

Output values from the NLP engine

TYPE: Optional[NlpArtifacts] DEFAULT: None

regex_flags

regex flags to be used in regex matching

TYPE: Optional[int] DEFAULT: None

RETURNS DESCRIPTION
List[RecognizerResult]
Source code in presidio_analyzer/pattern_recognizer.py
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def analyze(
    self,
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]:
    """
    Analyzes text to detect PII using regular expressions or deny-lists.

    :param text: Text to be analyzed
    :param entities: Entities this recognizer can detect
    :param nlp_artifacts: Output values from the NLP engine
    :param regex_flags: regex flags to be used in regex matching
    :return:
    """
    results = []

    if self.patterns:
        pattern_result = self.__analyze_patterns(text, regex_flags)
        results.extend(pattern_result)

    return results

enhance_using_context

enhance_using_context(
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]

Enhance confidence score using context of the entity.

Override this method in derived class in case a custom logic is needed, otherwise return value will be equal to raw_results.

in case a result score is boosted, derived class need to update result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

PARAMETER DESCRIPTION
text

The actual text that was analyzed

TYPE: str

raw_recognizer_results

This recognizer's results, to be updated based on recognizer specific context.

TYPE: List[RecognizerResult]

other_raw_recognizer_results

Other recognizer results matched in the given text to allow related entity context enhancement

TYPE: List[RecognizerResult]

nlp_artifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

TYPE: NlpArtifacts

context

list of context words

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/entity_recognizer.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def enhance_using_context(
    self,
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """Enhance confidence score using context of the entity.

    Override this method in derived class in case a custom logic
    is needed, otherwise return value will be equal to
    raw_results.

    in case a result score is boosted, derived class need to update
    result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

    :param text: The actual text that was analyzed
    :param raw_recognizer_results: This recognizer's results, to be updated
    based on recognizer specific context.
    :param other_raw_recognizer_results: Other recognizer results matched in
    the given text to allow related entity context enhancement
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param context: list of context words
    """
    return raw_recognizer_results

get_supported_entities

get_supported_entities() -> List[str]

Return the list of entities this recognizer can identify.

RETURNS DESCRIPTION
List[str]

A list of the supported entities by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
119
120
121
122
123
124
125
def get_supported_entities(self) -> List[str]:
    """
    Return the list of entities this recognizer can identify.

    :return: A list of the supported entities by this recognizer
    """
    return self.supported_entities

get_supported_language

get_supported_language() -> str

Return the language this recognizer can support.

RETURNS DESCRIPTION
str

A list of the supported language by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
127
128
129
130
131
132
133
def get_supported_language(self) -> str:
    """
    Return the language this recognizer can support.

    :return: A list of the supported language by this recognizer
    """
    return self.supported_language

get_version

get_version() -> str

Return the version of this recognizer.

RETURNS DESCRIPTION
str

The current version of this recognizer

Source code in presidio_analyzer/entity_recognizer.py
135
136
137
138
139
140
141
def get_version(self) -> str:
    """
    Return the version of this recognizer.

    :return: The current version of this recognizer
    """
    return self.version

to_dict

to_dict() -> Dict

Serialize instance into a dictionary.

Source code in presidio_analyzer/pattern_recognizer.py
254
255
256
257
258
259
260
261
262
263
264
def to_dict(self) -> Dict:
    """Serialize instance into a dictionary."""
    return_dict = super().to_dict()

    return_dict["patterns"] = [pat.to_dict() for pat in self.patterns]
    return_dict["deny_list"] = self.deny_list
    return_dict["context"] = self.context
    return_dict["supported_entity"] = return_dict["supported_entities"][0]
    del return_dict["supported_entities"]

    return return_dict

from_dict classmethod

from_dict(entity_recognizer_dict: Dict) -> PatternRecognizer

Create instance from a serialized dict.

Source code in presidio_analyzer/pattern_recognizer.py
266
267
268
269
270
271
272
273
274
@classmethod
def from_dict(cls, entity_recognizer_dict: Dict) -> "PatternRecognizer":
    """Create instance from a serialized dict."""
    patterns = entity_recognizer_dict.get("patterns")
    if patterns:
        patterns_list = [Pattern.from_dict(pat) for pat in patterns]
        entity_recognizer_dict["patterns"] = patterns_list

    return cls(**entity_recognizer_dict)

remove_duplicates staticmethod

remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]

Remove duplicate results.

Remove duplicates in case the two results have identical start and ends and types.

PARAMETER DESCRIPTION
results

List[RecognizerResult]

TYPE: List[RecognizerResult]

RETURNS DESCRIPTION
List[RecognizerResult]

List[RecognizerResult]

Source code in presidio_analyzer/entity_recognizer.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
@staticmethod
def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
    """
    Remove duplicate results.

    Remove duplicates in case the two results
    have identical start and ends and types.
    :param results: List[RecognizerResult]
    :return: List[RecognizerResult]
    """
    results = list(set(results))
    results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
    filtered_results = []

    for result in results:
        if result.score == 0:
            continue

        to_keep = result not in filtered_results  # equals based comparison
        if to_keep:
            for filtered in filtered_results:
                # If result is contained in one of the other results
                if (
                    result.contained_in(filtered)
                    and result.entity_type == filtered.entity_type
                ):
                    to_keep = False
                    break

        if to_keep:
            filtered_results.append(result)

    return filtered_results

sanitize_value staticmethod

sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str

Cleanse the input string of the replacement pairs specified as argument.

PARAMETER DESCRIPTION
text

input string

TYPE: str

replacement_pairs

pairs of what has to be replaced with which value

TYPE: List[Tuple[str, str]]

RETURNS DESCRIPTION
str

cleansed string

Source code in presidio_analyzer/entity_recognizer.py
200
201
202
203
204
205
206
207
208
209
210
211
@staticmethod
def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
    """
    Cleanse the input string of the replacement pairs specified as argument.

    :param text: input string
    :param replacement_pairs: pairs of what has to be replaced with which value
    :return: cleansed string
    """
    for search_string, replacement_string in replacement_pairs:
        text = text.replace(search_string, replacement_string)
    return text

validate_result

validate_result(pattern_text: str) -> Optional[bool]

Validate the pattern logic e.g., by running checksum on a detected pattern.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the validation was successful.

Source code in presidio_analyzer/pattern_recognizer.py
117
118
119
120
121
122
123
124
125
def validate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Validate the pattern logic e.g., by running checksum on a detected pattern.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the validation was successful.
    """
    return None

invalidate_result

invalidate_result(pattern_text: str) -> Optional[bool]

Logic to check for result invalidation by running pruning logic.

For example, each SSN number group should not consist of all the same digits.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the result is invalidated

Source code in presidio_analyzer/pattern_recognizer.py
127
128
129
130
131
132
133
134
135
136
137
def invalidate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Logic to check for result invalidation by running pruning logic.

    For example, each SSN number group should not consist of all the same digits.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the result is invalidated
    """
    return None

build_regex_explanation staticmethod

build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation

Construct an explanation for why this entity was detected.

PARAMETER DESCRIPTION
recognizer_name

Name of recognizer detecting the entity

TYPE: str

pattern_name

Regex pattern name which detected the entity

TYPE: str

pattern

Regex pattern logic

TYPE: str

original_score

Score given by the recognizer

TYPE: float

validation_result

Whether validation was used and its result

TYPE: bool

regex_flags

Regex flags used in the regex matching

TYPE: int

RETURNS DESCRIPTION
AnalysisExplanation

Analysis explanation

Source code in presidio_analyzer/pattern_recognizer.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
@staticmethod
def build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation:
    """
    Construct an explanation for why this entity was detected.

    :param recognizer_name: Name of recognizer detecting the entity
    :param pattern_name: Regex pattern name which detected the entity
    :param pattern: Regex pattern logic
    :param original_score: Score given by the recognizer
    :param validation_result: Whether validation was used and its result
    :param regex_flags: Regex flags used in the regex matching
    :return: Analysis explanation
    """
    textual_explanation = (
        f"Detected by `{recognizer_name}` " f"using pattern `{pattern_name}`"
    )

    explanation = AnalysisExplanation(
        recognizer=recognizer_name,
        original_score=original_score,
        pattern_name=pattern_name,
        pattern=pattern,
        validation_result=validation_result,
        regex_flags=regex_flags,
        textual_explanation=textual_explanation,
    )
    return explanation

SgUenRecognizer

Bases: PatternRecognizer

Recognize Singapore UEN (Unique Entity Number) using regex.

PARAMETER DESCRIPTION
patterns

List of patterns to be used by this recognizer

TYPE: Optional[List[Pattern]] DEFAULT: None

context

List of context words to increase confidence in detection

TYPE: Optional[List[str]] DEFAULT: None

supported_language

Language this recognizer supports

TYPE: str DEFAULT: 'en'

supported_entity

The entity this recognizer can detect

TYPE: str DEFAULT: 'SG_UEN'

METHOD DESCRIPTION
analyze

Analyzes text to detect PII using regular expressions or deny-lists.

enhance_using_context

Enhance confidence score using context of the entity.

get_supported_entities

Return the list of entities this recognizer can identify.

get_supported_language

Return the language this recognizer can support.

get_version

Return the version of this recognizer.

to_dict

Serialize instance into a dictionary.

from_dict

Create instance from a serialized dict.

remove_duplicates

Remove duplicate results.

sanitize_value

Cleanse the input string of the replacement pairs specified as argument.

invalidate_result

Logic to check for result invalidation by running pruning logic.

build_regex_explanation

Construct an explanation for why this entity was detected.

validate_result

Validate the pattern logic e.g., by running checksum on a detected pattern.

validate_uen_format_a

Validate the UEN format A using checksum.

validate_uen_format_b

Validate the UEN format B using checksum.

validate_uen_format_c

Validate the UEN format C using checksum.

Source code in presidio_analyzer/predefined_recognizers/sg_uen_recognizer.py
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
class SgUenRecognizer(PatternRecognizer):
    """
    Recognize Singapore UEN (Unique Entity Number) using regex.

    :param patterns: List of patterns to be used by this recognizer
    :param context: List of context words to increase confidence in detection
    :param supported_language: Language this recognizer supports
    :param supported_entity: The entity this recognizer can detect
    """

    PATTERNS = [
        Pattern(
            "UEN (low)",
            r"\b\d{8}[A-Z]\b|\b\d{9}[A-Z]\b|\b(T|S)\d{2}[A-Z]{2}\d{4}[A-Z]\b",
            0.3,
        )
    ]

    CONTEXT = ["uen", "unique entity number", "business registration", "ACRA"]

    UEN_FORMAT_A_WEIGHT = (10, 4, 9, 3, 8, 2, 7, 1)
    UEN_FORMAT_A_ALPHABET = "XMKECAWLJDB"
    UEN_FORMAT_B_WEIGHT = (10, 8, 6, 4, 9, 7, 5, 3, 1)
    UEN_FORMAT_B_ALPHABET = "ZKCMDNERGWH"
    UEN_FORMAT_C_WEIGHT = (4, 3, 5, 3, 10, 2, 2, 5, 7)
    UEN_FORMAT_C_ALPHABET = "ABCDEFGHJKLMNPQRSTUVWX0123456789"
    UEN_FORMAT_C_PREFIX = {"T", "S", "R"}
    UEN_FORMAT_C_ENTITY_TYPE = {
        "LP",
        "LL",
        "FC",
        "PF",
        "RF",
        "MQ",
        "MM",
        "NB",
        "CC",
        "CS",
        "MB",
        "FM",
        "GS",
        "DP",
        "CP",
        "NR",
        "CM",
        "CD",
        "MD",
        "HS",
        "VH",
        "CH",
        "MH",
        "CL",
        "XL",
        "CX",
        "HC",
        "RP",
        "TU",
        "TC",
        "FB",
        "FN",
        "PA",
        "PB",
        "SS",
        "MC",
        "SM",
        "GA",
        "GB",
    }

    def __init__(
        self,
        patterns: Optional[List[Pattern]] = None,
        context: Optional[List[str]] = None,
        supported_language: str = "en",
        supported_entity: str = "SG_UEN",
    ):
        patterns = patterns if patterns else self.PATTERNS
        context = context if context else self.CONTEXT
        super().__init__(
            supported_entity=supported_entity,
            patterns=patterns,
            context=context,
            supported_language=supported_language,
        )

    def validate_result(self, pattern_text: str) -> Optional[bool]:
        """
        Validate the pattern logic e.g., by running checksum on a detected pattern.

        :param pattern_text: the text to validated.
        Only the part in text that was detected by the regex engine
        :return: A bool indicating whether the validation was successful.
        """

        if len(pattern_text) == 9:
            # Checksum validation for UEN format A
            return self.validate_uen_format_a(pattern_text)
        elif len(pattern_text) == 10 and pattern_text[0].isalpha():
            # Checksum validation for UEN format C
            return self.validate_uen_format_c(pattern_text)
        elif len(pattern_text) == 10:
            # Checksum validation for UEN format B
            return self.validate_uen_format_b(pattern_text)

        return False

    @staticmethod
    def validate_uen_format_a(uen: str) -> bool:
        """
        Validate the UEN format A using checksum.

        :param uen: The UEN to validate.
        :return: True if the UEN is valid according to its respective
        format, False otherwise.
        """
        check_digit = uen[-1]

        weighted_sum = sum(
            int(n) * w for n, w in zip(uen[:-1], SgUenRecognizer.UEN_FORMAT_A_WEIGHT)
        )

        checksum = SgUenRecognizer.UEN_FORMAT_A_ALPHABET[weighted_sum % 11]

        return check_digit == checksum

    @staticmethod
    def validate_uen_format_b(uen: str) -> bool:
        """
        Validate the UEN format B using checksum.

        :param uen: The UEN to validate.
        :return: True if the UEN is valid according to its respective
        format, False otherwise.
        """
        check_digit = uen[-1]
        year_of_registration = int(uen[0:4])

        # Check if the year of registration is not in the future
        if year_of_registration > date.today().year:
            return False

        weighted_sum = sum(
            int(n) * w for n, w in zip(uen[:-1], SgUenRecognizer.UEN_FORMAT_B_WEIGHT)
        )

        checksum = SgUenRecognizer.UEN_FORMAT_B_ALPHABET[weighted_sum % 11]

        return check_digit == checksum

    @staticmethod
    def validate_uen_format_c(uen: str) -> bool:
        """
        Validate the UEN format C using checksum.

        :param uen: The UEN to validate.
        :return: True if the UEN is valid according to its respective
        format, False otherwise.
        """
        check_digit = uen[-1]

        if uen[0] not in SgUenRecognizer.UEN_FORMAT_C_PREFIX:
            return False

        entity_type = uen[3:5]

        if entity_type not in SgUenRecognizer.UEN_FORMAT_C_ENTITY_TYPE:
            return False

        weighted_sum = sum(
            SgUenRecognizer.UEN_FORMAT_C_ALPHABET.index(n) * w
            for n, w in zip(uen[:-1], SgUenRecognizer.UEN_FORMAT_C_WEIGHT)
        )

        checksum = SgUenRecognizer.UEN_FORMAT_C_ALPHABET[(weighted_sum - 5) % 11]

        return check_digit == checksum

id property

id

Return a unique identifier of this recognizer.

analyze

analyze(
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]

Analyzes text to detect PII using regular expressions or deny-lists.

PARAMETER DESCRIPTION
text

Text to be analyzed

TYPE: str

entities

Entities this recognizer can detect

TYPE: List[str]

nlp_artifacts

Output values from the NLP engine

TYPE: Optional[NlpArtifacts] DEFAULT: None

regex_flags

regex flags to be used in regex matching

TYPE: Optional[int] DEFAULT: None

RETURNS DESCRIPTION
List[RecognizerResult]
Source code in presidio_analyzer/pattern_recognizer.py
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def analyze(
    self,
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]:
    """
    Analyzes text to detect PII using regular expressions or deny-lists.

    :param text: Text to be analyzed
    :param entities: Entities this recognizer can detect
    :param nlp_artifacts: Output values from the NLP engine
    :param regex_flags: regex flags to be used in regex matching
    :return:
    """
    results = []

    if self.patterns:
        pattern_result = self.__analyze_patterns(text, regex_flags)
        results.extend(pattern_result)

    return results

enhance_using_context

enhance_using_context(
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]

Enhance confidence score using context of the entity.

Override this method in derived class in case a custom logic is needed, otherwise return value will be equal to raw_results.

in case a result score is boosted, derived class need to update result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

PARAMETER DESCRIPTION
text

The actual text that was analyzed

TYPE: str

raw_recognizer_results

This recognizer's results, to be updated based on recognizer specific context.

TYPE: List[RecognizerResult]

other_raw_recognizer_results

Other recognizer results matched in the given text to allow related entity context enhancement

TYPE: List[RecognizerResult]

nlp_artifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

TYPE: NlpArtifacts

context

list of context words

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/entity_recognizer.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def enhance_using_context(
    self,
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """Enhance confidence score using context of the entity.

    Override this method in derived class in case a custom logic
    is needed, otherwise return value will be equal to
    raw_results.

    in case a result score is boosted, derived class need to update
    result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

    :param text: The actual text that was analyzed
    :param raw_recognizer_results: This recognizer's results, to be updated
    based on recognizer specific context.
    :param other_raw_recognizer_results: Other recognizer results matched in
    the given text to allow related entity context enhancement
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param context: list of context words
    """
    return raw_recognizer_results

get_supported_entities

get_supported_entities() -> List[str]

Return the list of entities this recognizer can identify.

RETURNS DESCRIPTION
List[str]

A list of the supported entities by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
119
120
121
122
123
124
125
def get_supported_entities(self) -> List[str]:
    """
    Return the list of entities this recognizer can identify.

    :return: A list of the supported entities by this recognizer
    """
    return self.supported_entities

get_supported_language

get_supported_language() -> str

Return the language this recognizer can support.

RETURNS DESCRIPTION
str

A list of the supported language by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
127
128
129
130
131
132
133
def get_supported_language(self) -> str:
    """
    Return the language this recognizer can support.

    :return: A list of the supported language by this recognizer
    """
    return self.supported_language

get_version

get_version() -> str

Return the version of this recognizer.

RETURNS DESCRIPTION
str

The current version of this recognizer

Source code in presidio_analyzer/entity_recognizer.py
135
136
137
138
139
140
141
def get_version(self) -> str:
    """
    Return the version of this recognizer.

    :return: The current version of this recognizer
    """
    return self.version

to_dict

to_dict() -> Dict

Serialize instance into a dictionary.

Source code in presidio_analyzer/pattern_recognizer.py
254
255
256
257
258
259
260
261
262
263
264
def to_dict(self) -> Dict:
    """Serialize instance into a dictionary."""
    return_dict = super().to_dict()

    return_dict["patterns"] = [pat.to_dict() for pat in self.patterns]
    return_dict["deny_list"] = self.deny_list
    return_dict["context"] = self.context
    return_dict["supported_entity"] = return_dict["supported_entities"][0]
    del return_dict["supported_entities"]

    return return_dict

from_dict classmethod

from_dict(entity_recognizer_dict: Dict) -> PatternRecognizer

Create instance from a serialized dict.

Source code in presidio_analyzer/pattern_recognizer.py
266
267
268
269
270
271
272
273
274
@classmethod
def from_dict(cls, entity_recognizer_dict: Dict) -> "PatternRecognizer":
    """Create instance from a serialized dict."""
    patterns = entity_recognizer_dict.get("patterns")
    if patterns:
        patterns_list = [Pattern.from_dict(pat) for pat in patterns]
        entity_recognizer_dict["patterns"] = patterns_list

    return cls(**entity_recognizer_dict)

remove_duplicates staticmethod

remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]

Remove duplicate results.

Remove duplicates in case the two results have identical start and ends and types.

PARAMETER DESCRIPTION
results

List[RecognizerResult]

TYPE: List[RecognizerResult]

RETURNS DESCRIPTION
List[RecognizerResult]

List[RecognizerResult]

Source code in presidio_analyzer/entity_recognizer.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
@staticmethod
def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
    """
    Remove duplicate results.

    Remove duplicates in case the two results
    have identical start and ends and types.
    :param results: List[RecognizerResult]
    :return: List[RecognizerResult]
    """
    results = list(set(results))
    results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
    filtered_results = []

    for result in results:
        if result.score == 0:
            continue

        to_keep = result not in filtered_results  # equals based comparison
        if to_keep:
            for filtered in filtered_results:
                # If result is contained in one of the other results
                if (
                    result.contained_in(filtered)
                    and result.entity_type == filtered.entity_type
                ):
                    to_keep = False
                    break

        if to_keep:
            filtered_results.append(result)

    return filtered_results

sanitize_value staticmethod

sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str

Cleanse the input string of the replacement pairs specified as argument.

PARAMETER DESCRIPTION
text

input string

TYPE: str

replacement_pairs

pairs of what has to be replaced with which value

TYPE: List[Tuple[str, str]]

RETURNS DESCRIPTION
str

cleansed string

Source code in presidio_analyzer/entity_recognizer.py
200
201
202
203
204
205
206
207
208
209
210
211
@staticmethod
def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
    """
    Cleanse the input string of the replacement pairs specified as argument.

    :param text: input string
    :param replacement_pairs: pairs of what has to be replaced with which value
    :return: cleansed string
    """
    for search_string, replacement_string in replacement_pairs:
        text = text.replace(search_string, replacement_string)
    return text

invalidate_result

invalidate_result(pattern_text: str) -> Optional[bool]

Logic to check for result invalidation by running pruning logic.

For example, each SSN number group should not consist of all the same digits.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the result is invalidated

Source code in presidio_analyzer/pattern_recognizer.py
127
128
129
130
131
132
133
134
135
136
137
def invalidate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Logic to check for result invalidation by running pruning logic.

    For example, each SSN number group should not consist of all the same digits.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the result is invalidated
    """
    return None

build_regex_explanation staticmethod

build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation

Construct an explanation for why this entity was detected.

PARAMETER DESCRIPTION
recognizer_name

Name of recognizer detecting the entity

TYPE: str

pattern_name

Regex pattern name which detected the entity

TYPE: str

pattern

Regex pattern logic

TYPE: str

original_score

Score given by the recognizer

TYPE: float

validation_result

Whether validation was used and its result

TYPE: bool

regex_flags

Regex flags used in the regex matching

TYPE: int

RETURNS DESCRIPTION
AnalysisExplanation

Analysis explanation

Source code in presidio_analyzer/pattern_recognizer.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
@staticmethod
def build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation:
    """
    Construct an explanation for why this entity was detected.

    :param recognizer_name: Name of recognizer detecting the entity
    :param pattern_name: Regex pattern name which detected the entity
    :param pattern: Regex pattern logic
    :param original_score: Score given by the recognizer
    :param validation_result: Whether validation was used and its result
    :param regex_flags: Regex flags used in the regex matching
    :return: Analysis explanation
    """
    textual_explanation = (
        f"Detected by `{recognizer_name}` " f"using pattern `{pattern_name}`"
    )

    explanation = AnalysisExplanation(
        recognizer=recognizer_name,
        original_score=original_score,
        pattern_name=pattern_name,
        pattern=pattern,
        validation_result=validation_result,
        regex_flags=regex_flags,
        textual_explanation=textual_explanation,
    )
    return explanation

validate_result

validate_result(pattern_text: str) -> Optional[bool]

Validate the pattern logic e.g., by running checksum on a detected pattern.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the validation was successful.

Source code in presidio_analyzer/predefined_recognizers/sg_uen_recognizer.py
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
def validate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Validate the pattern logic e.g., by running checksum on a detected pattern.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the validation was successful.
    """

    if len(pattern_text) == 9:
        # Checksum validation for UEN format A
        return self.validate_uen_format_a(pattern_text)
    elif len(pattern_text) == 10 and pattern_text[0].isalpha():
        # Checksum validation for UEN format C
        return self.validate_uen_format_c(pattern_text)
    elif len(pattern_text) == 10:
        # Checksum validation for UEN format B
        return self.validate_uen_format_b(pattern_text)

    return False

validate_uen_format_a staticmethod

validate_uen_format_a(uen: str) -> bool

Validate the UEN format A using checksum.

PARAMETER DESCRIPTION
uen

The UEN to validate.

TYPE: str

RETURNS DESCRIPTION
bool

True if the UEN is valid according to its respective format, False otherwise.

Source code in presidio_analyzer/predefined_recognizers/sg_uen_recognizer.py
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
@staticmethod
def validate_uen_format_a(uen: str) -> bool:
    """
    Validate the UEN format A using checksum.

    :param uen: The UEN to validate.
    :return: True if the UEN is valid according to its respective
    format, False otherwise.
    """
    check_digit = uen[-1]

    weighted_sum = sum(
        int(n) * w for n, w in zip(uen[:-1], SgUenRecognizer.UEN_FORMAT_A_WEIGHT)
    )

    checksum = SgUenRecognizer.UEN_FORMAT_A_ALPHABET[weighted_sum % 11]

    return check_digit == checksum

validate_uen_format_b staticmethod

validate_uen_format_b(uen: str) -> bool

Validate the UEN format B using checksum.

PARAMETER DESCRIPTION
uen

The UEN to validate.

TYPE: str

RETURNS DESCRIPTION
bool

True if the UEN is valid according to its respective format, False otherwise.

Source code in presidio_analyzer/predefined_recognizers/sg_uen_recognizer.py
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
@staticmethod
def validate_uen_format_b(uen: str) -> bool:
    """
    Validate the UEN format B using checksum.

    :param uen: The UEN to validate.
    :return: True if the UEN is valid according to its respective
    format, False otherwise.
    """
    check_digit = uen[-1]
    year_of_registration = int(uen[0:4])

    # Check if the year of registration is not in the future
    if year_of_registration > date.today().year:
        return False

    weighted_sum = sum(
        int(n) * w for n, w in zip(uen[:-1], SgUenRecognizer.UEN_FORMAT_B_WEIGHT)
    )

    checksum = SgUenRecognizer.UEN_FORMAT_B_ALPHABET[weighted_sum % 11]

    return check_digit == checksum

validate_uen_format_c staticmethod

validate_uen_format_c(uen: str) -> bool

Validate the UEN format C using checksum.

PARAMETER DESCRIPTION
uen

The UEN to validate.

TYPE: str

RETURNS DESCRIPTION
bool

True if the UEN is valid according to its respective format, False otherwise.

Source code in presidio_analyzer/predefined_recognizers/sg_uen_recognizer.py
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
@staticmethod
def validate_uen_format_c(uen: str) -> bool:
    """
    Validate the UEN format C using checksum.

    :param uen: The UEN to validate.
    :return: True if the UEN is valid according to its respective
    format, False otherwise.
    """
    check_digit = uen[-1]

    if uen[0] not in SgUenRecognizer.UEN_FORMAT_C_PREFIX:
        return False

    entity_type = uen[3:5]

    if entity_type not in SgUenRecognizer.UEN_FORMAT_C_ENTITY_TYPE:
        return False

    weighted_sum = sum(
        SgUenRecognizer.UEN_FORMAT_C_ALPHABET.index(n) * w
        for n, w in zip(uen[:-1], SgUenRecognizer.UEN_FORMAT_C_WEIGHT)
    )

    checksum = SgUenRecognizer.UEN_FORMAT_C_ALPHABET[(weighted_sum - 5) % 11]

    return check_digit == checksum

SpacyRecognizer

Bases: LocalRecognizer

Recognize PII entities using a spaCy NLP model.

Since the spaCy pipeline is ran by the AnalyzerEngine/SpacyNlpEngine,
this recognizer only extracts the entities from the NlpArtifacts
and returns them.
METHOD DESCRIPTION
enhance_using_context

Enhance confidence score using context of the entity.

get_supported_entities

Return the list of entities this recognizer can identify.

get_supported_language

Return the language this recognizer can support.

get_version

Return the version of this recognizer.

to_dict

Serialize self to dictionary.

from_dict

Create EntityRecognizer from a dict input.

remove_duplicates

Remove duplicate results.

sanitize_value

Cleanse the input string of the replacement pairs specified as argument.

build_explanation

Create explanation for why this result was detected.

Source code in presidio_analyzer/predefined_recognizers/spacy_recognizer.py
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
class SpacyRecognizer(LocalRecognizer):
    """
    Recognize PII entities using a spaCy NLP model.

        Since the spaCy pipeline is ran by the AnalyzerEngine/SpacyNlpEngine,
        this recognizer only extracts the entities from the NlpArtifacts
        and returns them.

    """

    ENTITIES = ["DATE_TIME", "NRP", "LOCATION", "PERSON", "ORGANIZATION"]

    DEFAULT_EXPLANATION = "Identified as {} by Spacy's Named Entity Recognition"

    # deprecated, use MODEL_TO_PRESIDIO_MAPPING in NerModelConfiguration instead
    CHECK_LABEL_GROUPS = [
        ({"LOCATION"}, {"GPE", "LOC"}),
        ({"PERSON", "PER"}, {"PERSON", "PER"}),
        ({"DATE_TIME"}, {"DATE", "TIME"}),
        ({"NRP"}, {"NORP"}),
        ({"ORGANIZATION"}, {"ORG"}),
    ]

    def __init__(
        self,
        supported_language: str = "en",
        supported_entities: Optional[List[str]] = None,
        ner_strength: float = 0.85,
        default_explanation: Optional[str] = None,
        check_label_groups: Optional[List[Tuple[Set, Set]]] = None,
        context: Optional[List[str]] = None,
    ):
        """

        :param supported_language: Language this recognizer supports
        :param supported_entities: The entities this recognizer can detect
        :param ner_strength: Default confidence for NER prediction
        :param check_label_groups: (DEPRECATED) Tuple containing Presidio entity names
        :param default_explanation: Default explanation for the results when using return_decision_process=True
        """  # noqa E501

        self.ner_strength = ner_strength
        if check_label_groups:
            warnings.warn(
                "check_label_groups is deprecated and isn't used;"
                "entities are mapped in NerModelConfiguration",
                DeprecationWarning,
                2,
            )

        self.default_explanation = (
            default_explanation if default_explanation else self.DEFAULT_EXPLANATION
        )
        supported_entities = supported_entities if supported_entities else self.ENTITIES
        super().__init__(
            supported_entities=supported_entities,
            supported_language=supported_language,
            context=context,
        )

    def load(self) -> None:  # noqa D102
        # no need to load anything as the analyze method already receives
        # preprocessed nlp artifacts
        pass

    def build_explanation(
        self, original_score: float, explanation: str
    ) -> AnalysisExplanation:
        """
        Create explanation for why this result was detected.

        :param original_score: Score given by this recognizer
        :param explanation: Explanation string
        :return:
        """
        explanation = AnalysisExplanation(
            recognizer=self.name,
            original_score=original_score,
            textual_explanation=explanation,
        )
        return explanation

    def analyze(self, text: str, entities, nlp_artifacts=None):  # noqa D102
        results = []
        if not nlp_artifacts:
            logger.warning("Skipping SpaCy, nlp artifacts not provided...")
            return results

        ner_entities = nlp_artifacts.entities
        ner_scores = nlp_artifacts.scores

        for ner_entity, ner_score in zip(ner_entities, ner_scores):
            if (
                ner_entity.label_ not in entities
                or ner_entity.label_ not in self.supported_entities
            ):
                logger.debug(
                    f"Skipping entity {ner_entity.label_} "
                    f"as it is not in the supported entities list"
                )
                continue

            textual_explanation = self.DEFAULT_EXPLANATION.format(ner_entity.label_)
            explanation = self.build_explanation(ner_score, textual_explanation)
            spacy_result = RecognizerResult(
                entity_type=ner_entity.label_,
                start=ner_entity.start_char,
                end=ner_entity.end_char,
                score=ner_score,
                analysis_explanation=explanation,
                recognition_metadata={
                    RecognizerResult.RECOGNIZER_NAME_KEY: self.name,
                    RecognizerResult.RECOGNIZER_IDENTIFIER_KEY: self.id,
                },
            )
            results.append(spacy_result)

        return results

    @staticmethod
    def __check_label(
        entity: str, label: str, check_label_groups: Tuple[Set, Set]
    ) -> bool:
        raise DeprecationWarning("__check_label is deprecated")

id property

id

Return a unique identifier of this recognizer.

enhance_using_context

enhance_using_context(
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]

Enhance confidence score using context of the entity.

Override this method in derived class in case a custom logic is needed, otherwise return value will be equal to raw_results.

in case a result score is boosted, derived class need to update result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

PARAMETER DESCRIPTION
text

The actual text that was analyzed

TYPE: str

raw_recognizer_results

This recognizer's results, to be updated based on recognizer specific context.

TYPE: List[RecognizerResult]

other_raw_recognizer_results

Other recognizer results matched in the given text to allow related entity context enhancement

TYPE: List[RecognizerResult]

nlp_artifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

TYPE: NlpArtifacts

context

list of context words

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/entity_recognizer.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def enhance_using_context(
    self,
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """Enhance confidence score using context of the entity.

    Override this method in derived class in case a custom logic
    is needed, otherwise return value will be equal to
    raw_results.

    in case a result score is boosted, derived class need to update
    result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

    :param text: The actual text that was analyzed
    :param raw_recognizer_results: This recognizer's results, to be updated
    based on recognizer specific context.
    :param other_raw_recognizer_results: Other recognizer results matched in
    the given text to allow related entity context enhancement
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param context: list of context words
    """
    return raw_recognizer_results

get_supported_entities

get_supported_entities() -> List[str]

Return the list of entities this recognizer can identify.

RETURNS DESCRIPTION
List[str]

A list of the supported entities by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
119
120
121
122
123
124
125
def get_supported_entities(self) -> List[str]:
    """
    Return the list of entities this recognizer can identify.

    :return: A list of the supported entities by this recognizer
    """
    return self.supported_entities

get_supported_language

get_supported_language() -> str

Return the language this recognizer can support.

RETURNS DESCRIPTION
str

A list of the supported language by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
127
128
129
130
131
132
133
def get_supported_language(self) -> str:
    """
    Return the language this recognizer can support.

    :return: A list of the supported language by this recognizer
    """
    return self.supported_language

get_version

get_version() -> str

Return the version of this recognizer.

RETURNS DESCRIPTION
str

The current version of this recognizer

Source code in presidio_analyzer/entity_recognizer.py
135
136
137
138
139
140
141
def get_version(self) -> str:
    """
    Return the version of this recognizer.

    :return: The current version of this recognizer
    """
    return self.version

to_dict

to_dict() -> Dict

Serialize self to dictionary.

RETURNS DESCRIPTION
Dict

a dictionary

Source code in presidio_analyzer/entity_recognizer.py
143
144
145
146
147
148
149
150
151
152
153
154
155
def to_dict(self) -> Dict:
    """
    Serialize self to dictionary.

    :return: a dictionary
    """
    return_dict = {
        "supported_entities": self.supported_entities,
        "supported_language": self.supported_language,
        "name": self.name,
        "version": self.version,
    }
    return return_dict

from_dict classmethod

from_dict(entity_recognizer_dict: Dict) -> EntityRecognizer

Create EntityRecognizer from a dict input.

PARAMETER DESCRIPTION
entity_recognizer_dict

Dict containing keys and values for instantiation

TYPE: Dict

Source code in presidio_analyzer/entity_recognizer.py
157
158
159
160
161
162
163
164
@classmethod
def from_dict(cls, entity_recognizer_dict: Dict) -> "EntityRecognizer":
    """
    Create EntityRecognizer from a dict input.

    :param entity_recognizer_dict: Dict containing keys and values for instantiation
    """
    return cls(**entity_recognizer_dict)

remove_duplicates staticmethod

remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]

Remove duplicate results.

Remove duplicates in case the two results have identical start and ends and types.

PARAMETER DESCRIPTION
results

List[RecognizerResult]

TYPE: List[RecognizerResult]

RETURNS DESCRIPTION
List[RecognizerResult]

List[RecognizerResult]

Source code in presidio_analyzer/entity_recognizer.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
@staticmethod
def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
    """
    Remove duplicate results.

    Remove duplicates in case the two results
    have identical start and ends and types.
    :param results: List[RecognizerResult]
    :return: List[RecognizerResult]
    """
    results = list(set(results))
    results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
    filtered_results = []

    for result in results:
        if result.score == 0:
            continue

        to_keep = result not in filtered_results  # equals based comparison
        if to_keep:
            for filtered in filtered_results:
                # If result is contained in one of the other results
                if (
                    result.contained_in(filtered)
                    and result.entity_type == filtered.entity_type
                ):
                    to_keep = False
                    break

        if to_keep:
            filtered_results.append(result)

    return filtered_results

sanitize_value staticmethod

sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str

Cleanse the input string of the replacement pairs specified as argument.

PARAMETER DESCRIPTION
text

input string

TYPE: str

replacement_pairs

pairs of what has to be replaced with which value

TYPE: List[Tuple[str, str]]

RETURNS DESCRIPTION
str

cleansed string

Source code in presidio_analyzer/entity_recognizer.py
200
201
202
203
204
205
206
207
208
209
210
211
@staticmethod
def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
    """
    Cleanse the input string of the replacement pairs specified as argument.

    :param text: input string
    :param replacement_pairs: pairs of what has to be replaced with which value
    :return: cleansed string
    """
    for search_string, replacement_string in replacement_pairs:
        text = text.replace(search_string, replacement_string)
    return text

build_explanation

build_explanation(
    original_score: float, explanation: str
) -> AnalysisExplanation

Create explanation for why this result was detected.

PARAMETER DESCRIPTION
original_score

Score given by this recognizer

TYPE: float

explanation

Explanation string

TYPE: str

RETURNS DESCRIPTION
AnalysisExplanation
Source code in presidio_analyzer/predefined_recognizers/spacy_recognizer.py
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
def build_explanation(
    self, original_score: float, explanation: str
) -> AnalysisExplanation:
    """
    Create explanation for why this result was detected.

    :param original_score: Score given by this recognizer
    :param explanation: Explanation string
    :return:
    """
    explanation = AnalysisExplanation(
        recognizer=self.name,
        original_score=original_score,
        textual_explanation=explanation,
    )
    return explanation

StanzaRecognizer

Bases: SpacyRecognizer

Recognize entities using the Stanza NLP package.

See https://stanfordnlp.github.io/stanza/. Uses the spaCy-Stanza package (https://github.com/explosion/spacy-stanza) to align Stanza's interface with spaCy's

METHOD DESCRIPTION
enhance_using_context

Enhance confidence score using context of the entity.

get_supported_entities

Return the list of entities this recognizer can identify.

get_supported_language

Return the language this recognizer can support.

get_version

Return the version of this recognizer.

to_dict

Serialize self to dictionary.

from_dict

Create EntityRecognizer from a dict input.

remove_duplicates

Remove duplicate results.

sanitize_value

Cleanse the input string of the replacement pairs specified as argument.

build_explanation

Create explanation for why this result was detected.

Source code in presidio_analyzer/predefined_recognizers/stanza_recognizer.py
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
class StanzaRecognizer(SpacyRecognizer):
    """
    Recognize entities using the Stanza NLP package.

    See https://stanfordnlp.github.io/stanza/.
    Uses the spaCy-Stanza package (https://github.com/explosion/spacy-stanza) to align
    Stanza's interface with spaCy's
    """

    def __init__(self, **kwargs):  # noqa ANN003
        self.DEFAULT_EXPLANATION = self.DEFAULT_EXPLANATION.replace("Spacy", "Stanza")
        super().__init__(**kwargs)

id property

id

Return a unique identifier of this recognizer.

enhance_using_context

enhance_using_context(
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]

Enhance confidence score using context of the entity.

Override this method in derived class in case a custom logic is needed, otherwise return value will be equal to raw_results.

in case a result score is boosted, derived class need to update result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

PARAMETER DESCRIPTION
text

The actual text that was analyzed

TYPE: str

raw_recognizer_results

This recognizer's results, to be updated based on recognizer specific context.

TYPE: List[RecognizerResult]

other_raw_recognizer_results

Other recognizer results matched in the given text to allow related entity context enhancement

TYPE: List[RecognizerResult]

nlp_artifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

TYPE: NlpArtifacts

context

list of context words

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/entity_recognizer.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def enhance_using_context(
    self,
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """Enhance confidence score using context of the entity.

    Override this method in derived class in case a custom logic
    is needed, otherwise return value will be equal to
    raw_results.

    in case a result score is boosted, derived class need to update
    result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

    :param text: The actual text that was analyzed
    :param raw_recognizer_results: This recognizer's results, to be updated
    based on recognizer specific context.
    :param other_raw_recognizer_results: Other recognizer results matched in
    the given text to allow related entity context enhancement
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param context: list of context words
    """
    return raw_recognizer_results

get_supported_entities

get_supported_entities() -> List[str]

Return the list of entities this recognizer can identify.

RETURNS DESCRIPTION
List[str]

A list of the supported entities by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
119
120
121
122
123
124
125
def get_supported_entities(self) -> List[str]:
    """
    Return the list of entities this recognizer can identify.

    :return: A list of the supported entities by this recognizer
    """
    return self.supported_entities

get_supported_language

get_supported_language() -> str

Return the language this recognizer can support.

RETURNS DESCRIPTION
str

A list of the supported language by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
127
128
129
130
131
132
133
def get_supported_language(self) -> str:
    """
    Return the language this recognizer can support.

    :return: A list of the supported language by this recognizer
    """
    return self.supported_language

get_version

get_version() -> str

Return the version of this recognizer.

RETURNS DESCRIPTION
str

The current version of this recognizer

Source code in presidio_analyzer/entity_recognizer.py
135
136
137
138
139
140
141
def get_version(self) -> str:
    """
    Return the version of this recognizer.

    :return: The current version of this recognizer
    """
    return self.version

to_dict

to_dict() -> Dict

Serialize self to dictionary.

RETURNS DESCRIPTION
Dict

a dictionary

Source code in presidio_analyzer/entity_recognizer.py
143
144
145
146
147
148
149
150
151
152
153
154
155
def to_dict(self) -> Dict:
    """
    Serialize self to dictionary.

    :return: a dictionary
    """
    return_dict = {
        "supported_entities": self.supported_entities,
        "supported_language": self.supported_language,
        "name": self.name,
        "version": self.version,
    }
    return return_dict

from_dict classmethod

from_dict(entity_recognizer_dict: Dict) -> EntityRecognizer

Create EntityRecognizer from a dict input.

PARAMETER DESCRIPTION
entity_recognizer_dict

Dict containing keys and values for instantiation

TYPE: Dict

Source code in presidio_analyzer/entity_recognizer.py
157
158
159
160
161
162
163
164
@classmethod
def from_dict(cls, entity_recognizer_dict: Dict) -> "EntityRecognizer":
    """
    Create EntityRecognizer from a dict input.

    :param entity_recognizer_dict: Dict containing keys and values for instantiation
    """
    return cls(**entity_recognizer_dict)

remove_duplicates staticmethod

remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]

Remove duplicate results.

Remove duplicates in case the two results have identical start and ends and types.

PARAMETER DESCRIPTION
results

List[RecognizerResult]

TYPE: List[RecognizerResult]

RETURNS DESCRIPTION
List[RecognizerResult]

List[RecognizerResult]

Source code in presidio_analyzer/entity_recognizer.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
@staticmethod
def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
    """
    Remove duplicate results.

    Remove duplicates in case the two results
    have identical start and ends and types.
    :param results: List[RecognizerResult]
    :return: List[RecognizerResult]
    """
    results = list(set(results))
    results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
    filtered_results = []

    for result in results:
        if result.score == 0:
            continue

        to_keep = result not in filtered_results  # equals based comparison
        if to_keep:
            for filtered in filtered_results:
                # If result is contained in one of the other results
                if (
                    result.contained_in(filtered)
                    and result.entity_type == filtered.entity_type
                ):
                    to_keep = False
                    break

        if to_keep:
            filtered_results.append(result)

    return filtered_results

sanitize_value staticmethod

sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str

Cleanse the input string of the replacement pairs specified as argument.

PARAMETER DESCRIPTION
text

input string

TYPE: str

replacement_pairs

pairs of what has to be replaced with which value

TYPE: List[Tuple[str, str]]

RETURNS DESCRIPTION
str

cleansed string

Source code in presidio_analyzer/entity_recognizer.py
200
201
202
203
204
205
206
207
208
209
210
211
@staticmethod
def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
    """
    Cleanse the input string of the replacement pairs specified as argument.

    :param text: input string
    :param replacement_pairs: pairs of what has to be replaced with which value
    :return: cleansed string
    """
    for search_string, replacement_string in replacement_pairs:
        text = text.replace(search_string, replacement_string)
    return text

build_explanation

build_explanation(
    original_score: float, explanation: str
) -> AnalysisExplanation

Create explanation for why this result was detected.

PARAMETER DESCRIPTION
original_score

Score given by this recognizer

TYPE: float

explanation

Explanation string

TYPE: str

RETURNS DESCRIPTION
AnalysisExplanation
Source code in presidio_analyzer/predefined_recognizers/spacy_recognizer.py
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
def build_explanation(
    self, original_score: float, explanation: str
) -> AnalysisExplanation:
    """
    Create explanation for why this result was detected.

    :param original_score: Score given by this recognizer
    :param explanation: Explanation string
    :return:
    """
    explanation = AnalysisExplanation(
        recognizer=self.name,
        original_score=original_score,
        textual_explanation=explanation,
    )
    return explanation

NhsRecognizer

Bases: PatternRecognizer

Recognizes NHS number using regex and checksum.

PARAMETER DESCRIPTION
patterns

List of patterns to be used by this recognizer

TYPE: Optional[List[Pattern]] DEFAULT: None

context

List of context words to increase confidence in detection

TYPE: Optional[List[str]] DEFAULT: None

supported_language

Language this recognizer supports

TYPE: str DEFAULT: 'en'

supported_entity

The entity this recognizer can detect

TYPE: str DEFAULT: 'UK_NHS'

replacement_pairs

List of tuples with potential replacement values for different strings to be used during pattern matching. This can allow a greater variety in input, for example by removing dashes or spaces.

TYPE: Optional[List[Tuple[str, str]]] DEFAULT: None

METHOD DESCRIPTION
analyze

Analyzes text to detect PII using regular expressions or deny-lists.

enhance_using_context

Enhance confidence score using context of the entity.

get_supported_entities

Return the list of entities this recognizer can identify.

get_supported_language

Return the language this recognizer can support.

get_version

Return the version of this recognizer.

to_dict

Serialize instance into a dictionary.

from_dict

Create instance from a serialized dict.

remove_duplicates

Remove duplicate results.

sanitize_value

Cleanse the input string of the replacement pairs specified as argument.

invalidate_result

Logic to check for result invalidation by running pruning logic.

build_regex_explanation

Construct an explanation for why this entity was detected.

validate_result

Validate the pattern logic e.g., by running checksum on a detected pattern.

Source code in presidio_analyzer/predefined_recognizers/uk_nhs_recognizer.py
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
class NhsRecognizer(PatternRecognizer):
    """
    Recognizes NHS number using regex and checksum.

    :param patterns: List of patterns to be used by this recognizer
    :param context: List of context words to increase confidence in detection
    :param supported_language: Language this recognizer supports
    :param supported_entity: The entity this recognizer can detect
    :param replacement_pairs: List of tuples with potential replacement values
    for different strings to be used during pattern matching.
    This can allow a greater variety in input, for example by removing dashes or spaces.
    """

    PATTERNS = [
        Pattern(
            "NHS (medium)",
            r"\b([0-9]{3})[- ]?([0-9]{3})[- ]?([0-9]{4})\b",
            0.5,
        ),
    ]

    CONTEXT = [
        "national health service",
        "nhs",
        "health services authority",
        "health authority",
    ]

    def __init__(
        self,
        patterns: Optional[List[Pattern]] = None,
        context: Optional[List[str]] = None,
        supported_language: str = "en",
        supported_entity: str = "UK_NHS",
        replacement_pairs: Optional[List[Tuple[str, str]]] = None,
    ):
        self.replacement_pairs = (
            replacement_pairs if replacement_pairs else [("-", ""), (" ", "")]
        )
        patterns = patterns if patterns else self.PATTERNS
        context = context if context else self.CONTEXT
        super().__init__(
            supported_entity=supported_entity,
            patterns=patterns,
            context=context,
            supported_language=supported_language,
        )

    def validate_result(self, pattern_text: str) -> bool:
        """
        Validate the pattern logic e.g., by running checksum on a detected pattern.

        :param pattern_text: the text to validated.
        Only the part in text that was detected by the regex engine
        :return: A bool indicating whether the validation was successful.
        """
        text = EntityRecognizer.sanitize_value(pattern_text, self.replacement_pairs)
        total = sum(
            [int(c) * multiplier for c, multiplier in zip(text, reversed(range(11)))]
        )
        remainder = total % 11
        check_remainder = remainder == 0

        return check_remainder

id property

id

Return a unique identifier of this recognizer.

analyze

analyze(
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]

Analyzes text to detect PII using regular expressions or deny-lists.

PARAMETER DESCRIPTION
text

Text to be analyzed

TYPE: str

entities

Entities this recognizer can detect

TYPE: List[str]

nlp_artifacts

Output values from the NLP engine

TYPE: Optional[NlpArtifacts] DEFAULT: None

regex_flags

regex flags to be used in regex matching

TYPE: Optional[int] DEFAULT: None

RETURNS DESCRIPTION
List[RecognizerResult]
Source code in presidio_analyzer/pattern_recognizer.py
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def analyze(
    self,
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]:
    """
    Analyzes text to detect PII using regular expressions or deny-lists.

    :param text: Text to be analyzed
    :param entities: Entities this recognizer can detect
    :param nlp_artifacts: Output values from the NLP engine
    :param regex_flags: regex flags to be used in regex matching
    :return:
    """
    results = []

    if self.patterns:
        pattern_result = self.__analyze_patterns(text, regex_flags)
        results.extend(pattern_result)

    return results

enhance_using_context

enhance_using_context(
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]

Enhance confidence score using context of the entity.

Override this method in derived class in case a custom logic is needed, otherwise return value will be equal to raw_results.

in case a result score is boosted, derived class need to update result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

PARAMETER DESCRIPTION
text

The actual text that was analyzed

TYPE: str

raw_recognizer_results

This recognizer's results, to be updated based on recognizer specific context.

TYPE: List[RecognizerResult]

other_raw_recognizer_results

Other recognizer results matched in the given text to allow related entity context enhancement

TYPE: List[RecognizerResult]

nlp_artifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

TYPE: NlpArtifacts

context

list of context words

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/entity_recognizer.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def enhance_using_context(
    self,
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """Enhance confidence score using context of the entity.

    Override this method in derived class in case a custom logic
    is needed, otherwise return value will be equal to
    raw_results.

    in case a result score is boosted, derived class need to update
    result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

    :param text: The actual text that was analyzed
    :param raw_recognizer_results: This recognizer's results, to be updated
    based on recognizer specific context.
    :param other_raw_recognizer_results: Other recognizer results matched in
    the given text to allow related entity context enhancement
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param context: list of context words
    """
    return raw_recognizer_results

get_supported_entities

get_supported_entities() -> List[str]

Return the list of entities this recognizer can identify.

RETURNS DESCRIPTION
List[str]

A list of the supported entities by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
119
120
121
122
123
124
125
def get_supported_entities(self) -> List[str]:
    """
    Return the list of entities this recognizer can identify.

    :return: A list of the supported entities by this recognizer
    """
    return self.supported_entities

get_supported_language

get_supported_language() -> str

Return the language this recognizer can support.

RETURNS DESCRIPTION
str

A list of the supported language by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
127
128
129
130
131
132
133
def get_supported_language(self) -> str:
    """
    Return the language this recognizer can support.

    :return: A list of the supported language by this recognizer
    """
    return self.supported_language

get_version

get_version() -> str

Return the version of this recognizer.

RETURNS DESCRIPTION
str

The current version of this recognizer

Source code in presidio_analyzer/entity_recognizer.py
135
136
137
138
139
140
141
def get_version(self) -> str:
    """
    Return the version of this recognizer.

    :return: The current version of this recognizer
    """
    return self.version

to_dict

to_dict() -> Dict

Serialize instance into a dictionary.

Source code in presidio_analyzer/pattern_recognizer.py
254
255
256
257
258
259
260
261
262
263
264
def to_dict(self) -> Dict:
    """Serialize instance into a dictionary."""
    return_dict = super().to_dict()

    return_dict["patterns"] = [pat.to_dict() for pat in self.patterns]
    return_dict["deny_list"] = self.deny_list
    return_dict["context"] = self.context
    return_dict["supported_entity"] = return_dict["supported_entities"][0]
    del return_dict["supported_entities"]

    return return_dict

from_dict classmethod

from_dict(entity_recognizer_dict: Dict) -> PatternRecognizer

Create instance from a serialized dict.

Source code in presidio_analyzer/pattern_recognizer.py
266
267
268
269
270
271
272
273
274
@classmethod
def from_dict(cls, entity_recognizer_dict: Dict) -> "PatternRecognizer":
    """Create instance from a serialized dict."""
    patterns = entity_recognizer_dict.get("patterns")
    if patterns:
        patterns_list = [Pattern.from_dict(pat) for pat in patterns]
        entity_recognizer_dict["patterns"] = patterns_list

    return cls(**entity_recognizer_dict)

remove_duplicates staticmethod

remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]

Remove duplicate results.

Remove duplicates in case the two results have identical start and ends and types.

PARAMETER DESCRIPTION
results

List[RecognizerResult]

TYPE: List[RecognizerResult]

RETURNS DESCRIPTION
List[RecognizerResult]

List[RecognizerResult]

Source code in presidio_analyzer/entity_recognizer.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
@staticmethod
def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
    """
    Remove duplicate results.

    Remove duplicates in case the two results
    have identical start and ends and types.
    :param results: List[RecognizerResult]
    :return: List[RecognizerResult]
    """
    results = list(set(results))
    results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
    filtered_results = []

    for result in results:
        if result.score == 0:
            continue

        to_keep = result not in filtered_results  # equals based comparison
        if to_keep:
            for filtered in filtered_results:
                # If result is contained in one of the other results
                if (
                    result.contained_in(filtered)
                    and result.entity_type == filtered.entity_type
                ):
                    to_keep = False
                    break

        if to_keep:
            filtered_results.append(result)

    return filtered_results

sanitize_value staticmethod

sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str

Cleanse the input string of the replacement pairs specified as argument.

PARAMETER DESCRIPTION
text

input string

TYPE: str

replacement_pairs

pairs of what has to be replaced with which value

TYPE: List[Tuple[str, str]]

RETURNS DESCRIPTION
str

cleansed string

Source code in presidio_analyzer/entity_recognizer.py
200
201
202
203
204
205
206
207
208
209
210
211
@staticmethod
def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
    """
    Cleanse the input string of the replacement pairs specified as argument.

    :param text: input string
    :param replacement_pairs: pairs of what has to be replaced with which value
    :return: cleansed string
    """
    for search_string, replacement_string in replacement_pairs:
        text = text.replace(search_string, replacement_string)
    return text

invalidate_result

invalidate_result(pattern_text: str) -> Optional[bool]

Logic to check for result invalidation by running pruning logic.

For example, each SSN number group should not consist of all the same digits.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the result is invalidated

Source code in presidio_analyzer/pattern_recognizer.py
127
128
129
130
131
132
133
134
135
136
137
def invalidate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Logic to check for result invalidation by running pruning logic.

    For example, each SSN number group should not consist of all the same digits.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the result is invalidated
    """
    return None

build_regex_explanation staticmethod

build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation

Construct an explanation for why this entity was detected.

PARAMETER DESCRIPTION
recognizer_name

Name of recognizer detecting the entity

TYPE: str

pattern_name

Regex pattern name which detected the entity

TYPE: str

pattern

Regex pattern logic

TYPE: str

original_score

Score given by the recognizer

TYPE: float

validation_result

Whether validation was used and its result

TYPE: bool

regex_flags

Regex flags used in the regex matching

TYPE: int

RETURNS DESCRIPTION
AnalysisExplanation

Analysis explanation

Source code in presidio_analyzer/pattern_recognizer.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
@staticmethod
def build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation:
    """
    Construct an explanation for why this entity was detected.

    :param recognizer_name: Name of recognizer detecting the entity
    :param pattern_name: Regex pattern name which detected the entity
    :param pattern: Regex pattern logic
    :param original_score: Score given by the recognizer
    :param validation_result: Whether validation was used and its result
    :param regex_flags: Regex flags used in the regex matching
    :return: Analysis explanation
    """
    textual_explanation = (
        f"Detected by `{recognizer_name}` " f"using pattern `{pattern_name}`"
    )

    explanation = AnalysisExplanation(
        recognizer=recognizer_name,
        original_score=original_score,
        pattern_name=pattern_name,
        pattern=pattern,
        validation_result=validation_result,
        regex_flags=regex_flags,
        textual_explanation=textual_explanation,
    )
    return explanation

validate_result

validate_result(pattern_text: str) -> bool

Validate the pattern logic e.g., by running checksum on a detected pattern.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
bool

A bool indicating whether the validation was successful.

Source code in presidio_analyzer/predefined_recognizers/uk_nhs_recognizer.py
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
def validate_result(self, pattern_text: str) -> bool:
    """
    Validate the pattern logic e.g., by running checksum on a detected pattern.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the validation was successful.
    """
    text = EntityRecognizer.sanitize_value(pattern_text, self.replacement_pairs)
    total = sum(
        [int(c) * multiplier for c, multiplier in zip(text, reversed(range(11)))]
    )
    remainder = total % 11
    check_remainder = remainder == 0

    return check_remainder

UkNinoRecognizer

Bases: PatternRecognizer

Recognizes UK National Insurance Number using regex.

PARAMETER DESCRIPTION
patterns

List of patterns to be used by this recognizer

TYPE: Optional[List[Pattern]] DEFAULT: None

context

List of context words to increase confidence in detection

TYPE: Optional[List[str]] DEFAULT: None

supported_language

Language this recognizer supports

TYPE: str DEFAULT: 'en'

supported_entity

The entity this recognizer can detect

TYPE: str DEFAULT: 'UK_NINO'

METHOD DESCRIPTION
analyze

Analyzes text to detect PII using regular expressions or deny-lists.

enhance_using_context

Enhance confidence score using context of the entity.

get_supported_entities

Return the list of entities this recognizer can identify.

get_supported_language

Return the language this recognizer can support.

get_version

Return the version of this recognizer.

to_dict

Serialize instance into a dictionary.

from_dict

Create instance from a serialized dict.

remove_duplicates

Remove duplicate results.

sanitize_value

Cleanse the input string of the replacement pairs specified as argument.

validate_result

Validate the pattern logic e.g., by running checksum on a detected pattern.

invalidate_result

Logic to check for result invalidation by running pruning logic.

build_regex_explanation

Construct an explanation for why this entity was detected.

Source code in presidio_analyzer/predefined_recognizers/uk_nino_recognizer.py
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
class UkNinoRecognizer(PatternRecognizer):
    """
    Recognizes UK National Insurance Number using regex.

    :param patterns: List of patterns to be used by this recognizer
    :param context: List of context words to increase confidence in detection
    :param supported_language: Language this recognizer supports
    :param supported_entity: The entity this recognizer can detect
    """

    PATTERNS = [
        Pattern(
            "NINO (medium)",
            r"\b(?!bg|gb|nk|kn|nt|tn|zz|BG|GB|NK|KN|NT|TN|ZZ) ?([a-ceghj-pr-tw-zA-CEGHJ-PR-TW-Z]{1}[a-ceghj-npr-tw-zA-CEGHJ-NPR-TW-Z]{1}) ?([0-9]{2}) ?([0-9]{2}) ?([0-9]{2}) ?([a-dA-D{1}])\b",  # noqa: E501
            0.5,
        ),
    ]

    CONTEXT = ["national insurance", "ni number", "nino"]

    def __init__(
        self,
        patterns: Optional[List[Pattern]] = None,
        context: Optional[List[str]] = None,
        supported_language: str = "en",
        supported_entity: str = "UK_NINO",
    ):
        patterns = patterns if patterns else self.PATTERNS
        context = context if context else self.CONTEXT
        super().__init__(
            supported_entity=supported_entity,
            patterns=patterns,
            context=context,
            supported_language=supported_language,
        )

id property

id

Return a unique identifier of this recognizer.

analyze

analyze(
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]

Analyzes text to detect PII using regular expressions or deny-lists.

PARAMETER DESCRIPTION
text

Text to be analyzed

TYPE: str

entities

Entities this recognizer can detect

TYPE: List[str]

nlp_artifacts

Output values from the NLP engine

TYPE: Optional[NlpArtifacts] DEFAULT: None

regex_flags

regex flags to be used in regex matching

TYPE: Optional[int] DEFAULT: None

RETURNS DESCRIPTION
List[RecognizerResult]
Source code in presidio_analyzer/pattern_recognizer.py
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def analyze(
    self,
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]:
    """
    Analyzes text to detect PII using regular expressions or deny-lists.

    :param text: Text to be analyzed
    :param entities: Entities this recognizer can detect
    :param nlp_artifacts: Output values from the NLP engine
    :param regex_flags: regex flags to be used in regex matching
    :return:
    """
    results = []

    if self.patterns:
        pattern_result = self.__analyze_patterns(text, regex_flags)
        results.extend(pattern_result)

    return results

enhance_using_context

enhance_using_context(
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]

Enhance confidence score using context of the entity.

Override this method in derived class in case a custom logic is needed, otherwise return value will be equal to raw_results.

in case a result score is boosted, derived class need to update result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

PARAMETER DESCRIPTION
text

The actual text that was analyzed

TYPE: str

raw_recognizer_results

This recognizer's results, to be updated based on recognizer specific context.

TYPE: List[RecognizerResult]

other_raw_recognizer_results

Other recognizer results matched in the given text to allow related entity context enhancement

TYPE: List[RecognizerResult]

nlp_artifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

TYPE: NlpArtifacts

context

list of context words

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/entity_recognizer.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def enhance_using_context(
    self,
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """Enhance confidence score using context of the entity.

    Override this method in derived class in case a custom logic
    is needed, otherwise return value will be equal to
    raw_results.

    in case a result score is boosted, derived class need to update
    result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

    :param text: The actual text that was analyzed
    :param raw_recognizer_results: This recognizer's results, to be updated
    based on recognizer specific context.
    :param other_raw_recognizer_results: Other recognizer results matched in
    the given text to allow related entity context enhancement
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param context: list of context words
    """
    return raw_recognizer_results

get_supported_entities

get_supported_entities() -> List[str]

Return the list of entities this recognizer can identify.

RETURNS DESCRIPTION
List[str]

A list of the supported entities by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
119
120
121
122
123
124
125
def get_supported_entities(self) -> List[str]:
    """
    Return the list of entities this recognizer can identify.

    :return: A list of the supported entities by this recognizer
    """
    return self.supported_entities

get_supported_language

get_supported_language() -> str

Return the language this recognizer can support.

RETURNS DESCRIPTION
str

A list of the supported language by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
127
128
129
130
131
132
133
def get_supported_language(self) -> str:
    """
    Return the language this recognizer can support.

    :return: A list of the supported language by this recognizer
    """
    return self.supported_language

get_version

get_version() -> str

Return the version of this recognizer.

RETURNS DESCRIPTION
str

The current version of this recognizer

Source code in presidio_analyzer/entity_recognizer.py
135
136
137
138
139
140
141
def get_version(self) -> str:
    """
    Return the version of this recognizer.

    :return: The current version of this recognizer
    """
    return self.version

to_dict

to_dict() -> Dict

Serialize instance into a dictionary.

Source code in presidio_analyzer/pattern_recognizer.py
254
255
256
257
258
259
260
261
262
263
264
def to_dict(self) -> Dict:
    """Serialize instance into a dictionary."""
    return_dict = super().to_dict()

    return_dict["patterns"] = [pat.to_dict() for pat in self.patterns]
    return_dict["deny_list"] = self.deny_list
    return_dict["context"] = self.context
    return_dict["supported_entity"] = return_dict["supported_entities"][0]
    del return_dict["supported_entities"]

    return return_dict

from_dict classmethod

from_dict(entity_recognizer_dict: Dict) -> PatternRecognizer

Create instance from a serialized dict.

Source code in presidio_analyzer/pattern_recognizer.py
266
267
268
269
270
271
272
273
274
@classmethod
def from_dict(cls, entity_recognizer_dict: Dict) -> "PatternRecognizer":
    """Create instance from a serialized dict."""
    patterns = entity_recognizer_dict.get("patterns")
    if patterns:
        patterns_list = [Pattern.from_dict(pat) for pat in patterns]
        entity_recognizer_dict["patterns"] = patterns_list

    return cls(**entity_recognizer_dict)

remove_duplicates staticmethod

remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]

Remove duplicate results.

Remove duplicates in case the two results have identical start and ends and types.

PARAMETER DESCRIPTION
results

List[RecognizerResult]

TYPE: List[RecognizerResult]

RETURNS DESCRIPTION
List[RecognizerResult]

List[RecognizerResult]

Source code in presidio_analyzer/entity_recognizer.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
@staticmethod
def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
    """
    Remove duplicate results.

    Remove duplicates in case the two results
    have identical start and ends and types.
    :param results: List[RecognizerResult]
    :return: List[RecognizerResult]
    """
    results = list(set(results))
    results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
    filtered_results = []

    for result in results:
        if result.score == 0:
            continue

        to_keep = result not in filtered_results  # equals based comparison
        if to_keep:
            for filtered in filtered_results:
                # If result is contained in one of the other results
                if (
                    result.contained_in(filtered)
                    and result.entity_type == filtered.entity_type
                ):
                    to_keep = False
                    break

        if to_keep:
            filtered_results.append(result)

    return filtered_results

sanitize_value staticmethod

sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str

Cleanse the input string of the replacement pairs specified as argument.

PARAMETER DESCRIPTION
text

input string

TYPE: str

replacement_pairs

pairs of what has to be replaced with which value

TYPE: List[Tuple[str, str]]

RETURNS DESCRIPTION
str

cleansed string

Source code in presidio_analyzer/entity_recognizer.py
200
201
202
203
204
205
206
207
208
209
210
211
@staticmethod
def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
    """
    Cleanse the input string of the replacement pairs specified as argument.

    :param text: input string
    :param replacement_pairs: pairs of what has to be replaced with which value
    :return: cleansed string
    """
    for search_string, replacement_string in replacement_pairs:
        text = text.replace(search_string, replacement_string)
    return text

validate_result

validate_result(pattern_text: str) -> Optional[bool]

Validate the pattern logic e.g., by running checksum on a detected pattern.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the validation was successful.

Source code in presidio_analyzer/pattern_recognizer.py
117
118
119
120
121
122
123
124
125
def validate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Validate the pattern logic e.g., by running checksum on a detected pattern.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the validation was successful.
    """
    return None

invalidate_result

invalidate_result(pattern_text: str) -> Optional[bool]

Logic to check for result invalidation by running pruning logic.

For example, each SSN number group should not consist of all the same digits.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the result is invalidated

Source code in presidio_analyzer/pattern_recognizer.py
127
128
129
130
131
132
133
134
135
136
137
def invalidate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Logic to check for result invalidation by running pruning logic.

    For example, each SSN number group should not consist of all the same digits.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the result is invalidated
    """
    return None

build_regex_explanation staticmethod

build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation

Construct an explanation for why this entity was detected.

PARAMETER DESCRIPTION
recognizer_name

Name of recognizer detecting the entity

TYPE: str

pattern_name

Regex pattern name which detected the entity

TYPE: str

pattern

Regex pattern logic

TYPE: str

original_score

Score given by the recognizer

TYPE: float

validation_result

Whether validation was used and its result

TYPE: bool

regex_flags

Regex flags used in the regex matching

TYPE: int

RETURNS DESCRIPTION
AnalysisExplanation

Analysis explanation

Source code in presidio_analyzer/pattern_recognizer.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
@staticmethod
def build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation:
    """
    Construct an explanation for why this entity was detected.

    :param recognizer_name: Name of recognizer detecting the entity
    :param pattern_name: Regex pattern name which detected the entity
    :param pattern: Regex pattern logic
    :param original_score: Score given by the recognizer
    :param validation_result: Whether validation was used and its result
    :param regex_flags: Regex flags used in the regex matching
    :return: Analysis explanation
    """
    textual_explanation = (
        f"Detected by `{recognizer_name}` " f"using pattern `{pattern_name}`"
    )

    explanation = AnalysisExplanation(
        recognizer=recognizer_name,
        original_score=original_score,
        pattern_name=pattern_name,
        pattern=pattern,
        validation_result=validation_result,
        regex_flags=regex_flags,
        textual_explanation=textual_explanation,
    )
    return explanation

UrlRecognizer

Bases: PatternRecognizer

Recognize urls using regex.

This application uses Open Source components: Project: CommonRegex https://github.com/madisonmay/CommonRegex Copyright (c) 2014 Madison May License (MIT) https://github.com/madisonmay/CommonRegex/blob/master/LICENSE

PARAMETER DESCRIPTION
patterns

List of patterns to be used by this recognizer

TYPE: Optional[List[Pattern]] DEFAULT: None

context

List of context words to increase confidence in detection

TYPE: Optional[List[str]] DEFAULT: None

supported_language

Language this recognizer supports

TYPE: str DEFAULT: 'en'

supported_entity

The entity this recognizer can detect

TYPE: str DEFAULT: 'URL'

METHOD DESCRIPTION
analyze

Analyzes text to detect PII using regular expressions or deny-lists.

enhance_using_context

Enhance confidence score using context of the entity.

get_supported_entities

Return the list of entities this recognizer can identify.

get_supported_language

Return the language this recognizer can support.

get_version

Return the version of this recognizer.

to_dict

Serialize instance into a dictionary.

from_dict

Create instance from a serialized dict.

remove_duplicates

Remove duplicate results.

sanitize_value

Cleanse the input string of the replacement pairs specified as argument.

validate_result

Validate the pattern logic e.g., by running checksum on a detected pattern.

invalidate_result

Logic to check for result invalidation by running pruning logic.

build_regex_explanation

Construct an explanation for why this entity was detected.

Source code in presidio_analyzer/predefined_recognizers/url_recognizer.py
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
class UrlRecognizer(PatternRecognizer):
    """
    Recognize urls using regex.

    This application uses Open Source components:
    Project: CommonRegex https://github.com/madisonmay/CommonRegex
    Copyright (c) 2014 Madison May
    License (MIT)  https://github.com/madisonmay/CommonRegex/blob/master/LICENSE

    :param patterns: List of patterns to be used by this recognizer
    :param context: List of context words to increase confidence in detection
    :param supported_language: Language this recognizer supports
    :param supported_entity: The entity this recognizer can detect
    """

    BASE_URL_REGEX = r"((www\d{0,3}[.])?[a-z0-9.\-]+[.](?:(?:com)|(?:edu)|(?:gov)|(?:int)|(?:mil)|(?:net)|(?:onl)|(?:org)|(?:pro)|(?:red)|(?:tel)|(?:uno)|(?:xxx)|(?:academy)|(?:accountant)|(?:accountants)|(?:actor)|(?:adult)|(?:africa)|(?:agency)|(?:airforce)|(?:apartments)|(?:app)|(?:archi)|(?:army)|(?:art)|(?:asia)|(?:associates)|(?:attorney)|(?:auction)|(?:audio)|(?:auto)|(?:autos)|(?:baby)|(?:band)|(?:bar)|(?:bargains)|(?:beer)|(?:berlin)|(?:best)|(?:bet)|(?:bid)|(?:bike)|(?:bio)|(?:black)|(?:blackfriday)|(?:blog)|(?:blue)|(?:boats)|(?:bond)|(?:boo)|(?:boston)|(?:bot)|(?:boutique)|(?:build)|(?:builders)|(?:business)|(?:buzz)|(?:cab)|(?:cafe)|(?:cam)|(?:camera)|(?:camp)|(?:capital)|(?:car)|(?:cards)|(?:care)|(?:careers)|(?:cars)|(?:casa)|(?:cash)|(?:casino)|(?:catering)|(?:center)|(?:ceo)|(?:cfd)|(?:charity)|(?:chat)|(?:cheap)|(?:christmas)|(?:church)|(?:city)|(?:claims)|(?:cleaning)|(?:click)|(?:clinic)|(?:clothing)|(?:cloud)|(?:club)|(?:codes)|(?:coffee)|(?:college)|(?:com)|(?:community)|(?:company)|(?:computer)|(?:condos)|(?:construction)|(?:consulting)|(?:contact)|(?:contractors)|(?:cooking)|(?:cool)|(?:coupons)|(?:courses)|(?:credit)|(?:creditcard)|(?:cricket)|(?:cruises)|(?:cyou)|(?:dad)|(?:dance)|(?:date)|(?:dating)|(?:day)|(?:degree)|(?:delivery)|(?:democrat)|(?:dental)|(?:dentist)|(?:desi)|(?:design)|(?:dev)|(?:diamonds)|(?:diet)|(?:digital)|(?:direct)|(?:directory)|(?:discount)|(?:doctor)|(?:dog)|(?:domains)|(?:download)|(?:earth)|(?:eco)|(?:education)|(?:email)|(?:energy)|(?:engineer)|(?:engineering)|(?:enterprises)|(?:equipment)|(?:esq)|(?:estate)|(?:events)|(?:exchange)|(?:expert)|(?:exposed)|(?:express)|(?:fail)|(?:faith)|(?:family)|(?:fans)|(?:farm)|(?:fashion)|(?:feedback)|(?:film)|(?:finance)|(?:financial)|(?:fish)|(?:fishing)|(?:fit)|(?:fitness)|(?:flights)|(?:florist)|(?:flowers)|(?:football)|(?:forsale)|(?:foundation)|(?:fun)|(?:fund)|(?:furniture)|(?:futbol)|(?:fyi)|(?:gallery)|(?:game)|(?:games)|(?:garden)|(?:gay)|(?:gdn)|(?:gifts)|(?:gives)|(?:giving)|(?:glass)|(?:global)|(?:gmbh)|(?:gold)|(?:golf)|(?:graphics)|(?:gratis)|(?:green)|(?:gripe)|(?:group)|(?:guide)|(?:guitars)|(?:guru)|(?:hair)|(?:hamburg)|(?:haus)|(?:health)|(?:healthcare)|(?:help)|(?:hiphop)|(?:hockey)|(?:holdings)|(?:holiday)|(?:homes)|(?:horse)|(?:hospital)|(?:host)|(?:hosting)|(?:house)|(?:how)|(?:icu)|(?:info)|(?:ink)|(?:institute)|(?:insure)|(?:international)|(?:investments)|(?:irish)|(?:jewelry)|(?:jetzt)|(?:juegos)|(?:kaufen)|(?:kids)|(?:kitchen)|(?:kiwi)|(?:krd)|(?:kyoto)|(?:land)|(?:lat)|(?:law)|(?:lawyer)|(?:lease)|(?:legal)|(?:lgbt)|(?:life)|(?:lighting)|(?:limited)|(?:limo)|(?:link)|(?:live)|(?:loan)|(?:loans)|(?:lol)|(?:london)|(?:love)|(?:ltd)|(?:ltda)|(?:luxury)|(?:maison)|(?:management)|(?:market)|(?:marketing)|(?:markets)|(?:mba)|(?:media)|(?:melbourne)|(?:meme)|(?:memorial)|(?:men)|(?:miami)|(?:mobi)|(?:moda)|(?:moe)|(?:mom)|(?:money)|(?:monster)|(?:mortgage)|(?:motorcycles)|(?:mov)|(?:movie)|(?:nagoya)|(?:name)|(?:navy)|(?:network)|(?:new)|(?:news)|(?:ngo)|(?:ninja)|(?:now)|(?:nyc)|(?:observer)|(?:okinawa)|(?:one)|(?:ong)|(?:onl)|(?:online)|(?:organic)|(?:osaka)|(?:page)|(?:paris)|(?:partners)|(?:parts)|(?:party)|(?:pet)|(?:phd)|(?:photo)|(?:photography)|(?:photos)|(?:pics)|(?:pictures)|(?:pink)|(?:pizza)|(?:place)|(?:plumbing)|(?:plus)|(?:poker)|(?:porn)|(?:press)|(?:pro)|(?:productions)|(?:prof)|(?:promo)|(?:properties)|(?:property)|(?:protection)|(?:pub)|(?:quest)|(?:racing)|(?:recipes)|(?:red)|(?:rehab)|(?:reise)|(?:reisen)|(?:rent)|(?:rentals)|(?:repair)|(?:report)|(?:republican)|(?:rest)|(?:restaurant)|(?:review)|(?:reviews)|(?:rip)|(?:rocks)|(?:rodeo)|(?:rsvp)|(?:run)|(?:saarland)|(?:sale)|(?:salon)|(?:sarl)|(?:sbs)|(?:school)|(?:schule)|(?:science)|(?:services)|(?:sex)|(?:sexy)|(?:sh)|(?:shoes)|(?:shop)|(?:shopping)|(?:show)|(?:singles)|(?:site)|(?:skin)|(?:soccer)|(?:social)|(?:software)|(?:solar)|(?:solutions)|(?:soy)|(?:space)|(?:spiegel)|(?:study)|(?:style)|(?:sucks)|(?:supply)|(?:support)|(?:surf)|(?:surgery)|(?:systems)|(?:tax)|(?:taxi)|(?:team)|(?:tech)|(?:technology)|(?:tel)|(?:theater)|(?:tips)|(?:tires)|(?:today)|(?:tools)|(?:top)|(?:tours)|(?:town)|(?:toys)|(?:trade)|(?:training)|(?:tube)|(?:uk)|(?:university)|(?:uno)|(?:vacations)|(?:ventures)|(?:vet)|(?:video)|(?:villas)|(?:vin)|(?:vip)|(?:vision)|(?:vlaanderen)|(?:vodka)|(?:vote)|(?:voting)|(?:voyage)|(?:wales)|(?:wang)|(?:watch)|(?:webcam)|(?:website)|(?:wedding)|(?:wiki)|(?:wine)|(?:work)|(?:works)|(?:world)|(?:wtf)|(?:xyz)|(?:yoga)|(?:yokohama)|(?:you)|(?:zone)|(?:ac)|(?:ad)|(?:ae)|(?:af)|(?:ag)|(?:ai)|(?:al)|(?:am)|(?:an)|(?:ao)|(?:aq)|(?:ar)|(?:as)|(?:at)|(?:au)|(?:aw)|(?:ax)|(?:az)|(?:ba)|(?:bb)|(?:bd)|(?:be)|(?:bf)|(?:bg)|(?:bh)|(?:bi)|(?:bj)|(?:bm)|(?:bn)|(?:bo)|(?:br)|(?:bs)|(?:bt)|(?:bv)|(?:bw)|(?:by)|(?:bz)|(?:ca)|(?:cc)|(?:cd)|(?:cf)|(?:cg)|(?:ch)|(?:ci)|(?:ck)|(?:cl)|(?:cm)|(?:cn)|(?:co)|(?:cr)|(?:cu)|(?:cv)|(?:cw)|(?:cx)|(?:cy)|(?:cz)|(?:de)|(?:dj)|(?:dk)|(?:dm)|(?:do)|(?:dz)|(?:ec)|(?:ee)|(?:eg)|(?:er)|(?:es)|(?:et)|(?:eu)|(?:fi)|(?:fj)|(?:fk)|(?:fm)|(?:fo)|(?:fr)|(?:ga)|(?:gb)|(?:gd)|(?:ge)|(?:gf)|(?:gg)|(?:gh)|(?:gi)|(?:gl)|(?:gm)|(?:gn)|(?:gp)|(?:gq)|(?:gr)|(?:gs)|(?:gt)|(?:gu)|(?:gw)|(?:gy)|(?:hk)|(?:hm)|(?:hn)|(?:hr)|(?:ht)|(?:hu)|(?:id)|(?:ie)|(?:il)|(?:im)|(?:in)|(?:io)|(?:iq)|(?:ir)|(?:is)|(?:it)|(?:je)|(?:jm)|(?:jo)|(?:jp)|(?:ke)|(?:kg)|(?:kh)|(?:ki)|(?:km)|(?:kn)|(?:kp)|(?:kr)|(?:kw)|(?:ky)|(?:kz)|(?:la)|(?:lb)|(?:lc)|(?:li)|(?:lk)|(?:lr)|(?:ls)|(?:lt)|(?:lu)|(?:lv)|(?:ly)|(?:ma)|(?:mc)|(?:md)|(?:me)|(?:mg)|(?:mh)|(?:mk)|(?:ml)|(?:mm)|(?:mn)|(?:mo)|(?:mp)|(?:mq)|(?:mr)|(?:ms)|(?:mt)|(?:mu)|(?:mv)|(?:mw)|(?:mx)|(?:my)|(?:mz)|(?:na)|(?:nc)|(?:ne)|(?:nf)|(?:ng)|(?:ni)|(?:nl)|(?:no)|(?:np)|(?:nr)|(?:nu)|(?:nz)|(?:om)|(?:pa)|(?:pe)|(?:pf)|(?:pg)|(?:ph)|(?:pk)|(?:pl)|(?:pm)|(?:pn)|(?:pr)|(?:ps)|(?:pt)|(?:pw)|(?:py)|(?:qa)|(?:re)|(?:ro)|(?:rs)|(?:ru)|(?:rw)|(?:sa)|(?:sb)|(?:sc)|(?:sd)|(?:se)|(?:sg)|(?:sh)|(?:si)|(?:sj)|(?:sk)|(?:sl)|(?:sm)|(?:sn)|(?:so)|(?:sr)|(?:st)|(?:su)|(?:sv)|(?:sx)|(?:sy)|(?:sz)|(?:tc)|(?:td)|(?:tf)|(?:tg)|(?:th)|(?:tj)|(?:tk)|(?:tl)|(?:tm)|(?:tn)|(?:to)|(?:tp)|(?:tr)|(?:tt)|(?:tv)|(?:tw)|(?:tz)|(?:ua)|(?:ug)|(?:uk)|(?:us)|(?:uy)|(?:uz)|(?:va)|(?:vc)|(?:ve)|(?:vg)|(?:vi)|(?:vn)|(?:vu)|(?:wf)|(?:ws)|(?:ye)|(?:yt)|(?:za)|(?:zm)|(?:zw))(?:/[^\s()<>\"']*)?)"  # noqa: E501

    PATTERNS = [
        Pattern("Standard Url", "(?i)(?:https?://)" + BASE_URL_REGEX, 0.6),
        Pattern("Non schema URL", "(?i)" + BASE_URL_REGEX, 0.5),
        Pattern("Quoted URL", r'(?i)["\'](https?://' + BASE_URL_REGEX + r')["\']', 0.6),
        Pattern(
            "Quoted Non-schema URL", r'(?i)["\'](' + BASE_URL_REGEX + r')["\']', 0.5
        ),
    ]

    CONTEXT = ["url", "website", "link"]

    def __init__(
        self,
        patterns: Optional[List[Pattern]] = None,
        context: Optional[List[str]] = None,
        supported_language: str = "en",
        supported_entity: str = "URL",
    ):
        patterns = patterns if patterns else self.PATTERNS
        context = context if context else self.CONTEXT
        super().__init__(
            supported_entity=supported_entity,
            patterns=patterns,
            context=context,
            supported_language=supported_language,
        )

id property

id

Return a unique identifier of this recognizer.

analyze

analyze(
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]

Analyzes text to detect PII using regular expressions or deny-lists.

PARAMETER DESCRIPTION
text

Text to be analyzed

TYPE: str

entities

Entities this recognizer can detect

TYPE: List[str]

nlp_artifacts

Output values from the NLP engine

TYPE: Optional[NlpArtifacts] DEFAULT: None

regex_flags

regex flags to be used in regex matching

TYPE: Optional[int] DEFAULT: None

RETURNS DESCRIPTION
List[RecognizerResult]
Source code in presidio_analyzer/pattern_recognizer.py
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def analyze(
    self,
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]:
    """
    Analyzes text to detect PII using regular expressions or deny-lists.

    :param text: Text to be analyzed
    :param entities: Entities this recognizer can detect
    :param nlp_artifacts: Output values from the NLP engine
    :param regex_flags: regex flags to be used in regex matching
    :return:
    """
    results = []

    if self.patterns:
        pattern_result = self.__analyze_patterns(text, regex_flags)
        results.extend(pattern_result)

    return results

enhance_using_context

enhance_using_context(
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]

Enhance confidence score using context of the entity.

Override this method in derived class in case a custom logic is needed, otherwise return value will be equal to raw_results.

in case a result score is boosted, derived class need to update result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

PARAMETER DESCRIPTION
text

The actual text that was analyzed

TYPE: str

raw_recognizer_results

This recognizer's results, to be updated based on recognizer specific context.

TYPE: List[RecognizerResult]

other_raw_recognizer_results

Other recognizer results matched in the given text to allow related entity context enhancement

TYPE: List[RecognizerResult]

nlp_artifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

TYPE: NlpArtifacts

context

list of context words

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/entity_recognizer.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def enhance_using_context(
    self,
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """Enhance confidence score using context of the entity.

    Override this method in derived class in case a custom logic
    is needed, otherwise return value will be equal to
    raw_results.

    in case a result score is boosted, derived class need to update
    result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

    :param text: The actual text that was analyzed
    :param raw_recognizer_results: This recognizer's results, to be updated
    based on recognizer specific context.
    :param other_raw_recognizer_results: Other recognizer results matched in
    the given text to allow related entity context enhancement
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param context: list of context words
    """
    return raw_recognizer_results

get_supported_entities

get_supported_entities() -> List[str]

Return the list of entities this recognizer can identify.

RETURNS DESCRIPTION
List[str]

A list of the supported entities by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
119
120
121
122
123
124
125
def get_supported_entities(self) -> List[str]:
    """
    Return the list of entities this recognizer can identify.

    :return: A list of the supported entities by this recognizer
    """
    return self.supported_entities

get_supported_language

get_supported_language() -> str

Return the language this recognizer can support.

RETURNS DESCRIPTION
str

A list of the supported language by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
127
128
129
130
131
132
133
def get_supported_language(self) -> str:
    """
    Return the language this recognizer can support.

    :return: A list of the supported language by this recognizer
    """
    return self.supported_language

get_version

get_version() -> str

Return the version of this recognizer.

RETURNS DESCRIPTION
str

The current version of this recognizer

Source code in presidio_analyzer/entity_recognizer.py
135
136
137
138
139
140
141
def get_version(self) -> str:
    """
    Return the version of this recognizer.

    :return: The current version of this recognizer
    """
    return self.version

to_dict

to_dict() -> Dict

Serialize instance into a dictionary.

Source code in presidio_analyzer/pattern_recognizer.py
254
255
256
257
258
259
260
261
262
263
264
def to_dict(self) -> Dict:
    """Serialize instance into a dictionary."""
    return_dict = super().to_dict()

    return_dict["patterns"] = [pat.to_dict() for pat in self.patterns]
    return_dict["deny_list"] = self.deny_list
    return_dict["context"] = self.context
    return_dict["supported_entity"] = return_dict["supported_entities"][0]
    del return_dict["supported_entities"]

    return return_dict

from_dict classmethod

from_dict(entity_recognizer_dict: Dict) -> PatternRecognizer

Create instance from a serialized dict.

Source code in presidio_analyzer/pattern_recognizer.py
266
267
268
269
270
271
272
273
274
@classmethod
def from_dict(cls, entity_recognizer_dict: Dict) -> "PatternRecognizer":
    """Create instance from a serialized dict."""
    patterns = entity_recognizer_dict.get("patterns")
    if patterns:
        patterns_list = [Pattern.from_dict(pat) for pat in patterns]
        entity_recognizer_dict["patterns"] = patterns_list

    return cls(**entity_recognizer_dict)

remove_duplicates staticmethod

remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]

Remove duplicate results.

Remove duplicates in case the two results have identical start and ends and types.

PARAMETER DESCRIPTION
results

List[RecognizerResult]

TYPE: List[RecognizerResult]

RETURNS DESCRIPTION
List[RecognizerResult]

List[RecognizerResult]

Source code in presidio_analyzer/entity_recognizer.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
@staticmethod
def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
    """
    Remove duplicate results.

    Remove duplicates in case the two results
    have identical start and ends and types.
    :param results: List[RecognizerResult]
    :return: List[RecognizerResult]
    """
    results = list(set(results))
    results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
    filtered_results = []

    for result in results:
        if result.score == 0:
            continue

        to_keep = result not in filtered_results  # equals based comparison
        if to_keep:
            for filtered in filtered_results:
                # If result is contained in one of the other results
                if (
                    result.contained_in(filtered)
                    and result.entity_type == filtered.entity_type
                ):
                    to_keep = False
                    break

        if to_keep:
            filtered_results.append(result)

    return filtered_results

sanitize_value staticmethod

sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str

Cleanse the input string of the replacement pairs specified as argument.

PARAMETER DESCRIPTION
text

input string

TYPE: str

replacement_pairs

pairs of what has to be replaced with which value

TYPE: List[Tuple[str, str]]

RETURNS DESCRIPTION
str

cleansed string

Source code in presidio_analyzer/entity_recognizer.py
200
201
202
203
204
205
206
207
208
209
210
211
@staticmethod
def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
    """
    Cleanse the input string of the replacement pairs specified as argument.

    :param text: input string
    :param replacement_pairs: pairs of what has to be replaced with which value
    :return: cleansed string
    """
    for search_string, replacement_string in replacement_pairs:
        text = text.replace(search_string, replacement_string)
    return text

validate_result

validate_result(pattern_text: str) -> Optional[bool]

Validate the pattern logic e.g., by running checksum on a detected pattern.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the validation was successful.

Source code in presidio_analyzer/pattern_recognizer.py
117
118
119
120
121
122
123
124
125
def validate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Validate the pattern logic e.g., by running checksum on a detected pattern.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the validation was successful.
    """
    return None

invalidate_result

invalidate_result(pattern_text: str) -> Optional[bool]

Logic to check for result invalidation by running pruning logic.

For example, each SSN number group should not consist of all the same digits.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the result is invalidated

Source code in presidio_analyzer/pattern_recognizer.py
127
128
129
130
131
132
133
134
135
136
137
def invalidate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Logic to check for result invalidation by running pruning logic.

    For example, each SSN number group should not consist of all the same digits.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the result is invalidated
    """
    return None

build_regex_explanation staticmethod

build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation

Construct an explanation for why this entity was detected.

PARAMETER DESCRIPTION
recognizer_name

Name of recognizer detecting the entity

TYPE: str

pattern_name

Regex pattern name which detected the entity

TYPE: str

pattern

Regex pattern logic

TYPE: str

original_score

Score given by the recognizer

TYPE: float

validation_result

Whether validation was used and its result

TYPE: bool

regex_flags

Regex flags used in the regex matching

TYPE: int

RETURNS DESCRIPTION
AnalysisExplanation

Analysis explanation

Source code in presidio_analyzer/pattern_recognizer.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
@staticmethod
def build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation:
    """
    Construct an explanation for why this entity was detected.

    :param recognizer_name: Name of recognizer detecting the entity
    :param pattern_name: Regex pattern name which detected the entity
    :param pattern: Regex pattern logic
    :param original_score: Score given by the recognizer
    :param validation_result: Whether validation was used and its result
    :param regex_flags: Regex flags used in the regex matching
    :return: Analysis explanation
    """
    textual_explanation = (
        f"Detected by `{recognizer_name}` " f"using pattern `{pattern_name}`"
    )

    explanation = AnalysisExplanation(
        recognizer=recognizer_name,
        original_score=original_score,
        pattern_name=pattern_name,
        pattern=pattern,
        validation_result=validation_result,
        regex_flags=regex_flags,
        textual_explanation=textual_explanation,
    )
    return explanation

UsBankRecognizer

Bases: PatternRecognizer

Recognizes US bank number using regex.

PARAMETER DESCRIPTION
patterns

List of patterns to be used by this recognizer

TYPE: Optional[List[Pattern]] DEFAULT: None

context

List of context words to increase confidence in detection

TYPE: Optional[List[str]] DEFAULT: None

supported_language

Language this recognizer supports

TYPE: str DEFAULT: 'en'

supported_entity

The entity this recognizer can detect

TYPE: str DEFAULT: 'US_BANK_NUMBER'

METHOD DESCRIPTION
analyze

Analyzes text to detect PII using regular expressions or deny-lists.

enhance_using_context

Enhance confidence score using context of the entity.

get_supported_entities

Return the list of entities this recognizer can identify.

get_supported_language

Return the language this recognizer can support.

get_version

Return the version of this recognizer.

to_dict

Serialize instance into a dictionary.

from_dict

Create instance from a serialized dict.

remove_duplicates

Remove duplicate results.

sanitize_value

Cleanse the input string of the replacement pairs specified as argument.

validate_result

Validate the pattern logic e.g., by running checksum on a detected pattern.

invalidate_result

Logic to check for result invalidation by running pruning logic.

build_regex_explanation

Construct an explanation for why this entity was detected.

Source code in presidio_analyzer/predefined_recognizers/us_bank_recognizer.py
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
class UsBankRecognizer(PatternRecognizer):
    """
    Recognizes US bank number using regex.

    :param patterns: List of patterns to be used by this recognizer
    :param context: List of context words to increase confidence in detection
    :param supported_language: Language this recognizer supports
    :param supported_entity: The entity this recognizer can detect
    """

    PATTERNS = [
        Pattern(
            "Bank Account (weak)",
            r"\b[0-9]{8,17}\b",
            0.05,
        ),
    ]

    CONTEXT = [
        # Task #603: Support keyphrases: change to "checking account"
        # as part of keyphrase change
        "check",
        "account",
        "account#",
        "acct",
        "bank",
        "save",
        "debit",
    ]

    def __init__(
        self,
        patterns: Optional[List[Pattern]] = None,
        context: Optional[List[str]] = None,
        supported_language: str = "en",
        supported_entity: str = "US_BANK_NUMBER",
    ):
        patterns = patterns if patterns else self.PATTERNS
        context = context if context else self.CONTEXT
        super().__init__(
            supported_entity=supported_entity,
            patterns=patterns,
            context=context,
            supported_language=supported_language,
        )

id property

id

Return a unique identifier of this recognizer.

analyze

analyze(
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]

Analyzes text to detect PII using regular expressions or deny-lists.

PARAMETER DESCRIPTION
text

Text to be analyzed

TYPE: str

entities

Entities this recognizer can detect

TYPE: List[str]

nlp_artifacts

Output values from the NLP engine

TYPE: Optional[NlpArtifacts] DEFAULT: None

regex_flags

regex flags to be used in regex matching

TYPE: Optional[int] DEFAULT: None

RETURNS DESCRIPTION
List[RecognizerResult]
Source code in presidio_analyzer/pattern_recognizer.py
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def analyze(
    self,
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]:
    """
    Analyzes text to detect PII using regular expressions or deny-lists.

    :param text: Text to be analyzed
    :param entities: Entities this recognizer can detect
    :param nlp_artifacts: Output values from the NLP engine
    :param regex_flags: regex flags to be used in regex matching
    :return:
    """
    results = []

    if self.patterns:
        pattern_result = self.__analyze_patterns(text, regex_flags)
        results.extend(pattern_result)

    return results

enhance_using_context

enhance_using_context(
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]

Enhance confidence score using context of the entity.

Override this method in derived class in case a custom logic is needed, otherwise return value will be equal to raw_results.

in case a result score is boosted, derived class need to update result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

PARAMETER DESCRIPTION
text

The actual text that was analyzed

TYPE: str

raw_recognizer_results

This recognizer's results, to be updated based on recognizer specific context.

TYPE: List[RecognizerResult]

other_raw_recognizer_results

Other recognizer results matched in the given text to allow related entity context enhancement

TYPE: List[RecognizerResult]

nlp_artifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

TYPE: NlpArtifacts

context

list of context words

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/entity_recognizer.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def enhance_using_context(
    self,
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """Enhance confidence score using context of the entity.

    Override this method in derived class in case a custom logic
    is needed, otherwise return value will be equal to
    raw_results.

    in case a result score is boosted, derived class need to update
    result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

    :param text: The actual text that was analyzed
    :param raw_recognizer_results: This recognizer's results, to be updated
    based on recognizer specific context.
    :param other_raw_recognizer_results: Other recognizer results matched in
    the given text to allow related entity context enhancement
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param context: list of context words
    """
    return raw_recognizer_results

get_supported_entities

get_supported_entities() -> List[str]

Return the list of entities this recognizer can identify.

RETURNS DESCRIPTION
List[str]

A list of the supported entities by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
119
120
121
122
123
124
125
def get_supported_entities(self) -> List[str]:
    """
    Return the list of entities this recognizer can identify.

    :return: A list of the supported entities by this recognizer
    """
    return self.supported_entities

get_supported_language

get_supported_language() -> str

Return the language this recognizer can support.

RETURNS DESCRIPTION
str

A list of the supported language by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
127
128
129
130
131
132
133
def get_supported_language(self) -> str:
    """
    Return the language this recognizer can support.

    :return: A list of the supported language by this recognizer
    """
    return self.supported_language

get_version

get_version() -> str

Return the version of this recognizer.

RETURNS DESCRIPTION
str

The current version of this recognizer

Source code in presidio_analyzer/entity_recognizer.py
135
136
137
138
139
140
141
def get_version(self) -> str:
    """
    Return the version of this recognizer.

    :return: The current version of this recognizer
    """
    return self.version

to_dict

to_dict() -> Dict

Serialize instance into a dictionary.

Source code in presidio_analyzer/pattern_recognizer.py
254
255
256
257
258
259
260
261
262
263
264
def to_dict(self) -> Dict:
    """Serialize instance into a dictionary."""
    return_dict = super().to_dict()

    return_dict["patterns"] = [pat.to_dict() for pat in self.patterns]
    return_dict["deny_list"] = self.deny_list
    return_dict["context"] = self.context
    return_dict["supported_entity"] = return_dict["supported_entities"][0]
    del return_dict["supported_entities"]

    return return_dict

from_dict classmethod

from_dict(entity_recognizer_dict: Dict) -> PatternRecognizer

Create instance from a serialized dict.

Source code in presidio_analyzer/pattern_recognizer.py
266
267
268
269
270
271
272
273
274
@classmethod
def from_dict(cls, entity_recognizer_dict: Dict) -> "PatternRecognizer":
    """Create instance from a serialized dict."""
    patterns = entity_recognizer_dict.get("patterns")
    if patterns:
        patterns_list = [Pattern.from_dict(pat) for pat in patterns]
        entity_recognizer_dict["patterns"] = patterns_list

    return cls(**entity_recognizer_dict)

remove_duplicates staticmethod

remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]

Remove duplicate results.

Remove duplicates in case the two results have identical start and ends and types.

PARAMETER DESCRIPTION
results

List[RecognizerResult]

TYPE: List[RecognizerResult]

RETURNS DESCRIPTION
List[RecognizerResult]

List[RecognizerResult]

Source code in presidio_analyzer/entity_recognizer.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
@staticmethod
def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
    """
    Remove duplicate results.

    Remove duplicates in case the two results
    have identical start and ends and types.
    :param results: List[RecognizerResult]
    :return: List[RecognizerResult]
    """
    results = list(set(results))
    results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
    filtered_results = []

    for result in results:
        if result.score == 0:
            continue

        to_keep = result not in filtered_results  # equals based comparison
        if to_keep:
            for filtered in filtered_results:
                # If result is contained in one of the other results
                if (
                    result.contained_in(filtered)
                    and result.entity_type == filtered.entity_type
                ):
                    to_keep = False
                    break

        if to_keep:
            filtered_results.append(result)

    return filtered_results

sanitize_value staticmethod

sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str

Cleanse the input string of the replacement pairs specified as argument.

PARAMETER DESCRIPTION
text

input string

TYPE: str

replacement_pairs

pairs of what has to be replaced with which value

TYPE: List[Tuple[str, str]]

RETURNS DESCRIPTION
str

cleansed string

Source code in presidio_analyzer/entity_recognizer.py
200
201
202
203
204
205
206
207
208
209
210
211
@staticmethod
def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
    """
    Cleanse the input string of the replacement pairs specified as argument.

    :param text: input string
    :param replacement_pairs: pairs of what has to be replaced with which value
    :return: cleansed string
    """
    for search_string, replacement_string in replacement_pairs:
        text = text.replace(search_string, replacement_string)
    return text

validate_result

validate_result(pattern_text: str) -> Optional[bool]

Validate the pattern logic e.g., by running checksum on a detected pattern.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the validation was successful.

Source code in presidio_analyzer/pattern_recognizer.py
117
118
119
120
121
122
123
124
125
def validate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Validate the pattern logic e.g., by running checksum on a detected pattern.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the validation was successful.
    """
    return None

invalidate_result

invalidate_result(pattern_text: str) -> Optional[bool]

Logic to check for result invalidation by running pruning logic.

For example, each SSN number group should not consist of all the same digits.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the result is invalidated

Source code in presidio_analyzer/pattern_recognizer.py
127
128
129
130
131
132
133
134
135
136
137
def invalidate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Logic to check for result invalidation by running pruning logic.

    For example, each SSN number group should not consist of all the same digits.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the result is invalidated
    """
    return None

build_regex_explanation staticmethod

build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation

Construct an explanation for why this entity was detected.

PARAMETER DESCRIPTION
recognizer_name

Name of recognizer detecting the entity

TYPE: str

pattern_name

Regex pattern name which detected the entity

TYPE: str

pattern

Regex pattern logic

TYPE: str

original_score

Score given by the recognizer

TYPE: float

validation_result

Whether validation was used and its result

TYPE: bool

regex_flags

Regex flags used in the regex matching

TYPE: int

RETURNS DESCRIPTION
AnalysisExplanation

Analysis explanation

Source code in presidio_analyzer/pattern_recognizer.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
@staticmethod
def build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation:
    """
    Construct an explanation for why this entity was detected.

    :param recognizer_name: Name of recognizer detecting the entity
    :param pattern_name: Regex pattern name which detected the entity
    :param pattern: Regex pattern logic
    :param original_score: Score given by the recognizer
    :param validation_result: Whether validation was used and its result
    :param regex_flags: Regex flags used in the regex matching
    :return: Analysis explanation
    """
    textual_explanation = (
        f"Detected by `{recognizer_name}` " f"using pattern `{pattern_name}`"
    )

    explanation = AnalysisExplanation(
        recognizer=recognizer_name,
        original_score=original_score,
        pattern_name=pattern_name,
        pattern=pattern,
        validation_result=validation_result,
        regex_flags=regex_flags,
        textual_explanation=textual_explanation,
    )
    return explanation

UsLicenseRecognizer

Bases: PatternRecognizer

Recognizes US driver license using regex.

PARAMETER DESCRIPTION
patterns

List of patterns to be used by this recognizer

TYPE: Optional[List[Pattern]] DEFAULT: None

context

List of context words to increase confidence in detection

TYPE: Optional[List[str]] DEFAULT: None

supported_language

Language this recognizer supports

TYPE: str DEFAULT: 'en'

supported_entity

The entity this recognizer can detect

TYPE: str DEFAULT: 'US_DRIVER_LICENSE'

METHOD DESCRIPTION
analyze

Analyzes text to detect PII using regular expressions or deny-lists.

enhance_using_context

Enhance confidence score using context of the entity.

get_supported_entities

Return the list of entities this recognizer can identify.

get_supported_language

Return the language this recognizer can support.

get_version

Return the version of this recognizer.

to_dict

Serialize instance into a dictionary.

from_dict

Create instance from a serialized dict.

remove_duplicates

Remove duplicate results.

sanitize_value

Cleanse the input string of the replacement pairs specified as argument.

validate_result

Validate the pattern logic e.g., by running checksum on a detected pattern.

invalidate_result

Logic to check for result invalidation by running pruning logic.

build_regex_explanation

Construct an explanation for why this entity was detected.

Source code in presidio_analyzer/predefined_recognizers/us_driver_license_recognizer.py
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
class UsLicenseRecognizer(PatternRecognizer):
    """
    Recognizes US driver license using regex.

    :param patterns: List of patterns to be used by this recognizer
    :param context: List of context words to increase confidence in detection
    :param supported_language: Language this recognizer supports
    :param supported_entity: The entity this recognizer can detect
    """

    PATTERNS = [
        Pattern(
            "Driver License - Alphanumeric (weak)",
            r"\b([A-Z][0-9]{3,6}|[A-Z][0-9]{5,9}|[A-Z][0-9]{6,8}|[A-Z][0-9]{4,8}|[A-Z][0-9]{9,11}|[A-Z]{1,2}[0-9]{5,6}|H[0-9]{8}|V[0-9]{6}|X[0-9]{8}|A-Z]{2}[0-9]{2,5}|[A-Z]{2}[0-9]{3,7}|[0-9]{2}[A-Z]{3}[0-9]{5,6}|[A-Z][0-9]{13,14}|[A-Z][0-9]{18}|[A-Z][0-9]{6}R|[A-Z][0-9]{9}|[A-Z][0-9]{1,12}|[0-9]{9}[A-Z]|[A-Z]{2}[0-9]{6}[A-Z]|[0-9]{8}[A-Z]{2}|[0-9]{3}[A-Z]{2}[0-9]{4}|[A-Z][0-9][A-Z][0-9][A-Z]|[0-9]{7,8}[A-Z])\b",  # noqa: E501
            0.3,
        ),
        Pattern(
            "Driver License - Digits (very weak)",
            r"\b([0-9]{6,14}|[0-9]{16})\b",  # noqa: E501
            0.01,
        ),
    ]

    CONTEXT = [
        "driver",
        "license",
        "permit",
        "lic",
        "identification",
        "dls",
        "cdls",
        "lic#",
        "driving",
    ]

    def __init__(
        self,
        patterns: Optional[List[Pattern]] = None,
        context: Optional[List[str]] = None,
        supported_language: str = "en",
        supported_entity: str = "US_DRIVER_LICENSE",
    ):
        patterns = patterns if patterns else self.PATTERNS
        context = context if context else self.CONTEXT
        super().__init__(
            supported_entity=supported_entity,
            supported_language=supported_language,
            patterns=patterns,
            context=context,
        )

id property

id

Return a unique identifier of this recognizer.

analyze

analyze(
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]

Analyzes text to detect PII using regular expressions or deny-lists.

PARAMETER DESCRIPTION
text

Text to be analyzed

TYPE: str

entities

Entities this recognizer can detect

TYPE: List[str]

nlp_artifacts

Output values from the NLP engine

TYPE: Optional[NlpArtifacts] DEFAULT: None

regex_flags

regex flags to be used in regex matching

TYPE: Optional[int] DEFAULT: None

RETURNS DESCRIPTION
List[RecognizerResult]
Source code in presidio_analyzer/pattern_recognizer.py
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def analyze(
    self,
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]:
    """
    Analyzes text to detect PII using regular expressions or deny-lists.

    :param text: Text to be analyzed
    :param entities: Entities this recognizer can detect
    :param nlp_artifacts: Output values from the NLP engine
    :param regex_flags: regex flags to be used in regex matching
    :return:
    """
    results = []

    if self.patterns:
        pattern_result = self.__analyze_patterns(text, regex_flags)
        results.extend(pattern_result)

    return results

enhance_using_context

enhance_using_context(
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]

Enhance confidence score using context of the entity.

Override this method in derived class in case a custom logic is needed, otherwise return value will be equal to raw_results.

in case a result score is boosted, derived class need to update result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

PARAMETER DESCRIPTION
text

The actual text that was analyzed

TYPE: str

raw_recognizer_results

This recognizer's results, to be updated based on recognizer specific context.

TYPE: List[RecognizerResult]

other_raw_recognizer_results

Other recognizer results matched in the given text to allow related entity context enhancement

TYPE: List[RecognizerResult]

nlp_artifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

TYPE: NlpArtifacts

context

list of context words

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/entity_recognizer.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def enhance_using_context(
    self,
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """Enhance confidence score using context of the entity.

    Override this method in derived class in case a custom logic
    is needed, otherwise return value will be equal to
    raw_results.

    in case a result score is boosted, derived class need to update
    result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

    :param text: The actual text that was analyzed
    :param raw_recognizer_results: This recognizer's results, to be updated
    based on recognizer specific context.
    :param other_raw_recognizer_results: Other recognizer results matched in
    the given text to allow related entity context enhancement
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param context: list of context words
    """
    return raw_recognizer_results

get_supported_entities

get_supported_entities() -> List[str]

Return the list of entities this recognizer can identify.

RETURNS DESCRIPTION
List[str]

A list of the supported entities by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
119
120
121
122
123
124
125
def get_supported_entities(self) -> List[str]:
    """
    Return the list of entities this recognizer can identify.

    :return: A list of the supported entities by this recognizer
    """
    return self.supported_entities

get_supported_language

get_supported_language() -> str

Return the language this recognizer can support.

RETURNS DESCRIPTION
str

A list of the supported language by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
127
128
129
130
131
132
133
def get_supported_language(self) -> str:
    """
    Return the language this recognizer can support.

    :return: A list of the supported language by this recognizer
    """
    return self.supported_language

get_version

get_version() -> str

Return the version of this recognizer.

RETURNS DESCRIPTION
str

The current version of this recognizer

Source code in presidio_analyzer/entity_recognizer.py
135
136
137
138
139
140
141
def get_version(self) -> str:
    """
    Return the version of this recognizer.

    :return: The current version of this recognizer
    """
    return self.version

to_dict

to_dict() -> Dict

Serialize instance into a dictionary.

Source code in presidio_analyzer/pattern_recognizer.py
254
255
256
257
258
259
260
261
262
263
264
def to_dict(self) -> Dict:
    """Serialize instance into a dictionary."""
    return_dict = super().to_dict()

    return_dict["patterns"] = [pat.to_dict() for pat in self.patterns]
    return_dict["deny_list"] = self.deny_list
    return_dict["context"] = self.context
    return_dict["supported_entity"] = return_dict["supported_entities"][0]
    del return_dict["supported_entities"]

    return return_dict

from_dict classmethod

from_dict(entity_recognizer_dict: Dict) -> PatternRecognizer

Create instance from a serialized dict.

Source code in presidio_analyzer/pattern_recognizer.py
266
267
268
269
270
271
272
273
274
@classmethod
def from_dict(cls, entity_recognizer_dict: Dict) -> "PatternRecognizer":
    """Create instance from a serialized dict."""
    patterns = entity_recognizer_dict.get("patterns")
    if patterns:
        patterns_list = [Pattern.from_dict(pat) for pat in patterns]
        entity_recognizer_dict["patterns"] = patterns_list

    return cls(**entity_recognizer_dict)

remove_duplicates staticmethod

remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]

Remove duplicate results.

Remove duplicates in case the two results have identical start and ends and types.

PARAMETER DESCRIPTION
results

List[RecognizerResult]

TYPE: List[RecognizerResult]

RETURNS DESCRIPTION
List[RecognizerResult]

List[RecognizerResult]

Source code in presidio_analyzer/entity_recognizer.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
@staticmethod
def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
    """
    Remove duplicate results.

    Remove duplicates in case the two results
    have identical start and ends and types.
    :param results: List[RecognizerResult]
    :return: List[RecognizerResult]
    """
    results = list(set(results))
    results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
    filtered_results = []

    for result in results:
        if result.score == 0:
            continue

        to_keep = result not in filtered_results  # equals based comparison
        if to_keep:
            for filtered in filtered_results:
                # If result is contained in one of the other results
                if (
                    result.contained_in(filtered)
                    and result.entity_type == filtered.entity_type
                ):
                    to_keep = False
                    break

        if to_keep:
            filtered_results.append(result)

    return filtered_results

sanitize_value staticmethod

sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str

Cleanse the input string of the replacement pairs specified as argument.

PARAMETER DESCRIPTION
text

input string

TYPE: str

replacement_pairs

pairs of what has to be replaced with which value

TYPE: List[Tuple[str, str]]

RETURNS DESCRIPTION
str

cleansed string

Source code in presidio_analyzer/entity_recognizer.py
200
201
202
203
204
205
206
207
208
209
210
211
@staticmethod
def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
    """
    Cleanse the input string of the replacement pairs specified as argument.

    :param text: input string
    :param replacement_pairs: pairs of what has to be replaced with which value
    :return: cleansed string
    """
    for search_string, replacement_string in replacement_pairs:
        text = text.replace(search_string, replacement_string)
    return text

validate_result

validate_result(pattern_text: str) -> Optional[bool]

Validate the pattern logic e.g., by running checksum on a detected pattern.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the validation was successful.

Source code in presidio_analyzer/pattern_recognizer.py
117
118
119
120
121
122
123
124
125
def validate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Validate the pattern logic e.g., by running checksum on a detected pattern.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the validation was successful.
    """
    return None

invalidate_result

invalidate_result(pattern_text: str) -> Optional[bool]

Logic to check for result invalidation by running pruning logic.

For example, each SSN number group should not consist of all the same digits.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the result is invalidated

Source code in presidio_analyzer/pattern_recognizer.py
127
128
129
130
131
132
133
134
135
136
137
def invalidate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Logic to check for result invalidation by running pruning logic.

    For example, each SSN number group should not consist of all the same digits.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the result is invalidated
    """
    return None

build_regex_explanation staticmethod

build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation

Construct an explanation for why this entity was detected.

PARAMETER DESCRIPTION
recognizer_name

Name of recognizer detecting the entity

TYPE: str

pattern_name

Regex pattern name which detected the entity

TYPE: str

pattern

Regex pattern logic

TYPE: str

original_score

Score given by the recognizer

TYPE: float

validation_result

Whether validation was used and its result

TYPE: bool

regex_flags

Regex flags used in the regex matching

TYPE: int

RETURNS DESCRIPTION
AnalysisExplanation

Analysis explanation

Source code in presidio_analyzer/pattern_recognizer.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
@staticmethod
def build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation:
    """
    Construct an explanation for why this entity was detected.

    :param recognizer_name: Name of recognizer detecting the entity
    :param pattern_name: Regex pattern name which detected the entity
    :param pattern: Regex pattern logic
    :param original_score: Score given by the recognizer
    :param validation_result: Whether validation was used and its result
    :param regex_flags: Regex flags used in the regex matching
    :return: Analysis explanation
    """
    textual_explanation = (
        f"Detected by `{recognizer_name}` " f"using pattern `{pattern_name}`"
    )

    explanation = AnalysisExplanation(
        recognizer=recognizer_name,
        original_score=original_score,
        pattern_name=pattern_name,
        pattern=pattern,
        validation_result=validation_result,
        regex_flags=regex_flags,
        textual_explanation=textual_explanation,
    )
    return explanation

UsItinRecognizer

Bases: PatternRecognizer

Recognizes US ITIN (Individual Taxpayer Identification Number) using regex.

PARAMETER DESCRIPTION
patterns

List of patterns to be used by this recognizer

TYPE: Optional[List[Pattern]] DEFAULT: None

context

List of context words to increase confidence in detection

TYPE: Optional[List[str]] DEFAULT: None

supported_language

Language this recognizer supports

TYPE: str DEFAULT: 'en'

supported_entity

The entity this recognizer can detect

TYPE: str DEFAULT: 'US_ITIN'

METHOD DESCRIPTION
analyze

Analyzes text to detect PII using regular expressions or deny-lists.

enhance_using_context

Enhance confidence score using context of the entity.

get_supported_entities

Return the list of entities this recognizer can identify.

get_supported_language

Return the language this recognizer can support.

get_version

Return the version of this recognizer.

to_dict

Serialize instance into a dictionary.

from_dict

Create instance from a serialized dict.

remove_duplicates

Remove duplicate results.

sanitize_value

Cleanse the input string of the replacement pairs specified as argument.

validate_result

Validate the pattern logic e.g., by running checksum on a detected pattern.

invalidate_result

Logic to check for result invalidation by running pruning logic.

build_regex_explanation

Construct an explanation for why this entity was detected.

Source code in presidio_analyzer/predefined_recognizers/us_itin_recognizer.py
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
class UsItinRecognizer(PatternRecognizer):
    """
    Recognizes US ITIN (Individual Taxpayer Identification Number) using regex.

    :param patterns: List of patterns to be used by this recognizer
    :param context: List of context words to increase confidence in detection
    :param supported_language: Language this recognizer supports
    :param supported_entity: The entity this recognizer can detect
    """

    PATTERNS = [
        Pattern(
            "Itin (very weak)",
            r"\b9\d{2}[- ](5\d|6[0-5]|7\d|8[0-8]|9([0-2]|[4-9]))\d{4}\b|\b9\d{2}(5\d|6[0-5]|7\d|8[0-8]|9([0-2]|[4-9]))[- ]\d{4}\b",  # noqa: E501
            0.05,
        ),
        Pattern(
            "Itin (weak)",
            r"\b9\d{2}(5\d|6[0-5]|7\d|8[0-8]|9([0-2]|[4-9]))\d{4}\b",  # noqa: E501
            0.3,
        ),
        Pattern(
            "Itin (medium)",
            r"\b9\d{2}[- ](5\d|6[0-5]|7\d|8[0-8]|9([0-2]|[4-9]))[- ]\d{4}\b",  # noqa: E501
            0.5,
        ),
    ]

    CONTEXT = ["individual", "taxpayer", "itin", "tax", "payer", "taxid", "tin"]

    def __init__(
        self,
        patterns: Optional[List[Pattern]] = None,
        context: Optional[List[str]] = None,
        supported_language: str = "en",
        supported_entity: str = "US_ITIN",
    ):
        patterns = patterns if patterns else self.PATTERNS
        context = context if context else self.CONTEXT
        super().__init__(
            supported_entity=supported_entity,
            patterns=patterns,
            context=context,
            supported_language=supported_language,
        )

id property

id

Return a unique identifier of this recognizer.

analyze

analyze(
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]

Analyzes text to detect PII using regular expressions or deny-lists.

PARAMETER DESCRIPTION
text

Text to be analyzed

TYPE: str

entities

Entities this recognizer can detect

TYPE: List[str]

nlp_artifacts

Output values from the NLP engine

TYPE: Optional[NlpArtifacts] DEFAULT: None

regex_flags

regex flags to be used in regex matching

TYPE: Optional[int] DEFAULT: None

RETURNS DESCRIPTION
List[RecognizerResult]
Source code in presidio_analyzer/pattern_recognizer.py
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def analyze(
    self,
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]:
    """
    Analyzes text to detect PII using regular expressions or deny-lists.

    :param text: Text to be analyzed
    :param entities: Entities this recognizer can detect
    :param nlp_artifacts: Output values from the NLP engine
    :param regex_flags: regex flags to be used in regex matching
    :return:
    """
    results = []

    if self.patterns:
        pattern_result = self.__analyze_patterns(text, regex_flags)
        results.extend(pattern_result)

    return results

enhance_using_context

enhance_using_context(
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]

Enhance confidence score using context of the entity.

Override this method in derived class in case a custom logic is needed, otherwise return value will be equal to raw_results.

in case a result score is boosted, derived class need to update result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

PARAMETER DESCRIPTION
text

The actual text that was analyzed

TYPE: str

raw_recognizer_results

This recognizer's results, to be updated based on recognizer specific context.

TYPE: List[RecognizerResult]

other_raw_recognizer_results

Other recognizer results matched in the given text to allow related entity context enhancement

TYPE: List[RecognizerResult]

nlp_artifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

TYPE: NlpArtifacts

context

list of context words

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/entity_recognizer.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def enhance_using_context(
    self,
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """Enhance confidence score using context of the entity.

    Override this method in derived class in case a custom logic
    is needed, otherwise return value will be equal to
    raw_results.

    in case a result score is boosted, derived class need to update
    result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

    :param text: The actual text that was analyzed
    :param raw_recognizer_results: This recognizer's results, to be updated
    based on recognizer specific context.
    :param other_raw_recognizer_results: Other recognizer results matched in
    the given text to allow related entity context enhancement
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param context: list of context words
    """
    return raw_recognizer_results

get_supported_entities

get_supported_entities() -> List[str]

Return the list of entities this recognizer can identify.

RETURNS DESCRIPTION
List[str]

A list of the supported entities by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
119
120
121
122
123
124
125
def get_supported_entities(self) -> List[str]:
    """
    Return the list of entities this recognizer can identify.

    :return: A list of the supported entities by this recognizer
    """
    return self.supported_entities

get_supported_language

get_supported_language() -> str

Return the language this recognizer can support.

RETURNS DESCRIPTION
str

A list of the supported language by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
127
128
129
130
131
132
133
def get_supported_language(self) -> str:
    """
    Return the language this recognizer can support.

    :return: A list of the supported language by this recognizer
    """
    return self.supported_language

get_version

get_version() -> str

Return the version of this recognizer.

RETURNS DESCRIPTION
str

The current version of this recognizer

Source code in presidio_analyzer/entity_recognizer.py
135
136
137
138
139
140
141
def get_version(self) -> str:
    """
    Return the version of this recognizer.

    :return: The current version of this recognizer
    """
    return self.version

to_dict

to_dict() -> Dict

Serialize instance into a dictionary.

Source code in presidio_analyzer/pattern_recognizer.py
254
255
256
257
258
259
260
261
262
263
264
def to_dict(self) -> Dict:
    """Serialize instance into a dictionary."""
    return_dict = super().to_dict()

    return_dict["patterns"] = [pat.to_dict() for pat in self.patterns]
    return_dict["deny_list"] = self.deny_list
    return_dict["context"] = self.context
    return_dict["supported_entity"] = return_dict["supported_entities"][0]
    del return_dict["supported_entities"]

    return return_dict

from_dict classmethod

from_dict(entity_recognizer_dict: Dict) -> PatternRecognizer

Create instance from a serialized dict.

Source code in presidio_analyzer/pattern_recognizer.py
266
267
268
269
270
271
272
273
274
@classmethod
def from_dict(cls, entity_recognizer_dict: Dict) -> "PatternRecognizer":
    """Create instance from a serialized dict."""
    patterns = entity_recognizer_dict.get("patterns")
    if patterns:
        patterns_list = [Pattern.from_dict(pat) for pat in patterns]
        entity_recognizer_dict["patterns"] = patterns_list

    return cls(**entity_recognizer_dict)

remove_duplicates staticmethod

remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]

Remove duplicate results.

Remove duplicates in case the two results have identical start and ends and types.

PARAMETER DESCRIPTION
results

List[RecognizerResult]

TYPE: List[RecognizerResult]

RETURNS DESCRIPTION
List[RecognizerResult]

List[RecognizerResult]

Source code in presidio_analyzer/entity_recognizer.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
@staticmethod
def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
    """
    Remove duplicate results.

    Remove duplicates in case the two results
    have identical start and ends and types.
    :param results: List[RecognizerResult]
    :return: List[RecognizerResult]
    """
    results = list(set(results))
    results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
    filtered_results = []

    for result in results:
        if result.score == 0:
            continue

        to_keep = result not in filtered_results  # equals based comparison
        if to_keep:
            for filtered in filtered_results:
                # If result is contained in one of the other results
                if (
                    result.contained_in(filtered)
                    and result.entity_type == filtered.entity_type
                ):
                    to_keep = False
                    break

        if to_keep:
            filtered_results.append(result)

    return filtered_results

sanitize_value staticmethod

sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str

Cleanse the input string of the replacement pairs specified as argument.

PARAMETER DESCRIPTION
text

input string

TYPE: str

replacement_pairs

pairs of what has to be replaced with which value

TYPE: List[Tuple[str, str]]

RETURNS DESCRIPTION
str

cleansed string

Source code in presidio_analyzer/entity_recognizer.py
200
201
202
203
204
205
206
207
208
209
210
211
@staticmethod
def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
    """
    Cleanse the input string of the replacement pairs specified as argument.

    :param text: input string
    :param replacement_pairs: pairs of what has to be replaced with which value
    :return: cleansed string
    """
    for search_string, replacement_string in replacement_pairs:
        text = text.replace(search_string, replacement_string)
    return text

validate_result

validate_result(pattern_text: str) -> Optional[bool]

Validate the pattern logic e.g., by running checksum on a detected pattern.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the validation was successful.

Source code in presidio_analyzer/pattern_recognizer.py
117
118
119
120
121
122
123
124
125
def validate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Validate the pattern logic e.g., by running checksum on a detected pattern.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the validation was successful.
    """
    return None

invalidate_result

invalidate_result(pattern_text: str) -> Optional[bool]

Logic to check for result invalidation by running pruning logic.

For example, each SSN number group should not consist of all the same digits.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the result is invalidated

Source code in presidio_analyzer/pattern_recognizer.py
127
128
129
130
131
132
133
134
135
136
137
def invalidate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Logic to check for result invalidation by running pruning logic.

    For example, each SSN number group should not consist of all the same digits.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the result is invalidated
    """
    return None

build_regex_explanation staticmethod

build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation

Construct an explanation for why this entity was detected.

PARAMETER DESCRIPTION
recognizer_name

Name of recognizer detecting the entity

TYPE: str

pattern_name

Regex pattern name which detected the entity

TYPE: str

pattern

Regex pattern logic

TYPE: str

original_score

Score given by the recognizer

TYPE: float

validation_result

Whether validation was used and its result

TYPE: bool

regex_flags

Regex flags used in the regex matching

TYPE: int

RETURNS DESCRIPTION
AnalysisExplanation

Analysis explanation

Source code in presidio_analyzer/pattern_recognizer.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
@staticmethod
def build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation:
    """
    Construct an explanation for why this entity was detected.

    :param recognizer_name: Name of recognizer detecting the entity
    :param pattern_name: Regex pattern name which detected the entity
    :param pattern: Regex pattern logic
    :param original_score: Score given by the recognizer
    :param validation_result: Whether validation was used and its result
    :param regex_flags: Regex flags used in the regex matching
    :return: Analysis explanation
    """
    textual_explanation = (
        f"Detected by `{recognizer_name}` " f"using pattern `{pattern_name}`"
    )

    explanation = AnalysisExplanation(
        recognizer=recognizer_name,
        original_score=original_score,
        pattern_name=pattern_name,
        pattern=pattern,
        validation_result=validation_result,
        regex_flags=regex_flags,
        textual_explanation=textual_explanation,
    )
    return explanation

UsPassportRecognizer

Bases: PatternRecognizer

Recognizes US Passport number using regex.

PARAMETER DESCRIPTION
patterns

List of patterns to be used by this recognizer

TYPE: Optional[List[Pattern]] DEFAULT: None

context

List of context words to increase confidence in detection

TYPE: Optional[List[str]] DEFAULT: None

supported_language

Language this recognizer supports

TYPE: str DEFAULT: 'en'

supported_entity

The entity this recognizer can detect

TYPE: str DEFAULT: 'US_PASSPORT'

METHOD DESCRIPTION
analyze

Analyzes text to detect PII using regular expressions or deny-lists.

enhance_using_context

Enhance confidence score using context of the entity.

get_supported_entities

Return the list of entities this recognizer can identify.

get_supported_language

Return the language this recognizer can support.

get_version

Return the version of this recognizer.

to_dict

Serialize instance into a dictionary.

from_dict

Create instance from a serialized dict.

remove_duplicates

Remove duplicate results.

sanitize_value

Cleanse the input string of the replacement pairs specified as argument.

validate_result

Validate the pattern logic e.g., by running checksum on a detected pattern.

invalidate_result

Logic to check for result invalidation by running pruning logic.

build_regex_explanation

Construct an explanation for why this entity was detected.

Source code in presidio_analyzer/predefined_recognizers/us_passport_recognizer.py
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
class UsPassportRecognizer(PatternRecognizer):
    """
    Recognizes US Passport number using regex.

    :param patterns: List of patterns to be used by this recognizer
    :param context: List of context words to increase confidence in detection
    :param supported_language: Language this recognizer supports
    :param supported_entity: The entity this recognizer can detect
    """

    # Weak pattern: all passport numbers are a weak match, e.g., 14019033
    PATTERNS = [
        Pattern("Passport (very weak)", r"(\b[0-9]{9}\b)", 0.05),
        Pattern("Passport Next Generation (very weak)", r"(\b[A-Z][0-9]{8}\b)", 0.1),
    ]
    CONTEXT = ["us", "united", "states", "passport", "passport#", "travel", "document"]

    def __init__(
        self,
        patterns: Optional[List[Pattern]] = None,
        context: Optional[List[str]] = None,
        supported_language: str = "en",
        supported_entity: str = "US_PASSPORT",
    ):
        patterns = patterns if patterns else self.PATTERNS
        context = context if context else self.CONTEXT
        super().__init__(
            supported_entity=supported_entity,
            patterns=patterns,
            context=context,
            supported_language=supported_language,
        )

id property

id

Return a unique identifier of this recognizer.

analyze

analyze(
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]

Analyzes text to detect PII using regular expressions or deny-lists.

PARAMETER DESCRIPTION
text

Text to be analyzed

TYPE: str

entities

Entities this recognizer can detect

TYPE: List[str]

nlp_artifacts

Output values from the NLP engine

TYPE: Optional[NlpArtifacts] DEFAULT: None

regex_flags

regex flags to be used in regex matching

TYPE: Optional[int] DEFAULT: None

RETURNS DESCRIPTION
List[RecognizerResult]
Source code in presidio_analyzer/pattern_recognizer.py
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def analyze(
    self,
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]:
    """
    Analyzes text to detect PII using regular expressions or deny-lists.

    :param text: Text to be analyzed
    :param entities: Entities this recognizer can detect
    :param nlp_artifacts: Output values from the NLP engine
    :param regex_flags: regex flags to be used in regex matching
    :return:
    """
    results = []

    if self.patterns:
        pattern_result = self.__analyze_patterns(text, regex_flags)
        results.extend(pattern_result)

    return results

enhance_using_context

enhance_using_context(
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]

Enhance confidence score using context of the entity.

Override this method in derived class in case a custom logic is needed, otherwise return value will be equal to raw_results.

in case a result score is boosted, derived class need to update result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

PARAMETER DESCRIPTION
text

The actual text that was analyzed

TYPE: str

raw_recognizer_results

This recognizer's results, to be updated based on recognizer specific context.

TYPE: List[RecognizerResult]

other_raw_recognizer_results

Other recognizer results matched in the given text to allow related entity context enhancement

TYPE: List[RecognizerResult]

nlp_artifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

TYPE: NlpArtifacts

context

list of context words

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/entity_recognizer.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def enhance_using_context(
    self,
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """Enhance confidence score using context of the entity.

    Override this method in derived class in case a custom logic
    is needed, otherwise return value will be equal to
    raw_results.

    in case a result score is boosted, derived class need to update
    result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

    :param text: The actual text that was analyzed
    :param raw_recognizer_results: This recognizer's results, to be updated
    based on recognizer specific context.
    :param other_raw_recognizer_results: Other recognizer results matched in
    the given text to allow related entity context enhancement
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param context: list of context words
    """
    return raw_recognizer_results

get_supported_entities

get_supported_entities() -> List[str]

Return the list of entities this recognizer can identify.

RETURNS DESCRIPTION
List[str]

A list of the supported entities by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
119
120
121
122
123
124
125
def get_supported_entities(self) -> List[str]:
    """
    Return the list of entities this recognizer can identify.

    :return: A list of the supported entities by this recognizer
    """
    return self.supported_entities

get_supported_language

get_supported_language() -> str

Return the language this recognizer can support.

RETURNS DESCRIPTION
str

A list of the supported language by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
127
128
129
130
131
132
133
def get_supported_language(self) -> str:
    """
    Return the language this recognizer can support.

    :return: A list of the supported language by this recognizer
    """
    return self.supported_language

get_version

get_version() -> str

Return the version of this recognizer.

RETURNS DESCRIPTION
str

The current version of this recognizer

Source code in presidio_analyzer/entity_recognizer.py
135
136
137
138
139
140
141
def get_version(self) -> str:
    """
    Return the version of this recognizer.

    :return: The current version of this recognizer
    """
    return self.version

to_dict

to_dict() -> Dict

Serialize instance into a dictionary.

Source code in presidio_analyzer/pattern_recognizer.py
254
255
256
257
258
259
260
261
262
263
264
def to_dict(self) -> Dict:
    """Serialize instance into a dictionary."""
    return_dict = super().to_dict()

    return_dict["patterns"] = [pat.to_dict() for pat in self.patterns]
    return_dict["deny_list"] = self.deny_list
    return_dict["context"] = self.context
    return_dict["supported_entity"] = return_dict["supported_entities"][0]
    del return_dict["supported_entities"]

    return return_dict

from_dict classmethod

from_dict(entity_recognizer_dict: Dict) -> PatternRecognizer

Create instance from a serialized dict.

Source code in presidio_analyzer/pattern_recognizer.py
266
267
268
269
270
271
272
273
274
@classmethod
def from_dict(cls, entity_recognizer_dict: Dict) -> "PatternRecognizer":
    """Create instance from a serialized dict."""
    patterns = entity_recognizer_dict.get("patterns")
    if patterns:
        patterns_list = [Pattern.from_dict(pat) for pat in patterns]
        entity_recognizer_dict["patterns"] = patterns_list

    return cls(**entity_recognizer_dict)

remove_duplicates staticmethod

remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]

Remove duplicate results.

Remove duplicates in case the two results have identical start and ends and types.

PARAMETER DESCRIPTION
results

List[RecognizerResult]

TYPE: List[RecognizerResult]

RETURNS DESCRIPTION
List[RecognizerResult]

List[RecognizerResult]

Source code in presidio_analyzer/entity_recognizer.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
@staticmethod
def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
    """
    Remove duplicate results.

    Remove duplicates in case the two results
    have identical start and ends and types.
    :param results: List[RecognizerResult]
    :return: List[RecognizerResult]
    """
    results = list(set(results))
    results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
    filtered_results = []

    for result in results:
        if result.score == 0:
            continue

        to_keep = result not in filtered_results  # equals based comparison
        if to_keep:
            for filtered in filtered_results:
                # If result is contained in one of the other results
                if (
                    result.contained_in(filtered)
                    and result.entity_type == filtered.entity_type
                ):
                    to_keep = False
                    break

        if to_keep:
            filtered_results.append(result)

    return filtered_results

sanitize_value staticmethod

sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str

Cleanse the input string of the replacement pairs specified as argument.

PARAMETER DESCRIPTION
text

input string

TYPE: str

replacement_pairs

pairs of what has to be replaced with which value

TYPE: List[Tuple[str, str]]

RETURNS DESCRIPTION
str

cleansed string

Source code in presidio_analyzer/entity_recognizer.py
200
201
202
203
204
205
206
207
208
209
210
211
@staticmethod
def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
    """
    Cleanse the input string of the replacement pairs specified as argument.

    :param text: input string
    :param replacement_pairs: pairs of what has to be replaced with which value
    :return: cleansed string
    """
    for search_string, replacement_string in replacement_pairs:
        text = text.replace(search_string, replacement_string)
    return text

validate_result

validate_result(pattern_text: str) -> Optional[bool]

Validate the pattern logic e.g., by running checksum on a detected pattern.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the validation was successful.

Source code in presidio_analyzer/pattern_recognizer.py
117
118
119
120
121
122
123
124
125
def validate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Validate the pattern logic e.g., by running checksum on a detected pattern.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the validation was successful.
    """
    return None

invalidate_result

invalidate_result(pattern_text: str) -> Optional[bool]

Logic to check for result invalidation by running pruning logic.

For example, each SSN number group should not consist of all the same digits.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the result is invalidated

Source code in presidio_analyzer/pattern_recognizer.py
127
128
129
130
131
132
133
134
135
136
137
def invalidate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Logic to check for result invalidation by running pruning logic.

    For example, each SSN number group should not consist of all the same digits.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the result is invalidated
    """
    return None

build_regex_explanation staticmethod

build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation

Construct an explanation for why this entity was detected.

PARAMETER DESCRIPTION
recognizer_name

Name of recognizer detecting the entity

TYPE: str

pattern_name

Regex pattern name which detected the entity

TYPE: str

pattern

Regex pattern logic

TYPE: str

original_score

Score given by the recognizer

TYPE: float

validation_result

Whether validation was used and its result

TYPE: bool

regex_flags

Regex flags used in the regex matching

TYPE: int

RETURNS DESCRIPTION
AnalysisExplanation

Analysis explanation

Source code in presidio_analyzer/pattern_recognizer.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
@staticmethod
def build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation:
    """
    Construct an explanation for why this entity was detected.

    :param recognizer_name: Name of recognizer detecting the entity
    :param pattern_name: Regex pattern name which detected the entity
    :param pattern: Regex pattern logic
    :param original_score: Score given by the recognizer
    :param validation_result: Whether validation was used and its result
    :param regex_flags: Regex flags used in the regex matching
    :return: Analysis explanation
    """
    textual_explanation = (
        f"Detected by `{recognizer_name}` " f"using pattern `{pattern_name}`"
    )

    explanation = AnalysisExplanation(
        recognizer=recognizer_name,
        original_score=original_score,
        pattern_name=pattern_name,
        pattern=pattern,
        validation_result=validation_result,
        regex_flags=regex_flags,
        textual_explanation=textual_explanation,
    )
    return explanation

UsSsnRecognizer

Bases: PatternRecognizer

Recognize US Social Security Number (SSN) using regex.

PARAMETER DESCRIPTION
patterns

List of patterns to be used by this recognizer

TYPE: Optional[List[Pattern]] DEFAULT: None

context

List of context words to increase confidence in detection

TYPE: Optional[List[str]] DEFAULT: None

supported_language

Language this recognizer supports

TYPE: str DEFAULT: 'en'

supported_entity

The entity this recognizer can detect

TYPE: str DEFAULT: 'US_SSN'

METHOD DESCRIPTION
analyze

Analyzes text to detect PII using regular expressions or deny-lists.

enhance_using_context

Enhance confidence score using context of the entity.

get_supported_entities

Return the list of entities this recognizer can identify.

get_supported_language

Return the language this recognizer can support.

get_version

Return the version of this recognizer.

to_dict

Serialize instance into a dictionary.

from_dict

Create instance from a serialized dict.

remove_duplicates

Remove duplicate results.

sanitize_value

Cleanse the input string of the replacement pairs specified as argument.

validate_result

Validate the pattern logic e.g., by running checksum on a detected pattern.

build_regex_explanation

Construct an explanation for why this entity was detected.

invalidate_result

Check if the pattern text cannot be validated as a US_SSN entity.

Source code in presidio_analyzer/predefined_recognizers/us_ssn_recognizer.py
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
class UsSsnRecognizer(PatternRecognizer):
    """Recognize US Social Security Number (SSN) using regex.

    :param patterns: List of patterns to be used by this recognizer
    :param context: List of context words to increase confidence in detection
    :param supported_language: Language this recognizer supports
    :param supported_entity: The entity this recognizer can detect
    """

    PATTERNS = [
        Pattern("SSN1 (very weak)", r"\b([0-9]{5})-([0-9]{4})\b", 0.05),  # noqa E501
        Pattern("SSN2 (very weak)", r"\b([0-9]{3})-([0-9]{6})\b", 0.05),  # noqa E501
        Pattern("SSN3 (very weak)", r"\b(([0-9]{3})-([0-9]{2})-([0-9]{4}))\b", 0.05),  # noqa E501
        Pattern("SSN4 (very weak)", r"\b[0-9]{9}\b", 0.05),
        Pattern("SSN5 (medium)", r"\b([0-9]{3})[- .]([0-9]{2})[- .]([0-9]{4})\b", 0.5),
    ]

    CONTEXT = [
        "social",
        "security",
        # "sec", # Task #603: Support keyphrases ("social sec")
        "ssn",
        "ssns",
        # "ssn#",  # iss:1452 - a # does not work with LemmaContextAwareEnhancer
        # "ss#",  # iss:1452 - a # does not work with LemmaContextAwareEnhancer
        "ssid",
    ]

    def __init__(
        self,
        patterns: Optional[List[Pattern]] = None,
        context: Optional[List[str]] = None,
        supported_language: str = "en",
        supported_entity: str = "US_SSN",
    ):
        patterns = patterns if patterns else self.PATTERNS
        context = context if context else self.CONTEXT
        super().__init__(
            supported_entity=supported_entity,
            patterns=patterns,
            context=context,
            supported_language=supported_language,
        )

    def invalidate_result(self, pattern_text: str) -> bool:
        """
        Check if the pattern text cannot be validated as a US_SSN entity.

        :param pattern_text: Text detected as pattern by regex
        :return: True if invalidated
        """
        # if there are delimiters, make sure both delimiters are the same
        delimiter_counts = defaultdict(int)
        for c in pattern_text:
            if c in (".", "-", " "):
                delimiter_counts[c] += 1
        if len(delimiter_counts.keys()) > 1:
            # mismatched delimiters
            return True

        only_digits = "".join(c for c in pattern_text if c.isdigit())
        if all(only_digits[0] == c for c in only_digits):
            # cannot be all same digit
            return True

        if only_digits[3:5] == "00" or only_digits[5:] == "0000":
            # groups cannot be all zeros
            return True

        for sample_ssn in ("000", "666", "123456789", "98765432", "078051120"):
            if only_digits.startswith(sample_ssn):
                return True

        return False

id property

id

Return a unique identifier of this recognizer.

analyze

analyze(
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]

Analyzes text to detect PII using regular expressions or deny-lists.

PARAMETER DESCRIPTION
text

Text to be analyzed

TYPE: str

entities

Entities this recognizer can detect

TYPE: List[str]

nlp_artifacts

Output values from the NLP engine

TYPE: Optional[NlpArtifacts] DEFAULT: None

regex_flags

regex flags to be used in regex matching

TYPE: Optional[int] DEFAULT: None

RETURNS DESCRIPTION
List[RecognizerResult]
Source code in presidio_analyzer/pattern_recognizer.py
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def analyze(
    self,
    text: str,
    entities: List[str],
    nlp_artifacts: Optional[NlpArtifacts] = None,
    regex_flags: Optional[int] = None,
) -> List[RecognizerResult]:
    """
    Analyzes text to detect PII using regular expressions or deny-lists.

    :param text: Text to be analyzed
    :param entities: Entities this recognizer can detect
    :param nlp_artifacts: Output values from the NLP engine
    :param regex_flags: regex flags to be used in regex matching
    :return:
    """
    results = []

    if self.patterns:
        pattern_result = self.__analyze_patterns(text, regex_flags)
        results.extend(pattern_result)

    return results

enhance_using_context

enhance_using_context(
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]

Enhance confidence score using context of the entity.

Override this method in derived class in case a custom logic is needed, otherwise return value will be equal to raw_results.

in case a result score is boosted, derived class need to update result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

PARAMETER DESCRIPTION
text

The actual text that was analyzed

TYPE: str

raw_recognizer_results

This recognizer's results, to be updated based on recognizer specific context.

TYPE: List[RecognizerResult]

other_raw_recognizer_results

Other recognizer results matched in the given text to allow related entity context enhancement

TYPE: List[RecognizerResult]

nlp_artifacts

The nlp artifacts contains elements such as lemmatized tokens for better accuracy of the context enhancement process

TYPE: NlpArtifacts

context

list of context words

TYPE: Optional[List[str]] DEFAULT: None

Source code in presidio_analyzer/entity_recognizer.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def enhance_using_context(
    self,
    text: str,
    raw_recognizer_results: List[RecognizerResult],
    other_raw_recognizer_results: List[RecognizerResult],
    nlp_artifacts: NlpArtifacts,
    context: Optional[List[str]] = None,
) -> List[RecognizerResult]:
    """Enhance confidence score using context of the entity.

    Override this method in derived class in case a custom logic
    is needed, otherwise return value will be equal to
    raw_results.

    in case a result score is boosted, derived class need to update
    result.recognition_metadata[RecognizerResult.IS_SCORE_ENHANCED_BY_CONTEXT_KEY]

    :param text: The actual text that was analyzed
    :param raw_recognizer_results: This recognizer's results, to be updated
    based on recognizer specific context.
    :param other_raw_recognizer_results: Other recognizer results matched in
    the given text to allow related entity context enhancement
    :param nlp_artifacts: The nlp artifacts contains elements
                          such as lemmatized tokens for better
                          accuracy of the context enhancement process
    :param context: list of context words
    """
    return raw_recognizer_results

get_supported_entities

get_supported_entities() -> List[str]

Return the list of entities this recognizer can identify.

RETURNS DESCRIPTION
List[str]

A list of the supported entities by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
119
120
121
122
123
124
125
def get_supported_entities(self) -> List[str]:
    """
    Return the list of entities this recognizer can identify.

    :return: A list of the supported entities by this recognizer
    """
    return self.supported_entities

get_supported_language

get_supported_language() -> str

Return the language this recognizer can support.

RETURNS DESCRIPTION
str

A list of the supported language by this recognizer

Source code in presidio_analyzer/entity_recognizer.py
127
128
129
130
131
132
133
def get_supported_language(self) -> str:
    """
    Return the language this recognizer can support.

    :return: A list of the supported language by this recognizer
    """
    return self.supported_language

get_version

get_version() -> str

Return the version of this recognizer.

RETURNS DESCRIPTION
str

The current version of this recognizer

Source code in presidio_analyzer/entity_recognizer.py
135
136
137
138
139
140
141
def get_version(self) -> str:
    """
    Return the version of this recognizer.

    :return: The current version of this recognizer
    """
    return self.version

to_dict

to_dict() -> Dict

Serialize instance into a dictionary.

Source code in presidio_analyzer/pattern_recognizer.py
254
255
256
257
258
259
260
261
262
263
264
def to_dict(self) -> Dict:
    """Serialize instance into a dictionary."""
    return_dict = super().to_dict()

    return_dict["patterns"] = [pat.to_dict() for pat in self.patterns]
    return_dict["deny_list"] = self.deny_list
    return_dict["context"] = self.context
    return_dict["supported_entity"] = return_dict["supported_entities"][0]
    del return_dict["supported_entities"]

    return return_dict

from_dict classmethod

from_dict(entity_recognizer_dict: Dict) -> PatternRecognizer

Create instance from a serialized dict.

Source code in presidio_analyzer/pattern_recognizer.py
266
267
268
269
270
271
272
273
274
@classmethod
def from_dict(cls, entity_recognizer_dict: Dict) -> "PatternRecognizer":
    """Create instance from a serialized dict."""
    patterns = entity_recognizer_dict.get("patterns")
    if patterns:
        patterns_list = [Pattern.from_dict(pat) for pat in patterns]
        entity_recognizer_dict["patterns"] = patterns_list

    return cls(**entity_recognizer_dict)

remove_duplicates staticmethod

remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]

Remove duplicate results.

Remove duplicates in case the two results have identical start and ends and types.

PARAMETER DESCRIPTION
results

List[RecognizerResult]

TYPE: List[RecognizerResult]

RETURNS DESCRIPTION
List[RecognizerResult]

List[RecognizerResult]

Source code in presidio_analyzer/entity_recognizer.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
@staticmethod
def remove_duplicates(results: List[RecognizerResult]) -> List[RecognizerResult]:
    """
    Remove duplicate results.

    Remove duplicates in case the two results
    have identical start and ends and types.
    :param results: List[RecognizerResult]
    :return: List[RecognizerResult]
    """
    results = list(set(results))
    results = sorted(results, key=lambda x: (-x.score, x.start, -(x.end - x.start)))
    filtered_results = []

    for result in results:
        if result.score == 0:
            continue

        to_keep = result not in filtered_results  # equals based comparison
        if to_keep:
            for filtered in filtered_results:
                # If result is contained in one of the other results
                if (
                    result.contained_in(filtered)
                    and result.entity_type == filtered.entity_type
                ):
                    to_keep = False
                    break

        if to_keep:
            filtered_results.append(result)

    return filtered_results

sanitize_value staticmethod

sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str

Cleanse the input string of the replacement pairs specified as argument.

PARAMETER DESCRIPTION
text

input string

TYPE: str

replacement_pairs

pairs of what has to be replaced with which value

TYPE: List[Tuple[str, str]]

RETURNS DESCRIPTION
str

cleansed string

Source code in presidio_analyzer/entity_recognizer.py
200
201
202
203
204
205
206
207
208
209
210
211
@staticmethod
def sanitize_value(text: str, replacement_pairs: List[Tuple[str, str]]) -> str:
    """
    Cleanse the input string of the replacement pairs specified as argument.

    :param text: input string
    :param replacement_pairs: pairs of what has to be replaced with which value
    :return: cleansed string
    """
    for search_string, replacement_string in replacement_pairs:
        text = text.replace(search_string, replacement_string)
    return text

validate_result

validate_result(pattern_text: str) -> Optional[bool]

Validate the pattern logic e.g., by running checksum on a detected pattern.

PARAMETER DESCRIPTION
pattern_text

the text to validated. Only the part in text that was detected by the regex engine

TYPE: str

RETURNS DESCRIPTION
Optional[bool]

A bool indicating whether the validation was successful.

Source code in presidio_analyzer/pattern_recognizer.py
117
118
119
120
121
122
123
124
125
def validate_result(self, pattern_text: str) -> Optional[bool]:
    """
    Validate the pattern logic e.g., by running checksum on a detected pattern.

    :param pattern_text: the text to validated.
    Only the part in text that was detected by the regex engine
    :return: A bool indicating whether the validation was successful.
    """
    return None

build_regex_explanation staticmethod

build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation

Construct an explanation for why this entity was detected.

PARAMETER DESCRIPTION
recognizer_name

Name of recognizer detecting the entity

TYPE: str

pattern_name

Regex pattern name which detected the entity

TYPE: str

pattern

Regex pattern logic

TYPE: str

original_score

Score given by the recognizer

TYPE: float

validation_result

Whether validation was used and its result

TYPE: bool

regex_flags

Regex flags used in the regex matching

TYPE: int

RETURNS DESCRIPTION
AnalysisExplanation

Analysis explanation

Source code in presidio_analyzer/pattern_recognizer.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
@staticmethod
def build_regex_explanation(
    recognizer_name: str,
    pattern_name: str,
    pattern: str,
    original_score: float,
    validation_result: bool,
    regex_flags: int,
) -> AnalysisExplanation:
    """
    Construct an explanation for why this entity was detected.

    :param recognizer_name: Name of recognizer detecting the entity
    :param pattern_name: Regex pattern name which detected the entity
    :param pattern: Regex pattern logic
    :param original_score: Score given by the recognizer
    :param validation_result: Whether validation was used and its result
    :param regex_flags: Regex flags used in the regex matching
    :return: Analysis explanation
    """
    textual_explanation = (
        f"Detected by `{recognizer_name}` " f"using pattern `{pattern_name}`"
    )

    explanation = AnalysisExplanation(
        recognizer=recognizer_name,
        original_score=original_score,
        pattern_name=pattern_name,
        pattern=pattern,
        validation_result=validation_result,
        regex_flags=regex_flags,
        textual_explanation=textual_explanation,
    )
    return explanation

invalidate_result

invalidate_result(pattern_text: str) -> bool

Check if the pattern text cannot be validated as a US_SSN entity.

PARAMETER DESCRIPTION
pattern_text

Text detected as pattern by regex

TYPE: str

RETURNS DESCRIPTION
bool

True if invalidated

Source code in presidio_analyzer/predefined_recognizers/us_ssn_recognizer.py
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
def invalidate_result(self, pattern_text: str) -> bool:
    """
    Check if the pattern text cannot be validated as a US_SSN entity.

    :param pattern_text: Text detected as pattern by regex
    :return: True if invalidated
    """
    # if there are delimiters, make sure both delimiters are the same
    delimiter_counts = defaultdict(int)
    for c in pattern_text:
        if c in (".", "-", " "):
            delimiter_counts[c] += 1
    if len(delimiter_counts.keys()) > 1:
        # mismatched delimiters
        return True

    only_digits = "".join(c for c in pattern_text if c.isdigit())
    if all(only_digits[0] == c for c in only_digits):
        # cannot be all same digit
        return True

    if only_digits[3:5] == "00" or only_digits[5:] == "0000":
        # groups cannot be all zeros
        return True

    for sample_ssn in ("000", "666", "123456789", "98765432", "078051120"):
        if only_digits.startswith(sample_ssn):
            return True

    return False

Misc

presidio_analyzer.analyzer_request.AnalyzerRequest

Analyzer request data.

PARAMETER DESCRIPTION
req_data

A request dictionary with the following fields: text: the text to analyze language: the language of the text entities: List of PII entities that should be looked for in the text. If entities=None then all entities are looked for. correlation_id: cross call ID for this request score_threshold: A minimum value for which to return an identified entity log_decision_process: Should the decision points within the analysis be logged return_decision_process: Should the decision points within the analysis returned as part of the response

TYPE: Dict

Source code in presidio_analyzer/analyzer_request.py
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
class AnalyzerRequest:
    """
    Analyzer request data.

    :param req_data: A request dictionary with the following fields:
        text: the text to analyze
        language: the language of the text
        entities: List of PII entities that should be looked for in the text.
        If entities=None then all entities are looked for.
        correlation_id: cross call ID for this request
        score_threshold: A minimum value for which to return an identified entity
        log_decision_process: Should the decision points within the analysis
        be logged
        return_decision_process: Should the decision points within the analysis
        returned as part of the response
    """

    def __init__(self, req_data: Dict):
        self.text = req_data.get("text")
        self.language = req_data.get("language")
        self.entities = req_data.get("entities")
        self.correlation_id = req_data.get("correlation_id")
        self.score_threshold = req_data.get("score_threshold")
        self.return_decision_process = req_data.get("return_decision_process")
        ad_hoc_recognizers = req_data.get("ad_hoc_recognizers")
        self.ad_hoc_recognizers = []
        if ad_hoc_recognizers:
            self.ad_hoc_recognizers = [
                PatternRecognizer.from_dict(rec) for rec in ad_hoc_recognizers
            ]
        self.context = req_data.get("context")
        self.allow_list = req_data.get("allow_list")
        self.allow_list_match = req_data.get("allow_list_match", "exact")
        self.regex_flags = req_data.get("regex_flags",
                                        re.DOTALL | re.MULTILINE | re.IGNORECASE)