Skip to content

Example 13: Allow-list to exclude words from being identified as PII

In this example, we will define a list of tokens that should not be marked as PII even if we want to identify others of that kind.

In this example, we will pass a short list of tokens which should not be marked as PII even if detected by one of the recognizers.

websites_list = [
    "bing.com",
    "microsoft.com"
]

We will use the built in recognizers that include the URLRecognizer and the NLP model EntityRecognizer and see the default functionality if we don't specify any list of words for the detector to allow to keep in the text.

from presidio_analyzer import AnalyzerEngine
text1 = "My favorite website is bing.com, his is microsoft.com"
analyzer = AnalyzerEngine()
result = analyzer.analyze(text = text1, language = 'en')
print(f"Result: \n {result}")

To specify an allow list we just pass a list of values we want to keep as a parameter to call to analyze. Now we can see that in the results, bing.com is no longer being recognized as a PII item, only microsoft.com is still recognized since we did include it in the allow list.

result = analyzer.analyze(text = text1, language = 'en', allow_list = ["bing.com"] )
print(f"Result:\n {result}")