Skip to content

Customizing the PII analysis process in Microsoft Presidio

This notebooks covers different customization use cases to:

  1. Adapt Presidio to detect new types of PII entities
  2. Adapt Presidio to detect PII entities in a new language
  3. Embed new types of detection modules into Presidio, to improve the coverage of the service.

Installation

First, let's install presidio using pip. For detailed documentation, see the installation docs.

Install from PyPI:

# download presidio
!pip install presidio_analyzer presidio_anonymizer
!python -m spacy download en_core_web_lg

Getting started

The high level process in Presidio-Analyzer is the following: image.png

Load the presidio-analyzer modules. For more information, see the analyzer docs.

from typing import List
import pprint

from presidio_analyzer import AnalyzerEngine, PatternRecognizer, EntityRecognizer, Pattern, RecognizerResult
from presidio_analyzer.recognizer_registry import RecognizerRegistry
from presidio_analyzer.nlp_engine import NlpEngine, SpacyNlpEngine, NlpArtifacts
from presidio_analyzer.context_aware_enhancers import LemmaContextAwareEnhancer

Example 1: Deny-list based PII recognition

In this example, we will pass a short list of tokens which should be marked as PII if detected. First, let's define the tokens we want to treat as PII. In this case it would be a list of titles:

titles_list = ["Sir", "Ma'am", "Madam", "Mr.", "Mrs.", "Ms.", "Miss", "Dr.", "Professor"]

Second, let's create a PatternRecognizer which would scan for those titles, by passing a deny_list:

titles_recognizer = PatternRecognizer(supported_entity="TITLE", deny_list=titles_list)

At this point we can call our recognizer directly:

text1 = "I suspect Professor Plum, in the Dining Room, with the candlestick"
result = titles_recognizer.analyze(text1, entities=["TITLE"])
print(f"Result:\n {result}")
Result:
 [type: TITLE, start: 10, end: 19, score: 1.0]

Finally, let's add this new recognizer to the list of recognizers used by the Presidio AnalyzerEngine:

analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(titles_recognizer)

When initializing the AnalyzerEngine, Presidio loads all available recognizers, including the NlpEngine used to detect entities, and extract tokens, lemmas and other linguistic features.

Let's run the analyzer with the new recognizer in place:

results = analyzer.analyze(text=text1, language="en")
print("Results:")
print(results)
Results:
[type: TITLE, start: 10, end: 19, score: 1.0, type: PERSON, start: 20, end: 24, score: 0.85]

As expected, both the name "Plum" and the title were identified as PII:

print("Identified these PII entities:")
for result in results:
    print(f"- {text1[result.start:result.end]} as {result.entity_type}")
Identified these PII entities:
- Professor as TITLE
- Plum as PERSON

Example 2: Regex based PII recognition

Another simple recognizer we can add is based on regular expressions. Let's assume we want to be extremely conservative and treat any token which contains a number as PII.

# Define the regex pattern in a Presidio `Pattern` object:
numbers_pattern = Pattern(name="numbers_pattern",regex="\d+", score = 0.5)

# Define the recognizer with one or more patterns
number_recognizer = PatternRecognizer(supported_entity="NUMBER", patterns = [numbers_pattern])

Testing the recognizer itself:

text2 = "I live in 510 Broad st."

numbers_result = number_recognizer.analyze(text=text2, entities=["NUMBER"])
print("Result:")
print(numbers_result)
Result:
[type: NUMBER, start: 10, end: 13, score: 0.5]

It's important to mention that recognizers is likely to have errors, both false-positive and false-negative, which would impact the entire performance of Presidio. Consider testing each recognizer on a representative dataset prior to integrating it into Presidio. For more info, see the best practices for developing recognizers documentation.

Example 3: Rule based logic recognizer

Taking the numbers recognizer one step further, let's say we also would like to detect numbers within words, e.g. "Number One". We can leverage the underlying spaCy token attributes, or write our own logic to detect such entities.

Notes:

  • In this example we would create a new class, which implements EntityRecognizer, the basic recognizer in Presidio. This abstract class requires us to implement the load method and analyze method.

  • Each recognizer accepts an object of type NlpArtifacts, which holds pre-computed attributes on the input text.

A new recognizer should have this structure:

class MyRecognizer(EntityRecognizer):

    def load(self) -> None:
        """No loading is required."""
        pass

    def analyze(self, text: str, entities: List[str], nlp_artifacts: NlpArtifacts) -> List[RecognizerResult]:
        """
        Logic for detecting a specific PII
        """
        pass

For example, detecting numbers in either numerical or alphabetic (e.g. Forty five) form:

class NumbersRecognizer(EntityRecognizer):

    expected_confidence_level = 0.7 # expected confidence level for this recognizer

    def load(self) -> None:
        """No loading is required."""
        pass

    def analyze(
        self, text: str, entities: List[str], nlp_artifacts: NlpArtifacts
    ) -> List[RecognizerResult]:
        """
        Analyzes test to find tokens which represent numbers (either 123 or One Two Three).
        """
        results = []

        # iterate over the spaCy tokens, and call `token.like_num`
        for token in nlp_artifacts.tokens:
            if token.like_num:
                result = RecognizerResult(
                    entity_type="NUMBER",
                    start=token.idx,
                    end=token.idx + len(token),
                    score=self.expected_confidence_level
                )
                results.append(result)
        return results
new_numbers_recognizer = NumbersRecognizer(supported_entities=["NUMBER"])

Since this recognizer requires the NlpArtifacts, we would have to call it as part of the AnalyzerEngine flow:

text3 = "Roberto lives in Five 10 Broad st."
analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(new_numbers_recognizer)

numbers_results2 = analyzer.analyze(text=text3, language="en")
print("Results:")
print("\n".join([str(res) for res in numbers_results2]))
Results:
type: PERSON, start: 0, end: 7, score: 0.85
type: NUMBER, start: 17, end: 21, score: 0.7
type: NUMBER, start: 22, end: 24, score: 0.7

The analyzer was able to pick up both numeric and alphabetical numbers, including other types of PII entities from other recognizers (PERSON in this case).

Example 4: Calling an external service for PII detection

In a similar way to example 3, we can write logic to call external services for PII detection. For a detailed example, see this part of the documentation.

This is a sample implementation of such remote recognizer.

Example 5: Supporting new languages

Two main parts in Presidio handle the text, and should be adapted if a new language is required: 1. The NlpEngine containing the NLP model which performs tokenization, lemmatization, Named Entity Recognition and other NLP tasks. 2. The different PII recognizers (EntityRecognizer objects) should be adapted or created.

Adapting the NLP engine

As its internal NLP engine, Presidio supports both spaCy and Stanza. Make sure you download the required models from spacy/stanza prior to using them. More details here. For example, to download the Spanish medium spaCy model: python -m spacy download es_core_news_md

In this example we will configure Presidio to use spaCy as its underlying NLP framework, with NLP models in English and Spanish:

from presidio_analyzer.nlp_engine import NlpEngineProvider

#import spacy
#spacy.cli.download("es_core_news_md")

# Create configuration containing engine name and models
configuration = {
    "nlp_engine_name": "spacy",
    "models": [{"lang_code": "es", "model_name": "es_core_news_md"},
               {"lang_code": "en", "model_name": "en_core_web_lg"}],
}

# Create NLP engine based on configuration
provider = NlpEngineProvider(nlp_configuration=configuration)
nlp_engine_with_spanish = provider.create_engine()

# Pass the created NLP engine and supported_languages to the AnalyzerEngine
analyzer = AnalyzerEngine(
    nlp_engine=nlp_engine_with_spanish, 
    supported_languages=["en", "es"]
)

# Analyze in different languages
results_spanish = analyzer.analyze(text="Mi nombre es Morris", language="es")
print("Results from Spanish request:")
print(results_spanish)

results_english = analyzer.analyze(text="My name is Morris", language="en")
print("Results from English request:")
print(results_english)
Results from Spanish request:
[]
Results from English request:
[type: PERSON, start: 11, end: 17, score: 0.85]

See this documentation for more details on how to configure Presidio support additional NLP models and languages.

Example 6: Using context words

Presidio has a internal mechanism for leveraging context words. This mechanism would increse the detection confidence of a PII entity in case a specific word appears before or after it.

In this example we would first implement a zip code recognizer without context, and then add context to see how the confidence changes. Zip regex patterns (essentially 5 digits) are very week, so we would want the initial confidence to be low, and increased with the existence of context words.

# Define the regex pattern
regex = r"(\b\d{5}(?:\-\d{4})?\b)" # very weak regex pattern
zipcode_pattern = Pattern(name="zip code (weak)", regex=regex, score=0.01)

# Define the recognizer with the defined pattern
zipcode_recognizer = PatternRecognizer(supported_entity="US_ZIP_CODE", patterns = [zipcode_pattern])

registry = RecognizerRegistry()
registry.add_recognizer(zipcode_recognizer)
analyzer = AnalyzerEngine(registry=registry)

# Test
results = analyzer.analyze(text="My zip code is 90210",language="en")
print(f"Result:\n {results}")
Result:
 [type: US_ZIP_CODE, start: 15, end: 20, score: 0.01]

So this is working, but would catch any 5 digit string. This is why we set the score to 0.01. Let's use context words to increase score:

# Define the recognizer with the defined pattern and context words
zipcode_recognizer = PatternRecognizer(supported_entity="US_ZIP_CODE", 
                                       patterns = [zipcode_pattern],
                                       context= ["zip","zipcode"])

When creating an AnalyzerEngine we can provide our own context enhancement logic by passing it to context_aware_enhancer parameter. AnalyzerEngine will create LemmaContextAwareEnhancer by default if not passed, which will enhance score of each matched result if it's recognizer holds context words and those words are found in context of the matched entity.

registry = RecognizerRegistry()
registry.add_recognizer(zipcode_recognizer)
analyzer = AnalyzerEngine(registry=registry)
# Test
results = analyzer.analyze(text="My zip code is 90210",language="en")
print("Result:")
print(results)
Result:
[type: US_ZIP_CODE, start: 15, end: 20, score: 0.4]

The confidence score is now 0.4, instead of 0.01. because LemmaContextAwareEnhancer default context similarity factor is 0.35 and default minimum score with context similarity is 0.4, we can change that by passing context_similarity_factor and min_score_with_context_similarity parameters of LemmaContextAwareEnhancer to other than values, for example:

registry = RecognizerRegistry()
registry.add_recognizer(zipcode_recognizer)
analyzer = AnalyzerEngine(
    registry=registry,
    context_aware_enhancer=
        LemmaContextAwareEnhancer(context_similarity_factor=0.45, min_score_with_context_similarity=0.4))
# Test
results = analyzer.analyze(text="My zip code is 90210",language="en")
print("Result:")
print(results)
Result:
[type: US_ZIP_CODE, start: 15, end: 20, score: 0.46]

The confidence score is now 0.46 because it got enhanced from 0.01 with 0.45 and is more the minimum of 0.4

Presidio supports passing a list of outer context in analyzer level, this is useful if the text is coming from a specific column or a specific user input etc. notice how the "zip" context word doesn't appear in the text but still enhance the confidence score from 0.01 to 0.4:

# Define the recognizer with the defined pattern and context words
zipcode_recognizer = PatternRecognizer(supported_entity="US_ZIP_CODE",
                                       patterns = [zipcode_pattern],
                                       context= ["zip","zipcode"])

registry = RecognizerRegistry()
registry.add_recognizer(zipcode_recognizer)
analyzer = AnalyzerEngine(registry=registry)

# Test
result = analyzer.analyze(text="My code is 90210",language="en", context=["zip"])
print("Result:")
print(result)
Result:
[type: US_ZIP_CODE, start: 11, end: 16, score: 0.4]

Example 7: Tracing the decision process

Presidio-analyzer's decision process exposes information on why a specific PII was detected. Such information could contain:

  • Which recognizer detected the entity
  • Which regex pattern was used
  • Interpretability mechanisms in ML models
  • Which context words improved the score
  • Confidence scores before and after each step And more.

For more information, refer to the decision process documentation.

Let's use the decision process output to understand how the zip code value was detected:

results = analyzer.analyze(text="My zip code is 90210",language="en", return_decision_process = True)
decision_process = results[0].analysis_explanation

pp = pprint.PrettyPrinter()
print("Decision process output:\n")
pp.pprint(decision_process.__dict__)
Decision process output:

{'original_score': 0.01,
 'pattern': '(\\b\\d{5}(?:\\-\\d{4})?\\b)',
 'pattern_name': 'zip code (weak)',
 'recognizer': 'PatternRecognizer',
 'score': 0.4,
 'score_context_improvement': 0.39,
 'supportive_context_word': 'zip',
 'textual_explanation': None,
 'validation_result': None}

When developing new recognizers, one can add information to this explanation and extend it with additional findings.

Example 8: passing a list of words to keep

We will use the built in recognizers that include the URLRecognizer and the NLP model EntityRecognizer and see the default functionality if we don't specify any list of words for the detector to allow to keep in the text.

websites_list = [
    "bing.com",
    "microsoft.com"
]
text1 = "Bill's favorite website is bing.com, David's is microsoft.com"
analyzer = AnalyzerEngine()
result = analyzer.analyze(text = text1, language = 'en')
print(f"Result: \n {result}")
Result: 
 [type: PERSON, start: 0, end: 4, score: 0.85, type: URL, start: 27, end: 35, score: 0.85, type: PERSON, start: 37, end: 42, score: 0.85, type: URL, start: 48, end: 61, score: 0.85]

To specify an allow list we just pass a list of values we want to keep as a parameter to call to analyze. Now we can see that in the results, bing.com is no longer being recognized as a PII item, only microsoft.com as well as the named entities are still recognized since we did include it in the allow list.

result = analyzer.analyze(text = text1, language = 'en', allow_list= ["bing.com", "google.com"])
print(f"Result: \n {result}")
Result: 
 [type: PERSON, start: 0, end: 4, score: 0.85, type: PERSON, start: 37, end: 42, score: 0.85, type: URL, start: 48, end: 61, score: 0.85]