Path to notebook: https://www.github.com/microsoft/presidio/blob/main/docs/samples/python/customizing_presidio_analyzer.ipynb ¶

Customizing the PII analysis process in Microsoft Presidio¶

This notebooks covers different customization use cases to:

Adapt Presidio to detect new types of PII entities
Adapt Presidio to detect PII entities in a new language
Embed new types of detection modules into Presidio, to improve the coverage of the service.

Installation¶

First, let's install presidio using pip. For detailed documentation, see the installation docs.

Install from PyPI:

In [ ]:

Copied!

# download presidio
!pip install presidio_analyzer presidio_anonymizer
!python -m spacy download en_core_web_lg
# download presidio
!pip install presidio_analyzer presidio_anonymizer
!python -m spacy download en_core_web_lg

Getting started¶

The high level process in Presidio-Analyzer is the following:

Load the presidio-analyzer modules. For more information, see the analyzer docs.

In [1]:

Copied!





from typing import List
import pprint

from presidio_analyzer import (
    AnalyzerEngine,
    PatternRecognizer,
    EntityRecognizer,
    Pattern,
    RecognizerResult,
)
from presidio_analyzer.recognizer_registry import RecognizerRegistry
from presidio_analyzer.nlp_engine import NlpEngine, SpacyNlpEngine, NlpArtifacts
from presidio_analyzer.context_aware_enhancers import LemmaContextAwareEnhancer
from typing import List
import pprint

from presidio_analyzer import (
    AnalyzerEngine,
    PatternRecognizer,
    EntityRecognizer,
    Pattern,
    RecognizerResult,
)
from presidio_analyzer.recognizer_registry import RecognizerRegistry
from presidio_analyzer.nlp_engine import NlpEngine, SpacyNlpEngine, NlpArtifacts
from presidio_analyzer.context_aware_enhancers import LemmaContextAwareEnhancer

In [68]:

Copied!

# Helper method to print results nicely

def print_analyzer_results(results: List[RecognizerResult], text: str):
    """Print the results in a human readable way."""

    for i, result in enumerate(results):
        print(f"Result {i}:")
        print(f" {result}, text: {text[result.start:result.end]}")

        if result.analysis_explanation is not None:
            print(f" {result.analysis_explanation.textual_explanation}")
# Helper method to print results nicely

def print_analyzer_results(results: List[RecognizerResult], text: str):
    """Print the results in a human readable way."""

    for i, result in enumerate(results):
        print(f"Result {i}:")
        print(f" {result}, text: {text[result.start:result.end]}")

        if result.analysis_explanation is not None:
            print(f" {result.analysis_explanation.textual_explanation}")

Example 1: Deny-list based PII recognition¶

In this example, we will pass a short list of tokens which should be marked as PII if detected. First, let's define the tokens we want to treat as PII. In this case it would be a list of titles:

In [69]:

Copied!





titles_list = [
    "Sir",
    "Ma'am",
    "Madam",
    "Mr.",
    "Mrs.",
    "Ms.",
    "Miss",
    "Dr.",
    "Professor",
]
titles_list = [
    "Sir",
    "Ma'am",
    "Madam",
    "Mr.",
    "Mrs.",
    "Ms.",
    "Miss",
    "Dr.",
    "Professor",
]

Second, let's create a PatternRecognizer which would scan for those titles, by passing a deny_list:

In [70]:

Copied!

titles_recognizer = PatternRecognizer(supported_entity="TITLE", deny_list=titles_list)
titles_recognizer = PatternRecognizer(supported_entity="TITLE", deny_list=titles_list)

At this point we can call our recognizer directly:

In [83]:

Copied!

text1 = "I suspect Professor Plum, in the Dining Room, with the candlestick"
result = titles_recognizer.analyze(text1, entities=["TITLE"])
print(f"Result:\n {result}")
text1 = "I suspect Professor Plum, in the Dining Room, with the candlestick"
result = titles_recognizer.analyze(text1, entities=["TITLE"])
print(f"Result:\n {result}")

Result:
 [type: TITLE, start: 10, end: 19, score: 1.0]

Finally, let's add this new recognizer to the list of recognizers used by the Presidio AnalyzerEngine:

In [84]:

Copied!

analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(titles_recognizer)
analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(titles_recognizer)

When initializing the AnalyzerEngine, Presidio loads all available recognizers, including the NlpEngine used to detect entities, and extract tokens, lemmas and other linguistic features.

Let's run the analyzer with the new recognizer in place:

In [85]:

Copied!

results = analyzer.analyze(text=text1, language="en")
results = analyzer.analyze(text=text1, language="en")

In [86]:

Copied!

print_analyzer_results(results, text=text1)
print_analyzer_results(results, text=text1)

Result 0:
 type: TITLE, start: 10, end: 19, score: 1.0, text: Professor
Result 1:
 type: PERSON, start: 20, end: 24, score: 0.85, text: Plum
Result 2:
 type: LOCATION, start: 29, end: 44, score: 0.85, text: the Dining Room

As expected, both the name "Plum" and the title were identified as PII:

In [87]:

Copied!

print("Identified these PII entities:")
for result in results:
    print(f"- {text1[result.start:result.end]} as {result.entity_type}")
print("Identified these PII entities:")
for result in results:
    print(f"- {text1[result.start:result.end]} as {result.entity_type}")

Identified these PII entities:
- Professor as TITLE
- Plum as PERSON
- the Dining Room as LOCATION

Example 2: Regex based PII recognition¶

Another simple recognizer we can add is based on regular expressions. Let's assume we want to be extremely conservative and treat any token which contains a number as PII.

In [88]:

Copied!





# Define the regex pattern in a Presidio `Pattern` object:
numbers_pattern = Pattern(name="numbers_pattern", regex="\d+", score=0.5)

# Define the recognizer with one or more patterns
number_recognizer = PatternRecognizer(
    supported_entity="NUMBER", patterns=[numbers_pattern]
)
# Define the regex pattern in a Presidio `Pattern` object:
numbers_pattern = Pattern(name="numbers_pattern", regex="\d+", score=0.5)

# Define the recognizer with one or more patterns
number_recognizer = PatternRecognizer(
    supported_entity="NUMBER", patterns=[numbers_pattern]
)

Testing the recognizer itself:

In [89]:

Copied!

text2 = "I live in 510 Broad st."

numbers_result = number_recognizer.analyze(text=text2, entities=["NUMBER"])
print("Result:")
print(numbers_result)
text2 = "I live in 510 Broad st."

numbers_result = number_recognizer.analyze(text=text2, entities=["NUMBER"])
print("Result:")
print(numbers_result)

Result:
[type: NUMBER, start: 10, end: 13, score: 0.5]

It's important to mention that recognizers is likely to have errors, both false-positive and false-negative, which would impact the entire performance of Presidio. Consider testing each recognizer on a representative dataset prior to integrating it into Presidio. For more info, see the best practices for developing recognizers documentation.

Example 3: Rule based logic recognizer¶

Taking the numbers recognizer one step further, let's say we also would like to detect numbers within words, e.g. "Number One". We can leverage the underlying spaCy token attributes, or write our own logic to detect such entities.

Notes:

In this example we would create a new class, which implements EntityRecognizer, the basic recognizer in Presidio. This abstract class requires us to implement the load method and analyze method.
Each recognizer accepts an object of type NlpArtifacts, which holds pre-computed attributes on the input text.

A new recognizer should have this structure:

In [90]:

Copied!





class MyRecognizer(EntityRecognizer):

    def load(self) -> None:
        """No loading is required."""
        pass

    def analyze(
        self, text: str, entities: List[str], nlp_artifacts: NlpArtifacts
    ) -> List[RecognizerResult]:
        """
        Logic for detecting a specific PII
        """
        pass
class MyRecognizer(EntityRecognizer):

    def load(self) -> None:
        """No loading is required."""
        pass

    def analyze(
        self, text: str, entities: List[str], nlp_artifacts: NlpArtifacts
    ) -> List[RecognizerResult]:
        """
        Logic for detecting a specific PII
        """
        pass

For example, detecting numbers in either numerical or alphabetic (e.g. Forty five) form:

In [91]:

Copied!





class NumbersRecognizer(EntityRecognizer):

    expected_confidence_level = 0.7  # expected confidence level for this recognizer

    def load(self) -> None:
        """No loading is required."""
        pass

    def analyze(
        self, text: str, entities: List[str], nlp_artifacts: NlpArtifacts
    ) -> List[RecognizerResult]:
        """
        Analyzes test to find tokens which represent numbers (either 123 or One Two Three).
        """
        results = []

        # iterate over the spaCy tokens, and call `token.like_num`
        for token in nlp_artifacts.tokens:
            if token.like_num:
                result = RecognizerResult(
                    entity_type="NUMBER",
                    start=token.idx,
                    end=token.idx + len(token),
                    score=self.expected_confidence_level,
                )
                results.append(result)
        return results
class NumbersRecognizer(EntityRecognizer):

    expected_confidence_level = 0.7  # expected confidence level for this recognizer

    def load(self) -> None:
        """No loading is required."""
        pass

    def analyze(
        self, text: str, entities: List[str], nlp_artifacts: NlpArtifacts
    ) -> List[RecognizerResult]:
        """
        Analyzes test to find tokens which represent numbers (either 123 or One Two Three).
        """
        results = []

        # iterate over the spaCy tokens, and call `token.like_num`
        for token in nlp_artifacts.tokens:
            if token.like_num:
                result = RecognizerResult(
                    entity_type="NUMBER",
                    start=token.idx,
                    end=token.idx + len(token),
                    score=self.expected_confidence_level,
                )
                results.append(result)
        return results

In [92]:

Copied!

new_numbers_recognizer = NumbersRecognizer(supported_entities=["NUMBER"])
new_numbers_recognizer = NumbersRecognizer(supported_entities=["NUMBER"])

Since this recognizer requires the NlpArtifacts, we would have to call it as part of the AnalyzerEngine flow:

In [93]:

Copied!





text3 = "Roberto lives in Five 10 Broad st."
analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(new_numbers_recognizer)

numbers_results2 = analyzer.analyze(text=text3, language="en")
print_analyzer_results(numbers_results2, text=text3)
text3 = "Roberto lives in Five 10 Broad st."
analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(new_numbers_recognizer)

numbers_results2 = analyzer.analyze(text=text3, language="en")
print_analyzer_results(numbers_results2, text=text3)

Result 0:
 type: PERSON, start: 0, end: 7, score: 0.85, text: Roberto
Result 1:
 type: LOCATION, start: 25, end: 34, score: 0.85, text: Broad st.
Result 2:
 type: NUMBER, start: 17, end: 21, score: 0.7, text: Five
Result 3:
 type: NUMBER, start: 22, end: 24, score: 0.7, text: 10

The analyzer was able to pick up both numeric and alphabetical numbers, including other types of PII entities from other recognizers (PERSON in this case).

Example 4: Calling an external service for PII detection¶

In a similar way to example 3, we can write logic to call external services for PII detection. For a detailed example, see this part of the documentation.

This is a sample implementation of such remote recognizer.

Example 5: Supporting new languages¶

Two main parts in Presidio handle the text, and should be adapted if a new language is required:

The NlpEngine containing the NLP model which performs tokenization, lemmatization, Named Entity Recognition and other NLP tasks.
The different PII recognizers (EntityRecognizer objects) should be adapted or created.

Adapting the NLP engine¶

As its internal NLP engine, Presidio supports both spaCy and Stanza. Make sure you download the required models from spacy/stanza prior to using them. More details here. For example, to download the Spanish medium spaCy model: python -m spacy download es_core_news_md

In this example we will configure Presidio to use spaCy as its underlying NLP framework, with NLP models in English and Spanish:

In [82]:

Copied!





from presidio_analyzer.nlp_engine import NlpEngineProvider

# import spacy
# spacy.cli.download("es_core_news_md")

# Create configuration containing engine name and models
configuration = {
    "nlp_engine_name": "spacy",
    "models": [
        {"lang_code": "es", "model_name": "es_core_news_md"},
        {"lang_code": "en", "model_name": "en_core_web_lg"},
    ],
}

# Create NLP engine based on configuration
provider = NlpEngineProvider(nlp_configuration=configuration)
nlp_engine_with_spanish = provider.create_engine()

# Pass the created NLP engine and supported_languages to the AnalyzerEngine
analyzer = AnalyzerEngine(
    nlp_engine=nlp_engine_with_spanish, supported_languages=["en", "es"]
)

# Analyze in different languages
results_spanish = analyzer.analyze(text="Mi nombre es Morris", language="es")
print("Results from Spanish request:")
print(results_spanish)

results_english = analyzer.analyze(text="My name is Morris", language="en")
print("Results from English request:")
print(results_english)
from presidio_analyzer.nlp_engine import NlpEngineProvider

# import spacy
# spacy.cli.download("es_core_news_md")

# Create configuration containing engine name and models
configuration = {
    "nlp_engine_name": "spacy",
    "models": [
        {"lang_code": "es", "model_name": "es_core_news_md"},
        {"lang_code": "en", "model_name": "en_core_web_lg"},
    ],
}

# Create NLP engine based on configuration
provider = NlpEngineProvider(nlp_configuration=configuration)
nlp_engine_with_spanish = provider.create_engine()

# Pass the created NLP engine and supported_languages to the AnalyzerEngine
analyzer = AnalyzerEngine(
    nlp_engine=nlp_engine_with_spanish, supported_languages=["en", "es"]
)

# Analyze in different languages
results_spanish = analyzer.analyze(text="Mi nombre es Morris", language="es")
print("Results from Spanish request:")
print(results_spanish)

results_english = analyzer.analyze(text="My name is Morris", language="en")
print("Results from English request:")
print(results_english)

Results from Spanish request:
[type: PERSON, start: 13, end: 19, score: 0.85]
Results from English request:
[type: PERSON, start: 11, end: 17, score: 0.85]

See this documentation for more details on how to configure Presidio support additional NLP models and languages.
See this sample for more implemention examples of various NLP engines and NER models.

Example 6: Using context words¶

Presidio has a internal mechanism for leveraging context words. This mechanism would increse the detection confidence of a PII entity in case a specific word appears before or after it.

In this example we would first implement a zip code recognizer without context, and then add context to see how the confidence changes. Zip regex patterns (essentially 5 digits) are very week, so we would want the initial confidence to be low, and increased with the existence of context words.

In [94]:

Copied!





# Define the regex pattern
regex = r"(\b\d{5}(?:\-\d{4})?\b)"  # very weak regex pattern
zipcode_pattern = Pattern(name="zip code (weak)", regex=regex, score=0.01)

# Define the recognizer with the defined pattern
zipcode_recognizer = PatternRecognizer(
    supported_entity="US_ZIP_CODE", patterns=[zipcode_pattern]
)

registry = RecognizerRegistry()
registry.add_recognizer(zipcode_recognizer)
analyzer = AnalyzerEngine(registry=registry)

# Test
text = "My zip code is 90210"
results = analyzer.analyze(text=text, language="en")
print_analyzer_results(results, text=text)
# Define the regex pattern
regex = r"(\b\d{5}(?:\-\d{4})?\b)"  # very weak regex pattern
zipcode_pattern = Pattern(name="zip code (weak)", regex=regex, score=0.01)

# Define the recognizer with the defined pattern
zipcode_recognizer = PatternRecognizer(
    supported_entity="US_ZIP_CODE", patterns=[zipcode_pattern]
)

registry = RecognizerRegistry()
registry.add_recognizer(zipcode_recognizer)
analyzer = AnalyzerEngine(registry=registry)

# Test
text = "My zip code is 90210"
results = analyzer.analyze(text=text, language="en")
print_analyzer_results(results, text=text)

Result 0:
 type: US_ZIP_CODE, start: 15, end: 20, score: 0.01, text: 90210

So this is working, but would catch any 5 digit string. This is why we set the score to 0.01. Let's use context words to increase score:

In [96]:

Copied!





# Define the recognizer with the defined pattern and context words
zipcode_recognizer = PatternRecognizer(
    supported_entity="US_ZIP_CODE",
    patterns=[zipcode_pattern],
    context=["zip", "zipcode"],
)
# Define the recognizer with the defined pattern and context words
zipcode_recognizer = PatternRecognizer(
    supported_entity="US_ZIP_CODE",
    patterns=[zipcode_pattern],
    context=["zip", "zipcode"],
)

When creating an AnalyzerEngine we can provide our own context enhancement logic by passing it to context_aware_enhancer parameter. AnalyzerEngine will create LemmaContextAwareEnhancer by default if not passed, which will enhance score of each matched result if it's recognizer holds context words and those words are found in context of the matched entity.

In [97]:

Copied!

registry = RecognizerRegistry()
registry.add_recognizer(zipcode_recognizer)
analyzer = AnalyzerEngine(registry=registry)
registry = RecognizerRegistry()
registry.add_recognizer(zipcode_recognizer)
analyzer = AnalyzerEngine(registry=registry)

In [98]:

Copied!





# Test
results = analyzer.analyze(text="My zip code is 90210", language="en")
print("Result:")
print_analyzer_results(results, text=text)
# Test
results = analyzer.analyze(text="My zip code is 90210", language="en")
print("Result:")
print_analyzer_results(results, text=text)

Result:
Result 0:
 type: US_ZIP_CODE, start: 15, end: 20, score: 0.4, text: 90210

The confidence score is now 0.4, instead of 0.01. because LemmaContextAwareEnhancer default context similarity factor is 0.35 and default minimum score with context similarity is 0.4, we can change that by passing context_similarity_factor and min_score_with_context_similarity parameters of LemmaContextAwareEnhancer to other than values, for example:

In [99]:

Copied!





registry = RecognizerRegistry()
registry.add_recognizer(zipcode_recognizer)
analyzer = AnalyzerEngine(
    registry=registry,
    context_aware_enhancer=LemmaContextAwareEnhancer(
        context_similarity_factor=0.45, min_score_with_context_similarity=0.4
    ),
)
registry = RecognizerRegistry()
registry.add_recognizer(zipcode_recognizer)
analyzer = AnalyzerEngine(
    registry=registry,
    context_aware_enhancer=LemmaContextAwareEnhancer(
        context_similarity_factor=0.45, min_score_with_context_similarity=0.4
    ),
)

In [100]:

Copied!





# Test
results = analyzer.analyze(text="My zip code is 90210", language="en")
print("Result:")
print_analyzer_results(results, text=text)
# Test
results = analyzer.analyze(text="My zip code is 90210", language="en")
print("Result:")
print_analyzer_results(results, text=text)

Result:
Result 0:
 type: US_ZIP_CODE, start: 15, end: 20, score: 0.46, text: 90210

The confidence score is now 0.46 because it got enhanced from 0.01 with 0.45 and is more the minimum of 0.4

Presidio supports passing a list of outer context in analyzer level, this is useful if the text is coming from a specific column or a specific user input etc. notice how the "zip" context word doesn't appear in the text but still enhance the confidence score from 0.01 to 0.4:

In [102]:

Copied!





# Define the recognizer with the defined pattern and context words
zipcode_recognizer = PatternRecognizer(
    supported_entity="US_ZIP_CODE",
    patterns=[zipcode_pattern],
    context=["zip", "zipcode"],
)

registry = RecognizerRegistry()
registry.add_recognizer(zipcode_recognizer)
analyzer = AnalyzerEngine(registry=registry)

# Test
text = "My code is 90210"
result = analyzer.analyze(text=text, language="en", context=["zip"])
print("Result:")
print_analyzer_results(result, text=text)
# Define the recognizer with the defined pattern and context words
zipcode_recognizer = PatternRecognizer(
    supported_entity="US_ZIP_CODE",
    patterns=[zipcode_pattern],
    context=["zip", "zipcode"],
)

registry = RecognizerRegistry()
registry.add_recognizer(zipcode_recognizer)
analyzer = AnalyzerEngine(registry=registry)

# Test
text = "My code is 90210"
result = analyzer.analyze(text=text, language="en", context=["zip"])
print("Result:")
print_analyzer_results(result, text=text)

Result:
Result 0:
 type: US_ZIP_CODE, start: 11, end: 16, score: 0.4, text: 90210

Example 7: Tracing the decision process¶

Presidio-analyzer's decision process exposes information on why a specific PII was detected. Such information could contain:

Which recognizer detected the entity
Which regex pattern was used
Interpretability mechanisms in ML models
Which context words improved the score
Confidence scores before and after each step And more.

For more information, refer to the decision process documentation.

Let's use the decision process output to understand how the zip code value was detected:

In [103]:

Copied!





results = analyzer.analyze(
    text="My zip code is 90210", language="en", return_decision_process=True
)
decision_process = results[0].analysis_explanation

pp = pprint.PrettyPrinter()
print("Decision process output:\n")
pp.pprint(decision_process.__dict__)
results = analyzer.analyze(
    text="My zip code is 90210", language="en", return_decision_process=True
)
decision_process = results[0].analysis_explanation

pp = pprint.PrettyPrinter()
print("Decision process output:\n")
pp.pprint(decision_process.__dict__)

Decision process output:

{'original_score': 0.01,
 'pattern': '(\\b\\d{5}(?:\\-\\d{4})?\\b)',
 'pattern_name': 'zip code (weak)',
 'recognizer': 'PatternRecognizer',
 'regex_flags': regex.I|regex.M|regex.S,
 'score': 0.4,
 'score_context_improvement': 0.39,
 'supportive_context_word': 'zip',
 'textual_explanation': 'Detected by `PatternRecognizer` using pattern `zip '
                        'code (weak)`',
 'validation_result': None}

When developing new recognizers, one can add information to this explanation and extend it with additional findings.

Example 8: passing a list of words to keep¶

We will use the built in recognizers that include the URLRecognizer and the NLP model EntityRecognizer and see the default functionality if we don't specify any list of words for the detector to allow to keep in the text.

In [104]:

Copied!





websites_list = ["bing.com", "microsoft.com"]
text1 = "Bill's favorite website is bing.com, David's is microsoft.com"
analyzer = AnalyzerEngine()
results = analyzer.analyze(text=text1, language="en", return_decision_process=True)
print_analyzer_results(results, text=text1)
websites_list = ["bing.com", "microsoft.com"]
text1 = "Bill's favorite website is bing.com, David's is microsoft.com"
analyzer = AnalyzerEngine()
results = analyzer.analyze(text=text1, language="en", return_decision_process=True)
print_analyzer_results(results, text=text1)

Result 0:
 type: PERSON, start: 0, end: 4, score: 0.85, text: Bill
 Identified as PERSON by Spacy's Named Entity Recognition
Result 1:
 type: URL, start: 27, end: 35, score: 0.85, text: bing.com
 Detected by `UrlRecognizer` using pattern `Non schema URL`
Result 2:
 type: PERSON, start: 37, end: 42, score: 0.85, text: David
 Identified as PERSON by Spacy's Named Entity Recognition
Result 3:
 type: URL, start: 48, end: 61, score: 0.85, text: microsoft.com
 Detected by `UrlRecognizer` using pattern `Non schema URL`

To specify an allow list we just pass a list of values we want to keep as a parameter to call to analyze. Now we can see that in the results, bing.com is no longer being recognized as a PII item, only microsoft.com as well as the named entities are still recognized since we did include it in the allow list.

In [105]:

Copied!





results = analyzer.analyze(
    text=text1,
    language="en",
    allow_list=["bing.com", "google.com"],
    return_decision_process=True,
)
print_analyzer_results(results, text=text1)
results = analyzer.analyze(
    text=text1,
    language="en",
    allow_list=["bing.com", "google.com"],
    return_decision_process=True,
)
print_analyzer_results(results, text=text1)

Result 0:
 type: PERSON, start: 0, end: 4, score: 0.85, text: Bill
 Identified as PERSON by Spacy's Named Entity Recognition
Result 1:
 type: PERSON, start: 37, end: 42, score: 0.85, text: David
 Identified as PERSON by Spacy's Named Entity Recognition
Result 2:
 type: URL, start: 48, end: 61, score: 0.85, text: microsoft.com
 Detected by `UrlRecognizer` using pattern `Non schema URL`