Customizing the PII analysis process in Microsoft Presidio¶
This notebooks covers different customization use cases to:
- Adapt Presidio to detect new types of PII entities
- Adapt Presidio to detect PII entities in a new language
- Embed new types of detection modules into Presidio, to improve the coverage of the service.
Installation¶
First, let's install presidio using pip
. For detailed documentation, see the installation docs.
Install from PyPI:
# download presidio
!pip install presidio_analyzer presidio_anonymizer
!python -m spacy download en_core_web_lg
Getting started¶
The high level process in Presidio-Analyzer is the following:
Load the presidio-analyzer
modules. For more information, see the analyzer docs.
from typing import List
import pprint
from presidio_analyzer import (
AnalyzerEngine,
PatternRecognizer,
EntityRecognizer,
Pattern,
RecognizerResult,
)
from presidio_analyzer.recognizer_registry import RecognizerRegistry
from presidio_analyzer.nlp_engine import NlpEngine, SpacyNlpEngine, NlpArtifacts
from presidio_analyzer.context_aware_enhancers import LemmaContextAwareEnhancer
# Helper method to print results nicely
def print_analyzer_results(results: List[RecognizerResult], text: str):
"""Print the results in a human readable way."""
for i, result in enumerate(results):
print(f"Result {i}:")
print(f" {result}, text: {text[result.start:result.end]}")
if result.analysis_explanation is not None:
print(f" {result.analysis_explanation.textual_explanation}")
Example 1: Deny-list based PII recognition¶
In this example, we will pass a short list of tokens which should be marked as PII if detected. First, let's define the tokens we want to treat as PII. In this case it would be a list of titles:
titles_list = [
"Sir",
"Ma'am",
"Madam",
"Mr.",
"Mrs.",
"Ms.",
"Miss",
"Dr.",
"Professor",
]
Second, let's create a PatternRecognizer
which would scan for those titles, by passing a deny_list
:
titles_recognizer = PatternRecognizer(supported_entity="TITLE", deny_list=titles_list)
At this point we can call our recognizer directly:
text1 = "I suspect Professor Plum, in the Dining Room, with the candlestick"
result = titles_recognizer.analyze(text1, entities=["TITLE"])
print(f"Result:\n {result}")
Result: [type: TITLE, start: 10, end: 19, score: 1.0]
Finally, let's add this new recognizer to the list of recognizers used by the Presidio AnalyzerEngine
:
analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(titles_recognizer)
When initializing the AnalyzerEngine
, Presidio loads all available recognizers, including the NlpEngine
used to detect entities, and extract tokens, lemmas and other linguistic features.
Let's run the analyzer with the new recognizer in place:
results = analyzer.analyze(text=text1, language="en")
print_analyzer_results(results, text=text1)
Result 0: type: TITLE, start: 10, end: 19, score: 1.0, text: Professor Result 1: type: PERSON, start: 20, end: 24, score: 0.85, text: Plum Result 2: type: LOCATION, start: 29, end: 44, score: 0.85, text: the Dining Room
As expected, both the name "Plum" and the title were identified as PII:
print("Identified these PII entities:")
for result in results:
print(f"- {text1[result.start:result.end]} as {result.entity_type}")
Identified these PII entities: - Professor as TITLE - Plum as PERSON - the Dining Room as LOCATION
Example 2: Regex based PII recognition¶
Another simple recognizer we can add is based on regular expressions. Let's assume we want to be extremely conservative and treat any token which contains a number as PII.
# Define the regex pattern in a Presidio `Pattern` object:
numbers_pattern = Pattern(name="numbers_pattern", regex="\d+", score=0.5)
# Define the recognizer with one or more patterns
number_recognizer = PatternRecognizer(
supported_entity="NUMBER", patterns=[numbers_pattern]
)
Testing the recognizer itself:
text2 = "I live in 510 Broad st."
numbers_result = number_recognizer.analyze(text=text2, entities=["NUMBER"])
print("Result:")
print(numbers_result)
Result: [type: NUMBER, start: 10, end: 13, score: 0.5]
It's important to mention that recognizers is likely to have errors, both false-positive and false-negative, which would impact the entire performance of Presidio. Consider testing each recognizer on a representative dataset prior to integrating it into Presidio. For more info, see the best practices for developing recognizers documentation.
Example 3: Rule based logic recognizer¶
Taking the numbers recognizer one step further, let's say we also would like to detect numbers within words, e.g. "Number One". We can leverage the underlying spaCy token attributes, or write our own logic to detect such entities.
Notes:
In this example we would create a new class, which implements
EntityRecognizer
, the basic recognizer in Presidio. This abstract class requires us to implement theload
method andanalyze
method.Each recognizer accepts an object of type
NlpArtifacts
, which holds pre-computed attributes on the input text.
A new recognizer should have this structure:
class MyRecognizer(EntityRecognizer):
def load(self) -> None:
"""No loading is required."""
pass
def analyze(
self, text: str, entities: List[str], nlp_artifacts: NlpArtifacts
) -> List[RecognizerResult]:
"""
Logic for detecting a specific PII
"""
pass
For example, detecting numbers in either numerical or alphabetic (e.g. Forty five) form:
class NumbersRecognizer(EntityRecognizer):
expected_confidence_level = 0.7 # expected confidence level for this recognizer
def load(self) -> None:
"""No loading is required."""
pass
def analyze(
self, text: str, entities: List[str], nlp_artifacts: NlpArtifacts
) -> List[RecognizerResult]:
"""
Analyzes test to find tokens which represent numbers (either 123 or One Two Three).
"""
results = []
# iterate over the spaCy tokens, and call `token.like_num`
for token in nlp_artifacts.tokens:
if token.like_num:
result = RecognizerResult(
entity_type="NUMBER",
start=token.idx,
end=token.idx + len(token),
score=self.expected_confidence_level,
)
results.append(result)
return results
new_numbers_recognizer = NumbersRecognizer(supported_entities=["NUMBER"])
Since this recognizer requires the NlpArtifacts
, we would have to call it as part of the AnalyzerEngine
flow:
text3 = "Roberto lives in Five 10 Broad st."
analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(new_numbers_recognizer)
numbers_results2 = analyzer.analyze(text=text3, language="en")
print_analyzer_results(numbers_results2, text=text3)
Result 0: type: PERSON, start: 0, end: 7, score: 0.85, text: Roberto Result 1: type: LOCATION, start: 25, end: 34, score: 0.85, text: Broad st. Result 2: type: NUMBER, start: 17, end: 21, score: 0.7, text: Five Result 3: type: NUMBER, start: 22, end: 24, score: 0.7, text: 10
The analyzer was able to pick up both numeric and alphabetical numbers, including other types of PII entities from other recognizers (PERSON in this case).
Example 4: Calling an external service for PII detection¶
In a similar way to example 3, we can write logic to call external services for PII detection. For a detailed example, see this part of the documentation.
Example 5: Supporting new languages¶
Two main parts in Presidio handle the text, and should be adapted if a new language is required:
- The
NlpEngine
containing the NLP model which performs tokenization, lemmatization, Named Entity Recognition and other NLP tasks. - The different PII recognizers (
EntityRecognizer
objects) should be adapted or created.
Adapting the NLP engine¶
As its internal NLP engine, Presidio supports both spaCy and Stanza. Make sure you download the required models from spacy/stanza prior to using them. More details here. For example, to download the Spanish medium spaCy model: python -m spacy download es_core_news_md
In this example we will configure Presidio to use spaCy as its underlying NLP framework, with NLP models in English and Spanish:
from presidio_analyzer.nlp_engine import NlpEngineProvider
# import spacy
# spacy.cli.download("es_core_news_md")
# Create configuration containing engine name and models
configuration = {
"nlp_engine_name": "spacy",
"models": [
{"lang_code": "es", "model_name": "es_core_news_md"},
{"lang_code": "en", "model_name": "en_core_web_lg"},
],
}
# Create NLP engine based on configuration
provider = NlpEngineProvider(nlp_configuration=configuration)
nlp_engine_with_spanish = provider.create_engine()
# Pass the created NLP engine and supported_languages to the AnalyzerEngine
analyzer = AnalyzerEngine(
nlp_engine=nlp_engine_with_spanish, supported_languages=["en", "es"]
)
# Analyze in different languages
results_spanish = analyzer.analyze(text="Mi nombre es Morris", language="es")
print("Results from Spanish request:")
print(results_spanish)
results_english = analyzer.analyze(text="My name is Morris", language="en")
print("Results from English request:")
print(results_english)
Results from Spanish request: [type: PERSON, start: 13, end: 19, score: 0.85] Results from English request: [type: PERSON, start: 11, end: 17, score: 0.85]
- See this documentation for more details on how to configure Presidio support additional NLP models and languages.
- See this sample for more implemention examples of various NLP engines and NER models.
Example 6: Using context words¶
Presidio has a internal mechanism for leveraging context words. This mechanism would increse the detection confidence of a PII entity in case a specific word appears before or after it.
In this example we would first implement a zip code recognizer without context, and then add context to see how the confidence changes. Zip regex patterns (essentially 5 digits) are very week, so we would want the initial confidence to be low, and increased with the existence of context words.
# Define the regex pattern
regex = r"(\b\d{5}(?:\-\d{4})?\b)" # very weak regex pattern
zipcode_pattern = Pattern(name="zip code (weak)", regex=regex, score=0.01)
# Define the recognizer with the defined pattern
zipcode_recognizer = PatternRecognizer(
supported_entity="US_ZIP_CODE", patterns=[zipcode_pattern]
)
registry = RecognizerRegistry()
registry.add_recognizer(zipcode_recognizer)
analyzer = AnalyzerEngine(registry=registry)
# Test
text = "My zip code is 90210"
results = analyzer.analyze(text=text, language="en")
print_analyzer_results(results, text=text)
Result 0: type: US_ZIP_CODE, start: 15, end: 20, score: 0.01, text: 90210
So this is working, but would catch any 5 digit string. This is why we set the score to 0.01. Let's use context words to increase score:
# Define the recognizer with the defined pattern and context words
zipcode_recognizer = PatternRecognizer(
supported_entity="US_ZIP_CODE",
patterns=[zipcode_pattern],
context=["zip", "zipcode"],
)
When creating an AnalyzerEngine
we can provide our own context enhancement logic by passing it to context_aware_enhancer
parameter.
AnalyzerEngine
will create LemmaContextAwareEnhancer
by default if not passed, which will enhance score of each matched result if it's recognizer holds context words and those words are found in context of the matched entity.
registry = RecognizerRegistry()
registry.add_recognizer(zipcode_recognizer)
analyzer = AnalyzerEngine(registry=registry)
# Test
results = analyzer.analyze(text="My zip code is 90210", language="en")
print("Result:")
print_analyzer_results(results, text=text)
Result: Result 0: type: US_ZIP_CODE, start: 15, end: 20, score: 0.4, text: 90210
The confidence score is now 0.4, instead of 0.01. because LemmaContextAwareEnhancer
default context similarity factor is 0.35 and default minimum score with context similarity is 0.4, we can change that by passing context_similarity_factor
and min_score_with_context_similarity
parameters of LemmaContextAwareEnhancer
to other than values, for example:
registry = RecognizerRegistry()
registry.add_recognizer(zipcode_recognizer)
analyzer = AnalyzerEngine(
registry=registry,
context_aware_enhancer=LemmaContextAwareEnhancer(
context_similarity_factor=0.45, min_score_with_context_similarity=0.4
),
)
# Test
results = analyzer.analyze(text="My zip code is 90210", language="en")
print("Result:")
print_analyzer_results(results, text=text)
Result: Result 0: type: US_ZIP_CODE, start: 15, end: 20, score: 0.46, text: 90210
The confidence score is now 0.46 because it got enhanced from 0.01 with 0.45 and is more the minimum of 0.4
Presidio supports passing a list of outer context in analyzer level, this is useful if the text is coming from a specific column or a specific user input etc. notice how the "zip" context word doesn't appear in the text but still enhance the confidence score from 0.01 to 0.4:
# Define the recognizer with the defined pattern and context words
zipcode_recognizer = PatternRecognizer(
supported_entity="US_ZIP_CODE",
patterns=[zipcode_pattern],
context=["zip", "zipcode"],
)
registry = RecognizerRegistry()
registry.add_recognizer(zipcode_recognizer)
analyzer = AnalyzerEngine(registry=registry)
# Test
text = "My code is 90210"
result = analyzer.analyze(text=text, language="en", context=["zip"])
print("Result:")
print_analyzer_results(result, text=text)
Result: Result 0: type: US_ZIP_CODE, start: 11, end: 16, score: 0.4, text: 90210
Example 7: Tracing the decision process¶
Presidio-analyzer's decision process exposes information on why a specific PII was detected. Such information could contain:
- Which recognizer detected the entity
- Which regex pattern was used
- Interpretability mechanisms in ML models
- Which context words improved the score
- Confidence scores before and after each step And more.
For more information, refer to the decision process documentation.
Let's use the decision process output to understand how the zip code value was detected:
results = analyzer.analyze(
text="My zip code is 90210", language="en", return_decision_process=True
)
decision_process = results[0].analysis_explanation
pp = pprint.PrettyPrinter()
print("Decision process output:\n")
pp.pprint(decision_process.__dict__)
Decision process output: {'original_score': 0.01, 'pattern': '(\\b\\d{5}(?:\\-\\d{4})?\\b)', 'pattern_name': 'zip code (weak)', 'recognizer': 'PatternRecognizer', 'regex_flags': regex.I|regex.M|regex.S, 'score': 0.4, 'score_context_improvement': 0.39, 'supportive_context_word': 'zip', 'textual_explanation': 'Detected by `PatternRecognizer` using pattern `zip ' 'code (weak)`', 'validation_result': None}
When developing new recognizers, one can add information to this explanation and extend it with additional findings.
Example 8: passing a list of words to keep¶
We will use the built in recognizers that include the URLRecognizer
and the NLP model EntityRecognizer
and see the default functionality if we don't specify any list of words for the detector to allow to keep in the text.
websites_list = ["bing.com", "microsoft.com"]
text1 = "Bill's favorite website is bing.com, David's is microsoft.com"
analyzer = AnalyzerEngine()
results = analyzer.analyze(text=text1, language="en", return_decision_process=True)
print_analyzer_results(results, text=text1)
Result 0: type: PERSON, start: 0, end: 4, score: 0.85, text: Bill Identified as PERSON by Spacy's Named Entity Recognition Result 1: type: URL, start: 27, end: 35, score: 0.85, text: bing.com Detected by `UrlRecognizer` using pattern `Non schema URL` Result 2: type: PERSON, start: 37, end: 42, score: 0.85, text: David Identified as PERSON by Spacy's Named Entity Recognition Result 3: type: URL, start: 48, end: 61, score: 0.85, text: microsoft.com Detected by `UrlRecognizer` using pattern `Non schema URL`
To specify an allow list we just pass a list of values we want to keep as a parameter to call to analyze
. Now we can see that in the results, bing.com
is no longer being recognized as a PII item, only microsoft.com
as well as the named entities are still recognized since we did include it in the allow list.
results = analyzer.analyze(
text=text1,
language="en",
allow_list=["bing.com", "google.com"],
return_decision_process=True,
)
print_analyzer_results(results, text=text1)
Result 0: type: PERSON, start: 0, end: 4, score: 0.85, text: Bill Identified as PERSON by Spacy's Named Entity Recognition Result 1: type: PERSON, start: 37, end: 42, score: 0.85, text: David Identified as PERSON by Spacy's Named Entity Recognition Result 2: type: URL, start: 48, end: 61, score: 0.85, text: microsoft.com Detected by `UrlRecognizer` using pattern `Non schema URL`