# download presidio
!pip install presidio_analyzer presidio_anonymizer
!python -m spacy download en_core_web_lg
Anonymizing known values¶
In addition to statistical and pattern based approaches, Presidio also supports the identification and anonymization of known values, using the deny-list mechanism. In this example we'll cover two cases:
- The known values are known a-priori (e.g., we have a list of names)
- The known values are only known in the context of a request (e.g., we have the name of a person as the filename)
Example 1: values are known a-priori¶
Assume you have a list of potential PII values, you can create a recognizer which would detect them every time they appear in the text. For this case, we can create a deny-list based recognizer, and add it to presidio's RecognizerRegistry
:
from presidio_analyzer import AnalyzerEngine, RecognizerRegistry, PatternRecognizer
from presidio_anonymizer import AnonymizerEngine
# Get known values as a deny-list
known_names_list = ["George", "Abraham", "Theodore", "Bill", "Barack", "Donald", "Joe"]
# Create a PatternRecognizer for the deny list
deny_list_recognizer = PatternRecognizer(supported_entity="PRESIDENT_FIRST_NAME", deny_list=known_names_list)
registry = RecognizerRegistry()
registry.add_recognizer(deny_list_recognizer)
analyzer = AnalyzerEngine(registry=registry)
anonymizer = AnonymizerEngine()
text="George Washington was the first US president"
results = analyzer.analyze(text=text, language="en")
print("Identified entities:")
print(results)
print("")
anonymized = anonymizer.anonymize(text=text, analyzer_results=results)
print(f"Anonymized text:\n{anonymized.text}")
Identified entities: [type: PRESIDENT_FIRST_NAME, start: 0, end: 6, score: 1.0] Anonymized text: <PRESIDENT_FIRST_NAME> Washington was the first US president
Example 2: values are only known in the context of the request¶
In some cases, we know the potential PII values only in the context of a specific text. Examples could be:
- Detect PII entities in free text columns in tabular databases, where other columns have entity values we can leverage
- Detect PII in a file having the filename or other metadata holding potential PII values (e.g. Smith.csv)
- Anonymize medical images which contain metadata
- Anonymize financial forms when the actual PII data is known
In this case we can use a functionality called ad-hoc recognizers. Here's a simple example:
person1 = {"name": "Martin Smith",
"special_value":"145A",
"free_text": "Martin Smith, id 145A, likes playing basketball"}
person2 = {"name":"Deb Schmidt",
"special_value":"256B",
"free_text": "Deb Schmidt, id 256B likes playing soccer"}
person3 = {"name":"R2D2",
"special_value":"X1T2",
"free_text": "X1T2 is R2D2's special value"}
dataset = [person1, person2, person3]
dataset
[{'name': 'Martin Smith', 'special_value': '145A', 'free_text': 'Martin Smith, id 145A, likes playing basketball'}, {'name': 'Deb Schmidt', 'special_value': '256B', 'free_text': 'Deb Schmidt, id 256B likes playing soccer'}, {'name': 'R2D2', 'special_value': 'X1T2', 'free_text': "X1T2 is R2D2's special value"}]
We're interested in anonymizing the free text using the values contained in name
and special_value
. Since these values are only available in the context of one record, we use the ad-hoc recognizer capability in Presidio, instead of a generic deny-list PatternRecognizer
added to Presidio's RecognizerRegistry
.
# Go over dataset
for person in dataset:
# Get the different known values
name = person['name']
special_val = person['special_value']
# Get the free text to anonymize
free_text = person['free_text']
# Create ad-hoc recognizers
ad_hoc_name_recognizer = PatternRecognizer(supported_entity="name", deny_list = [name])
ad_hoc_id_recognizer = PatternRecognizer(supported_entity="special_value", deny_list = [special_val])
# Run the analyze method with ad_hoc_recognizers:
analyzer_results = analyzer.analyze(text=free_text,
language="en",
ad_hoc_recognizers=[ad_hoc_name_recognizer, ad_hoc_id_recognizer])
# Anonymize results
anonymized = anonymizer.anonymize(text=free_text, analyzer_results=analyzer_results)
print(anonymized.text)
# Store output in original dataset
person["anonymized_free_text"] = anonymized.text
<name>, id <special_value>, likes playing basketball <name>, id <special_value> likes playing soccer <special_value> is <name>'s special value
# Dataset now contains the anonymiezd version as well
dataset
[{'name': 'Martin Smith', 'special_value': '145A', 'free_text': 'Martin Smith, id 145A, likes playing basketball', 'anonymized_free_text': '<name>, id <special_value>, likes playing basketball'}, {'name': 'Deb Schmidt', 'special_value': '256B', 'free_text': 'Deb Schmidt, id 256B likes playing soccer', 'anonymized_free_text': '<name>, id <special_value> likes playing soccer'}, {'name': 'R2D2', 'special_value': 'X1T2', 'free_text': "X1T2 is R2D2's special value", 'anonymized_free_text': "<special_value> is <name>'s special value"}]
Note that in these examples we're only using the custom recognizers we created. We can also add our custom recognizers to the existing recognizers in presidio, by calling registry.load_predefined_recognizers()
:
registry = RecognizerRegistry()
# Load existing recognizer
registry.load_predefined_recognizers()
# Add our custom one
registry.add_recognizer(deny_list_recognizer)
# Initialize AnalyzerEngine
analyzer = AnalyzerEngine(registry=registry)
analyzer.analyze("George Washington was the first president of the United States", language="en")
[type: PRESIDENT_FIRST_NAME, start: 0, end: 6, score: 1.0, type: PERSON, start: 0, end: 17, score: 0.85, type: LOCATION, start: 45, end: 62, score: 0.85]
Since George is also a name, it was detected twice, once as a PERSON entity, and once as a custom entity.
Read more:
- For more info on Presidio Analyzer, see this documentation
- For more info on Presidio Anonymize, see this documentation
- To further customize the anonymization type, see this tutorial