In [ ]:

Copied!

# download presidio
!pip install presidio_analyzer presidio_anonymizer

!python -m spacy download en_core_web_lg
# download presidio
!pip install presidio_analyzer presidio_anonymizer

!python -m spacy download en_core_web_lg

Path to notebook: https://www.github.com/microsoft/presidio/blob/main/docs/samples/python/Anonymizing%20known%20values.ipynb ¶

Anonymizing known values¶

In addition to statistical and pattern based approaches, Presidio also supports the identification and anonymization of known values, using the deny-list mechanism. In this example we'll cover two cases:

The known values are known a-priori (e.g., we have a list of names)
The known values are only known in the context of a request (e.g., we have the name of a person as the filename)

Example 1: values are known a-priori¶

Assume you have a list of potential PII values, you can create a recognizer which would detect them every time they appear in the text. For this case, we can create a deny-list based recognizer, and add it to presidio's RecognizerRegistry:

In [2]:

Copied!

from presidio_analyzer import AnalyzerEngine, RecognizerRegistry, PatternRecognizer
from presidio_anonymizer import AnonymizerEngine
from presidio_analyzer import AnalyzerEngine, RecognizerRegistry, PatternRecognizer
from presidio_anonymizer import AnonymizerEngine

In [3]:

Copied!

# Get known values as a deny-list
known_names_list = ["George", "Abraham", "Theodore", "Bill", "Barack", "Donald", "Joe"]
# Get known values as a deny-list
known_names_list = ["George", "Abraham", "Theodore", "Bill", "Barack", "Donald", "Joe"]

In [4]:

Copied!

# Create a PatternRecognizer for the deny list
deny_list_recognizer = PatternRecognizer(supported_entity="PRESIDENT_FIRST_NAME", deny_list=known_names_list)
# Create a PatternRecognizer for the deny list
deny_list_recognizer = PatternRecognizer(supported_entity="PRESIDENT_FIRST_NAME", deny_list=known_names_list)

In [5]:

Copied!

registry = RecognizerRegistry()
registry.add_recognizer(deny_list_recognizer)

analyzer = AnalyzerEngine(registry=registry)

anonymizer = AnonymizerEngine()
registry = RecognizerRegistry()
registry.add_recognizer(deny_list_recognizer)

analyzer = AnalyzerEngine(registry=registry)

anonymizer = AnonymizerEngine()

In [6]:

Copied!





text="George Washington was the first US president"

results = analyzer.analyze(text=text, language="en")

print("Identified entities:")
print(results)
print("")
anonymized = anonymizer.anonymize(text=text, analyzer_results=results)
print(f"Anonymized text:\n{anonymized.text}")
text="George Washington was the first US president"

results = analyzer.analyze(text=text, language="en")

print("Identified entities:")
print(results)
print("")
anonymized = anonymizer.anonymize(text=text, analyzer_results=results)
print(f"Anonymized text:\n{anonymized.text}")

Identified entities:
[type: PRESIDENT_FIRST_NAME, start: 0, end: 6, score: 1.0]

Anonymized text:
<PRESIDENT_FIRST_NAME> Washington was the first US president

Example 2: values are only known in the context of the request¶

In some cases, we know the potential PII values only in the context of a specific text. Examples could be:

Detect PII entities in free text columns in tabular databases, where other columns have entity values we can leverage
Detect PII in a file having the filename or other metadata holding potential PII values (e.g. Smith.csv)
Anonymize medical images which contain metadata
Anonymize financial forms when the actual PII data is known

In this case we can use a functionality called ad-hoc recognizers. Here's a simple example:

In [7]:

Copied!





person1 = {"name": "Martin Smith", 
           "special_value":"145A", 
           "free_text": "Martin Smith, id 145A, likes playing basketball"}
person2 = {"name":"Deb Schmidt", 
           "special_value":"256B", 
           "free_text": "Deb Schmidt, id 256B likes playing soccer"}
person3 = {"name":"R2D2", 
           "special_value":"X1T2", 
           "free_text": "X1T2 is R2D2's special value"}

dataset = [person1, person2, person3]
dataset
person1 = {"name": "Martin Smith", 
           "special_value":"145A", 
           "free_text": "Martin Smith, id 145A, likes playing basketball"}
person2 = {"name":"Deb Schmidt", 
           "special_value":"256B", 
           "free_text": "Deb Schmidt, id 256B likes playing soccer"}
person3 = {"name":"R2D2", 
           "special_value":"X1T2", 
           "free_text": "X1T2 is R2D2's special value"}

dataset = [person1, person2, person3]
dataset

Out[7]:

[{'name': 'Martin Smith',
  'special_value': '145A',
  'free_text': 'Martin Smith, id 145A, likes playing basketball'},
 {'name': 'Deb Schmidt',
  'special_value': '256B',
  'free_text': 'Deb Schmidt, id 256B likes playing soccer'},
 {'name': 'R2D2',
  'special_value': 'X1T2',
  'free_text': "X1T2 is R2D2's special value"}]

We're interested in anonymizing the free text using the values contained in name and special_value. Since these values are only available in the context of one record, we use the ad-hoc recognizer capability in Presidio, instead of a generic deny-list PatternRecognizer added to Presidio's RecognizerRegistry.

In [8]:

Copied!





# Go over dataset
for person in dataset:
    
    # Get the different known values
    name = person['name']
    special_val = person['special_value']
    
    # Get the free text to anonymize
    free_text = person['free_text']
    
    # Create ad-hoc recognizers
    ad_hoc_name_recognizer = PatternRecognizer(supported_entity="name", deny_list = [name])
    ad_hoc_id_recognizer = PatternRecognizer(supported_entity="special_value", deny_list = [special_val])
    
    # Run the analyze method with ad_hoc_recognizers:
    analyzer_results = analyzer.analyze(text=free_text, 
                                        language="en", 
                                        ad_hoc_recognizers=[ad_hoc_name_recognizer, ad_hoc_id_recognizer])
    
    # Anonymize results
    anonymized = anonymizer.anonymize(text=free_text, analyzer_results=analyzer_results)
    print(anonymized.text)
    
    # Store output in original dataset
    person["anonymized_free_text"] = anonymized.text
# Go over dataset
for person in dataset:
    
    # Get the different known values
    name = person['name']
    special_val = person['special_value']
    
    # Get the free text to anonymize
    free_text = person['free_text']
    
    # Create ad-hoc recognizers
    ad_hoc_name_recognizer = PatternRecognizer(supported_entity="name", deny_list = [name])
    ad_hoc_id_recognizer = PatternRecognizer(supported_entity="special_value", deny_list = [special_val])
    
    # Run the analyze method with ad_hoc_recognizers:
    analyzer_results = analyzer.analyze(text=free_text, 
                                        language="en", 
                                        ad_hoc_recognizers=[ad_hoc_name_recognizer, ad_hoc_id_recognizer])
    
    # Anonymize results
    anonymized = anonymizer.anonymize(text=free_text, analyzer_results=analyzer_results)
    print(anonymized.text)
    
    # Store output in original dataset
    person["anonymized_free_text"] = anonymized.text
    

<name>, id <special_value>, likes playing basketball
<name>, id <special_value> likes playing soccer
<special_value> is <name>'s special value

In [9]:

Copied!

# Dataset now contains the anonymiezd version as well
dataset
# Dataset now contains the anonymiezd version as well
dataset

Out[9]:

[{'name': 'Martin Smith',
  'special_value': '145A',
  'free_text': 'Martin Smith, id 145A, likes playing basketball',
  'anonymized_free_text': '<name>, id <special_value>, likes playing basketball'},
 {'name': 'Deb Schmidt',
  'special_value': '256B',
  'free_text': 'Deb Schmidt, id 256B likes playing soccer',
  'anonymized_free_text': '<name>, id <special_value> likes playing soccer'},
 {'name': 'R2D2',
  'special_value': 'X1T2',
  'free_text': "X1T2 is R2D2's special value",
  'anonymized_free_text': "<special_value> is <name>'s special value"}]

Note that in these examples we're only using the custom recognizers we created. We can also add our custom recognizers to the existing recognizers in presidio, by calling registry.load_predefined_recognizers():

In [10]:

Copied!

registry = RecognizerRegistry()

# Load existing recognizer
registry.load_predefined_recognizers()

# Add our custom one
registry.add_recognizer(deny_list_recognizer)

# Initialize AnalyzerEngine
analyzer = AnalyzerEngine(registry=registry)
registry = RecognizerRegistry()

# Load existing recognizer
registry.load_predefined_recognizers()

# Add our custom one
registry.add_recognizer(deny_list_recognizer)

# Initialize AnalyzerEngine
analyzer = AnalyzerEngine(registry=registry)

In [11]:

Copied!

analyzer.analyze("George Washington was the first president of the United States", language="en")
analyzer.analyze("George Washington was the first president of the United States", language="en")

Out[11]:

[type: PRESIDENT_FIRST_NAME, start: 0, end: 6, score: 1.0,
 type: PERSON, start: 0, end: 17, score: 0.85,
 type: LOCATION, start: 45, end: 62, score: 0.85]

Since George is also a name, it was detected twice, once as a PERSON entity, and once as a custom entity.

Path to notebook: https://www.github.com/microsoft/presidio/blob/main/docs/samples/python/Anonymizing%20known%20values.ipynb¶

Anonymizing known values¶

Example 1: values are known a-priori¶

Example 2: values are only known in the context of the request¶

Path to notebook: https://www.github.com/microsoft/presidio/blob/main/docs/samples/python/Anonymizing%20known%20values.ipynb ¶