# download presidio
!pip install presidio_analyzer presidio_anonymizer
!python -m spacy download en_core_web_lg
from presidio_analyzer import AnalyzerEngine, PatternRecognizer
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig
import json
from pprint import pprint
Analyze Text for PII Entities¶
Using Presidio Analyzer, analyze a text to identify PII entities. The Presidio analyzer is using pre-defined entity recognizers, and offers the option to create custom recognizers.
The following code sample will:
- Set up the Analyzer engine: load the NLP module (spaCy model by default) and other PII recognizers
- Call analyzer to get analyzed results for "PHONE_NUMBER" entity type
text_to_anonymize = "His name is Mr. Jones and his phone number is 212-555-5555"
analyzer = AnalyzerEngine()
analyzer_results = analyzer.analyze(text=text_to_anonymize, entities=["PHONE_NUMBER"], language='en')
print(analyzer_results)
Create Custom PII Entity Recognizers¶
Presidio Analyzer comes with a pre-defined set of entity recognizers. It also allows adding new recognizers without changing the analyzer base code, by creating custom recognizers.
In the following example, we will create two new recognizers of type PatternRecognizer
to identify titles and pronouns in the analyzed text.
A PatternRecognizer
is a PII entity recognizer which uses regular expressions or deny-lists.
The following code sample will:
- Create custom recognizers
- Add the new custom recognizers to the analyzer
- Call analyzer to get results from the new recognizers
titles_recognizer = PatternRecognizer(supported_entity="TITLE",
deny_list=["Mr.","Mrs.","Miss"])
pronoun_recognizer = PatternRecognizer(supported_entity="PRONOUN",
deny_list=["he", "He", "his", "His", "she", "She", "hers", "Hers"])
analyzer.registry.add_recognizer(titles_recognizer)
analyzer.registry.add_recognizer(pronoun_recognizer)
analyzer_results = analyzer.analyze(text=text_to_anonymize,
entities=["TITLE", "PRONOUN"],
language="en")
print(analyzer_results)
Call Presidio Analyzer and get analyzed results with all the configured recognizers - default and new custom recognizers
analyzer_results = analyzer.analyze(text=text_to_anonymize, language='en')
analyzer_results
Anonymize Text with Identified PII Entities¶
Presidio Anonymizer iterates over the Presidio Analyzer result, and provides anonymization capabilities for the identified text.
The anonymizer provides 5 types of anonymizers - replace, redact, mask, hash and encrypt. The default is replace
The following code sample will:
- Setup the anonymizer engine
- Create an anonymizer request - text to anonymize, list of anonymizers to apply and the results from the analyzer request
- Anonymize the text
anonymizer = AnonymizerEngine()
anonymized_results = anonymizer.anonymize(
text=text_to_anonymize,
analyzer_results=analyzer_results,
operators={"DEFAULT": OperatorConfig("replace", {"new_value": "<ANONYMIZED>"}),
"PHONE_NUMBER": OperatorConfig("mask", {"type": "mask", "masking_char" : "*", "chars_to_mask" : 12, "from_end" : True}),
"TITLE": OperatorConfig("redact", {})}
)
print(f"text: {anonymized_results.text}")
print("detailed response:")
pprint(json.loads(anonymized_results.to_json()))