Getting started with text de-identification with Presidio
Presidio provides a simple way to de-identify text data by detecting and anonymizing personally identifiable information (PII). This guide shows you how to get started with text de-identification using Presidio's Python packages.
Note that Presidio can leverage different NLP packages to analyze text data. The default engine is based on spaCy, but you can also use others. This guide shows two examples: one using spaCy and the other using transformers.
Simple flow - Python package
Using Presidio's modules as Python packages to get started:
-
Install Presidio
pip install presidio-analyzer pip install presidio-anonymizer python -m spacy download en_core_web_lg -
Analyze + Anonymize
from presidio_analyzer import AnalyzerEngine from presidio_anonymizer import AnonymizerEngine text="My phone number is 212-555-5555" # Set up the engine, loads the NLP module (spaCy model by default) # and other PII recognizers analyzer = AnalyzerEngine() # Call analyzer to get results results = analyzer.analyze(text=text, entities=["PHONE_NUMBER"], language='en') print(results) # Analyzer results are passed to the AnonymizerEngine for anonymization anonymizer = AnonymizerEngine() anonymized_text = anonymizer.anonymize(text=text,analyzer_results=results) print(anonymized_text)
-
Install Presidio
pip install "presidio-analyzer[transformers]" pip install presidio-anonymizer python -m spacy download en_core_web_sm -
Analyze + Anonymize
from presidio_analyzer import AnalyzerEngine from presidio_analyzer.nlp_engine import TransformersNlpEngine from presidio_anonymizer import AnonymizerEngine text = "My name is Don and my phone number is 212-555-5555" # Define which transformers model to use model_config = [{"lang_code": "en", "model_name": { "spacy": "en_core_web_sm", # use a small spaCy model for lemmas, tokens etc. "transformers": "dslim/bert-base-NER" } }] nlp_engine = TransformersNlpEngine(models=model_config) # Set up the engine, loads the NLP module (spaCy model by default) # and other PII recognizers analyzer = AnalyzerEngine(nlp_engine=nlp_engine) # Call analyzer to get results results = analyzer.analyze(text=text, language='en') print(results) # Analyzer results are passed to the AnonymizerEngine for anonymization anonymizer = AnonymizerEngine() anonymized_text = anonymizer.anonymize(text=text, analyzer_results=results) print(anonymized_text)Tip: Downloading models
If not available, the transformers model and the spacy model would be downloaded on the first call to the
AnalyzerEngine. To pre-download, see this doc.
Simple flow - Docker container
Presidio provides Docker containers that you can use to de-identify text data. Each module, analyzer, and anonymizer, has its own Docker container. The containers are available on Docker Hub.
- Download Docker images
docker pull mcr.microsoft.com/presidio-analyzer
docker pull mcr.microsoft.com/presidio-anonymizer
- Run containers
docker run -d -p 5002:3000 mcr.microsoft.com/presidio-analyzer:latest
docker run -d -p 5001:3000 mcr.microsoft.com/presidio-anonymizer:latest
- Use the API
curl -X POST http://localhost:5002/analyze \
-H "Content-Type: application/json" \
-d '{
"text": "My phone number is 555-123-4567.",
"language": "en"
}'
curl -X POST http://localhost:5001/anonymize -H "Content-Type: application/json" -d '
{
"text": "My phone number is 555-123-4567",
"anonymizers": {
"PHONE_NUMBER": {
"type": "replace",
"new_value": "--Redacted phone number--"
}
},
"analyzer_results": [
{
"start": 19,
"end": 31,
"score": 0.95,
"entity_type": "PHONE_NUMBER"
}
]}'