Getting started with text de-identification with Presidio
Presidio provides a simple way to de-identify text data by detecting and anonymizing personally identifiable information (PII). This guide shows you how to get started with text de-identification using Presidio's Python packages.
Note that Presidio can leverage different NLP packages to analyze text data. The default engine is based on spaCy
, but you can also use others. This guide shows two examples: one using spaCy
and the other using transformers
.
Simple flow - Python package
Using Presidio's modules as Python packages to get started:
-
Install Presidio
pip install presidio-analyzer pip install presidio-anonymizer python -m spacy download en_core_web_lg
-
Analyze + Anonymize
from presidio_analyzer import AnalyzerEngine from presidio_anonymizer import AnonymizerEngine text="My phone number is 212-555-5555" # Set up the engine, loads the NLP module (spaCy model by default) # and other PII recognizers analyzer = AnalyzerEngine() # Call analyzer to get results results = analyzer.analyze(text=text, entities=["PHONE_NUMBER"], language='en') print(results) # Analyzer results are passed to the AnonymizerEngine for anonymization anonymizer = AnonymizerEngine() anonymized_text = anonymizer.anonymize(text=text,analyzer_results=results) print(anonymized_text)
-
Install Presidio
pip install "presidio-analyzer[transformers]" pip install presidio-anonymizer python -m spacy download en_core_web_sm
-
Analyze + Anonymize
from presidio_analyzer import AnalyzerEngine from presidio_analyzer.nlp_engine import TransformersNlpEngine from presidio_anonymizer import AnonymizerEngine text = "My name is Don and my phone number is 212-555-5555" # Define which transformers model to use model_config = [{"lang_code": "en", "model_name": { "spacy": "en_core_web_sm", # use a small spaCy model for lemmas, tokens etc. "transformers": "dslim/bert-base-NER" } }] nlp_engine = TransformersNlpEngine(models=model_config) # Set up the engine, loads the NLP module (spaCy model by default) # and other PII recognizers analyzer = AnalyzerEngine(nlp_engine=nlp_engine) # Call analyzer to get results results = analyzer.analyze(text=text, language='en') print(results) # Analyzer results are passed to the AnonymizerEngine for anonymization anonymizer = AnonymizerEngine() anonymized_text = anonymizer.anonymize(text=text, analyzer_results=results) print(anonymized_text)
Tip: Downloading models
If not available, the transformers model and the spacy model would be downloaded on the first call to the
AnalyzerEngine
. To pre-download, see this doc.
Simple flow - Docker container
Presidio provides Docker containers that you can use to de-identify text data. Each module, analyzer, and anonymizer, has its own Docker container. The containers are available on Docker Hub.
- Download Docker images
docker pull mcr.microsoft.com/presidio-analyzer
docker pull mcr.microsoft.com/presidio-anonymizer
- Run containers
docker run -d -p 5002:3000 mcr.microsoft.com/presidio-analyzer:latest
docker run -d -p 5001:3000 mcr.microsoft.com/presidio-anonymizer:latest
- Use the API
curl -X POST http://localhost:5002/analyze \
-H "Content-Type: application/json" \
-d '{
"text": "My phone number is 555-123-4567.",
"language": "en"
}'
curl -X POST http://localhost:5001/anonymize -H "Content-Type: application/json" -d '
{
"text": "My phone number is 555-123-4567",
"anonymizers": {
"PHONE_NUMBER": {
"type": "replace",
"new_value": "--Redacted phone number--"
}
},
"analyzer_results": [
{
"start": 19,
"end": 31,
"score": 0.95,
"entity_type": "PHONE_NUMBER"
}
]}'