Getting started with text de-identification with Presidio

Presidio provides a simple way to de-identify text data by detecting and anonymizing personally identifiable information (PII). This guide shows you how to get started with text de-identification using Presidio's Python packages.

Note that Presidio can leverage different NLP packages to analyze text data. The default engine is based on spaCy, but you can also use others. This guide shows two examples: one using spaCy and the other using transformers.

Simple flow - Python package

Using Presidio's modules as Python packages to get started:

Anonymize PII in text (Default spaCy model)Anonymize PII in text (transformers)

Install Presidio

pip install presidio-analyzer
pip install presidio-anonymizer
python -m spacy download en_core_web_lg

Analyze + Anonymize

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

text="My phone number is 212-555-5555"

# Set up the engine, loads the NLP module (spaCy model by default) 
# and other PII recognizers
analyzer = AnalyzerEngine()

# Call analyzer to get results
results = analyzer.analyze(text=text,
                           entities=["PHONE_NUMBER"],
                           language='en')
print(results)

# Analyzer results are passed to the AnonymizerEngine for anonymization

anonymizer = AnonymizerEngine()

anonymized_text = anonymizer.anonymize(text=text,analyzer_results=results)

print(anonymized_text)

Install Presidio

pip install "presidio-analyzer[transformers]"
pip install presidio-anonymizer
python -m spacy download en_core_web_sm

Analyze + Anonymize

from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import TransformersNlpEngine
from presidio_anonymizer import AnonymizerEngine

text = "My name is Don and my phone number is 212-555-5555"

# Define which transformers model to use
model_config = [{"lang_code": "en", "model_name": {
    "spacy": "en_core_web_sm",  # use a small spaCy model for lemmas, tokens etc.
    "transformers": "dslim/bert-base-NER"
    }
}]

nlp_engine = TransformersNlpEngine(models=model_config)

# Set up the engine, loads the NLP module (spaCy model by default) 
# and other PII recognizers
analyzer = AnalyzerEngine(nlp_engine=nlp_engine)

# Call analyzer to get results
results = analyzer.analyze(text=text, language='en')
print(results)

# Analyzer results are passed to the AnonymizerEngine for anonymization

anonymizer = AnonymizerEngine()

anonymized_text = anonymizer.anonymize(text=text, analyzer_results=results)

print(anonymized_text)

Tip: Downloading models

If not available, the transformers model and the spacy model would be downloaded on the first call to the AnalyzerEngine. To pre-download, see this doc.

Simple flow - Docker container

Presidio provides Docker containers that you can use to de-identify text data. Each module, analyzer, and anonymizer, has its own Docker container. The containers are available on Docker Hub.

Download Docker images

docker pull mcr.microsoft.com/presidio-analyzer
docker pull mcr.microsoft.com/presidio-anonymizer

Run containers

docker run -d -p 5002:3000 mcr.microsoft.com/presidio-analyzer:latest

docker run -d -p 5001:3000 mcr.microsoft.com/presidio-anonymizer:latest

Use the API

curl -X POST http://localhost:5002/analyze \
-H "Content-Type: application/json" \
-d '{
  "text": "My phone number is 555-123-4567.",
  "language": "en"
}'


curl -X POST http://localhost:5001/anonymize -H "Content-Type: application/json"  -d '
    {
        "text": "My phone number is 555-123-4567",
        "anonymizers": {
            "PHONE_NUMBER": {
            "type": "replace",
            "new_value": "--Redacted phone number--"
            }
        },
        "analyzer_results": [
        {
            "start": 19,
            "end": 31,
            "score": 0.95,
            "entity_type": "PHONE_NUMBER"
        }
    ]}'

Getting started with text de-identification with Presidio

Simple flow - Python package

Simple flow - Docker container

Read more