Skip to content

Getting started with text de-identification with Presidio

Presidio provides a simple way to de-identify text data by detecting and anonymizing personally identifiable information (PII). This guide shows you how to get started with text de-identification using Presidio's Python packages.

Note that Presidio can leverage different NLP packages to analyze text data. The default engine is based on spaCy, but you can also use others. This guide shows two examples: one using spaCy and the other using transformers.

Simple flow - Python package

Using Presidio's modules as Python packages to get started:

  1. Install Presidio

    pip install presidio-analyzer
    pip install presidio-anonymizer
    python -m spacy download en_core_web_lg
    
  2. Analyze + Anonymize

    from presidio_analyzer import AnalyzerEngine
    from presidio_anonymizer import AnonymizerEngine
    
    text="My phone number is 212-555-5555"
    
    # Set up the engine, loads the NLP module (spaCy model by default) 
    # and other PII recognizers
    analyzer = AnalyzerEngine()
    
    # Call analyzer to get results
    results = analyzer.analyze(text=text,
                               entities=["PHONE_NUMBER"],
                               language='en')
    print(results)
    
    # Analyzer results are passed to the AnonymizerEngine for anonymization
    
    anonymizer = AnonymizerEngine()
    
    anonymized_text = anonymizer.anonymize(text=text,analyzer_results=results)
    
    print(anonymized_text)
    
  1. Install Presidio

    pip install "presidio-analyzer[transformers]"
    pip install presidio-anonymizer
    python -m spacy download en_core_web_sm
    
  2. Analyze + Anonymize

    from presidio_analyzer import AnalyzerEngine
    from presidio_analyzer.nlp_engine import TransformersNlpEngine
    from presidio_anonymizer import AnonymizerEngine
    
    text = "My name is Don and my phone number is 212-555-5555"
    
    # Define which transformers model to use
    model_config = [{"lang_code": "en", "model_name": {
        "spacy": "en_core_web_sm",  # use a small spaCy model for lemmas, tokens etc.
        "transformers": "dslim/bert-base-NER"
        }
    }]
    
    nlp_engine = TransformersNlpEngine(models=model_config)
    
    # Set up the engine, loads the NLP module (spaCy model by default) 
    # and other PII recognizers
    analyzer = AnalyzerEngine(nlp_engine=nlp_engine)
    
    # Call analyzer to get results
    results = analyzer.analyze(text=text, language='en')
    print(results)
    
    # Analyzer results are passed to the AnonymizerEngine for anonymization
    
    anonymizer = AnonymizerEngine()
    
    anonymized_text = anonymizer.anonymize(text=text, analyzer_results=results)
    
    print(anonymized_text)
    

    Tip: Downloading models

    If not available, the transformers model and the spacy model would be downloaded on the first call to the AnalyzerEngine. To pre-download, see this doc.

Simple flow - Docker container

Presidio provides Docker containers that you can use to de-identify text data. Each module, analyzer, and anonymizer, has its own Docker container. The containers are available on Docker Hub.

  1. Download Docker images
docker pull mcr.microsoft.com/presidio-analyzer
docker pull mcr.microsoft.com/presidio-anonymizer
  1. Run containers
docker run -d -p 5002:3000 mcr.microsoft.com/presidio-analyzer:latest

docker run -d -p 5001:3000 mcr.microsoft.com/presidio-anonymizer:latest
  1. Use the API
curl -X POST http://localhost:5002/analyze \
-H "Content-Type: application/json" \
-d '{
  "text": "My phone number is 555-123-4567.",
  "language": "en"
}'


curl -X POST http://localhost:5001/anonymize -H "Content-Type: application/json"  -d '
    {
        "text": "My phone number is 555-123-4567",
        "anonymizers": {
            "PHONE_NUMBER": {
            "type": "replace",
            "new_value": "--Redacted phone number--"
            }
        },
        "analyzer_results": [
        {
            "start": 19,
            "end": 31,
            "score": 0.95,
            "entity_type": "PHONE_NUMBER"
        }
    ]}'

Read more