Presidio Analyzer
The Presidio analyzer is a Python based service for detecting PII entities in text.
During analysis, it runs a set of different PII Recognizers, each one in charge of detecting one or more PII entities using different mechanisms.
Presidio analyzer comes with a set of predefined recognizers, but can easily be extended with other types of custom recognizers. Predefined and custom recognizers leverage regex, Named Entity Recognition and other types of logic to detect PII in unstructured text.
Installation
Note
Consider installing the Presidio python packages on a virtual environment like venv or conda.
To get started with Presidio-analyzer,
download the package and the en_core_web_lg
spaCy model:
pip install presidio-analyzer
python -m spacy download en_core_web_lg
Note
This requires Docker to be installed. Download Docker.
# Download image from Dockerhub
docker pull mcr.microsoft.com/presidio-analyzer
# Run the container with the default port
docker run -d -p 5002:3000 mcr.microsoft.com/presidio-analyzer:latest
First, clone the Presidio repo. See here for instructions.
Then, build the presidio-analyzer container:
cd presidio-analyzer
docker build . -t presidio/presidio-analyzer
Getting started
Once the Presidio-analyzer package is installed, run this simple analysis script:
from presidio_analyzer import AnalyzerEngine
# Set up the engine, loads the NLP module (spaCy model by default) and other PII recognizers
analyzer = AnalyzerEngine()
# Call analyzer to get results
results = analyzer.analyze(text="My phone number is 212-555-5555",
entities=["PHONE_NUMBER"],
language='en')
print(results)
You can run presidio analyzer as an http server using either python runtime or using a docker container.
Using docker container
cd presidio-analyzer
docker run -p 5002:3000 presidio-analyzer
Using python runtime
Note
This requires the Presidio Github repository to be cloned.
cd presidio-analyzer
python app.py
curl -d '{"text":"John Smith drivers license is AC432223", "language":"en"}' -H "Content-Type: application/json" -X POST http://localhost:3000/analyze
Creating PII recognizers
Presidio analyzer can be easily extended to support additional PII entities. See this tutorial on adding new PII recognizers for more information.
Multi-language support
Presidio can be used to detect PII entities in multiple languages. Refer to the multi-language support for more information.
Outputting the analyzer decision process
Presidio analyzer has a built in mechanism for tracing each decision made. This can be useful when attempting to understand a specific PII detection. For more info, see the decision process documentation.
Supported entities
For a list of the current supported entities: Supported entities.
API reference
Follow the API Spec for the Analyzer REST API reference details and Analyzer Python API for Python API reference
Samples
Samples illustrating the usage of the Presidio Analyzer can be found in the Python samples.