The Presidio-analyzer decision process

Background

Presidio-analyzer's decision process exposes information on why a specific PII was detected. Such information could contain:

Which recognizer detected the entity
Which regex pattern was used
Interpretability mechanisms in ML models
Which context words improved the score
Confidence scores before and after each step

And more.

Usage

The decision process can be leveraged in two ways:

Presidio-analyzer can log its decision process into a designated logger, which allows you to investigate a specific api request, by exposing a correlation-id as part of the api response headers.
The decision process can be returned as part of the /analyze response.

Getting the decision process as part of the response

The decision process result can be added to the response. To enable it, call the analyze method with return_decision_process set as True.

For example:

HTTPPython

curl -d '{
    "text": "John Smith drivers license is AC432223", 
    "language": "en", 
    "return_decision_process": true}' -H "Content-Type: application/json" -X POST http://localhost:3000/analyze

from presidio_analyzer import AnalyzerEngine

# Set up the engine, loads the NLP module (spaCy model by default)
# and other PII recognizers
analyzer = AnalyzerEngine()

# Call analyzer to get results
results = analyzer.analyze(text='My phone number is 212-555-5555', 
                        entities=['PHONE_NUMBER'], 
                        language='en', 
                        return_decision_process=True)

# Get the decision process results for the first result
print(results[0].analysis_explanation)

Logging the decision process

Logging of the decision process is turned off by default. To turn it on, create the AnalyzerEngine object with log_decision_process=True.

For example:

from presidio_analyzer import AnalyzerEngine

# Set up the engine, loads the NLP module (spaCy model by default)
# and other PII recognizers
analyzer = AnalyzerEngine(log_decision_process=True)

# Call analyzer to get results
results = analyzer.analyze(text='My phone number is 212-555-5555', 
                           entities=['PHONE_NUMBER'], 
                           language='en', 
                           correlation_id="xyz")

The decision process logs will be written to standard output. Note that it is possible to define a correlation-id which is the trace identification. It will help you to query the stdout logs. The id can be retrieved from each API response header: x-correlation-id.

By having the traces written into the stdout it's very easy to configure a monitoring solution to ease the process of reading processing the tracing logs in a distributed system.

Examples

For the a request with the following text:

My name is Bart Simpson, my Credit card is: 4095-2609-9393-4932,  my phone is 425 8829090

The following traces will be written to log, with this format:

[Date Time][decision_process][Log Level][Unique Correlation ID][Trace Message]

[2019-07-14 14:22:32,409][decision_process][INFO][00000000-0000-0000-0000-000000000000][nlp artifacts:{'entities': (Bart Simpson, 4095, 425), 'tokens': ['My', 'name', 'is', 'Bart', 'Simpson', ',', 'my', 'Credit', 'card', 'is', ':', '4095', '-', '2609', '-', '9393', '-', '4932', ',', ' ', 'my', 'phone', 'is', '425', '8829090'], 'lemmas': ['My', 'name', 'be', 'Bart', 'Simpson', ',', 'my', 'Credit', 'card', 'be', ':', '4095', '-', '2609', '-', '9393', '-', '4932', ',', ' ', 'my', 'phone', 'be', '425', '8829090'], 'tokens_indices': [0, 3, 8, 11, 16, 23, 25, 28, 35, 40, 42, 44, 48, 49, 53, 54, 58, 59, 63, 65, 66, 69, 75, 78, 82], 'keywords': ['bart', 'simpson', 'credit', 'card', '4095', '2609', '9393', '4932', ' ', 'phone', '425', '8829090']}]

[2019-07-14 14:22:32,417][decision_process][INFO][00000000-0000-0000-0000-000000000000][["{'entity_type': 'CREDIT_CARD', 'start': 44, 'end': 63, 'score': 1.0, 'analysis_explanation': {'recognizer': 'CreditCardRecognizer', 'pattern_name': 'All Credit Cards (weak)', 'pattern': '\\\\b((4\\\\d{3})|(5[0-5]\\\\d{2})|(6\\\\d{3})|(1\\\\d{3})|(3\\\\d{3}))[- ]?(\\\\d{3,4})[- ]?(\\\\d{3,4})[- ]?(\\\\d{3,5})\\\\b', 'original_score': 0.3, 'score': 1.0, 'textual_explanation': None, 'score_context_improvement': 0.7, 'supportive_context_word': 'credit', 'validation_result': True}}", "{'entity_type': 'PERSON', 'start': 11, 'end': 23, 'score': 0.85, 'analysis_explanation': {'recognizer': 'SpacyRecognizer', 'pattern_name': None, 'pattern': None, 'original_score': 0.85, 'score': 0.85, 'textual_explanation': \"Identified as PERSON by Spacy's Named Entity Recognition\", 'score_context_improvement': 0, 'supportive_context_word': '', 'validation_result': None}}", "{'entity_type': 'PHONE_NUMBER', 'start': 78, 'end': 89, 'score': 0.85, 'analysis_explanation': {'recognizer': 'UsPhoneRecognizer', 'pattern_name': 'Phone (medium)', 'pattern': '\\\\b(\\\\d{3}[-\\\\.\\\\s]\\\\d{3}[-\\\\.\\\\s]??\\\\d{4})\\\\b', 'original_score': 0.5, 'score': 0.85, 'textual_explanation': None, 'score_context_improvement': 0.35, 'supportive_context_word': 'phone', 'validation_result': None}}"]]

Writing custom decision process for a recognizer

When creating new PII recognizers, it is possible to add information about the recognizer's decision process. This information will be traced or returned to the user, depending on the configuration.

For example, the spacy_recognizer.py implements a custom trace as follows:

SPACY_DEFAULT_EXPLANATION = "Identified as {} by Spacy's Named Entity Recognition"

def build_spacy_explanation(recognizer_name, original_score, entity):
    explanation = AnalysisExplanation(
        recognizer=recognizer_name,
        original_score=original_score,
        textual_explanation=SPACY_DEFAULT_EXPLANATION.format(entity))
    return explanation

The textual_explanation field in AnalysisExplanation class allows you to add your own custom text into the final trace which will be written.

Note

These traces leverage the Python logging mechanisms. In the default configuration, A StreamHandler is used to write these logs to sys.stdout.

Warning

Decision-process traces explain why PIIs were detected, but not why they were not detected!