Skip to content
# download presidio
!pip install presidio_analyzer presidio_anonymizer
!python -m spacy download en_core_web_lg

Getting a list of all identified texts

This sample illustrates how to get a list of all the identified PII entities using Presidio Analyzer for detection and a custom Presidio Anonymizer operator.

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
text_to_analyze = "Hi my name is Charles Darwin and my email is cdarwin@hmsbeagle.org"
analyzer_results = analyzer.analyze(text_to_analyze, language="en")

A naive approach for getting the text values:

[(text_to_analyze[res.start:res.end], res.start, res.end) for res in analyzer_results]
[('cdarwin@hmsbeagle.org', 45, 66),
 ('Charles Darwin', 14, 28),
 ('hmsbeagle.org', 53, 66)]

Another option is to set up a custom operator* which runs an identity function (lambda x: x). This operator doesn't really anonymize, but replaces the identified value with itself. This is useful as the Anonymizer handles the overlaps automatically.

> In this example, the URL (hmsbeagle.org) is contained in the email address, so it's ommitted from the final result.

* an Operator is usually either an Anonymizer or Deanonymizer on the presidio-anonymizer library/

anonymized_results = anonymizer.anonymize(
        text=text_to_analyze,
        analyzer_results=analyzer_results,            
        operators={"DEFAULT": OperatorConfig("custom", {"lambda": lambda x: x})}        
    )

The operator defined here is DEFAULT, meaning it will be used for all entities. The OperatorConfig is a custom one and the labmda is the identity function.

Output text, start and end locations for each detected entity

[(item.text, item.start, item.end) for item in anonymized_results.items]
[('cdarwin@hmsbeagle.org', 45, 66), ('Charles Darwin', 14, 28)]

A third option would be to use the keep operator:

anonymized_results_with_keep = anonymizer.anonymize(
        text=text_to_analyze,
        analyzer_results=analyzer_results,            
        operators={"DEFAULT": OperatorConfig("keep")}        
    )
[(item.text, item.start, item.end) for item in anonymized_results_with_keep.items]
[('cdarwin@hmsbeagle.org', 45, 66), ('Charles Darwin', 14, 28)]