# download presidio
!pip install presidio_analyzer presidio_anonymizer
!python -m spacy download en_core_web_lg
Getting a list of all identified texts
This sample illustrates how to get a list of all the identified PII entities using Presidio Analyzer for detection and a custom Presidio Anonymizer operator.
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
text_to_analyze = "Hi my name is Charles Darwin and my email is cdarwin@hmsbeagle.org"
analyzer_results = analyzer.analyze(text_to_analyze, language="en")
A naive approach for getting the text values:
[(text_to_analyze[res.start:res.end], res.start, res.end) for res in analyzer_results]
Another option is to set up a custom operator* which runs an identity function (lambda x: x
). This operator doesn't really anonymize, but replaces the identified value with itself. This is useful as the Anonymizer handles the overlaps automatically.
> In this example, the URL (hmsbeagle.org) is contained in the email address, so it's ommitted from the final result.
* an Operator
is usually either an Anonymizer
or Deanonymizer
on the presidio-anonymizer library/
anonymized_results = anonymizer.anonymize(
text=text_to_analyze,
analyzer_results=analyzer_results,
operators={"DEFAULT": OperatorConfig("custom", {"lambda": lambda x: x})}
)
The operator defined here is DEFAULT
, meaning it will be used for all entities. The OperatorConfig
is a custom one and the labmda is the identity function.
Output text, start and end locations for each detected entity
[(item.text, item.start, item.end) for item in anonymized_results.items]
A third option would be to use the keep
operator:
anonymized_results_with_keep = anonymizer.anonymize(
text=text_to_analyze,
analyzer_results=analyzer_results,
operators={"DEFAULT": OperatorConfig("keep")}
)
[(item.text, item.start, item.end) for item in anonymized_results_with_keep.items]