# download presidio
!pip install presidio_analyzer presidio_anonymizer
!python -m spacy download en_core_web_lg
Getting a list of all identified texts¶
This sample illustrates how to get a list of all the identified PII entities using Presidio Analyzer for detection and a custom Presidio Anonymizer operator.
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
text_to_analyze = "Hi my name is Charles Darwin and my email is cdarwin@hmsbeagle.org"
analyzer_results = analyzer.analyze(text_to_analyze, language="en")
A naive approach for getting the text values:
[(text_to_analyze[res.start:res.end], res.start, res.end) for res in analyzer_results]
[('cdarwin@hmsbeagle.org', 45, 66), ('Charles Darwin', 14, 28), ('hmsbeagle.org', 53, 66)]
Another option is to set up a custom operator* which runs an identity function (lambda x: x
). This operator doesn't really anonymize, but replaces the identified value with itself. This is useful as the Anonymizer handles the overlaps automatically.
In this example, the URL (hmsbeagle.org) is contained in the email address, so it's ommitted from the final result.
* an Operator
is usually either an Anonymizer
or Deanonymizer
on the presidio-anonymizer library/
anonymized_results = anonymizer.anonymize(
text=text_to_analyze,
analyzer_results=analyzer_results,
operators={"DEFAULT": OperatorConfig("custom", {"lambda": lambda x: x})}
)
The operator defined here is DEFAULT
, meaning it will be used for all entities. The OperatorConfig
is a custom one and the labmda is the identity function.
Output text, start and end locations for each detected entity
[(item.text, item.start, item.end) for item in anonymized_results.items]
[('cdarwin@hmsbeagle.org', 45, 66), ('Charles Darwin', 14, 28)]
A third option would be to use the keep
operator:
anonymized_results_with_keep = anonymizer.anonymize(
text=text_to_analyze,
analyzer_results=analyzer_results,
operators={"DEFAULT": OperatorConfig("keep")}
)
[(item.text, item.start, item.end) for item in anonymized_results_with_keep.items]
[('cdarwin@hmsbeagle.org', 45, 66), ('Charles Darwin', 14, 28)]