In [ ]:

Copied!

# download presidio
!pip install presidio_analyzer presidio_anonymizer
!python -m spacy download en_core_web_lg
# download presidio
!pip install presidio_analyzer presidio_anonymizer
!python -m spacy download en_core_web_lg

Path to notebook: https://www.github.com/microsoft/presidio/blob/main/docs/samples/python/getting_entity_values.ipynb ¶

Getting a list of all identified texts¶

This sample illustrates how to get a list of all the identified PII entities using Presidio Analyzer for detection and a custom Presidio Anonymizer operator.

In [1]:

Copied!

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig

In [2]:

Copied!

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

In [3]:

Copied!

text_to_analyze = "Hi my name is Charles Darwin and my email is cdarwin@hmsbeagle.org"
analyzer_results = analyzer.analyze(text_to_analyze, language="en")
text_to_analyze = "Hi my name is Charles Darwin and my email is cdarwin@hmsbeagle.org"
analyzer_results = analyzer.analyze(text_to_analyze, language="en")

A naive approach for getting the text values:

In [4]:

Copied!

[(text_to_analyze[res.start:res.end], res.start, res.end) for res in analyzer_results]
[(text_to_analyze[res.start:res.end], res.start, res.end) for res in analyzer_results]

Out[4]:

[('cdarwin@hmsbeagle.org', 45, 66),
 ('Charles Darwin', 14, 28),
 ('hmsbeagle.org', 53, 66)]

Another option is to set up a custom operator* which runs an identity function (lambda x: x). This operator doesn't really anonymize, but replaces the identified value with itself. This is useful as the Anonymizer handles the overlaps automatically.

In this example, the URL (hmsbeagle.org) is contained in the email address, so it's ommitted from the final result.

* an Operator is usually either an Anonymizer or Deanonymizer on the presidio-anonymizer library/

In [7]:

Copied!





anonymized_results = anonymizer.anonymize(
        text=text_to_analyze,
        analyzer_results=analyzer_results,            
        operators={"DEFAULT": OperatorConfig("custom", {"lambda": lambda x: x})}        
    )
anonymized_results = anonymizer.anonymize(
        text=text_to_analyze,
        analyzer_results=analyzer_results,            
        operators={"DEFAULT": OperatorConfig("custom", {"lambda": lambda x: x})}        
    )

The operator defined here is DEFAULT, meaning it will be used for all entities. The OperatorConfig is a custom one and the labmda is the identity function.

Output text, start and end locations for each detected entity

In [8]:

Copied!

[(item.text, item.start, item.end) for item in anonymized_results.items]
[(item.text, item.start, item.end) for item in anonymized_results.items]

Out[8]:

[('cdarwin@hmsbeagle.org', 45, 66), ('Charles Darwin', 14, 28)]

A third option would be to use the keep operator:

In [9]:

Copied!





anonymized_results_with_keep = anonymizer.anonymize(
        text=text_to_analyze,
        analyzer_results=analyzer_results,            
        operators={"DEFAULT": OperatorConfig("keep")}        
    )
[(item.text, item.start, item.end) for item in anonymized_results_with_keep.items]
anonymized_results_with_keep = anonymizer.anonymize(
        text=text_to_analyze,
        analyzer_results=analyzer_results,            
        operators={"DEFAULT": OperatorConfig("keep")}        
    )
[(item.text, item.start, item.end) for item in anonymized_results_with_keep.items]

Out[9]:

[('cdarwin@hmsbeagle.org', 45, 66), ('Charles Darwin', 14, 28)]

Path to notebook: https://www.github.com/microsoft/presidio/blob/main/docs/samples/python/getting_entity_values.ipynb¶

Getting a list of all identified texts¶

Path to notebook: https://www.github.com/microsoft/presidio/blob/main/docs/samples/python/getting_entity_values.ipynb ¶