# download presidio
!pip install presidio_analyzer presidio_anonymizer
!python -m spacy download en_core_web_lg
Integrating external models/services with Presidio¶
Presidio analyzer is comprised of a set of PII recognizers which can run local or remotely. In this notebook we'll give an example of integrating an external service into Presidio-Analyzer.
Azure Text Analytics¶
Azure Text Analytics is a cloud-based service that provides advanced natural language processing over raw text. One of its main functions includes Named Entity Recognition (NER), which has the ability to identify different entities in text and categorize them into pre-defined classes or types.
Supported entity categories in the Text Analytics API¶
Text Analytics supports multiple PII entity categories. The Text Analytics service runs a predictive model to identify and categorize named entities from an input document. The service's latest version includes the ability to detect personal (PII) and health (PHI) information. A list of all supported entities can be found in the official documentation.
Prerequisites¶
To use Text Analytics with Preisido, an Azure Text Analytics resource should
first be created under an Azure subscription. Follow the official documentation
for instructions. The key and endpoint, generated once the resource is created, should replace the placeholders <YOUR_TEXT_ANALYTICS_KEY>
and <YOUR_TEXT_ANALYTICS_ENDPOINT>
in this notebook, respectively.
Text Analytics Recognizer¶
In this example we will use the TextAnalyticsRecognizer
sample implementation. This class extends Presidio's Remote Recognizer for calling the Text Analytics service REST API. For additional information of a remote recognizer, see the ExampleRemoteRecognizer sample.
from presidio_analyzer import AnalyzerEngine
from text_analytics.example_text_analytics_recognizer import TextAnalyticsEntityCategory, TextAnalyticsRecognizer
- Define which entities to get from Text Analytics
ta_entities = [
TextAnalyticsEntityCategory(name="Person",
entity_type="NAME",
supported_languages=["en"]),
TextAnalyticsEntityCategory(name="Age",
entity_type="AGE",
subcategory = "Age",
supported_languages=["en"]),
TextAnalyticsEntityCategory(name="InternationlBankingAccountNumber",
entity_type="IBAN",
supported_languages=["en"])]
For a full list of entities: https://docs.microsoft.com/en-us/azure/cognitive-services/text-analytics/named-entity-types?tabs=personal
- Instantiate the remote recognizer object (In this case
TextAnalyticsRecognizer
)
text_analytics_recognizer = TextAnalyticsRecognizer(
text_analytics_key="<YOUR_TEXT_ANALYTICS_KEY>",
text_analytics_endpoint="<YOUR_TEXT_ANALYTICS_ENDPOINT>",
text_analytics_categories = ta_entities)
- Add the new recognizer to the list of recognizers and run the
PresidioAnalyzer
analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(text_analytics_recognizer)
results = analyzer.analyze(
text="David is 30 years old. His IBAN: IL150120690000003111111", language="en"
)
print(results)