PII detection in different languages
Presidio supports PII detection in multiple languages. In its default configuration, it contains recognizers and models for English.
To extend Presidio to detect PII in an additional language, these modules require modification:
NlpEnginecontaining the NLP model which performs tokenization, lemmatization, Named Entity Recognition and other NLP tasks.
- PII recognizers (different
EntityRecognizerobjects) should be adapted or created.
While different detection mechanisms such as regular expressions are language agnostic, the context words used to increase the PII detection confidence aren't. Consider updating the list of context words for each recognizer to leverage context words in additional languages.
Table of contents
- Configuring the NLP Engine
- Set up language specific recognizers
- Automatically install NLP models into the Docker container
Configuring the NLP Engine
Presidio's NLP engine can be adapted to support multiple languages and frameworks (such as spaCy, Stanza and transformers). Configuring the NLP engine for a new language or NLP framework is done by downloading or using a model trained on a different language, and providing a configuration. See the NLP model customization documentation for details on how to configure models for new languages.
Set up language specific recognizers
Recognizers are language dependent either by their logic or by the context words used while scanning the surrounding of a detected entity. As these context words are used to increase score, they should be in the expected input language.
Consider updating the context words of existing recognizers or add new recognizers to support new languages. Each recognizer can support one language. For example:
from presidio_analyzer import AnalyzerEngine, RecognizerRegistry from presidio_analyzer.predefined_recognizers import EmailRecognizer from presidio_analyzer.nlp_engine import NlpEngineProvider LANGUAGES_CONFIG_FILE = "./docs/analyzer/languages-config.yml" # Create NLP engine based on configuration file provider = NlpEngineProvider(conf_file=LANGUAGES_CONFIG_FILE) nlp_engine_with_spanish = provider.create_engine() # Setting up an English Email recognizer: email_recognizer_en = EmailRecognizer(supported_language="en", context=["email", "mail"]) # Setting up a Spanish Email recognizer email_recognizer_es = EmailRecognizer(supported_language="es", context=["correo", "electrónico"]) registry = RecognizerRegistry() # Add recognizers to registry registry.add_recognizer(email_recognizer_en) registry.add_recognizer(email_recognizer_es) # Set up analyzer with our updated recognizer registry analyzer = AnalyzerEngine( registry=registry, supported_languages=["en","es"], nlp_engine=nlp_engine_with_spanish) analyzer.analyze(text="My name is David", language="en")
Automatically install NLP models into the Docker container
When packaging the code into a Docker container, NLP models are automatically installed.
To define which models should be installed,
update the conf/default.yaml file. This file is read during
docker build phase and the models defined in it are installed automatically.
transformers based models, the configuration can be found here.
In addition, make sure the Docker file contains the relevant packages for
transformers, which are not loaded automatically with Presidio.