Example 5: Supporting new models and languages
Two main parts in Presidio handle the text, and should be adapted if a new language is required:
- The
NlpEngine
containing the NLP model which performs tokenization, lemmatization, Named Entity Recognition and other NLP tasks. - The different PII recognizers (
EntityRecognizer
objects) should be adapted or created.
Adapting the NLP engine
As its internal NLP engine, Presidio supports both spaCy and Stanza. Make sure you download the required models from spacy/stanza prior to using them. More details here. For example, to download the Spanish medium spaCy model: python -m spacy download es_core_news_md
In this example we will configure Presidio to use spaCy as its underlying NLP framework, with NLP models in English and Spanish:
from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import NlpEngineProvider
# import spacy
# spacy.cli.download("es_core_news_md")
# Create configuration containing engine name and models
configuration = {
"nlp_engine_name": "spacy",
"models": [
{"lang_code": "es", "model_name": "es_core_news_md"},
{"lang_code": "en", "model_name": "en_core_web_lg"},
],
}
# Create NLP engine based on configuration
provider = NlpEngineProvider(nlp_configuration=configuration)
nlp_engine_with_spanish = provider.create_engine()
# Pass the created NLP engine and supported_languages to the AnalyzerEngine
analyzer = AnalyzerEngine(
nlp_engine=nlp_engine_with_spanish, supported_languages=["en", "es"]
)
# Analyze in different languages
results_spanish = analyzer.analyze(text="Mi nombre es Morris", language="es")
print("Results from Spanish request:")
print(results_spanish)
results_english = analyzer.analyze(text="My name is Morris", language="en")
print("Results from English request:")
print(results_english)
See this documentation for more details on setting up additional NLP models and languages.
Using external models/frameworks
Some languages are not supported by spaCy/Stanza/huggingface, or have very limited support in those. In this case, other frameworks could be leveraged. (see example 4 for more information).
Since Presidio requires a spaCy model to be passed, we propose to use a simple spaCy pipeline such as en_core_web_sm
as the NLP engine's model, and a recognizer calling an external framework/service as the Named Entity Recognition (NER) model.