Skip to content

spaCy/Stanza NLP engine

Presidio can be loaded with pre-trained or custom models coming from spaCy or Stanza.

Using a public pre-trained spaCy/Stanza model

Download the pre-trained model

To replace the default model with a different public model, first download the desired spaCy/Stanza NER models.

  • To download a new model with spaCy:

    python -m spacy download es_core_news_md
    

    In this example we download the medium size model for Spanish.

  • To download a new model with Stanza:

    import stanza
    stanza.download("en") # where en is the language code of the model.
    

For the available models, follow these links: spaCy, stanza.

Tip

For Person, Location and Organization detection, it could be useful to try out the transformers based models (e.g. en_core_web_trf) which uses a more modern deep-learning architecture, but is generally slower than the default en_core_web_lg model.

Configure Presidio to use the pre-trained model

Once created, see the NLP configuration documentation for more information.

How NER results flow within Presidio

This diagram describes the flow of NER results within Presidio, and the relationship between the SpacyNlpEngine component and the SpacyRecognizer component:

sequenceDiagram
    AnalyzerEngine->>SpacyNlpEngine: Call engine.process_text(text) <br>to get model results
    SpacyNlpEngine->>spaCy: Call spaCy pipeline
    spaCy->>SpacyNlpEngine: return entities and other attributes
    Note over SpacyNlpEngine: Map entity names to Presidio's, <BR>update scores, <BR>remove unwanted entities <BR> based on NerModelConfiguration
    SpacyNlpEngine->>AnalyzerEngine: Pass NlpArtifacts<BR>(Entities, lemmas, tokens, scores etc.)
    Note over AnalyzerEngine: Call all recognizers
    AnalyzerEngine->>SpacyRecognizer: Pass NlpArtifacts
    Note over SpacyRecognizer: Extract PII entities out of NlpArtifacts
    SpacyRecognizer->>AnalyzerEngine: Return List[RecognizerResult]

Training your own model

Note

A labeled dataset containing text and labeled PII entities is required for training a new model.

For more information on model training and evaluation for Presidio, see the Presidio-Research Github repository.

To train your own model, see these links on spaCy and Stanza:

Once models are trained, they should be installed locally in the same environment as Presidio Analyzer.

Using a previously loaded spaCy pipeline

If the app is already loading an existing spaCy NLP pipeline, it can be re-used to prevent presidio from loading it again by extending the relevant engine.

from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import SpacyNlpEngine
import spacy

# Create a class inheriting from SpacyNlpEngine
class LoadedSpacyNlpEngine(SpacyNlpEngine):
    def __init__(self, loaded_spacy_model):
        super().__init__()
        self.nlp = {"en": loaded_spacy_model}

# Load a model a-priori
nlp = spacy.load("en_core_web_sm")

# Pass the loaded model to the new LoadedSpacyNlpEngine
loaded_nlp_engine = LoadedSpacyNlpEngine(loaded_spacy_model = nlp)

# Pass the engine to the analyzer
analyzer = AnalyzerEngine(nlp_engine = loaded_nlp_engine)

# Analyze text
analyzer.analyze(text="My name is Bob", language="en")