Configuring the Analyzer Engine from file

Presidio uses AnalyzerEngineProvider to load AnalyzerEngine configuration from file. Configuration can be loaded in three different ways:

Using a single file

Create an AnalyzerEngineProvider using a single configuration file and set its path to analyzer_engine_conf_file, then create AnalyzerEngine based on it:

from presidio_analyzer import AnalyzerEngine, AnalyzerEngineProvider

analyzer_conf_file = "./analyzer/analyzer-config-all.yml"

provider = AnalyzerEngineProvider(
    analyzer_engine_conf_file=analyzer_conf_file
    )
analyzer = provider.create_engine()

results = analyzer.analyze(text="My name is Morris", language="en")
print(results)

An example configuration file:

supported_languages: 
- en
default_score_threshold: 0

nlp_configuration:
  nlp_engine_name: spacy
  models:
  -
    lang_code: en
    model_name: en_core_web_lg
  -
    lang_code: es
    model_name: es_core_news_md
  ner_model_configuration:
    model_to_presidio_entity_mapping:
      PER: PERSON
      PERSON: PERSON
      LOC: LOCATION
      LOCATION: LOCATION
      GPE: LOCATION
      ORG: ORGANIZATION
      DATE: DATE_TIME
      TIME: DATE_TIME
      NORP: NRP

    low_confidence_score_multiplier: 0.4
    low_score_entity_names:
    - ORGANIZATION
    - ORG
    default_score: 0.85

recognizer_registry:
  global_regex_flags: 26
  recognizers: 
  - name: CreditCardRecognizer
    supported_languages: 
      - en
    supported_entity: IT_FISCAL_CODE
    type: predefined

  - name: ItFiscalCodeRecognizer
    type: predefined

The configuration file contains the following parameters:

supported_languages: A list of supported languages that the analyzer will support.
default_score_threshold: A score that determines the minimal threshold for detection.
nlp_configuration: Configuration given to the NLP engine which will detect the PIIs and extract features for the downstream logic.
recognizer_registry: All the recognizers that will be used by the analyzer.

Note

supported_languages must be identical to the same field in recognizer_registry

Using multiple files

Create an AnalyzerEngineProvider using three different configuration files for each of the following components:

Analyzer
NLP Engine
Recognizer Registry

Note

Each of these parameters is optional and in case it's not set, the default configuration will be used.

from presidio_analyzer import AnalyzerEngine, AnalyzerEngineProvider

analyzer_conf_file = "./analyzer/analyzer-config.yml"
nlp_engine_conf_file = "./analyzer/nlp-config.yml"
recognizer_registry_conf_file = "./analyzer/recognizers-config.yml"

provider = AnalyzerEngineProvider(
    analyzer_engine_conf_file=analyzer_conf_file,
    nlp_engine_conf_file=nlp_engine_conf_file,
    recognizer_registry_conf_file=recognizer_registry_conf_file,
    )
analyzer = provider.create_engine()

results = analyzer.analyze(text="My name is Morris", language="en")
print(results)

The structure of the configuration files is as follows:

Analyzer engine configuration file:

supported_languages: 
- en
default_score_threshold: 0

NLP engine configuration file structure is examined thoroughly in the Customizing the NLP model section.
Recognizer registry configuration file structure is examined thoroughly in the Customizing recognizer registry from file section.

Using the default configuration

Create an AnalyzerEngineProvider without any parameters. This will load the default configuration:

from presidio_analyzer import AnalyzerEngine, AnalyzerEngineProvider

provider = AnalyzerEngineProvider().create_engine()

results = provider.analyze(text="My name is Morris", language="en")
print(results)

The default configuration of AnalyzerEngine is defined in the following files:

Enabling and disabling recognizers

In general, recognizers that are not added to the configuration would not be created, with one exception.

Enabling/Disabling the NLP recognizer

One exception to this is the recognizer which extracts the NlpEngine entities (e.g. SpacyRecognizer when the NlpEngine is SpacyNlpEngine; TransformersRecognizer when the engine is TransformersNlpEngine and StanzaRecognizer when the engine is StanzaNlpEngine).

Recognizers (including the NLP recognizer) could be disabled by defining enabled=false in the YAML configuration. For example:

recognizer_registry:
  global_regex_flags: 26
  recognizers:
    - name: SpacyRecognizer
      type: predefined
      enabled: false
    - name: CreditCardRecognizer
      type: predefined
      enabled: true

supported_languages:
  - en
default_score_threshold: 0.7

nlp_configuration:
  nlp_engine_name: spacy
  models:
    -
      lang_code: en
      model_name: en_core_web_lg

In this example, the SpacyRecognizer is disabled, and the CreditCardRecognizer is enabled, resulting in only the CREDIT_CARD PII entity to be returned if detected.

Adding context words in YAML recognizers

Recognizers defined in YAML can also include a context field.
When used with AnalyzerEngine and a context enhancer, these words boost the score if they appear near the detected entity.

Example:

recognizers:
  - name: "Date of Birth Recognizer"
    supported_entity: "DATE_TIME"
    supported_language: "en"
    patterns:
      - name: "DOB without slashes"
        regex: "((19|20)\\d{2}(0[1-9]|1[0-2])(0[1-9]|[12]\\d|3[01]))"
        score: 0.8
    context:
      - DOB

from presidio_analyzer import AnalyzerEngine, RecognizerRegistry
from presidio_analyzer.nlp_engine import NlpEngineProvider
from presidio_analyzer.context_aware_enhancers.lemma_context_aware_enhancer import LemmaContextAwareEnhancer

# Save the DOB recognizer YAML to disk
dob_yaml = """
recognizers:
  - name: "Date of Birth Recognizer"
    supported_entity: "DATE_TIME"
    supported_language: "en"
    patterns:
      - name: "DOB without slashes"
        regex: "((19|20)\\d{2}(0[1-9]|1[0-2])(0[1-9]|[12]\\d|3[01]))"
        score: 0.8
    context:
      - DOB
"""
with open("dob_recognizer.yml", "w") as f:
    f.write(dob_yaml)

# Configure NLP engine
configuration = {
    "nlp_engine_name": "spacy",
    "models": [{"lang_code": "en", "model_name": "en_core_web_sm"}],
}
provider = NlpEngineProvider(nlp_configuration=configuration)
nlp_engine = provider.create_engine()

# Load recognizer from YAML
registry = RecognizerRegistry()
registry.add_recognizers_from_yaml("dob_recognizer.yml")

# Analyzer with custom registry
analyzer = AnalyzerEngine(
    registry=registry,
    nlp_engine=nlp_engine,
    supported_languages=["en"]
)

text = "DOB: 19571012"

# Run base analysis
results = analyzer.analyze(text=text, language="en")
print("Base results:", results)

# Apply context enhancer
enhancer = LemmaContextAwareEnhancer()
nlp_artifacts = analyzer.nlp_engine.process_text(text, language="en")

boosted = enhancer.enhance_using_context(
    text=text,
    raw_results=results,
    nlp_artifacts=nlp_artifacts,
    recognizers=registry.recognizers,
    context=["DOB"]
)
print("Boosted results:", boosted)