No code configuration

No-code configuration can be helpful in three scenarios:

There's an existing set of regular expressions / deny-lists that should be leveraged within Presidio.
As a simple way to configure which recognizers to enable and disable, and how to configure the NLP engine.
For team members interested in changing the configuration without writing code.

In this example, we'll show how to create a no-code configuration in Presidio. We start by creating YAML configuration files that are based on the default ones. The default configuration files for Presidio can be found here:

Alternatively, one can create one configuration file for all three components. In this example, we'll tweak the configuration to reduce the number of predefinedrecognizers to only a few, and add a new custom one. We'll also adjust the context words to support the detection of a different language (Spanish).

import yaml
import json
import tempfile
from pprint import pprint
from presidio_analyzer import AnalyzerEngineProvider

In this example we're going to create the yaml as a string for illustration purposes, but the more common scenario is to create these YAML files and load them into the PresidioAnalyzerProvider.

Defining the configuration in YAML format

General Analyzer parameters

(default file)

analyzer_config_yaml = """
supported_languages: 
  - en
  - es
default_score_threshold: 0.4
"""

Recognizer Registry parameters

(default file)

recognizer_registry_config_yaml = """
recognizer_registry:
  supported_languages: 
  - en
  - es
  global_regex_flags: 26

  recognizers:
  - name: CreditCardRecognizer
    supported_languages:
    - language: en
      context: [credit, card, visa, mastercard, cc, amex, discover, jcb, diners, maestro, instapayment]
    - language: es
      context: [tarjeta, credito, visa, mastercard, cc, amex, discover, jcb, diners, maestro, instapayment]
    type: predefined

  - name: DateRecognizer
    supported_languages:
    - language: en
      context: [date, time, birthday, birthdate, dob]
    - language: es
      context: [fecha, tiempo, hora, nacimiento, dob]
    type: predefined

  - name: EmailRecognizer
    supported_languages:
    - language: en
      context: [email, mail, address]
    - language: es
      context: [correo, electrónico, email]
    type: predefined

  - name: PhoneRecognizer
    type: predefined
    supported_languages:
    - language: en
      context: [phone, number, telephone, fax]
    - language: es
      context: [teléfono, número, fax]

  - name: "Titles recognizer (en)"
    supported_language: "en"
    supported_entity: "TITLE"
    deny_list:
      - Mr.
      - Mrs.
      - Ms.
      - Miss
      - Dr.
      - Prof.
      - Doctor
      - Professor
  - name: "Titles recognizer (es)"
    supported_language: "es"
    supported_entity: "TITLE"
    deny_list:
      - Sr.
      - Señor
      - Sra.
      - Señora
      - Srta.
      - Señorita
      - Dr.
      - Doctor
      - Doctora
      - Prof.
      - Profesor
      - Profesora
"""

NLP Engine parameters

(default file)

nlp_engine_yaml = """
nlp_configuration:
    nlp_engine_name: transformers
    models:
      -
        lang_code: en
        model_name:
          spacy: en_core_web_sm
          transformers: StanfordAIMI/stanford-deidentifier-base
      -
        lang_code: es
        model_name:
          spacy: es_core_news_sm
          transformers: MMG/xlm-roberta-large-ner-spanish  
    ner_model_configuration:
      labels_to_ignore:
      - O
      aggregation_strategy: first # "simple", "first", "average", "max"
      stride: 16
      alignment_mode: expand # "strict", "contract", "expand"
      model_to_presidio_entity_mapping:
        PER: PERSON
        PERSON: PERSON
        LOC: LOCATION
        LOCATION: LOCATION
        GPE: LOCATION
        ORG: ORGANIZATION
        ORGANIZATION: ORGANIZATION
        NORP: NRP
        AGE: AGE
        ID: ID
        EMAIL: EMAIL
        PATIENT: PERSON
        STAFF: PERSON
        HOSP: ORGANIZATION
        PATORG: ORGANIZATION
        DATE: DATE_TIME
        TIME: DATE_TIME
        PHONE: PHONE_NUMBER
        HCW: PERSON
        HOSPITAL: LOCATION
        FACILITY: LOCATION
        VENDOR: ORGANIZATION
        MISC: ID

      low_confidence_score_multiplier: 0.4
      low_score_entity_names:
      - ID
"""

Creating the analyzer engine and running it

Create a unified YAML file and save it as a temp file

full_config = f"{analyzer_config_yaml}\n{recognizer_registry_config_yaml}\n{nlp_engine_yaml}"

with tempfile.NamedTemporaryFile(mode='w+', delete=False, suffix='.yaml') as temp_file:
    # Write the YAML string to the temp file
    temp_file.write(full_config)
    temp_file_path = temp_file.name

Pass the YAML file to `AnalyzerEngineProvider` to create an `AnalyzerEngine` instance

analyzer_engine = AnalyzerEngineProvider(analyzer_engine_conf_file=temp_file_path).create_engine()

Print the loaded configuration for both languages

for lang in ("en", "es"):
    pprint(f"Supported entities for {lang}:")
    print("\n")
    pprint(analyzer_engine.get_supported_entities(lang), compact=True)

    print(f"\nLoaded recognizers for {lang}:")
    pprint([rec.name for rec in analyzer_engine.registry.get_recognizers(lang, all_fields=True)], compact=True)
    print("\n")

print(f"\nLoaded NER models:")
pprint(analyzer_engine.nlp_engine.models)

Run two requests, one in English and one in Spanish

es_text = "Hola, me llamo David Johnson y soy originalmente de Liverpool. Mi número de tarjeta de crédito es 4095260993934932"
analyzer_engine.analyze(es_text, language="es")

en_text = "Hi, my name is David Johnson and I'm originally from Liverpool. My credit card number is 4095260993934932"
analyzer_engine.analyze(en_text, language="en")

No code configuration

Defining the configuration in YAML format

General Analyzer parameters

Recognizer Registry parameters

NLP Engine parameters

Creating the analyzer engine and running it

Create a unified YAML file and save it as a temp file

Pass the YAML file to AnalyzerEngineProvider to create an AnalyzerEngine instance

Print the loaded configuration for both languages

Run two requests, one in English and one in Spanish

Pass the YAML file to `AnalyzerEngineProvider` to create an `AnalyzerEngine` instance