Language Model-based PII/PHI Detection (Experimental Feature)

Introduction

Presidio supports language model-based PII/PHI detection for flexible entity recognition using language models (LLMs, SLMs, etc.). This approach enables detection of both: - PII (Personally Identifiable Information): Names, emails, phone numbers, SSN, credit cards, etc. - PHI (Protected Health Information): Medical records, health identifiers, etc.

(The default approach uses LangExtract under the hood to integrate with language model providers.)

Entity Detection Capabilities

Unlike pattern-based recognizers, language model-based detection is flexible and depends on:

The language model being used
The prompt description provided
The few-shot examples configured

The default configuration includes examples for common PII/PHI entities such as PERSON, EMAIL_ADDRESS, PHONE_NUMBER, US_SSN, CREDIT_CARD, MEDICAL_LICENSE, and more. You can customize the prompts and examples to detect any entity types relevant to your use case.

For the default entity mappings and examples, see the default configuration.

Supported Language Model Providers

Presidio supports the following language model providers through LangExtract:

Azure OpenAI - Cloud-based Azure OpenAI Service (GPT-4o, GPT-4, GPT-3.5-turbo, etc.)
Ollama - Local language model deployment (open-source models like Gemma, Llama, etc.)

Choosing Between Azure OpenAI and Ollama

Feature	Azure OpenAI	Ollama
Deployment	Cloud (Azure)	Local (on-premises)
Cost	Pay-per-use (tokens)	Free (hardware required)
Models	GPT-4o, GPT-4, GPT-3.5-turbo	Open-source (Gemma, Llama, etc.)
Privacy	Microsoft Azure compliance	Complete data control
Setup	Azure Portal + API key/Managed Identity	Docker/local installation
Authentication	API Key or Managed Identity (RBAC)	None (local)
Best For	Production, enterprise compliance	Development, on-premises requirements

Recommendations:

Use Azure OpenAI for production workloads requiring enterprise security, compliance (HIPAA, SOC 2, etc.), and managed infrastructure
Use Ollama for local development, testing, or when data must stay on-premises

Language Model-based Recognizer Implementation

Presidio provides a hierarchy of recognizers for language model-based PII/PHI detection:

LMRecognizer: Abstract base class for all language model recognizers (LLMs, SLMs, etc.)
LangExtractRecognizer: Abstract base class for LangExtract library integration (model-agnostic)
AzureOpenAILangExtractRecognizer: Concrete implementation for Azure OpenAI Service
Implementation
BasicLangExtractRecognizer: Concrete implementation where ModelConfig is configured from YAML (supporting Ollama, OpenAI, Gemini, and other providers)
Implementation

Using Azure OpenAI (Cloud Models)

Azure OpenAI provides cloud-based access to OpenAI models (GPT-4o, GPT-4, GPT-3.5-turbo) with enterprise security and compliance features.

Prerequisites

Install Presidio with LangExtract support:

pip install presidio-analyzer[langextract]

Set up Ollama

You have two options to set up Ollama:

Option 1: Docker Compose (recommended for CPU)

This option requires Docker to be installed on your system.

Where to run: From the root presidio directory (where docker-compose.yml is located)

docker compose up -d ollama
docker exec presidio-ollama-1 ollama pull qwen2.5:1.5b
docker exec presidio-ollama-1 ollama list

Platform differences: - Linux/Mac: Commands above work as-is - Windows: Use PowerShell or CMD, commands are the same

If you don't have Docker installed: - Linux: Follow Docker installation guide - Mac: Install Docker Desktop for Mac - Windows: Install Docker Desktop for Windows

Option 2: Native installation (recommended for GPU acceleration)

Follow the official LangExtract Ollama guide.

After installation, pull and run the model:

ollama pull qwen2.5:1.5b
ollama run qwen2.5:1.5b

This option provides better performance with GPU acceleration (e.g., on Mac with Metal Performance Shaders or systems with NVIDIA GPUs). The model must be pulled and run before using the recognizer. The default model is qwen2.5:1.5b.

Configuration (optional): Create your own ollama_config.yaml or use the default configuration

Usage

Option 1: Enable in configuration file

Enable the recognizer in default_recognizers.yaml:

- name: BasicLangExtractRecognizer
  enabled: true  # Change from false to true

Then load the analyzer using this modified configuration file:

from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.recognizer_registry import RecognizerRegistryProvider

# Point to your modified default_recognizers.yaml with Ollama enabled
provider = RecognizerRegistryProvider(
    conf_file="/path/to/your/modified/default_recognizers.yaml"
)
registry = provider.create_recognizer_registry()

# Create analyzer with the registry that includes Ollama recognizer
analyzer = AnalyzerEngine(registry=registry, supported_languages=["en"])

# Analyze text - Ollama recognizer will participate in detection
results = analyzer.analyze(text="My email is john.doe@example.com", language="en")

Option 2: Add programmatically

from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.predefined_recognizers.third_party.basic_langextract_recognizer import BasicLangExtractRecognizer

analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(BasicLangExtractRecognizer())

results = analyzer.analyze(text="My email is john.doe@example.com", language="en")

Note

The recognizer is disabled by default in default_recognizers.yaml to avoid requiring Ollama for basic Presidio usage. Enable it when you have Ollama set up and running.

Custom Configuration

To use a custom configuration file:

analyzer.registry.add_recognizer(
    BasicLangExtractRecognizer(config_path="/path/to/custom_config.yaml")
)

Configuration Options

The langextract_config_ollama.yaml file supports the following options:

model_id: The model to use.
provider.name: The model provider (eg, ollama, openai)
provider.kwargs: kwargs to pass to the model provider (eg, model_url for Ollama, base_url for OpenAI)
provider.extract_params: Extraction parameters (eg, use_schema_constraints, fence_output, temperature)
provider.language_model_params: Parameters for the model itself (eg, timeout, num_ctx)
supported_entities: PII/PHI entity types to detect
entity_mappings: Map LangExtract entity classes to Presidio entity names
min_score: Minimum confidence score (default: 0.5)

See the configuration file for all options.

Troubleshooting

ConnectionError: "Ollama server not reachable" - Ensure Ollama is running: docker ps or check http://localhost:11434 - Verify the model_url in your configuration matches your Ollama server address

RuntimeError: "Model 'qwen2.5:1.5b' not found" - Pull the model: docker exec -it presidio-ollama-1 ollama pull qwen2.5:1.5b - Or for manual setup: ollama pull qwen2.5:1.5b - Verify the model name matches the model_id in your configuration

Using Azure OpenAI (Cloud Models)

Azure OpenAI provides cloud-based access to OpenAI models (GPT-4o, GPT-4, GPT-3.5-turbo) with enterprise security and compliance features.

Prerequisites

Install Presidio with LangExtract support:

pip install presidio-analyzer[langextract]

This installs langextract with OpenAI support, including the OpenAI Python SDK and Azure Identity libraries.

Azure Subscription: Create one at azure.microsoft.com
Azure OpenAI Resource:
Create an Azure OpenAI resource in Azure Portal
Request access if needed (some regions require approval)
Deploy a model and note the deployment name you choose (e.g., "gpt-4", "my-gpt-deployment")
Optional: Download config file (only if customizing entities/prompts):

# On macOS/Linux/PowerShell:
wget https://raw.githubusercontent.com/microsoft/presidio/main/presidio-analyzer/presidio_analyzer/conf/langextract_config_azureopenai.yaml

# Or download manually from:
# https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/conf/langextract_config_azureopenai.yaml

Authentication Options

Azure OpenAI supports multiple authentication methods with flexible configuration:

Option 1: Direct Parameters (Recommended for Most Users)

Simplest approach - pass credentials and deployment name as parameters:

from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.predefined_recognizers import AzureOpenAILangExtractRecognizer

# Initialize with deployment name and credentials
azure_openai = AzureOpenAILangExtractRecognizer(
    model_id="gpt-4",  # Your Azure deployment name
    azure_endpoint="https://your-resource.openai.azure.com/",
    api_key="your-api-key-here",
    api_version="2024-02-15-preview"  # Optional
)

analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(azure_openai)

results = analyzer.analyze(
    text="My email is john.doe@example.com and my phone is 555-123-4567",
    language="en"
)

Option 2: Environment Variables

Use environment variables for credentials:

import os
from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.predefined_recognizers import AzureOpenAILangExtractRecognizer

# Set environment variables
os.environ["AZURE_OPENAI_ENDPOINT"] = "https://your-resource.openai.azure.com/"
os.environ["AZURE_OPENAI_API_KEY"] = "your-api-key-here"

# Initialize with just deployment name
azure_openai = AzureOpenAILangExtractRecognizer(
    model_id="gpt-4"  # Your Azure deployment name
)

analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(azure_openai)

results = analyzer.analyze(
    text="My email is john.doe@example.com and my phone is 555-123-4567",
    language="en"
)

Option 3: Managed Identity (Production)

More secure - No API keys in code, uses Azure RBAC:

import os
from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.predefined_recognizers import AzureOpenAILangExtractRecognizer

# Set endpoint (no API key = uses managed identity)
os.environ["AZURE_OPENAI_ENDPOINT"] = "https://your-resource.openai.azure.com/"

# Initialize without API key (uses managed identity)
azure_openai = AzureOpenAILangExtractRecognizer(
    model_id="gpt-4"  # Your Azure deployment name
)

analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(azure_openai)

results = analyzer.analyze(
    text="Patient John Smith has SSN 123-45-6789",
    language="en"
)

Managed Identity Authentication Flow (Production):

When api_key is not provided, the provider automatically uses ChainedTokenCredential which tries credentials in order:

EnvironmentCredential - Service principal from environment variables
WorkloadIdentityCredential - Azure Kubernetes Service workload identity
ManagedIdentityCredential - Azure VM/App Service managed identity

For local development, set ENV=development to use DefaultAzureCredential instead:

import os
os.environ["ENV"] = "development"
os.environ["AZURE_OPENAI_ENDPOINT"] = "https://your-resource.openai.azure.com/"
# AZURE_OPENAI_API_KEY not set - uses DefaultAzureCredential in development mode
# (includes Azure CLI, VS Code, etc.)

azure_openai = AzureOpenAILangExtractRecognizer()

Setup Managed Identity:

Enable managed identity on your Azure resource (VM, App Service, Container Instance, etc.)
Grant the managed identity "Cognitive Services OpenAI User" role on your Azure OpenAI resource
No API keys needed - authentication is automatic

See Azure Managed Identity documentation for details.

Configuration File (Optional)

The configuration file is optional for basic usage. You only need it to customize:

Supported entity types
Entity mappings (LangExtract → Presidio)
Prompts and examples
Detection parameters

For basic usage, just pass model_id as a parameter (see examples above).

When you need a custom config:

Download the default config:

wget https://raw.githubusercontent.com/microsoft/presidio/main/presidio-analyzer/presidio_analyzer/conf/langextract_config_azureopenai.yaml

Customize entities, prompts, or other settings in the file
Use the customized config:

recognizer = AzureOpenAILangExtractRecognizer(
    model_id="gpt-4",  # Can override config's model_id
    config_path="./custom_config.yaml",
    azure_endpoint="...",
    api_key="..."
)

Configuration Reference:

The config file contains two main sections:

lm_recognizer section (LLM recognizer settings):

supported_entities: List of PII/PHI entity types to detect
labels_to_ignore: Entity labels to skip during processing
enable_generic_consolidation: Whether to consolidate unknown entities to GENERIC_PII_ENTITY
min_score: Minimum confidence score threshold (0.0-1.0)

langextract section (LangExtract-specific settings):

model.model_id: Azure OpenAI deployment name (e.g., "gpt-4o", "gpt-4", "gpt-35-turbo")
model.temperature: Model temperature for generation (null = use model default)
prompt_file: Path to custom prompt template file
examples_file: Path to few-shot examples file
entity_mappings: Map LangExtract entity classes to Presidio entity names

See the full config file for details.

Language Model-based PII/PHI Detection (Experimental Feature)

Introduction

Entity Detection Capabilities

Supported Language Model Providers

Choosing Between Azure OpenAI and Ollama

Language Model-based Recognizer Implementation

Using Azure OpenAI (Cloud Models)

Prerequisites

Usage

Custom Configuration

Configuration Options

Troubleshooting

Using Azure OpenAI (Cloud Models)

Prerequisites

Authentication Options

Option 1: Direct Parameters (Recommended for Most Users)

Option 2: Environment Variables

Option 3: Managed Identity (Production)

Configuration File (Optional)

Using Ollama (Local Models)