Azure Health Data Services De-identification Service Integration
Presidio supports integration with Azure Health Data Services (AHDS) de-identification service for both entity recognition and anonymization with realistic surrogate generation.
Resources
Overview
The AHDS de-identification service integration provides two main capabilities:
- AHDS Recognizer (in presidio-analyzer): Detects PHI entities using the Azure Health Data Services de-identification service
- AHDS Surrogate Operator (in presidio-anonymizer): Replaces PHI entities with realistic surrogates using the de-identification service
Benefits of Surrogation
- Maintains Data Utility: Preserves structure and format for downstream analytics
- Realistic Healthcare Context: Generates medically plausible names, dates, and identifiers
- Consistent Cross-References: Same entity gets same surrogate throughout document
- Format Preservation: Maintains original formatting and linguistic patterns
Installation
For AHDS Recognizer
pip install presidio-analyzer[ahds]
For AHDS Surrogate Operator
pip install presidio-anonymizer[ahds]
Prerequisites
- Azure Health Data Services de-identification service endpoint
- Azure role-based access control configured
- Environment variables:
AHDS_ENDPOINT: Your AHDS de-identification service endpoint
Complete Workflow Example
import os
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig
# Set your AHDS endpoint
os.environ["AHDS_ENDPOINT"] = "https://your-ahds-endpoint.api.eus001.deid.azure.com"
# Initialize engines
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
# Medical text with PHI
text = "Patient John Doe was seen by Dr. Smith on 2024-01-15 for diabetes treatment."
# Step 1: Detect entities using AHDS recognizer
analyzer_results = analyzer.analyze(
text=text,
entities=["PATIENT", "DOCTOR", "DATE"],
language="en"
)
# Step 2: Anonymize using AHDS surrogate generation
result = anonymizer.anonymize(
text=text,
analyzer_results=analyzer_results,
operators={
"DEFAULT": OperatorConfig("surrogate", {
"entities": analyzer_results,
"input_locale": "en-US",
"surrogate_locale": "en-US"
})
},
)
print(f"Original: {text}")
print(f"Anonymized: {result.text}")
# Output: "Patient Michael Johnson was seen by Dr. Brown on 1987-06-23 for diabetes treatment."
Note: This example uses the Azure Health Data Services de-identification service surrogation, which provides superior data utility by generating realistic, medically-appropriate surrogates while maintaining document structure and relationships.
Configuration Options
AHDS Surrogate Operator Parameters
endpoint: AHDS de-identification service endpoint (optional, usesAHDS_ENDPOINTenv var)entities: List of entities detected by analyzerinput_locale: Input locale (default: "en-US")surrogate_locale: Surrogate locale (default: "en-US")
Authentication
The AHDS de-identification service integration uses Azure authentication with a secure-by-default approach:
Production Mode (Default)
By default, the integration uses a restricted credential chain that only includes:
- EnvironmentCredential: Service Principal via environment variables
- WorkloadIdentityCredential: Workload Identity (Kubernetes)
- ManagedIdentityCredential: Managed Identity (Azure services)
This credential chain is more secure as it excludes interactive browser logins and developer credentials, making it suitable for production deployments.
Development Mode
For local development, you can enable DefaultAzureCredential by setting the environment variable:
export ENV=development
In development mode, the integration uses DefaultAzureCredential, which supports additional authentication methods:
- Environment variables (Service Principal)
- Workload Identity
- Managed Identity
- Azure CLI (
az login) - Azure PowerShell
- Visual Studio/VS Code credentials
- Interactive browser login
Local Development Setup
For local development with Azure CLI authentication:
# Set development mode
export ENV=development
# Login with Azure CLI
az login
# Set your AHDS endpoint
export AHDS_ENDPOINT="https://your-ahds-endpoint.api.eus001.deid.azure.com"
# Now you can run Presidio with AHDS integration
python your_script.py
Production Deployment Recommendations
For production deployments, we recommend:
- Do not set ENV=development (use default production mode)
- Use Managed Identity when running on Azure services (AKS, App Service, Functions, etc.)
- Use Workload Identity for Kubernetes deployments
- Use Service Principal with environment variables for other scenarios:
export AZURE_CLIENT_ID="<client-id>" export AZURE_CLIENT_SECRET="<client-secret>" export AZURE_TENANT_ID="<tenant-id>"
Environment Variables Summary
AHDS_ENDPOINT: Your AHDS de-identification service endpoint (required)ENV: Set todevelopmentfor local dev, omit for production (optional, default: production mode)
Troubleshooting
Common Issues
-
ModuleNotFoundError: Install the AHDS optional dependencies
pip install presidio-analyzer[ahds] presidio-anonymizer[ahds] -
Authentication errors: Ensure Azure credentials are properly configured
az login # For local development -
Endpoint not found: Verify the
AHDS_ENDPOINTenvironment variable is set correctly
Testing without AHDS
The AHDS operators gracefully handle missing dependencies and will be skipped if not available. Tests will be automatically skipped if the AHDS_ENDPOINT environment variable is not set.