Azure Health Data Services De-identification Service Integration
Presidio supports integration with Azure Health Data Services (AHDS) de-identification service for both entity recognition and anonymization with realistic surrogate generation.
Resources
Overview
The AHDS de-identification service integration provides two main capabilities:
- AHDS Recognizer (in presidio-analyzer): Detects PHI entities using the Azure Health Data Services de-identification service
- AHDS Surrogate Operator (in presidio-anonymizer): Replaces PHI entities with realistic surrogates using the de-identification service
Benefits of Surrogation
- Maintains Data Utility: Preserves structure and format for downstream analytics
- Realistic Healthcare Context: Generates medically plausible names, dates, and identifiers
- Consistent Cross-References: Same entity gets same surrogate throughout document
- Format Preservation: Maintains original formatting and linguistic patterns
Installation
For AHDS Recognizer
pip install presidio-analyzer[ahds]
For AHDS Surrogate Operator
pip install presidio-anonymizer[ahds]
Prerequisites
- Azure Health Data Services de-identification service endpoint
- Azure role-based access control configured
- Environment variables:
AHDS_ENDPOINT: Your AHDS de-identification service endpoint
Complete Workflow Example
import os
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig
# Set your AHDS endpoint
os.environ["AHDS_ENDPOINT"] = "https://your-ahds-endpoint.api.eus001.deid.azure.com"
# Initialize engines
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
# Medical text with PHI
text = "Patient John Doe was seen by Dr. Smith on 2024-01-15 for diabetes treatment."
# Step 1: Detect entities using AHDS recognizer
analyzer_results = analyzer.analyze(
text=text,
entities=["PATIENT", "DOCTOR", "DATE"],
language="en"
)
# Step 2: Anonymize using AHDS surrogate generation
result = anonymizer.anonymize(
text=text,
analyzer_results=analyzer_results,
operators={
"DEFAULT": OperatorConfig("surrogate", {
"entities": analyzer_results,
"input_locale": "en-US",
"surrogate_locale": "en-US"
})
},
)
print(f"Original: {text}")
print(f"Anonymized: {result.text}")
# Output: "Patient Michael Johnson was seen by Dr. Brown on 1987-06-23 for diabetes treatment."
Note: This example uses the Azure Health Data Services de-identification service surrogation, which provides superior data utility by generating realistic, medically-appropriate surrogates while maintaining document structure and relationships.
Configuration Options
AHDS Surrogate Operator Parameters
endpoint: AHDS de-identification service endpoint (optional, usesAHDS_ENDPOINTenv var)entities: List of entities detected by analyzerinput_locale: Input locale (default: "en-US")surrogate_locale: Surrogate locale (default: "en-US")
Authentication
The AHDS de-identification service integration uses Azure's DefaultAzureCredential, which supports multiple authentication methods:
- Environment variables (Service Principal)
- Managed Identity (when running on Azure)
- Azure CLI (
az login) - Visual Studio/VS Code credentials
- Interactive browser login
For production deployments, we recommend using Service Principal or Managed Identity.
Troubleshooting
Common Issues
-
ModuleNotFoundError: Install the AHDS optional dependencies
pip install presidio-analyzer[ahds] presidio-anonymizer[ahds] -
Authentication errors: Ensure Azure credentials are properly configured
az login # For local development -
Endpoint not found: Verify the
AHDS_ENDPOINTenvironment variable is set correctly
Testing without AHDS
The AHDS operators gracefully handle missing dependencies and will be skipped if not available. Tests will be automatically skipped if the AHDS_ENDPOINT environment variable is not set.