Azure Health Data Services De-identification Service Integration
Presidio supports integration with Azure Health Data Services (AHDS) de-identification service for both entity recognition and anonymization with realistic surrogate generation.
Resources
Overview
The AHDS de-identification service integration provides two main capabilities:
- AHDS Recognizer (in presidio-analyzer): Detects PHI entities using the Azure Health Data Services de-identification service
- AHDS Surrogate Operator (in presidio-anonymizer): Replaces PHI entities with realistic surrogates using the de-identification service
Benefits of Surrogation
- Maintains Data Utility: Preserves structure and format for downstream analytics
- Realistic Healthcare Context: Generates medically plausible names, dates, and identifiers
- Consistent Cross-References: Same entity gets same surrogate throughout document
- Format Preservation: Maintains original formatting and linguistic patterns
Installation
For AHDS Recognizer
pip install presidio-analyzer[ahds]
For AHDS Surrogate Operator
pip install presidio-anonymizer[ahds]
Prerequisites
- Azure Health Data Services de-identification service endpoint
- Azure role-based access control configured
- Environment variables:
AHDS_ENDPOINT
: Your AHDS de-identification service endpoint
Complete Workflow Example
import os
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig
# Set your AHDS endpoint
os.environ["AHDS_ENDPOINT"] = "https://your-ahds-endpoint.api.eus001.deid.azure.com"
# Initialize engines
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
# Medical text with PHI
text = "Patient John Doe was seen by Dr. Smith on 2024-01-15 for diabetes treatment."
# Step 1: Detect entities using AHDS recognizer
analyzer_results = analyzer.analyze(
text=text,
entities=["PATIENT", "DOCTOR", "DATE"],
language="en"
)
# Step 2: Anonymize using AHDS surrogate generation
result = anonymizer.anonymize(
text=text,
analyzer_results=analyzer_results,
operators={
"DEFAULT": OperatorConfig("surrogate", {
"entities": analyzer_results,
"input_locale": "en-US",
"surrogate_locale": "en-US"
})
},
)
print(f"Original: {text}")
print(f"Anonymized: {result.text}")
# Output: "Patient Michael Johnson was seen by Dr. Brown on 1987-06-23 for diabetes treatment."
Note: This example uses the Azure Health Data Services de-identification service surrogation, which provides superior data utility by generating realistic, medically-appropriate surrogates while maintaining document structure and relationships.
Configuration Options
AHDS Surrogate Operator Parameters
endpoint
: AHDS de-identification service endpoint (optional, usesAHDS_ENDPOINT
env var)entities
: List of entities detected by analyzerinput_locale
: Input locale (default: "en-US")surrogate_locale
: Surrogate locale (default: "en-US")
Authentication
The AHDS de-identification service integration uses Azure's DefaultAzureCredential
, which supports multiple authentication methods:
- Environment variables (Service Principal)
- Managed Identity (when running on Azure)
- Azure CLI (
az login
) - Visual Studio/VS Code credentials
- Interactive browser login
For production deployments, we recommend using Service Principal or Managed Identity.
Troubleshooting
Common Issues
-
ModuleNotFoundError: Install the AHDS optional dependencies
pip install presidio-analyzer[ahds] presidio-anonymizer[ahds]
-
Authentication errors: Ensure Azure credentials are properly configured
az login # For local development
-
Endpoint not found: Verify the
AHDS_ENDPOINT
environment variable is set correctly
Testing without AHDS
The AHDS operators gracefully handle missing dependencies and will be skipped if not available. Tests will be automatically skipped if the AHDS_ENDPOINT
environment variable is not set.