Path to notebook: https://www.github.com/microsoft/presidio/blob/main/docs/samples/python/example_dicom_redactor_evaluation.ipynb

Evaluate DICOM de-identification performance

This notebook demonstrates how to use the DicomImagePiiVerifyEngine to evaluate how well the DicomImageRedactorEngine de-identifies text Personal Health Information (PHI) from DICOM images when ground truth labels are available.

Prerequisites

Before getting started, make sure presidio and the latest version of Tesseract OCR are installed. For detailed documentation, see the installation docs.

!pip install presidio_analyzer presidio_anonymizer presidio_image_redactor
!python -m spacy download en_core_web_lg

Dataset

Sample DICOM files are available for use in this notebook in ./sample_data. Copies of the original DICOM data were saved into the folder with permission from the dataset owners. Please see the original dataset information below: > Rutherford, M., Mun, S.K., Levine, B., Bennett, W.C., Smith, K., Farmer, P., Jarosz, J., Wagner, U., Farahani, K., Prior, F. (2021). A DICOM dataset for evaluation of medical image de-identification (Pseudo-PHI-DICOM-Data) [Data set]. The Cancer Imaging Archive. DOI: https://doi.org/10.7937/s17z-r072

import os
import json
import pandas as pd
import pydicom

from presidio_image_redactor import DicomImagePiiVerifyEngine

Load ground truth

Load the ground truth labels. For more information on the ground truth format, please see the evaluating DICOM de-identification page.

# Set paths
data_dir = "sample_data"
gt_path = "sample_data/ground_truth.json"

# Load ground truth JSON
with open(gt_path) as json_file:
    gt = json.load(json_file)

# Get list of files
gt_dicom_files = list(gt.keys())
gt_dicom_files

['sample_data/0_ORIGINAL.dcm',
 'sample_data/1_ORIGINAL.dcm',
 'sample_data/2_ORIGINAL.dcm',
 'sample_data/3_ORIGINAL.dcm']

Initialize the verification engine

This engine will be used for both verification and evaluation.

dicom_engine = DicomImagePiiVerifyEngine()

Verify detected PHI for one DICOM image

To visually identify what text is being detected as PHI on a DICOM image, use the .verify_dicom_instance() method.

# Select one file to work with
file_of_interest = gt_dicom_files[0]
gt_file_of_interest = gt[file_of_interest]

# Return image to visually inspect
instance = pydicom.dcmread(file_of_interest)
verify_image, ocr_results, analyzer_results = dicom_engine.verify_dicom_instance(instance)

$No description has been provided for this image$

Evaluate de-identification performance

To evaluate how well the actual sensitive text (specified in the ground truth) are identified and redacted, use the .evaluate_dicom_instance() method.

def get_PHI_list(PHI: list) -&gt; list:
    """Get list of PHI from ground truth for a single file.

    Args:
        PHI_dict (list): List of ground truth or detected text PHI.

    Return:
        PHI_list (list): List of PHI (just text).
    """
    PHI_list = []
    for item in PHI:
        PHI_list.append(item['label'])

    return PHI_list

For one image

Display the DICOM image with bounding boxes identifying the detected PHI.

_, eval_results = dicom_engine.eval_dicom_instance(instance, gt_file_of_interest)

Results

print(f"Precision: {eval_results['precision']}")
print(f"Recall: {eval_results['recall']}")
print(f"All Positives: {get_PHI_list(eval_results['all_positives'])}")
print(f"Ground Truth: {get_PHI_list(eval_results['ground_truth'])}")

Precision: 1.0
Recall: 1.0
All Positives: ['DAVIDSON', 'DOUGLAS', '[M]', '01.09.2012', '06.16.1976']
Ground Truth: ['DAVIDSON', 'DOUGLAS', '[M]', '01.09.2012', '06.16.1976']

For multiple images

# Initialize lists to turn into results table
list_of_files = gt_dicom_files
list_of_gt = []
list_of_pos = []
list_of_recall = []
list_of_precision = []

Loop through all the files

for file in gt_dicom_files:
    # Setup
    ground_truth = gt[file]
    instance = pydicom.dcmread(file)

    # Evaluate
    _, eval_results = dicom_engine.eval_dicom_instance(instance, ground_truth)

    # Save results
    list_of_gt.append(get_PHI_list(eval_results["ground_truth"]))
    list_of_pos.append(get_PHI_list(eval_results["all_positives"]))
    list_of_recall.append(eval_results["recall"])
    list_of_precision.append(eval_results["precision"])

Create a summary results table

# Organize results into a table
all_results_dict = {
    "file": list_of_files,
    "ground_truth": list_of_gt,
    "all_positives": list_of_pos,
    "recall": list_of_recall,
    "precision": list_of_precision
}

df_results = pd.DataFrame(all_results_dict)
df_results

	file	ground_truth	all_positives	recall	precision
0	sample_data/0_ORIGINAL.dcm	[DAVIDSON, DOUGLAS, [M], 01.09.2012, 06.16.1976]	[DAVIDSON, DOUGLAS, [M], 01.09.2012, 06.16.1976]	1.0	1.0
1	sample_data/1_ORIGINAL.dcm	[MARTIN, CHAD, [U], 01.01.2000]	[MARTIN, CHAD, [U], 01.01.2000]	1.0	1.0
2	sample_data/2_ORIGINAL.dcm	[KAUFMAN, SCOTT, [M], 03.09.2012, 07.22.1943]	[KAUFMAN, 07.22.1943, SCOTT, [M], 03.09.2012]	1.0	1.0
3	sample_data/3_ORIGINAL.dcm	[MEYER, STEPHANIE, [F], 02.25.2012, 07.16.1953]	[MEYER, STEPHANIE, [F], 02.25.2012, 07.16.1953]	1.0	1.0

Experiment with settings

You can experiment with different settings such as padding_width, tolerance, and any additional arguments to feed into the image analyzer in your DICOM verification engine and see the effect on performance.

> Changing tolerance does not affect the de-identification logic nor image redaction. Tolerance is only used for matching analyzer results to ground truth labels.

For example, if we set padding_width=1, this can negatively impact the OCR step which identifies all text regardless of PHI status in an image if text is bordering the edges of the image. When the OCR fails to return all text, we cannot reliably detect PHI.

# Select file
file_of_interest = gt_dicom_files[3]
gt_file_of_interest = gt[file_of_interest]
instance = pydicom.dcmread(file_of_interest)

# Run evaluation with minimal padding (0 padding not allowed)
_, eval_results = dicom_engine.eval_dicom_instance(instance, gt_file_of_interest, padding_width=1)

Notice how low the recall is and how different the detected PHI list is here than in the summary table above which ran de-identification and evaluation with the default padding_width=25.

print(f"Precision: {eval_results['precision']}")
print(f"Recall: {eval_results['recall']}")
print(f"All Positives: {get_PHI_list(eval_results['all_positives'])}")
print(f"Ground Truth: {get_PHI_list(eval_results['ground_truth'])}")

Precision: 1.0
Recall: 0.2
All Positives: ['07.16.1953']
Ground Truth: ['MEYER', 'STEPHANIE', '[F]', '02.25.2012', '07.16.1953']

Conclusion

The DicomImagePiiVerifyEngine allows us to easily do a visual inspection on the identified PHI and also evaluate how well the de-identification worked compared to a provided ground truth.

In the case of these sample images, the precision and recall of the Presidio DicomImageRedactorEngine redact function is 1.0 when we use the default values padding_width=25 and tolerance=50.