Evaluate DICOM de-identification performance¶
This notebook demonstrates how to use the DicomImagePiiVerifyEngine
to evaluate how well the DicomImageRedactorEngine
de-identifies text Personal Health Information (PHI) from DICOM images when ground truth labels are available.
Prerequisites¶
Before getting started, make sure presidio and the latest version of Tesseract OCR are installed. For detailed documentation, see the installation docs.
!pip install presidio_analyzer presidio_anonymizer presidio_image_redactor
!python -m spacy download en_core_web_lg
Dataset¶
Sample DICOM files are available for use in this notebook in ./sample_data
. Copies of the original DICOM data were saved into the folder with permission from the dataset owners. Please see the original dataset information below:
Rutherford, M., Mun, S.K., Levine, B., Bennett, W.C., Smith, K., Farmer, P., Jarosz, J., Wagner, U., Farahani, K., Prior, F. (2021). A DICOM dataset for evaluation of medical image de-identification (Pseudo-PHI-DICOM-Data) [Data set]. The Cancer Imaging Archive. DOI: https://doi.org/10.7937/s17z-r072
import os
import json
import pandas as pd
import pydicom
from presidio_image_redactor import DicomImagePiiVerifyEngine
Load ground truth¶
Load the ground truth labels. For more information on the ground truth format, please see the evaluating DICOM de-identification page.
# Set paths
data_dir = "sample_data"
gt_path = "sample_data/ground_truth.json"
# Load ground truth JSON
with open(gt_path) as json_file:
gt = json.load(json_file)
# Get list of files
gt_dicom_files = list(gt.keys())
gt_dicom_files
['sample_data/0_ORIGINAL.dcm', 'sample_data/1_ORIGINAL.dcm', 'sample_data/2_ORIGINAL.dcm', 'sample_data/3_ORIGINAL.dcm']
Initialize the verification engine¶
This engine will be used for both verification and evaluation.
dicom_engine = DicomImagePiiVerifyEngine()
Verify detected PHI for one DICOM image¶
To visually identify what text is being detected as PHI on a DICOM image, use the .verify_dicom_instance()
method.
# Select one file to work with
file_of_interest = gt_dicom_files[0]
gt_file_of_interest = gt[file_of_interest]
# Return image to visually inspect
instance = pydicom.dcmread(file_of_interest)
verify_image, ocr_results, analyzer_results = dicom_engine.verify_dicom_instance(instance)