Anonymize PII using Presidio on Spark

You can leverages presidio to perform data anonymization as part of spark notebooks.

The following sample uses Azure Databricks and simple text files hosted on Azure Blob Storage. However, it can easily change to fit any other scenario which requires PII analysis or anonymization as part of spark jobs.

Note that this code works for Databricks runtime 8.1 (spark 3.1.1) and the libraries described here.

The basics of working with Presidio in Spark

A typical use case of Presidio in Spark is transforming a text column in a data frame, by anonymizing its content. The following code sample, a part of transform presidio notebook, is the basis of the e2e sample which uses Azure Databricks as the Spark environment.

anonymized_column = "value" # name of column to anonymize
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

# broadcast the engines to the cluster nodes
broadcasted_analyzer = sc.broadcast(analyzer)
broadcasted_anonymizer = sc.broadcast(anonymizer)

# define a pandas UDF function and a series function over it.
def anonymize_text(text: str) -> str:
    analyzer = broadcasted_analyzer.value
    anonymizer = broadcasted_anonymizer.value
    analyzer_results = analyzer.analyze(text=text, language="en")
    anonymized_results = anonymizer.anonymize(
        text=text,
        analyzer_results=analyzer_results,
        operators={
            "DEFAULT": OperatorConfig("replace", {"new_value": "<ANONYMIZED>"})
        },
    )
    return anonymized_results.text


def anonymize_series(s: pd.Series) -> pd.Series:
    return s.apply(anonymize_text)


# define a the function as pandas UDF
anonymize = pandas_udf(anonymize_series, returnType=StringType())

# apply the udf
anonymized_df = input_df.withColumn(
    anonymized_column, anonymize(col(anonymized_column))
)

Pre-requisites

If you do not have an instance of Azure Databricks, follow through with the following steps to provision and setup the required infrastructure.

If you do have a Databricks workspace and a cluster you wish to configure to run Presidio, jump over to the Configure an existing cluster section.

Deploy Infrastructure

Provision the Azure resources by running the following script.

export RESOURCE_GROUP=[resource group name]
export STORAGE_ACCOUNT_NAME=[storage account name]
export STORAGE_CONTAINER_NAME=[blob container name]
export DATABRICKS_WORKSPACE_NAME=[databricks workspace name]
export DATABRICKS_SKU=[basic/standard/premium]
export LOCATION=[location]

# Create the resource group
az group create --name $RESOURCE_GROUP --location $LOCATION

# Use ARM template to build the resources and get back the workspace URL
deployment_response=$(az deployment group create -g $RESOURCE_GROUP --template-file ./docs/samples/deployments/spark/arm-template/databricks.json  --parameters location=$LOCATION workspaceName=$DATABRICKS_WORKSPACE_NAME storageAccountName=$STORAGE_ACCOUNT_NAME containerName=$STORAGE_CONTAINER_NAME)

export DATABRICKS_WORKSPACE_URL=$(echo $deployment_response | jq -r ".properties.outputs.workspaceUrl.value")
export DATABRICKS_WORKSPACE_ID=$(echo $deployment_response | jq -r ".properties.outputs.workspaceId.value")

Setup Databricks

The following script will setup a new cluster in the databricks workspace and prepare it to run presidio anonymization jobs. Once finished, the script will output an access key which you can use when working with databricks cli.

sh ./scripts/configure_databricks.sh

Configure an existing cluster

Only follow through with the steps in this section if you have an existing databricks workspace and clsuter you wish to configure to run presidio. If you've followed through with the "Deploy Infrastructure" and "Setup Databricks" sections you do not have to run the script in this section.

Set up secret scope and secrets for storage account

Add an Azure Storage account key to secret scope.

STORAGE_PRIMARY_KEY=[Primary key of storage account]

databricks secrets create-scope --scope storage_scope --initial-manage-principal users
databricks secrets put --scope storage_scope --key storage_account_access_key --string-value "$STORAGE_PRIMARY_KEY"

Upload or update cluster init scripts

Presidio libraries are loaded to the cluster on init. Upload the cluster setup script or add its content to the existing cluster's init script.

databricks fs cp "./setup/startup.sh" "dbfs:/FileStore/dependencies/startup.sh"

Setup the cluster to run the init script.

Upload presidio notebooks

databricks workspace import_dir "./notebooks" "/notebooks" --overwrite

Update cluster environment

Add the following environment variables to your databricks cluster:

"STORAGE_MOUNT_NAME": "/mnt/files"
"STORAGE_CONTAINER_NAME": [Blob container name]
"STORAGE_ACCOUNT_NAME": [Storage account name]

Mount the storage container

Run the notebook 00_setup to mount the storage account to databricks.

Running the sample

Configure Presidio transformation notebook

From Databricks workspace, under notebooks folder, open the provided 01_transform_presidio notebook and attach it to the cluster preisidio_cluster. Run the first code-cell and note the following parameters on the top end of the notebook (notebook widgets) and set them accordingly

Input File Format - text (selected).
Input path - a folder on the container where input files are found.
Output Folder - a folder on the container where output files will be written to.
Column to Anonymize - value (selected).

Run the notebook

Upload a text file to the blob storage input folder, using any preferred method (Azure Portal, Azure Storage Explorer, Azure CLI).

az storage blob upload --account-name $STORAGE_ACCOUNT_NAME  --container $STORAGE_CONTAINER_NAME --file ./[file name] --name input/[file name]

Run the notebook cells, the output should be csv files which contain two columns, the original file name, and the anonymized content of that file.