Presidio + Spark: PII Detection & Anonymization¶

This notebook demonstrates how to use Presidio for PII detection and anonymization with Spark in Microsoft Fabric, leveraging SpaCy models for accurate entity recognition.

Table of Contents¶

1. Small Model Setup & Demo: Initialize Spark session and set up a lightweight SpaCy-based PII detection model (Small models are not recommended as they might impact accuracy)
2. Large Model Setup & Demo: Configure a larger SpaCy model for enhanced PII detection on bigger datasets.
3. Detect PII Summary: Run Presidio's analyzer engine on a Spark DataFrame to extract sensitive information.
4. Anonymize user_query: Apply Presidio's anonymization engine to mask detected PII while maintaining data utility.
5. Scale Data (Duplicate Rows): Expand datasets by duplicating rows, appending unique IDs, and re-anonymizing.
6. Write to Delta Table (Optional): Persist anonymized data to a Delta table for further analysis.
7. Scale-Up Testing: Evaluate performance on larger datasets by duplicating rows and measuring execution time.

In [1]:

Copied!

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("PresidioInFabric").getOrCreate()
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("PresidioInFabric").getOrCreate()

StatementMeta(, 1c67db7b-90ca-4222-8845-9aade21d8c79, 5, Finished, Available, Finished)

Loading SpaCy Models¶

There are two methods to load SpaCy models:

Via the Environment: For smaller models (e.g., en_core_web_md < 300MB) that were included directly in your Fabric environment.
- Note: Small models are not recommended as they might impact accuracy.
From the Lakehouse: For larger models (e.g., en_core_web_lg > 300MB), install them within the notebook.
- First, upload the .whl file to your Lakehouse
- Then install it using the pip command below

In [ ]:

Copied!

%pip install /lakehouse/default/Files/presidio/models/en_core_web_lg-3.8.0-py3-none-any.whl  
# Installing the large model from the lakehouse as it exceeds the size limit for custom libraries in the Fabric environment.
%pip install /lakehouse/default/Files/presidio/models/en_core_web_lg-3.8.0-py3-none-any.whl  
# Installing the large model from the lakehouse as it exceeds the size limit for custom libraries in the Fabric environment.

In [3]:

Copied!





from pyspark.sql.functions import (
    array, lit, explode, col, monotonically_increasing_id, concat
)
from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import NlpEngineProvider
from presidio_anonymizer import AnonymizerEngine
from pyspark.sql.functions import udf, col
from pyspark.sql.types import ArrayType, StringType
from pyspark.sql.functions import pandas_udf, PandasUDFType
from presidio_anonymizer.entities import OperatorConfig
import pandas as pd
from pyspark.sql.functions import (
    array, lit, explode, col, monotonically_increasing_id, concat
)
from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import NlpEngineProvider
from presidio_anonymizer import AnonymizerEngine
from pyspark.sql.functions import udf, col
from pyspark.sql.types import ArrayType, StringType
from pyspark.sql.functions import pandas_udf, PandasUDFType
from presidio_anonymizer.entities import OperatorConfig
import pandas as pd

StatementMeta(, 1c67db7b-90ca-4222-8845-9aade21d8c79, 13, Finished, Available, Finished)

In [4]:

parameters

Copied!





num_duplicates = 5000 # for the scale part
csv_path = "Files/presidio/fabric_sample_data.csv"
is_write_to_delta = True
table_namne = "presidio_demo_table"
partitions_number = 100
num_duplicates = 5000 # for the scale part
csv_path = "Files/presidio/fabric_sample_data.csv"
is_write_to_delta = True
table_namne = "presidio_demo_table"
partitions_number = 100

StatementMeta(, 1c67db7b-90ca-4222-8845-9aade21d8c79, 14, Finished, Available, Finished)

Small Model Setup & Demo¶

In [5]:

Copied!





configuration = {
    "nlp_engine_name": "spacy",
    "models": [
        {"lang_code": "en", "model_name": "en_core_web_md"},
    ]
}

provider = NlpEngineProvider(nlp_configuration=configuration)
nlp_engine = provider.create_engine()

small_analyzer = AnalyzerEngine(
    nlp_engine=nlp_engine, supported_languages=["en"]
)

text_to_anonymize = "His name is Mr. Jones and his phone number is 212-555-5555"
analyzer_results = small_analyzer.analyze(text=text_to_anonymize, entities=["PHONE_NUMBER"], language='en')
print(analyzer_results)
configuration = {
    "nlp_engine_name": "spacy",
    "models": [
        {"lang_code": "en", "model_name": "en_core_web_md"},
    ]
}

provider = NlpEngineProvider(nlp_configuration=configuration)
nlp_engine = provider.create_engine()

small_analyzer = AnalyzerEngine(
    nlp_engine=nlp_engine, supported_languages=["en"]
)

text_to_anonymize = "His name is Mr. Jones and his phone number is 212-555-5555"
analyzer_results = small_analyzer.analyze(text=text_to_anonymize, entities=["PHONE_NUMBER"], language='en')
print(analyzer_results)

StatementMeta(, 1c67db7b-90ca-4222-8845-9aade21d8c79, 15, Finished, Available, Finished)

model_to_presidio_entity_mapping is missing from configuration, using default
low_score_entity_names is missing from configuration, using default
labels_to_ignore is missing from configuration, using default
Recognizer not added to registry because language is not supported by registry - CreditCardRecognizer supported languages: es, registry supported languages: en
Recognizer not added to registry because language is not supported by registry - CreditCardRecognizer supported languages: it, registry supported languages: en
Recognizer not added to registry because language is not supported by registry - CreditCardRecognizer supported languages: pl, registry supported languages: en
Recognizer not added to registry because language is not supported by registry - EsNifRecognizer supported languages: es, registry supported languages: en
Recognizer not added to registry because language is not supported by registry - EsNieRecognizer supported languages: es, registry supported languages: en
Recognizer not added to registry because language is not supported by registry - ItDriverLicenseRecognizer supported languages: it, registry supported languages: en
Recognizer not added to registry because language is not supported by registry - ItFiscalCodeRecognizer supported languages: it, registry supported languages: en
Recognizer not added to registry because language is not supported by registry - ItVatCodeRecognizer supported languages: it, registry supported languages: en
Recognizer not added to registry because language is not supported by registry - ItIdentityCardRecognizer supported languages: it, registry supported languages: en
Recognizer not added to registry because language is not supported by registry - ItPassportRecognizer supported languages: it, registry supported languages: en
Recognizer not added to registry because language is not supported by registry - PlPeselRecognizer supported languages: pl, registry supported languages: en
[type: PHONE_NUMBER, start: 46, end: 58, score: 0.75]

Large Model Setup & Demo¶

In [6]:

Copied!





configuration = {
    "nlp_engine_name": "spacy",
    "models": [
        {"lang_code": "en", "model_name": "en_core_web_lg"},
    ]
}

provider = NlpEngineProvider(nlp_configuration=configuration)
nlp_engine = provider.create_engine()

analyzer = AnalyzerEngine(
    nlp_engine=nlp_engine, supported_languages=["en"]
)

text_to_anonymize = "His name is Mr. Jones and his phone number is 212-555-5555"
analyzer_results = analyzer.analyze(text=text_to_anonymize, entities=["PHONE_NUMBER"], language='en')

print(analyzer_results)
configuration = {
    "nlp_engine_name": "spacy",
    "models": [
        {"lang_code": "en", "model_name": "en_core_web_lg"},
    ]
}

provider = NlpEngineProvider(nlp_configuration=configuration)
nlp_engine = provider.create_engine()

analyzer = AnalyzerEngine(
    nlp_engine=nlp_engine, supported_languages=["en"]
)

text_to_anonymize = "His name is Mr. Jones and his phone number is 212-555-5555"
analyzer_results = analyzer.analyze(text=text_to_anonymize, entities=["PHONE_NUMBER"], language='en')

print(analyzer_results)

StatementMeta(, 1c67db7b-90ca-4222-8845-9aade21d8c79, 16, Finished, Available, Finished)

model_to_presidio_entity_mapping is missing from configuration, using default
low_score_entity_names is missing from configuration, using default
labels_to_ignore is missing from configuration, using default
Recognizer not added to registry because language is not supported by registry - CreditCardRecognizer supported languages: es, registry supported languages: en
Recognizer not added to registry because language is not supported by registry - CreditCardRecognizer supported languages: it, registry supported languages: en
Recognizer not added to registry because language is not supported by registry - CreditCardRecognizer supported languages: pl, registry supported languages: en
Recognizer not added to registry because language is not supported by registry - EsNifRecognizer supported languages: es, registry supported languages: en
Recognizer not added to registry because language is not supported by registry - EsNieRecognizer supported languages: es, registry supported languages: en
Recognizer not added to registry because language is not supported by registry - ItDriverLicenseRecognizer supported languages: it, registry supported languages: en
Recognizer not added to registry because language is not supported by registry - ItFiscalCodeRecognizer supported languages: it, registry supported languages: en
Recognizer not added to registry because language is not supported by registry - ItVatCodeRecognizer supported languages: it, registry supported languages: en
Recognizer not added to registry because language is not supported by registry - ItIdentityCardRecognizer supported languages: it, registry supported languages: en
Recognizer not added to registry because language is not supported by registry - ItPassportRecognizer supported languages: it, registry supported languages: en
Recognizer not added to registry because language is not supported by registry - PlPeselRecognizer supported languages: pl, registry supported languages: en
[type: PHONE_NUMBER, start: 46, end: 58, score: 0.75]

Load the data and broadcast¶

Load your data from the CSV file stored in your Lakehouse (e.g., fabric_sample_data.csv). This sample data contains various types of PII that we'll detect and anonymize.

We broadcast the analyzer and anonymizer objects to make them available to all worker nodes in the Spark cluster, which improves performance for distributed processing.

In [ ]:

Copied!





anonymizer = AnonymizerEngine()
broadcasted_analyzer = spark.sparkContext.broadcast(analyzer)
broadcasted_anonymizer = spark.sparkContext.broadcast(anonymizer)
df = spark.read.format("csv").option("header", "true").load(csv_path)
display(df)
# df.show()
anonymizer = AnonymizerEngine()
broadcasted_analyzer = spark.sparkContext.broadcast(analyzer)
broadcasted_anonymizer = spark.sparkContext.broadcast(anonymizer)
df = spark.read.format("csv").option("header", "true").load(csv_path)
display(df)
# df.show()

StatementMeta(, 1c67db7b-90ca-4222-8845-9aade21d8c79, 17, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, b5c48139-db77-44e4-b332-d90c99345f2d)

+------------+--------------------+------------+-------------+-----+--------+--------------------+
|        name|               email|      street|         city|state| non_pii|          user_query|
+------------+--------------------+------------+-------------+-----+--------+--------------------+
|    John Doe|john.doe@example.com|  123 Elm St|       Dallas|   TX|  abc123|My phone number i...|
|  Jane Smith|jane.smith@exampl...|  456 Oak Rd|        Miami|   FL|  xyz789|Please call me at...|
| Alice Brown|alice.brown@examp...|  99 Pine Dr|      Seattle|   WA|cust-123|SSN is 987-65-432...|
|   Bob Davis|bob.davis@example...|10 Maple Ave|     New York|   NY| info999|Passport number i...|
| Carol Jones|carol.jones@examp...|777 Cedar Ln|  Los Angeles|   CA| test111|My phone 333-777-...|
| David Green|david.green@examp...|321 Birch St|      Chicago|   IL| npii-01|He said credit ca...|
| Emily Clark|emily.clark@examp...|555 Aspen Rd|       Boston|   MA| npii-02|Passport A9876543...|
|Frank Wilson|frank.wilson@exam...| 1010 Walnut|      Phoenix|   AZ| npii-03|SSN 666-77-8888 a...|
|   Grace Lee|grace.lee@example...|2020 Palm St|      Orlando|   FL| npii-04|Use my card 4000-...|
|  Henry King|henry.king@exampl...|  3030 Peach|      Atlanta|   GA| npii-05|NHSNO 0123456789 ...|
| Irene Adams|irene.adams@examp...|   4040 Plum|       Denver|   CO| npii-06|SSN 000-11-2222 a...|
| Jack Miller|jack.miller@examp...|   5050 Lily|      Houston|   TX| npii-07|My phone 713-555-...|
|  Kate Allen|kate.allen@exampl...| 6060 Poplar| Philadelphia|   PA| npii-08|SSN: 123-00-4567,...|
|   Leo Clark|leo.clark@example...|   7070 Pine|    Las Vegas|   NV| npii-09|He used card # 44...|
|   Mona Reed|mona.reed@example...|    8080 Bay|    San Diego|   CA| npii-10|Call me at 619-55...|
|  Nancy Ward|nancy.ward@exampl...|   9090 Moss|San Francisco|   CA| npii-11|Passport is L1234...|
|  Oscar Hill|oscar.hill@exampl...|   1111 Hill|     Portland|   OR| npii-12|My card is 4002-1...|
|   Paula Ray|paula.ray@example...|  2222 Cliff|       Austin|   TX| npii-13|CC 4012-8888-9999...|
|  Ray Foster|ray.foster@exampl...|   3333 Rock|    Nashville|   TN| npii-14|Card # 5555-6666-...|
|  Sara White|sara.white@exampl...|  4444 Shell|      Raleigh|   NC| npii-15|SSN 111-22-9999 a...|
+------------+--------------------+------------+-------------+-----+--------+--------------------+

Detect PII Summary¶

This section demonstrates how to:

Instantiate an AnalyzerEngine with the SpaCy NLP model
Run PII detection across each row in your DataFrame
Create a summary of detected PII entities by column

The function below analyzes each column separately so we know which substring (entity text) belongs to which column.

In [ ]:

Copied!





def detect_pii_in_row(*cols):
    """
    Analyze each column separately so we know which substring (entity text) 
    belongs to which column. Return a dict {col_name: [ 'ENTITY_TYPE: substring', ... ] }.
    """
    analyzer = broadcasted_analyzer.value
    col_names = detect_pii_in_row.col_names
    entities_found = {}

    for idx, val in enumerate(cols):
        if val is None:
            continue
        column_text = str(val)
        results = analyzer.analyze(text=column_text, language="en")

        if results:
            # Example: ["PERSON: John Doe", "PHONE_NUMBER: 212-555-1111", ...]
            found_entities = []
            for res in results:
                substring = column_text[res.start:res.end]  # The actual text recognized
                entity_str = f"{res.entity_type}: {substring}"
                found_entities.append(entity_str)
            
            entities_found[col_names[idx]] = found_entities

    # If no PII was detected at all
    if not entities_found:
        return "No PII"
    return str(entities_found)

detect_pii_in_row.col_names = df.columns
detect_pii_udf = udf(detect_pii_in_row, StringType())

df_with_pii_summary = df.withColumn(
    "pii_summary",
    detect_pii_udf(*[col(c) for c in df.columns])
)

display(df_with_pii_summary)
# df_with_pii_summary.show()
def detect_pii_in_row(*cols):
    """
    Analyze each column separately so we know which substring (entity text) 
    belongs to which column. Return a dict {col_name: [ 'ENTITY_TYPE: substring', ... ] }.
    """
    analyzer = broadcasted_analyzer.value
    col_names = detect_pii_in_row.col_names
    entities_found = {}

    for idx, val in enumerate(cols):
        if val is None:
            continue
        column_text = str(val)
        results = analyzer.analyze(text=column_text, language="en")

        if results:
            # Example: ["PERSON: John Doe", "PHONE_NUMBER: 212-555-1111", ...]
            found_entities = []
            for res in results:
                substring = column_text[res.start:res.end]  # The actual text recognized
                entity_str = f"{res.entity_type}: {substring}"
                found_entities.append(entity_str)
            
            entities_found[col_names[idx]] = found_entities

    # If no PII was detected at all
    if not entities_found:
        return "No PII"
    return str(entities_found)

detect_pii_in_row.col_names = df.columns
detect_pii_udf = udf(detect_pii_in_row, StringType())

df_with_pii_summary = df.withColumn(
    "pii_summary",
    detect_pii_udf(*[col(c) for c in df.columns])
)

display(df_with_pii_summary)
# df_with_pii_summary.show()

StatementMeta(, 1c67db7b-90ca-4222-8845-9aade21d8c79, 22, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, d1fcab6e-5b73-4aa7-9287-b0f4aac44eea)

+------------+--------------------+------------+-------------+-----+--------+--------------------+--------------------+
|        name|               email|      street|         city|state| non_pii|          user_query|         pii_summary|
+------------+--------------------+------------+-------------+-----+--------+--------------------+--------------------+
|    John Doe|john.doe@example.com|  123 Elm St|       Dallas|   TX|  abc123|My phone number i...|{'name': ['PERSON...|
|  Jane Smith|jane.smith@exampl...|  456 Oak Rd|        Miami|   FL|  xyz789|Please call me at...|{'name': ['PERSON...|
| Alice Brown|alice.brown@examp...|  99 Pine Dr|      Seattle|   WA|cust-123|SSN is 987-65-432...|{'name': ['PERSON...|
|   Bob Davis|bob.davis@example...|10 Maple Ave|     New York|   NY| info999|Passport number i...|{'name': ['PERSON...|
| Carol Jones|carol.jones@examp...|777 Cedar Ln|  Los Angeles|   CA| test111|My phone 333-777-...|{'name': ['PERSON...|
| David Green|david.green@examp...|321 Birch St|      Chicago|   IL| npii-01|He said credit ca...|{'name': ['PERSON...|
| Emily Clark|emily.clark@examp...|555 Aspen Rd|       Boston|   MA| npii-02|Passport A9876543...|{'name': ['PERSON...|
|Frank Wilson|frank.wilson@exam...| 1010 Walnut|      Phoenix|   AZ| npii-03|SSN 666-77-8888 a...|{'name': ['PERSON...|
|   Grace Lee|grace.lee@example...|2020 Palm St|      Orlando|   FL| npii-04|Use my card 4000-...|{'name': ['PERSON...|
|  Henry King|henry.king@exampl...|  3030 Peach|      Atlanta|   GA| npii-05|NHSNO 0123456789 ...|{'name': ['PERSON...|
| Irene Adams|irene.adams@examp...|   4040 Plum|       Denver|   CO| npii-06|SSN 000-11-2222 a...|{'name': ['PERSON...|
| Jack Miller|jack.miller@examp...|   5050 Lily|      Houston|   TX| npii-07|My phone 713-555-...|{'name': ['PERSON...|
|  Kate Allen|kate.allen@exampl...| 6060 Poplar| Philadelphia|   PA| npii-08|SSN: 123-00-4567,...|{'name': ['PERSON...|
|   Leo Clark|leo.clark@example...|   7070 Pine|    Las Vegas|   NV| npii-09|He used card # 44...|{'name': ['PERSON...|
|   Mona Reed|mona.reed@example...|    8080 Bay|    San Diego|   CA| npii-10|Call me at 619-55...|{'name': ['PERSON...|
|  Nancy Ward|nancy.ward@exampl...|   9090 Moss|San Francisco|   CA| npii-11|Passport is L1234...|{'name': ['PERSON...|
|  Oscar Hill|oscar.hill@exampl...|   1111 Hill|     Portland|   OR| npii-12|My card is 4002-1...|{'name': ['PERSON...|
|   Paula Ray|paula.ray@example...|  2222 Cliff|       Austin|   TX| npii-13|CC 4012-8888-9999...|{'name': ['PERSON...|
|  Ray Foster|ray.foster@exampl...|   3333 Rock|    Nashville|   TN| npii-14|Card # 5555-6666-...|{'name': ['PERSON...|
|  Sara White|sara.white@exampl...|  4444 Shell|      Raleigh|   NC| npii-15|SSN 111-22-9999 a...|{'name': ['PERSON...|
+------------+--------------------+------------+-------------+-----+--------+--------------------+--------------------+

Anonymize user_query¶

This section shows how to:

Use the AnonymizerEngine to mask or replace sensitive information
Apply anonymization to specific columns
Preserve the DataFrame structure while removing PII

We implement a UDF that detects and anonymizes text in the given column.

In [ ]:

Copied!





from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def anonymize_text(text: str) -> str:
    """
    Detect PII in the given text using the large model and replace it with an empty string.
    """
    if text is None:
        return None

    analyzer = broadcasted_analyzer.value
    anonymizer = broadcasted_anonymizer.value
    analyze_results = analyzer.analyze(text=text, language="en")
    anonymized_result = anonymizer.anonymize(
        text=text,
        analyzer_results=analyze_results,
        operators={"DEFAULT": OperatorConfig("replace", {"new_value": ""})}
)
    return anonymized_result.text

# Registering as a regular PySpark UDF
anon_udf = udf(anonymize_text, StringType())

df_final = df_with_pii_summary.withColumn("anon_user_query", anon_udf(col("user_query")))

display(df_final)
# df_final.show()
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def anonymize_text(text: str) -> str:
    """
    Detect PII in the given text using the large model and replace it with an empty string.
    """
    if text is None:
        return None

    analyzer = broadcasted_analyzer.value
    anonymizer = broadcasted_anonymizer.value
    analyze_results = analyzer.analyze(text=text, language="en")
    anonymized_result = anonymizer.anonymize(
        text=text,
        analyzer_results=analyze_results,
        operators={"DEFAULT": OperatorConfig("replace", {"new_value": ""})}
)
    return anonymized_result.text

# Registering as a regular PySpark UDF
anon_udf = udf(anonymize_text, StringType())

df_final = df_with_pii_summary.withColumn("anon_user_query", anon_udf(col("user_query")))

display(df_final)
# df_final.show()

StatementMeta(, 1c67db7b-90ca-4222-8845-9aade21d8c79, 23, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 4053ba15-5b54-4ba8-bc55-889fc129cee4)

+------------+--------------------+------------+-------------+-----+--------+--------------------+--------------------+--------------------+
|        name|               email|      street|         city|state| non_pii|          user_query|         pii_summary|     anon_user_query|
+------------+--------------------+------------+-------------+-----+--------+--------------------+--------------------+--------------------+
|    John Doe|john.doe@example.com|  123 Elm St|       Dallas|   TX|  abc123|My phone number i...|{'name': ['PERSON...|My phone number i...|
|  Jane Smith|jane.smith@exampl...|  456 Oak Rd|        Miami|   FL|  xyz789|Please call me at...|{'name': ['PERSON...|Please call me at...|
| Alice Brown|alice.brown@examp...|  99 Pine Dr|      Seattle|   WA|cust-123|SSN is 987-65-432...|{'name': ['PERSON...|SSN is <US_ITIN>;...|
|   Bob Davis|bob.davis@example...|10 Maple Ave|     New York|   NY| info999|Passport number i...|{'name': ['PERSON...|Passport number i...|
| Carol Jones|carol.jones@examp...|777 Cedar Ln|  Los Angeles|   CA| test111|My phone 333-777-...|{'name': ['PERSON...|My phone <PHONE_N...|
| David Green|david.green@examp...|321 Birch St|      Chicago|   IL| npii-01|He said credit ca...|{'name': ['PERSON...|He said credit ca...|
| Emily Clark|emily.clark@examp...|555 Aspen Rd|       Boston|   MA| npii-02|Passport A9876543...|{'name': ['PERSON...|Passport <US_DRIV...|
|Frank Wilson|frank.wilson@exam...| 1010 Walnut|      Phoenix|   AZ| npii-03|SSN 666-77-8888 a...|{'name': ['PERSON...|SSN 666-77-8888 a...|
|   Grace Lee|grace.lee@example...|2020 Palm St|      Orlando|   FL| npii-04|Use my card 4000-...|{'name': ['PERSON...|Use my card <CRED...|
|  Henry King|henry.king@exampl...|  3030 Peach|      Atlanta|   GA| npii-05|NHSNO 0123456789 ...|{'name': ['PERSON...|NHSNO <UK_NHS> an...|
| Irene Adams|irene.adams@examp...|   4040 Plum|       Denver|   CO| npii-06|SSN 000-11-2222 a...|{'name': ['PERSON...|SSN 000-11-2222 a...|
| Jack Miller|jack.miller@examp...|   5050 Lily|      Houston|   TX| npii-07|My phone 713-555-...|{'name': ['PERSON...|My phone <PHONE_N...|
|  Kate Allen|kate.allen@exampl...| 6060 Poplar| Philadelphia|   PA| npii-08|SSN: 123-00-4567,...|{'name': ['PERSON...|SSN: 123-00-4567,...|
|   Leo Clark|leo.clark@example...|   7070 Pine|    Las Vegas|   NV| npii-09|He used card # 44...|{'name': ['PERSON...|He used card # <I...|
|   Mona Reed|mona.reed@example...|    8080 Bay|    San Diego|   CA| npii-10|Call me at 619-55...|{'name': ['PERSON...|Call me at <PHONE...|
|  Nancy Ward|nancy.ward@exampl...|   9090 Moss|San Francisco|   CA| npii-11|Passport is L1234...|{'name': ['PERSON...|Passport is L1234...|
|  Oscar Hill|oscar.hill@exampl...|   1111 Hill|     Portland|   OR| npii-12|My card is 4002-1...|{'name': ['PERSON...|My card is <CREDI...|
|   Paula Ray|paula.ray@example...|  2222 Cliff|       Austin|   TX| npii-13|CC 4012-8888-9999...|{'name': ['PERSON...|CC <IN_PAN>9999-1...|
|  Ray Foster|ray.foster@exampl...|   3333 Rock|    Nashville|   TN| npii-14|Card # 5555-6666-...|{'name': ['PERSON...|Card # <IN_PAN>77...|
|  Sara White|sara.white@exampl...|  4444 Shell|      Raleigh|   NC| npii-15|SSN 111-22-9999 a...|{'name': ['PERSON...|SSN <US_SSN> and ...|
+------------+--------------------+------------+-------------+-----+--------+--------------------+--------------------+--------------------+

Scale-Up Testing¶

This section lets you:

Evaluate performance on larger datasets by duplicating rows
Test how well the solution scales with increasing data volume
Measure execution time for PII detection and anonymization at scale

We expand the dataset by creating multiple copies of each row, appending unique IDs, and then re-anonymizing the text to simulate processing a larger dataset.

In [ ]:

Copied!





df_expanded = df.withColumn(
    "duplication_array",
    array([lit(i) for i in range(num_duplicates)])
)

df_test = df_expanded.withColumn("duplicate_id", explode(col("duplication_array")))

df_test = df_test.withColumn("id", monotonically_increasing_id())

df_test = df_test.withColumn(
    "user_query",
    concat(col("user_query"), lit(" - ID: "), col("id"))
)

df_test = df_test.drop("duplication_array", "duplicate_id")
df_test = df_test.repartition(partitions_number) # repartition to show parrallel processing -should be remove/modify to allow high scales.
df_test = df_test.withColumn("anon_user_query", anon_udf(col("user_query")))
print(f"total row number {df_test.count()}") # Number of duplicates X number of rows in the DF
display(df_test.limit(partitions_number))
# (df_test.limit(partitions_number)).show()
df_expanded = df.withColumn(
    "duplication_array",
    array([lit(i) for i in range(num_duplicates)])
)

df_test = df_expanded.withColumn("duplicate_id", explode(col("duplication_array")))

df_test = df_test.withColumn("id", monotonically_increasing_id())

df_test = df_test.withColumn(
    "user_query",
    concat(col("user_query"), lit(" - ID: "), col("id"))
)

df_test = df_test.drop("duplication_array", "duplicate_id")
df_test = df_test.repartition(partitions_number) # repartition to show parrallel processing -should be remove/modify to allow high scales.
df_test = df_test.withColumn("anon_user_query", anon_udf(col("user_query")))
print(f"total row number {df_test.count()}") # Number of duplicates X number of rows in the DF
display(df_test.limit(partitions_number))
# (df_test.limit(partitions_number)).show()

StatementMeta(, 1c67db7b-90ca-4222-8845-9aade21d8c79, 24, Finished, Available, Finished)

total row number 100000

SynapseWidget(Synapse.DataFrame, 055e3b9b-76b6-4d30-a98f-43928f3c8603)

+------------+--------------------+------------+-----------+-----+-------+--------------------+-----+--------------------+
|        name|               email|      street|       city|state|non_pii|          user_query|   id|     anon_user_query|
+------------+--------------------+------------+-----------+-----+-------+--------------------+-----+--------------------+
|   Leo Clark|leo.clark@example...|   7070 Pine|  Las Vegas|   NV|npii-09|He used card # 44...|65376|He used card # <I...|
|  Oscar Hill|oscar.hill@exampl...|   1111 Hill|   Portland|   OR|npii-12|My card is 4002-1...|84635|My card is <CREDI...|
| Emily Clark|emily.clark@examp...|555 Aspen Rd|     Boston|   MA|npii-02|Passport A9876543...|34945|Passport <US_DRIV...|
|  Oscar Hill|oscar.hill@exampl...|   1111 Hill|   Portland|   OR|npii-12|My card is 4002-1...|81800|My card is <CREDI...|
| Irene Adams|irene.adams@examp...|   4040 Plum|     Denver|   CO|npii-06|SSN 000-11-2222 a...|51834|SSN 000-11-2222 a...|
| Jack Miller|jack.miller@examp...|   5050 Lily|    Houston|   TX|npii-07|My phone 713-555-...|55111|My phone <PHONE_N...|
|Frank Wilson|frank.wilson@exam...| 1010 Walnut|    Phoenix|   AZ|npii-03|SSN 666-77-8888 a...|38326|SSN 666-77-8888 a...|
| David Green|david.green@examp...|321 Birch St|    Chicago|   IL|npii-01|He said credit ca...|29259|He said credit ca...|
|  Sara White|sara.white@exampl...|  4444 Shell|    Raleigh|   NC|npii-15|SSN 111-22-9999 a...|96171|SSN <US_SSN> and ...|
| David Green|david.green@examp...|321 Birch St|    Chicago|   IL|npii-01|He said credit ca...|29983|He said credit ca...|
|   Grace Lee|grace.lee@example...|2020 Palm St|    Orlando|   FL|npii-04|Use my card 4000-...|44886|Use my card <CRED...|
|   Bob Davis|bob.davis@example...|10 Maple Ave|   New York|   NY|info999|Passport number i...|19619|Passport number i...|
|   Bob Davis|bob.davis@example...|10 Maple Ave|   New York|   NY|info999|Passport number i...|17969|Passport number i...|
|  Sara White|sara.white@exampl...|  4444 Shell|    Raleigh|   NC|npii-15|SSN 111-22-9999 a...|99185|SSN <US_SSN> and ...|
|  Ray Foster|ray.foster@exampl...|   3333 Rock|  Nashville|   TN|npii-14|Card # 5555-6666-...|93467|Card # <IN_PAN>77...|
|   Grace Lee|grace.lee@example...|2020 Palm St|    Orlando|   FL|npii-04|Use my card 4000-...|40770|Use my card <CRED...|
| Jack Miller|jack.miller@examp...|   5050 Lily|    Houston|   TX|npii-07|My phone 713-555-...|56657|My phone <PHONE_N...|
| Carol Jones|carol.jones@examp...|777 Cedar Ln|Los Angeles|   CA|test111|My phone 333-777-...|24322|My phone <PHONE_N...|
|   Grace Lee|grace.lee@example...|2020 Palm St|    Orlando|   FL|npii-04|Use my card 4000-...|41338|Use my card <CRED...|
|    John Doe|john.doe@example.com|  123 Elm St|     Dallas|   TX| abc123|My phone number i...| 3698|My phone number i...|
+------------+--------------------+------------+-----------+-----+-------+--------------------+-----+--------------------+
only showing top 20 rows

Write to a Delta Table (Optional)¶

Optionally save your anonymized data to a Delta table:

Persist results for downstream processing
Make anonymized data available to other Fabric applications
Enable further analysis while maintaining privacy compliance

The following code will execute only if is_write_to_delta is set to True.

In [11]:

Copied!

if is_write_to_delta:
    df_test.write.format("delta").mode("overwrite").saveAsTable(table_namne)
if is_write_to_delta:
    df_test.write.format("delta").mode("overwrite").saveAsTable(table_namne)

StatementMeta(, 1c67db7b-90ca-4222-8845-9aade21d8c79, 21, Finished, Available, Finished)

Conclusion¶

You have successfully executed this notebook in Fabric, leveraging Presidio for PII detection/anonymization and Spark for scalable data processing.

You can incorporate it into a Fabric pipeline for scheduled runs and integrated workflows. For further customization—like adding detection rules, additional anonymization methods, or advanced Spark configurations—refer to the official Presidio documentation and your Fabric environment's settings.

Enjoy building robust PII compliance workflows!