Presidio Structured Basic Usage

Path to notebook: https://www.github.com/microsoft/presidio/blob/main/docs/samples/python/example_structured.ipynb ¶

In [1]:

Copied!

from presidio_structured import StructuredEngine, JsonAnalysisBuilder, PandasAnalysisBuilder, StructuredAnalysis, CsvReader, JsonReader, JsonDataProcessor, PandasDataProcessor
from presidio_structured import StructuredEngine, JsonAnalysisBuilder, PandasAnalysisBuilder, StructuredAnalysis, CsvReader, JsonReader, JsonDataProcessor, PandasDataProcessor

This sample showcases presidio-structured on structured and semi-structured data containing sensitive data like names, emails, and addresses. It differs from the sample for the batch analyzer/anonymizer engines example, which includes narrative phrases that might contain sensitive data. The presence of personal data embedded in these phrases requires to analyze and to anonymize the text inside the cells, which is not the case for our structured sample, where the sensitive data is already separated into columns.

Loading in data¶

In [2]:

Copied!

sample_df = CsvReader().read("./csv_sample_data/test_structured.csv")
sample_df
sample_df = CsvReader().read("./csv_sample_data/test_structured.csv")
sample_df

Out[2]:

	id	name	email	street	city	state	postal_code
0	1	John Doe	john.doe@example.com	123 Main St	Anytown	CA	12345
1	2	Jane Smith	jane.smith@example.com	456 Elm St	Somewhere	TX	67890
2	3	Alice Johnson	alice.johnson@example.com	789 Pine St	Elsewhere	NY	11223

In [3]:

Copied!

sample_json = JsonReader().read("./sample_data/test_structured.json")
sample_json
sample_json = JsonReader().read("./sample_data/test_structured.json")
sample_json

Out[3]:

{'id': 1,
 'name': 'John Doe',
 'email': 'john.doe@example.com',
 'address': {'street': '123 Main St',
  'city': 'Anytown',
  'state': 'CA',
  'postal_code': '12345'}}

In [4]:

Copied!

# contains nested objects in lists
sample_complex_json = JsonReader().read("./sample_data/test_structured_complex.json")
sample_complex_json
# contains nested objects in lists
sample_complex_json = JsonReader().read("./sample_data/test_structured_complex.json")
sample_complex_json

Out[4]:

{'users': [{'id': 1,
   'name': 'John Doe',
   'email': 'john.doe@example.com',
   'address': {'street': '123 Main St',
    'city': 'Anytown',
    'state': 'CA',
    'postal_code': '12345'}},
  {'id': 2,
   'name': 'Jane Smith',
   'email': 'jane.smith@example.com',
   'address': {'street': '456 Elm St',
    'city': 'Somewhere',
    'state': 'TX',
    'postal_code': '67890'}},
  {'id': 3,
   'name': 'Alice Johnson',
   'email': 'alice.johnson@example.com',
   'address': {'street': '789 Pine St',
    'city': 'Elsewhere',
    'state': 'NY',
    'postal_code': '11223'}}]}

Tabular (csv) data: defining & generating tabular analysis, anonymization.¶

In [5]:

Copied!

# Automatically detect the entity for the columns
tabular_analysis = PandasAnalysisBuilder().generate_analysis(sample_df)
tabular_analysis
# Automatically detect the entity for the columns
tabular_analysis = PandasAnalysisBuilder().generate_analysis(sample_df)
tabular_analysis

Out[5]:

StructuredAnalysis(entity_mapping={'name': 'PERSON', 'email': 'EMAIL_ADDRESS', 'city': 'LOCATION', 'state': 'LOCATION'})

In [6]:

Copied!





# anonymized data defaults to be replaced with None, unless operators is specified

pandas_engine = StructuredEngine(data_processor=PandasDataProcessor())
df_to_be_anonymized = sample_df.copy() # in-place anonymization
anonymized_df = pandas_engine.anonymize(df_to_be_anonymized, tabular_analysis, operators=None) # explicit None for clarity
anonymized_df
# anonymized data defaults to be replaced with None, unless operators is specified

pandas_engine = StructuredEngine(data_processor=PandasDataProcessor())
df_to_be_anonymized = sample_df.copy() # in-place anonymization
anonymized_df = pandas_engine.anonymize(df_to_be_anonymized, tabular_analysis, operators=None) # explicit None for clarity
anonymized_df

Out[6]:

	id	name	email	street	city	state	postal_code
0	1	<None>	<None>	123 Main St	<None>	<None>	12345
1	2	<None>	<None>	456 Elm St	<None>	<None>	67890
2	3	<None>	<None>	789 Pine St	<None>	<None>	11223

We can also define operators using OperatorConfig similar as to the AnonymizerEngine:¶

In [7]:

Copied!





from presidio_anonymizer.entities.engine import OperatorConfig
from faker import Faker
fake = Faker()

operators = {
    "PERSON": OperatorConfig("replace", {"new_value": "person..."}),
    "EMAIL_ADDRESS": OperatorConfig("custom", {"lambda": lambda x: fake.safe_email()})
    # etc...
    }
anonymized_df = pandas_engine.anonymize(sample_df, tabular_analysis, operators=operators)
anonymized_df
from presidio_anonymizer.entities.engine import OperatorConfig
from faker import Faker
fake = Faker()

operators = {
    "PERSON": OperatorConfig("replace", {"new_value": "person..."}),
    "EMAIL_ADDRESS": OperatorConfig("custom", {"lambda": lambda x: fake.safe_email()})
    # etc...
    }
anonymized_df = pandas_engine.anonymize(sample_df, tabular_analysis, operators=operators)
anonymized_df

Out[7]:

	id	name	email	street	city	state	postal_code
0	1	person...	jamestaylor@example.net	123 Main St	<None>	<None>	12345
1	2	person...	brian49@example.com	456 Elm St	<None>	<None>	67890
2	3	person...	clarkcody@example.org	789 Pine St	<None>	<None>	11223

Semi-structured (JSON) data: simple and complex analysis, anonymization¶

In [8]:

Copied!

json_analysis = JsonAnalysisBuilder().generate_analysis(sample_json)
json_analysis
json_analysis = JsonAnalysisBuilder().generate_analysis(sample_json)
json_analysis

Out[8]:

StructuredAnalysis(entity_mapping={'name': 'PERSON', 'email': 'EMAIL_ADDRESS', 'address.city': 'LOCATION', 'address.state': 'LOCATION'})

In [9]:

Copied!





# Currently does not support nested objects in lists
try:
    json_complex_analysis = JsonAnalysisBuilder().generate_analysis(sample_complex_json)
except ValueError as e:
    print(e)

# however, we can define it manually:
json_complex_analysis = StructuredAnalysis(entity_mapping={
    "users.name":"PERSON",
    "users.address.street":"LOCATION",
    "users.address.city":"LOCATION",
    "users.address.state":"LOCATION",
    "users.email": "EMAIL_ADDRESS",
})
# Currently does not support nested objects in lists
try:
    json_complex_analysis = JsonAnalysisBuilder().generate_analysis(sample_complex_json)
except ValueError as e:
    print(e)

# however, we can define it manually:
json_complex_analysis = StructuredAnalysis(entity_mapping={
    "users.name":"PERSON",
    "users.address.street":"LOCATION",
    "users.address.city":"LOCATION",
    "users.address.state":"LOCATION",
    "users.email": "EMAIL_ADDRESS",
})

Analyzer.analyze_iterator only works on primitive types (int, float, bool, str). Lists of objects are not yet supported.

In [10]:

Copied!





# anonymizing simple data
json_engine = StructuredEngine(data_processor=JsonDataProcessor())
anonymized_json = json_engine.anonymize(sample_json, json_analysis, operators=operators)
anonymized_json
# anonymizing simple data
json_engine = StructuredEngine(data_processor=JsonDataProcessor())
anonymized_json = json_engine.anonymize(sample_json, json_analysis, operators=operators)
anonymized_json

Out[10]:

{'id': 1,
 'name': 'person...',
 'email': 'virginia29@example.org',
 'address': {'street': '123 Main St',
  'city': '<None>',
  'state': '<None>',
  'postal_code': '12345'}}

In [11]:

Copied!

anonymized_complex_json = json_engine.anonymize(sample_complex_json, json_complex_analysis, operators=operators)
anonymized_complex_json
anonymized_complex_json = json_engine.anonymize(sample_complex_json, json_complex_analysis, operators=operators)
anonymized_complex_json

Out[11]:

{'users': [{'id': 1,
   'name': 'person...',
   'email': 'david90@example.org',
   'address': {'street': '<None>',
    'city': '<None>',
    'state': '<None>',
    'postal_code': '12345'}},
  {'id': 2,
   'name': 'person...',
   'email': 'david90@example.org',
   'address': {'street': '<None>',
    'city': '<None>',
    'state': '<None>',
    'postal_code': '67890'}},
  {'id': 3,
   'name': 'person...',
   'email': 'david90@example.org',
   'address': {'street': '<None>',
    'city': '<None>',
    'state': '<None>',
    'postal_code': '11223'}}]}