Presidio Structured Basic Usage
In [1]:
Copied!
from presidio_structured import StructuredEngine, JsonAnalysisBuilder, PandasAnalysisBuilder, StructuredAnalysis, CsvReader, JsonReader, JsonDataProcessor, PandasDataProcessor
from presidio_structured import StructuredEngine, JsonAnalysisBuilder, PandasAnalysisBuilder, StructuredAnalysis, CsvReader, JsonReader, JsonDataProcessor, PandasDataProcessor
This sample showcases presidio-structured on structured and semi-structured data containing sensitive data like names, emails, and addresses. It differs from the sample for the batch analyzer/anonymizer engines example, which includes narrative phrases that might contain sensitive data. The presence of personal data embedded in these phrases requires to analyze and to anonymize the text inside the cells, which is not the case for our structured sample, where the sensitive data is already separated into columns.
Loading in data¶
In [2]:
Copied!
sample_df = CsvReader().read("./csv_sample_data/test_structured.csv")
sample_df
sample_df = CsvReader().read("./csv_sample_data/test_structured.csv")
sample_df
Out[2]:
id | name | street | city | state | postal_code | ||
---|---|---|---|---|---|---|---|
0 | 1 | John Doe | john.doe@example.com | 123 Main St | Anytown | CA | 12345 |
1 | 2 | Jane Smith | jane.smith@example.com | 456 Elm St | Somewhere | TX | 67890 |
2 | 3 | Alice Johnson | alice.johnson@example.com | 789 Pine St | Elsewhere | NY | 11223 |
In [3]:
Copied!
sample_json = JsonReader().read("./sample_data/test_structured.json")
sample_json
sample_json = JsonReader().read("./sample_data/test_structured.json")
sample_json
Out[3]:
{'id': 1, 'name': 'John Doe', 'email': 'john.doe@example.com', 'address': {'street': '123 Main St', 'city': 'Anytown', 'state': 'CA', 'postal_code': '12345'}}
In [4]:
Copied!
# contains nested objects in lists
sample_complex_json = JsonReader().read("./sample_data/test_structured_complex.json")
sample_complex_json
# contains nested objects in lists
sample_complex_json = JsonReader().read("./sample_data/test_structured_complex.json")
sample_complex_json
Out[4]:
{'users': [{'id': 1, 'name': 'John Doe', 'email': 'john.doe@example.com', 'address': {'street': '123 Main St', 'city': 'Anytown', 'state': 'CA', 'postal_code': '12345'}}, {'id': 2, 'name': 'Jane Smith', 'email': 'jane.smith@example.com', 'address': {'street': '456 Elm St', 'city': 'Somewhere', 'state': 'TX', 'postal_code': '67890'}}, {'id': 3, 'name': 'Alice Johnson', 'email': 'alice.johnson@example.com', 'address': {'street': '789 Pine St', 'city': 'Elsewhere', 'state': 'NY', 'postal_code': '11223'}}]}
Tabular (csv) data: defining & generating tabular analysis, anonymization.¶
In [5]:
Copied!
# Automatically detect the entity for the columns
tabular_analysis = PandasAnalysisBuilder().generate_analysis(sample_df)
tabular_analysis
# Automatically detect the entity for the columns
tabular_analysis = PandasAnalysisBuilder().generate_analysis(sample_df)
tabular_analysis
Out[5]:
StructuredAnalysis(entity_mapping={'name': 'PERSON', 'email': 'EMAIL_ADDRESS', 'city': 'LOCATION', 'state': 'LOCATION'})
In [6]:
Copied!
# anonymized data defaults to be replaced with None, unless operators is specified
pandas_engine = StructuredEngine(data_processor=PandasDataProcessor())
df_to_be_anonymized = sample_df.copy() # in-place anonymization
anonymized_df = pandas_engine.anonymize(df_to_be_anonymized, tabular_analysis, operators=None) # explicit None for clarity
anonymized_df
# anonymized data defaults to be replaced with None, unless operators is specified
pandas_engine = StructuredEngine(data_processor=PandasDataProcessor())
df_to_be_anonymized = sample_df.copy() # in-place anonymization
anonymized_df = pandas_engine.anonymize(df_to_be_anonymized, tabular_analysis, operators=None) # explicit None for clarity
anonymized_df
Out[6]:
id | name | street | city | state | postal_code | ||
---|---|---|---|---|---|---|---|
0 | 1 | <None> | <None> | 123 Main St | <None> | <None> | 12345 |
1 | 2 | <None> | <None> | 456 Elm St | <None> | <None> | 67890 |
2 | 3 | <None> | <None> | 789 Pine St | <None> | <None> | 11223 |
We can also define operators using OperatorConfig similar as to the AnonymizerEngine:¶
In [7]:
Copied!
from presidio_anonymizer.entities.engine import OperatorConfig
from faker import Faker
fake = Faker()
operators = {
"PERSON": OperatorConfig("replace", {"new_value": "person..."}),
"EMAIL_ADDRESS": OperatorConfig("custom", {"lambda": lambda x: fake.safe_email()})
# etc...
}
anonymized_df = pandas_engine.anonymize(sample_df, tabular_analysis, operators=operators)
anonymized_df
from presidio_anonymizer.entities.engine import OperatorConfig
from faker import Faker
fake = Faker()
operators = {
"PERSON": OperatorConfig("replace", {"new_value": "person..."}),
"EMAIL_ADDRESS": OperatorConfig("custom", {"lambda": lambda x: fake.safe_email()})
# etc...
}
anonymized_df = pandas_engine.anonymize(sample_df, tabular_analysis, operators=operators)
anonymized_df
Out[7]:
id | name | street | city | state | postal_code | ||
---|---|---|---|---|---|---|---|
0 | 1 | person... | jamestaylor@example.net | 123 Main St | <None> | <None> | 12345 |
1 | 2 | person... | brian49@example.com | 456 Elm St | <None> | <None> | 67890 |
2 | 3 | person... | clarkcody@example.org | 789 Pine St | <None> | <None> | 11223 |
Semi-structured (JSON) data: simple and complex analysis, anonymization¶
In [8]:
Copied!
json_analysis = JsonAnalysisBuilder().generate_analysis(sample_json)
json_analysis
json_analysis = JsonAnalysisBuilder().generate_analysis(sample_json)
json_analysis
Out[8]:
StructuredAnalysis(entity_mapping={'name': 'PERSON', 'email': 'EMAIL_ADDRESS', 'address.city': 'LOCATION', 'address.state': 'LOCATION'})
In [9]:
Copied!
# Currently does not support nested objects in lists
try:
json_complex_analysis = JsonAnalysisBuilder().generate_analysis(sample_complex_json)
except ValueError as e:
print(e)
# however, we can define it manually:
json_complex_analysis = StructuredAnalysis(entity_mapping={
"users.name":"PERSON",
"users.address.street":"LOCATION",
"users.address.city":"LOCATION",
"users.address.state":"LOCATION",
"users.email": "EMAIL_ADDRESS",
})
# Currently does not support nested objects in lists
try:
json_complex_analysis = JsonAnalysisBuilder().generate_analysis(sample_complex_json)
except ValueError as e:
print(e)
# however, we can define it manually:
json_complex_analysis = StructuredAnalysis(entity_mapping={
"users.name":"PERSON",
"users.address.street":"LOCATION",
"users.address.city":"LOCATION",
"users.address.state":"LOCATION",
"users.email": "EMAIL_ADDRESS",
})
Analyzer.analyze_iterator only works on primitive types (int, float, bool, str). Lists of objects are not yet supported.
In [10]:
Copied!
# anonymizing simple data
json_engine = StructuredEngine(data_processor=JsonDataProcessor())
anonymized_json = json_engine.anonymize(sample_json, json_analysis, operators=operators)
anonymized_json
# anonymizing simple data
json_engine = StructuredEngine(data_processor=JsonDataProcessor())
anonymized_json = json_engine.anonymize(sample_json, json_analysis, operators=operators)
anonymized_json
Out[10]:
{'id': 1, 'name': 'person...', 'email': 'virginia29@example.org', 'address': {'street': '123 Main St', 'city': '<None>', 'state': '<None>', 'postal_code': '12345'}}
In [11]:
Copied!
anonymized_complex_json = json_engine.anonymize(sample_complex_json, json_complex_analysis, operators=operators)
anonymized_complex_json
anonymized_complex_json = json_engine.anonymize(sample_complex_json, json_complex_analysis, operators=operators)
anonymized_complex_json
Out[11]:
{'users': [{'id': 1, 'name': 'person...', 'email': 'david90@example.org', 'address': {'street': '<None>', 'city': '<None>', 'state': '<None>', 'postal_code': '12345'}}, {'id': 2, 'name': 'person...', 'email': 'david90@example.org', 'address': {'street': '<None>', 'city': '<None>', 'state': '<None>', 'postal_code': '67890'}}, {'id': 3, 'name': 'person...', 'email': 'david90@example.org', 'address': {'street': '<None>', 'city': '<None>', 'state': '<None>', 'postal_code': '11223'}}]}