# download presidio
!pip install presidio_analyzer presidio_anonymizer
!pip install openai pandas
!python -m spacy download en_core_web_lg
Use Presidio + OpenAI to turn real text into fake text¶
This notebook uses Presidio to turn text with PII into text where PII entities are replaced with placeholders, e.g. "My name is David
" turns into "My name is {{PERSON}}
". Then, it calls the OpenAI API to create a fake record which is based on the original one.
Flow:
My friend David lives in Paris. He likes it.
My friend {{PERSON}} lives in {{CITY}}. He likes it.
My friend Lucy lives in Beirut. She likes it.
Note that OpenAI completion models could possibly detect PII values and replace them in one call, but it is suggested to validate that all PII entities are indeed detected.
Imports and set up OpenAI Key¶
import pprint
from dotenv import load_dotenv
import os
import pandas as pd
from openai import OpenAI
load_dotenv()
client = OpenAI(
api_key=os.environ.get("OPENAI_API_KEY"),
)
#Or put explicitly in notebook. Find out more here: https://help.openai.com/en/articles/4936850-where-do-i-find-my-secret-api-key
Define request for the OpenAI service¶
def call_completion_model(prompt:str, model:str="gpt-3.5-turbo", max_tokens:int=512) ->str:
"""Creates a request for the OpenAI Completion service and returns the response.
:param prompt: The prompt for the completion model
:param model: OpenAI model name
:param max_tokens: Model's max tokens parameter
"""
completion = client.chat.completions.create(
messages=[
{
"role": "user",
"content": prompt,
}
],
model=model,
)
return completion.choices[0].message.content
De-identify data using Presidio Analyzer and Anonymizer¶
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
sample = """
Hello, my name is David Johnson and I live in Maine.
My credit card number is 4095-2609-9393-4932 and my crypto wallet id is 16Yeky6GMjeNkAiNcBY7ZhrLoMSgg1BoyZ.
On September 18 I visited microsoft.com and sent an email to test@presidio.site, from the IP 192.168.0.1.
My passport: 191280342 and my phone number: (212) 555-1234.
This is a valid International Bank Account Number: IL150120690000003111111 . Can you please check the status on bank account 954567876544?
Kate's social security number is 078-05-1126. Her driver license? it is 1234567A.
"""
results = analyzer.analyze(sample, language="en")
anonymized = anonymizer.anonymize(text=sample, analyzer_results=results)
anonymized_text = anonymized.text
print(anonymized_text)
Hello, my name is <PERSON> and I live in <LOCATION>. My credit card number is <CREDIT_CARD> and my crypto wallet id is <CRYPTO>. On <DATE_TIME> I visited <URL> and sent an email to <EMAIL_ADDRESS>, from the IP <IP_ADDRESS>. My passport: <US_PASSPORT> and my phone number: <PHONE_NUMBER>. This is a valid International Bank Account Number: <IBAN_CODE> . Can you please check the status on bank account <US_BANK_NUMBER>? <PERSON>'s social security number is <US_SSN>. Her driver license? it is <US_DRIVER_LICENSE>.
Create prompt (instructions + text to manipulate)¶
def create_prompt(anonymized_text: str) -> str:
"""
Create the prompt with instructions to GPT-3.
:param anonymized_text: Text with placeholders instead of PII values, e.g. My name is <PERSON>.
"""
prompt = f"""
Your role is to create synthetic text based on de-identified text with placeholders instead of Personally Identifiable Information (PII).
Replace the placeholders (e.g. ,<PERSON>, {{DATE}}, {{ip_address}}) with fake values.
Instructions:
a. Use completely random numbers, so every digit is drawn between 0 and 9.
b. Use realistic names that come from diverse genders, ethnicities and countries.
c. If there are no placeholders, return the text as is.
d. Keep the formatting as close to the original as possible.
e. If PII exists in the input, replace it with fake values in the output.
f. Remove whitespace before and after the generated text
input: [[TEXT STARTS]] How do I change the limit on my credit card {{credit_card_number}}?[[TEXT ENDS]]
output: How do I change the limit on my credit card 2539 3519 2345 1555?
input: [[TEXT STARTS]]<PERSON> was the chief science officer at <ORGANIZATION>.[[TEXT ENDS]]
output: Katherine Buckjov was the chief science officer at NASA.
input: [[TEXT STARTS]]Cameroon lives in <LOCATION>.[[TEXT ENDS]]
output: Vladimir lives in Moscow.
input: [[TEXT STARTS]]{anonymized_text}[[TEXT ENDS]]
output:"""
return prompt
print("This is the prompt with de-identified values:")
print(create_prompt(anonymized_text))
This is the prompt with de-identified values: Your role is to create synthetic text based on de-identified text with placeholders instead of Personally Identifiable Information (PII). Replace the placeholders (e.g. ,<PERSON>, {DATE}, {ip_address}) with fake values. Instructions: a. Use completely random numbers, so every digit is drawn between 0 and 9. b. Use realistic names that come from diverse genders, ethnicities and countries. c. If there are no placeholders, return the text as is. d. Keep the formatting as close to the original as possible. e. If PII exists in the input, replace it with fake values in the output. f. Remove whitespace before and after the generated text input: [[TEXT STARTS]] How do I change the limit on my credit card {credit_card_number}?[[TEXT ENDS]] output: How do I change the limit on my credit card 2539 3519 2345 1555? input: [[TEXT STARTS]]<PERSON> was the chief science officer at <ORGANIZATION>.[[TEXT ENDS]] output: Katherine Buckjov was the chief science officer at NASA. input: [[TEXT STARTS]]Cameroon lives in <LOCATION>.[[TEXT ENDS]] output: Vladimir lives in Moscow. input: [[TEXT STARTS]] Hello, my name is <PERSON> and I live in <LOCATION>. My credit card number is <CREDIT_CARD> and my crypto wallet id is <CRYPTO>. On <DATE_TIME> I visited <URL> and sent an email to <EMAIL_ADDRESS>, from the IP <IP_ADDRESS>. My passport: <US_PASSPORT> and my phone number: <PHONE_NUMBER>. This is a valid International Bank Account Number: <IBAN_CODE> . Can you please check the status on bank account <US_BANK_NUMBER>? <PERSON>'s social security number is <US_SSN>. Her driver license? it is <US_DRIVER_LICENSE>. [[TEXT ENDS]] output:
Call the LLM¶
gpt_res = call_completion_model(create_prompt(anonymized_text))
print(gpt_res)
Hello, my name is Aaliyah and I live in Tokyo. My credit card number is 4928 7562 1034 8907 and my crypto wallet id is 0x3B 7a 5f 1C. On 02/07/2023 15:45 I visited www.example.com and sent an email to example@email.com, from the IP 127.0.0.1. My passport: L921483B and my phone number: +1 (555) 123-4567. This is a valid International Bank Account Number: FR76 1234 5789 1256 3321 7564 901. Can you please check the status on bank account 987654321? Eliana's social security number is 123-45-6789. Her driver license? it is DL12345678.
Alternatively, run on a list of template sentences:¶
import urllib
templates = []
url = "https://raw.githubusercontent.com/microsoft/presidio-research/master/presidio_evaluator/data_generator/raw_data/templates.txt"
for line in urllib.request.urlopen(url):
templates.append(line.decode('utf-8'))
print("Example templates:")
templates[:5]
Example templates:
['I want to increase limit on my card # {{credit_card_number}} for certain duration of time. is it possible?\n', 'My credit card {{credit_card_number}} has been lost, Can I request you to block it.\n', 'Need to change billing date of my card {{credit_card_number}}\n', 'I want to update my primary and secondary address to the same: {{address}}\n', "In case of my child's account, we need to add {{person}} as guardian\n"]
templates_to_use = templates[:5]
import time
pp = pprint.PrettyPrinter(indent=2, width=110)
sentences = []
for template in templates_to_use:
synth_sentence = call_completion_model(create_prompt(template))
sentence_dict = {"original": template, "synthetic":synth_sentence.strip()}
sentences.append(sentence_dict)
pp.pprint(sentence_dict)
time.sleep(3) # wait to not get blocked by service (only applicable for the free tier)
print("--------------")
{ 'original': 'I want to increase limit on my card # {{credit_card_number}} for certain duration of time. is ' 'it possible?\n', 'synthetic': 'I want to increase limit on my card # 4701 2895 7462 8306 for certain duration of time. is ' 'it possible?'} -------------- { 'original': 'My credit card {{credit_card_number}} has been lost, Can I request you to block it.\n', 'synthetic': 'My credit card 4892 7634 1023 8756 has been lost, Can I request you to block it.'} -------------- { 'original': 'Need to change billing date of my card {{credit_card_number}}\n', 'synthetic': 'Need to change billing date of my card 4876 2035 6981 7423'} -------------- { 'original': 'I want to update my primary and secondary address to the same: {{address}}\n', 'synthetic': 'I want to update my primary and secondary address to the same: 123 Main Street, Apt 4.'} -------------- { 'original': "In case of my child's account, we need to add {{person}} as guardian\n", 'synthetic': "In case of my child's account, we need to add Abdul as guardian"} --------------
-------------- { 'original': '{{name}} lives at {{building_number}} {{street_name}}, {{city}}\n', 'synthetic': 'John Smith lives at 635 Poplar Street, Houston'} -------------- { 'original': '{{first_name_male}} had given {{first_name}} his address: {{building_number}} ' '{{street_name}}\n', 'synthetic': 'Adam had given Sarah his address: 44 Apple Street'} -------------- { 'original': '{{first_name_male}} had given {{first_name}} his address: {{building_number}} ' '{{street_name}}, {{city}}\n', 'synthetic': 'David had given Emma his address: 515 Elm Street, Camden.'} -------------- { 'original': 'What is your address? it is {{address}}\n', 'synthetic': 'What is your address? it is 3498 Allensby Street, Los Angeles, CA 90011.'} -------------- {'original': 'We moved here from {{city}}\n', 'synthetic': 'We moved here from Paris.'} -------------- {'original': 'We moved here from {{country}}\n', 'synthetic': 'We moved here from Venezuela.'} -------------- { 'original': '{{person}}\\n\\n{{building_number}} {{street_name}}\\n {{secondary_address}}\\n {{city}}\\n ' '{{country}} {{postcode}}\\n{{phone_number}}-Office\\,{{phone_number}}-Fax\n', 'synthetic': 'Jessica Thompson\n' ' 8745 West Drive\n' ' Suite 1402\n' ' Brooklyn, NY USA 12009\n' ' 789-534-9921-Office, 567-945-0023-Fax'} -------------- { 'original': '{{person}}\\n{{job}}\\n{{organization}}\\n{{address}}\n', 'synthetic': 'John Smith\\nAccountant\\nGlobalTech Solutions\\n25 Speedwell Street, Richmond, VA 23223'} -------------- { 'original': 'Our offices are located at {{address}}\n', 'synthetic': 'Our offices are located at 1234 Main St, Los Angeles, CA 91234.'} -------------- { 'original': 'Please return to {{address}} in case of an issue.\n', 'synthetic': 'Please return to 123 Cherry Street, El Monte, CA 91731 in case of an issue.'} -------------- { 'original': '{{organization}}\\n\\n{{address}}\n', 'synthetic': 'ABC Inc.\n 1234 Main Street, Anytown, ST 12345'} -------------- { 'original': 'The {{organization}} office is at {{address}}\n', 'synthetic': 'The ABC Corporation office is at 123 Redwood Street, Anytown, USA.'} -------------- { 'original': '{{name}}\\n{{organization}}\\n{{address}}\\n{{phone_number}} office\\n{{phone_number}} ' 'fax\\n{{phone_number}} mobile\\n\n', 'synthetic': 'Larry Fernandez\\nStar Enterprises\\n421 5th Avenue, Los Angeles, CA 90012\\n123-456-7890 ' 'office\\n456-789-0123 fax\\n256-454-2397 mobile\\n'} -------------- { 'original': '{{name}}\\n{{organization}}\\n{{address}}\\nMobile: {{phone_number}}\\nDesk: ' '{{phone_number}}\\nFax: {{phone_number}}\\n\n', 'synthetic': 'John Brown\n' ' ABC Consulting\n' ' 5th St., Suite 116, LA CA 90004\n' ' Mobile: 213-294-4497\n' ' Desk: 424-348-1275\n' ' Fax: 323-456-2545'} -------------- { 'original': 'Billing address: {{name}}\\n {{building_number}} {{street_name}} ' '{{secondary_address}}\\n {{city}}\\n {{state_abbr}}\\n {{zipcode}}\\n\n', 'synthetic': 'Billing address: Mariam Rajput\n' ' 576 Broadway Street Apartment A8\n' ' Sacramento\n' ' CA\n' ' 95349'} -------------- { 'original': "As promised, here's {{first_name}}'s address:\\n\\n{{address}}\n", 'synthetic': "As promised, here's Aamir's address:\\n\\n2166 Sesame Street, Fortaleza, Ceará, Brazil."} -------------- { 'original': '>{{name}}\\n>{{organization}}\\n>{{person}}\\n>{{building_number}} ' '{{street_name}}\\n>{{secondary_address}}\\n>{{city}}\\n>{{country}} {{postcode}}\n', 'synthetic': '>Freda Chen\n' '>Example Inc.\n' '>John Doe\n' '>50 Main Street\n' '>Apt. 2\n' '>New York\n' '>United States 10005'} -------------- { 'original': '??? {{name}}\\n??? {{organization}}\\n??? {{building_number}} {{street_name}}\\n??? ' '{{secondary_address}}\\n??? {{city}}\\n??? {{country}} {{postcode}}\n', 'synthetic': 'John Smith\n' ' ABC Corporation\n' ' 192 Main Street\n' ' Suite 100\n' ' Austin\n' ' United States 78701'} -------------- { 'original': '> \\n> {{name}}\\n> {{organization}}\\n> {{person}}\\n> {{building_number}} ' '{{street_name}}\\n> {{secondary_address}}\\n> {{city}}\\n> {{country}} {{postcode}}\n', 'synthetic': '> John Doe\n' ' > ABC Corp\n' ' > Jane Smith\n' ' > 123 Main Street\n' ' > APT 8\n' ' > San Francisco\n' ' > United States 12345'} -------------- { 'original': 'Pedestrians must enter on {{street_name}} St. the first three months\n', 'synthetic': 'Pedestrians must enter on Jericho Avenue St. the first three months'} -------------- { 'original': 'When: {{date_time}}\\nWhere: {{city}} Country Club.\n', 'synthetic': 'When: 05/01/2020 10:00am\\nWhere: Richmond Country Club.'} -------------- { 'original': "We'll meet {{day_of_week}} at {{organization}}, {{building_number}} {{street_name}}, " '{{city}}\n', 'synthetic': "We'll meet Monday at Smartdel Solutions, 145 King Street, San Diego."} -------------- { 'original': 'They had 6: {{first_name}}, {{first_name}}, {{first_name}}, {{first_name}}, {{first_name}} ' 'and {{first_name}}.\n', 'synthetic': 'They had 6: Sarah, Micheal, Kanak, Hana, Mei and Dan.'} -------------- {'original': 'She moved here from {{country}}\n', 'synthetic': 'She moved here from Mexico.'} -------------- {'original': 'My zip code is {{zipcode}}\n', 'synthetic': 'My zip code is 47713.'} -------------- {'original': 'ZIP: {{zipcode}}\n', 'synthetic': 'ZIP: 08547'} -------------- {'original': 'The bus station is on {{street_name}}\n', 'synthetic': 'The bus station is on Wilson Avenue.'} -------------- { 'original': "They're not answering at {{phone_number}}\n", 'synthetic': "They're not answering at 654-339-1013."} -------------- { 'original': 'God gave rock and roll to you, gave rock and roll to you, put it in the soul of everyone.\n', 'synthetic': 'God gave rock and roll to you, gave rock and roll to you, put it in the soul of everyone.'} -------------- {'original': '3... 2... 1... liftoff!\n', 'synthetic': '3... 2... 1... liftoff!'} -------------- { 'original': 'My great great grandfather was called {{name_male}}, and my great great grandmother was ' 'called {{name_female}}\n', 'synthetic': 'My great great grandfather was called Michael, and my great great grandmother was called ' 'Emma.'} -------------- {'original': 'She named him {{first_name_male}}\n', 'synthetic': 'She named him Juan.'} -------------- { 'original': 'Name: {{name}}\\nAddress: {{address}}\n', 'synthetic': 'Name: Amari Walters\\nAddress: 32 Webster Street, Salem, MA 01819'} -------------- { 'original': 'Follow up with {{name}} in a couple of months.\n', 'synthetic': 'Follow up with Beatriz Lawrence in a couple of months.'} -------------- { 'original': '{{prefix_male}} {{last_name_male}} is a {{age}} year old man who grew up in {{city}}.\n', 'synthetic': 'Mr.Williams is a 28 year old man who grew up in Dallas.'} -------------- { 'original': 'Date: {{date_time}}\\nName: {{name}}\\nPhone: {{phone_number}}\n', 'synthetic': 'Date: 01/03/2021 13:45 \\nName: Pratima Joshi \\nPhone: 467-562-8954'} -------------- { 'original': '{{first_name}}: "Who are you?"\\n{{first_name_female}}:"I\'m {{first_name}}\'s daughter".\n', 'synthetic': 'Bob: "Who are you?"\\nMaria:"I\'m Bob\'s daughter".'} -------------- { 'original': 'At my suggestion, one morning over breakfast, she agreed, and on the last Sunday before Labor ' 'Day we returned to {{city}} by helicopter.\n', 'synthetic': 'At my suggestion, one morning over breakfast, she agreed, and on the last Sunday before ' 'Labor Day we returned to Paris by helicopter.'}
-------------- { 'original': "It was a done thing between him and {{first_name}}'s kid; and everybody thought so.\n", 'synthetic': "It was a done thing between him and Jeffery's kid; and everybody thought so."} -------------- { 'original': 'Capitalized words like Wisdom and Discipline are often mistaken with names.\n', 'synthetic': 'Capitalized words like Wisdom and Discipline are often mistaken with names.'} -------------- { 'original': 'The letter arrived at {{address}} last night.\n', 'synthetic': 'The letter arrived at 1143 Orange Street last night.'} -------------- { 'original': 'The Princess Royal arrived at {{city}} this morning from {{country}}.\n', 'synthetic': 'The Princess Royal arrived at London this morning from France.'} -------------- {'original': "I'm in {{city}}, at the conference\n", 'synthetic': "I'm in Toronto, at the conference."} -------------- { 'original': '{{name}}, the {{job}}, said: "I\'m glad to hear that this has been withdrawn – quite why they ' 'thought this would go down well is beyond me."\n', 'synthetic': 'Gloria Green, the Nurse Practitioner, said: "I\'m glad to hear that this has been withdrawn ' '– quite why they thought this would go down well is beyond me."'} -------------- { 'original': '"I\'m glad to hear that {{country}} is moving in that direction," says {{last_name}}.\n', 'synthetic': '"I\'m glad to hear that Canada is moving in that direction," says Smith.'} -------------- { 'original': 'I am {{nation_woman}} but I live in {{country}}.\n', 'synthetic': 'I am Marianna Montenegro but I live in Ukraine.'} -------------- {'original': 'We are proud {{nation_plural}}\n', 'synthetic': 'We are proud Americans.'} -------------- { 'original': "{{person}}'s killers sentenced to life in prison\n", 'synthetic': "John Smith's killers sentenced to life in prison"} -------------- { 'original': "{{country}} leader gives 'kill without warning' order\n", 'synthetic': "Brazilian leader gives 'kill without warning' order"} -------------- { 'original': 'The {{nationality}} Border Force have detained top-flight tennis player {{name_female}} over ' 'visa disputes.\n', 'synthetic': 'The British Border Force have detained top-flight tennis player Maria Rodriguez over visa ' 'disputes.'} -------------- { 'original': 'You will be responsible for the husbandry and care of a large variety of species including ' 'lemurs, antelope, camels, and more\n', 'synthetic': 'You will be responsible for the husbandry and care of a large variety of species including ' 'lemurs, antelope, camels, and more.'} -------------- { 'original': '{{name}}\\n\\n{{job}}\\n\\nPersonal ' 'Info:\\nPhone:\\n{{phone_number}}\\n\\nE-mail:\\n{{email}}\\n\\nWebsite:\\n{{url}}\\n\\nAddress:\\n{{address}}.\n', 'synthetic': 'Robert James\\n\\nSoftware Engineer\\n\\nPersonal ' 'Info:\\nPhone:\\n555-847-8915\\n\\nE-mail:\\nrobertjames@example.com\\n\\nWebsite:\\nwww.example.com\\n\\nAddress:\\n277 ' 'Park Ave North, Denver, CO 80100.'} -------------- { 'original': '{{name}}\\n\\n{{city}}\\n{{country}}\n', 'synthetic': 'John Smith\n Los Angeles\n United States'} -------------- { 'original': 'Title VII of the Civil Rights Act of {{year}} protects individuals against employment ' 'discrimination on the basis of race and color as well as national origin, sex, or religion.\n', 'synthetic': 'Title VII of the Civil Rights Act of 1964 protects individuals against employment ' 'discrimination on the basis of race and color as well as national origin, sex, or religion.'} -------------- { 'original': 'Energetic and driven salesperson with 8+ years of professional experience in inbound and ' 'outbound sales. Awarded Salesperson of the Month three times. Helped increase inbound sales ' 'by 16% within the first year of employment. Looking to support {{organization}} in {{city}} ' '{{zipcode}} in its mission to become a market-leading solution.\n', 'synthetic': 'Energetic and driven salesperson with 8+ years of professional experience in inbound and ' 'outbound sales. Awarded Salesperson of the Month three times. Helped increase inbound sales ' 'by 16% within the first year of employment. Looking to support Acme Corporation in Los ' 'Angeles 90018 in its mission to become a market-leading solution.'} -------------- { 'original': 'The bus drops you off at {{building_number}} {{street_name}} St.\n', 'synthetic': 'The bus drops you off at 2774 Chestnut St.'} -------------- { 'original': 'Ask the driver to stop at the corner of {{street_name}} St. and {{street_name}} St.\n', 'synthetic': 'Ask the driver to stop at the corner of Maple St. and Sycamore St.'} -------------- { 'original': 'He lives on the north side of {{street_name}}.\n', 'synthetic': 'He lives on the north side of Abbey Road.'} -------------- { 'original': "I used to work for {{organization}} as {{job}}, but quit a few months ago. Now I'm " 'unemployed.\n', 'synthetic': "I used to work for ABC Corporation as Software Engineer, but quit a few months ago. Now I'm " 'unemployed.'} -------------- {'original': '{{city}} bridge is falling down.\n', 'synthetic': 'Berlin bridge is falling down.'} -------------- { 'original': '{{name}} of {{organization}} is the CEO of the year. ABC Business considered several other ' "influential CEOs for this year's honor, including {{name}} of {{organization}}, " "{{organization}}'s {{name}}, {{name}} of {{organization}}'s, {{name}} of {{organization}}, " "and {{organization}}'s {{name}}.\n", 'synthetic': 'Franklin Smith of Technology Solutions International is the CEO of the year. ABC Business ' "considered several other influential CEOs for this year's honor, including Madison Chang of " "Radiance Digital, Radiance Digital's Ashleigh Jones, Cash Huang of Clark & Partner's, Sierra " "Urbina of Intelicity, and Clark & Partners's Asher Kenney."} -------------- { 'original': '{{organization}} is a design agency based in {{city}}.\n', 'synthetic': 'Maestro Design Inc. is a design agency based in Amsterdam.'} -------------- { 'original': 'Action & Adventure, Animation, Comedy, Kids & Family, Mystery & Suspense\\nDirected By: ' '{{name}}\n', 'synthetic': 'Action & Adventure, Animation, Comedy, Kids & Family, Mystery & Suspense\n' 'Directed By: Esther Jones'} -------------- { 'original': '{{first_name}}: What a wife.\\n{{first_name}}: Remember me, {{first_name}}? When I killed ' 'your brother, I talked just like this!\\n{{first_name}}: You saved my life! How can I ever ' 'repay you?\n', 'synthetic': 'Emma: What a wife.\n' ' Emma: Remember me, Emma? When I killed your brother, I talked just like this!\n' ' Emma: You saved my life! How can I ever repay you?'} -------------- {'original': 'He just turned {{age}} years old\n', 'synthetic': 'He just turned 7 years old'} -------------- { 'original': "I'm {{name}}, originally from {{city}}, and i'm {{age}} y/o.\n", 'synthetic': "I'm Emily Evanston, originally from London, and I'm 24 y/o."} -------------- { 'original': 'Patient is a {{age}}-year-old male with a history of headaches\n', 'synthetic': 'Patient is a 35-year-old male with a history of headaches'} -------------- {'original': 'I just turned {{age}}\n', 'synthetic': 'I just turned 24.'} -------------- {'original': 'My father retired at the age of {{age}}\n', 'synthetic': 'My father retired at the age of 60.'} -------------- { 'original': 'This {{age}} year old female complaining of stomach pain.\n', 'synthetic': 'This 28 year old female complaining of stomach pain.'} -------------- { 'original': "My birthday is on the weekend. I'll turn {{age}}.\n", 'synthetic': "My birthday is on the weekend. I'll turn 20."} -------------- {'original': 'My brother just turned {{age}}\n', 'synthetic': 'My brother just turned 18.'}
-------------- { 'original': '{{prefix}} {{last_name}} flew to {{city}} on {{day_of_week}} morning.', 'synthetic': 'Dr. Nguyen flew to Los Angeles on Tuesday morning.'} --------------
This notebook demonstrates how to leverage OpenAI models for fake/surrogate data generation. It uses Presidio to first de-identify data (as de-identification might be required prior to passing the model to OpenAI), and then uses OpenAI completion models to create synthetic/fake/surrogate data based on real data. OpenAI models would also potentially remove additional PII entities, if those are not detected by Presidio.
Some impressions:
- LLMs sometimes gives additonal output, especially if the text is a question or concerning a human/bot interaction. Engineering the prompt can mitigate some of these issues. Potential post-processing might be required.
- LLMs sometimes creates fake values even in the absence of placeholders.
- LLMs re-uses context from other sentences, which could cause phone numbers are sometimes generated using a credit card pattern or other similar mistakes.
- Co-references are sometimes missed (i.e. two name placeholders that should be filled with the same name, or referencing he/she to a male/female name)