pe.api.text.llm_augpe_api module

class pe.api.text.llm_augpe_api.LLMAugPE(llm, random_api_prompt_file, variation_api_prompt_file, min_word_count=0, word_count_std=None, token_to_word_ratio=None, max_completion_tokens_limit=None, blank_probabilities=None, tokenizer_model='gpt-3.5-turbo')[source]

Bases: API

The text API that uses open-source or API-based LLMs. This algorithm is initially proposed in the ICML 2024 Spotlight paper, “Differentially Private Synthetic Data via Foundation Model APIs 2: Text” (https://arxiv.org/abs/2403.01749)

__init__(llm, random_api_prompt_file, variation_api_prompt_file, min_word_count=0, word_count_std=None, token_to_word_ratio=None, max_completion_tokens_limit=None, blank_probabilities=None, tokenizer_model='gpt-3.5-turbo')[source]

Constructor.

Parameters:

llm (pe.llm.LLM) – The LLM utilized for the random and variation generation
random_api_prompt_file (str) – The prompt file for the random API. See the explanations to variation_api_prompt_file for the format of the prompt file
variation_api_prompt_file (str) –
The prompt file for the variation API. The file is in JSON format and contains the following fields:
- message_template: A list of messages that will be sent to the LLM. Each message contains the following fields:
  - content: The content of the message. The content can contain variable placeholders (e.g., {variable_name}). The variable_name can be label name in the original data that will be replaced by the actual label value; or “sample” that will be replaced by the input text to the variation API; or “masked_sample” that will be replaced by the masked/blanked input text to the variation API when the blanking feature is enabled; or “word_count” that will be replaced by the target word count of the text when the word count variation feature is enabled; or other variables specified in the replacement rules (see below).
  - role: The role of the message. The role can be “system”, “user”, or “assistant”.
- replacement_rules: A list of replacement rules that will be applied one by one to update the variable list. Each replacement rule contains the following fields:
  - constraints: A dictionary of constraints that must be satisfied for the replacement rule to be applied. The key is the variable name and the value is the variable value.
  - replacements: A dictionary of replacements that will be used to update the variable list if the constraints are satisfied. The key is the variable name and the value is the variable value or a list of variable values to choose from in a uniform random manner.
min_word_count (int, optional) – The minimum word count for the variation API, defaults to 0
word_count_std (float, optional) – The standard deviation for the word count for the variation API. If None, the word count variation feature is disabled and “{word_count}” variable will not be provided to the prompt. Defaults to None
token_to_word_ratio (float, optional) – The token to word ratio for the variation API. If not None, the maximum completion tokens will be set to token_to_word_ratio times the target word count when the word count variation feature is enabled. Defaults to None
max_completion_tokens_limit (int, optional) – The maximum completion tokens limit for the variation API, defaults to None
blank_probabilities (float or list[float], optional) – The token blank probabilities for the variation API utilized at each PE iteration. If a single float is provided, the same blank probability will be used for all iterations. If None, the blanking feature is disabled and “{masked_sample}” variable will not be provided to the prompt. Defaults to None
tokenizer_model (str, optional) – The tokenizer model used for blanking, defaults to “gpt-3.5-turbo”

_blank_sample(sample, blank_probability)[source]

Blanking the input text.

Parameters:

sample (str) – The input text
blank_probability (float) – The token blank probability

Returns:

The blanked input text

Return type:

str

_construct_prompt(prompt_config, variables)[source]

Applying the replacement rules to construct the final prompt messages.

Parameters:

prompt_config (dict) – The prompt configuration
variables (dict) – The inital variables to be used in the prompt messages

Returns:

The constructed prompt messages

Return type:

list[dict]

random_api(label_info, num_samples)[source]

Generating random synthetic data.

Parameters:

label_info (omegaconf.dictconfig.DictConfig) – The info of the label
num_samples (int) – The number of random samples to generate

Returns:

The data object of the generated synthetic data

Return type:

pe.data.Data

variation_api(syn_data)[source]

Generating variations of the synthetic data.

Parameters:: syn_data (pe.data.Data) – The data object of the synthetic data
Returns:: The data object of the variation of the input synthetic data
Return type:: pe.data.Data