pe.api package
- class pe.api.API[source]
Bases:
ABC
The abstract class that defines the APIs for the synthetic data generation.
- abstract random_api(label_info, num_samples)[source]
The abstract method that generates random synthetic data.
- Parameters:
label_info (omegaconf.dictconfig.DictConfig) – The info of the label
num_samples (int) – The number of random samples to generate
- abstract variation_api(syn_data)[source]
The abstract method that generates variations of the synthetic data.
- Parameters:
syn_data (
pe.data.Data
) – The data object of the synthetic data
- class pe.api.Avatar(res, variation_degrees, crop=(40, 40, 224, 224), num_processes=50, chunksize=100)[source]
Bases:
API
The API that uses the python_avatars library to generate synthetic avatar images.
- __init__(res, variation_degrees, crop=(40, 40, 224, 224), num_processes=50, chunksize=100)[source]
Constructor.
- Parameters:
res (int) – The resolution of the generated images
variation_degrees (float or list[float]) – The variation degrees utilized at each PE iteration. If a single value is provided, the same variation degree will be used for all iterations. The value means the probability of changing a parameter to a random value.
crop (tuple, optional) – The crop of the generated images from the python_avatars library, defaults to (40, 40, 264 - 40, 280 - 56)
num_processes (int, optional) – The number of processes to use for parallel generation, defaults to 50
chunksize (int, optional) – The chunksize for parallel generation, defaults to 100
- _get_params_from_avatar(avatar)[source]
Get the parameters of an avatar.
- Parameters:
avatar (python_avatars.Avatar) – The avatar
- Returns:
The parameters of the avatar
- Return type:
dict
- _get_random_image(_)[source]
Get a random image and its parameters.
- Parameters:
_ (int) – The index of the sample
- Returns:
The image and its parameters
- Return type:
tuple[np.ndarray, dict]
- _get_variation_image(params, variation_degree)[source]
Get a variation image and its parameters.
- Parameters:
params (dict) – The parameters of the avatar
variation_degree (float) – The degree of variation
- Returns:
The image of the avatar and its parameters
- Return type:
tuple[np.ndarray, dict]
- _svg_to_numpy(svg)[source]
Converts an SVG string to an image in numpy array format.
- Parameters:
svg (str) – The SVG string
- Returns:
The image in numpy array format
- Return type:
np.ndarray
- random_api(label_info, num_samples)[source]
Generating random synthetic data.
- Parameters:
label_info (omegaconf.dictconfig.DictConfig) – The info of the label
num_samples (int) – The number of random samples to generate
- Returns:
The data object of the generated synthetic data
- Return type:
- variation_api(syn_data)[source]
Creating variations of the synthetic data.
- Parameters:
syn_data (
pe.data.Data
) – The data object of the synthetic data- Returns:
The data object of the variation of the input synthetic data
- Return type:
- class pe.api.DrawText(font_root_path, font_variation_degrees, font_size_variation_degrees, rotation_degree_variation_degrees, stroke_width_variation_degrees, text_list=['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'], width=28, height=28, background_color=(0, 0, 0), text_color=(255, 255, 255), font_size_list=range(10, 30), stroke_width_list=[0, 1, 2], rotation_degree_list=range(-30, 31))[source]
Bases:
API
The API that uses the PIL library to generate synthetic images with text on them.
- __init__(font_root_path, font_variation_degrees, font_size_variation_degrees, rotation_degree_variation_degrees, stroke_width_variation_degrees, text_list=['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'], width=28, height=28, background_color=(0, 0, 0), text_color=(255, 255, 255), font_size_list=range(10, 30), stroke_width_list=[0, 1, 2], rotation_degree_list=range(-30, 31))[source]
Constructor.
- Parameters:
font_root_path (str) – The root path that contains the font files in .ttf format
font_variation_degrees (float or list[float]) – The variation degrees for font utilized at each PE iteration. If a single value is provided, the same variation degree will be used for all iterations. The value means the probability of changing the font to a random font.
font_size_variation_degrees (int or list[int]) – The variation degrees for font size utilized at each PE iteration. If a single value is provided, the same variation degree will be used for all iterations. The value means the maximum possible variation in font size.
rotation_degree_variation_degrees (int or list[int]) – The variation degrees for rotation degree utilized at each PE iteration. If a single value is provided, the same variation degree will be used for all iterations. The value means the maximum possible variation in rotation degree.
stroke_width_variation_degrees (int or list[int]) – The variation degrees for stroke width utilized at each PE iteration. If a single value is provided, the same variation degree will be used for all iterations. The value means the maximum possible variation in stroke width.
text_list (list or dict, optional) – The texts to be used in the synthetic images. It can be a dictionary that maps label_names to a list of strings, meaning the texts to be used for each label_name. If a list is provided, the same texts will be used for all label_names. Defaults to [“0”, “1”, “2”, “3”, “4”, “5”, “6”, “7”, “8”, “9”]
width (int, optional) – The width of the synthetic images, defaults to 28
height (int, optional) – The height of the synthetic images, defaults to 28
background_color (tuple, optional) – The background color of the synthetic images, defaults to (0, 0, 0)
text_color (tuple, optional) – The text color of the synthetic images, defaults to (255, 255, 255)
font_size_list (list, optional) – The feasible set of font sizes to be used in the synthetic images, defaults to range(10, 30)
stroke_width_list (list, optional) – The feasible set of stroke widths to be used in the synthetic images, defaults to [0, 1, 2]
rotation_degree_list (list, optional) – The feasible set of rotation degrees to be used in the synthetic images, defaults to range(-30, 31, 1)
- _create_image(font_size, font_file, text, stroke_width, rotation_degree)[source]
Create an image with text on it.
- Parameters:
font_size (int) – The font size
font_file (str) – The font file
text (str) – The text
stroke_width (int) – The stroke width
rotation_degree (int) – The rotation degree
- Returns:
The image with text on it
- Return type:
np.ndarray
- _get_random_image(label_name)[source]
Get a random image and its parameters.
- Parameters:
label_name (str) – The label name
- Returns:
The image and its parameters
- Return type:
tuple[np.ndarray, dict]
- _get_variation_image(font_size, font_file, text, stroke_width, rotation_degree, font_size_variation_degree, font_variation_degree, stroke_width_variation_degree, rotation_degree_variation_degree)[source]
Get a variation image and its parameters.
- Parameters:
font_size (int) – The font size
font_file (str) – The font file
text (str) – The text
stroke_width (int) – The stroke width
rotation_degree (int) – The rotation degree
font_size_variation_degree (int) – The degree of variation in font size
font_variation_degree (float) – The degree of variation in font
stroke_width_variation_degree (int) – The degree of variation in stroke width
rotation_degree_variation_degree (int) – The degree of variation in rotation degree
- Returns:
The image of the avatar and its parameters
- Return type:
tuple[np.ndarray, dict]
- random_api(label_info, num_samples)[source]
Generating random synthetic data.
- Parameters:
label_info (omegaconf.dictconfig.DictConfig) – The info of the label
num_samples (int) – The number of random samples to generate
- Returns:
The data object of the generated synthetic data
- Return type:
- variation_api(syn_data)[source]
Creating variations of the synthetic data.
- Parameters:
syn_data (
pe.data.Data
) – The data object of the synthetic data- Returns:
The data object of the variation of the input synthetic data
- Return type:
- class pe.api.ImprovedDiffusion(variation_degrees, model_path, model_image_size=64, num_channels=192, num_res_blocks=3, learn_sigma=True, class_cond=True, use_checkpoint=False, attention_resolutions='16,8', num_heads=4, num_heads_upsample=-1, use_scale_shift_norm=True, dropout=0.0, diffusion_steps=4000, sigma_small=False, noise_schedule='cosine', use_kl=False, predict_xstart=False, rescale_timesteps=False, rescale_learned_sigmas=False, timestep_respacing='100', batch_size=2000, use_ddim=True, clip_denoised=True, use_data_parallel=True)[source]
Bases:
API
The image API that utilizes improved diffusion models from https://arxiv.org/abs/2102.09672.
- __init__(variation_degrees, model_path, model_image_size=64, num_channels=192, num_res_blocks=3, learn_sigma=True, class_cond=True, use_checkpoint=False, attention_resolutions='16,8', num_heads=4, num_heads_upsample=-1, use_scale_shift_norm=True, dropout=0.0, diffusion_steps=4000, sigma_small=False, noise_schedule='cosine', use_kl=False, predict_xstart=False, rescale_timesteps=False, rescale_learned_sigmas=False, timestep_respacing='100', batch_size=2000, use_ddim=True, clip_denoised=True, use_data_parallel=True)[source]
Constructor. See https://github.com/openai/improved-diffusion for the explanation of the parameters not listed here.
- Parameters:
variation_degrees (int or list[int]) – The variation degrees utilized at each PE iteration. If a single int is provided, the same variation degree will be used for all iterations.
model_path (str) – The path of the model checkpoint
diffusion_steps (int, optional) – The total number of diffusion steps, defaults to 4000
timestep_respacing (str or list[str], optional) – The step configurations for image generation utilized at each PE iteration. If a single str is provided, the same step configuration will be used for all iterations. Defaults to “100”
batch_size (int, optional) – The batch size for image generation, defaults to 2000
use_data_parallel (bool, optional) – Whether to use data parallel during image generation, defaults to True
- random_api(label_info, num_samples)[source]
Generating random synthetic data.
- Parameters:
label_info (omegaconf.dictconfig.DictConfig) – The info of the label, not utilized in this API
num_samples (int) – The number of random samples to generate
- Returns:
The data object of the generated synthetic data
- Return type:
- variation_api(syn_data)[source]
Generating variations of the synthetic data.
- Parameters:
syn_data (
pe.data.Data
) – The data object of the synthetic data- Returns:
The data object of the variation of the input synthetic data
- Return type:
- class pe.api.ImprovedDiffusion270M(variation_degrees, model_path=None, batch_size=2000, timestep_respacing='100', use_data_parallel=True)[source]
Bases:
ImprovedDiffusion
- CHECKPOINT_URL = 'https://openaipublic.blob.core.windows.net/diffusion/march-2021/imagenet64_cond_270M_250K.pt'
The URL of the checkpoint path
- __init__(variation_degrees, model_path=None, batch_size=2000, timestep_respacing='100', use_data_parallel=True)[source]
The “Class-conditional ImageNet-64 model (270M parameters, trained for 250K iterations)” model from the Improved Diffusion paper.
- Parameters:
variation_degrees (list[int]) – The variation degrees utilized at each PE iteration
model_path (str) – The path of the model checkpoint. If not provided, the checkpoint will be downloaded from the CHECKPOINT_URL
batch_size (int, optional) – The batch size for image generation, defaults to 2000
timestep_respacing (str, optional) – The step configuration for image generation, defaults to “100”
use_data_parallel (bool, optional) – Whether to use data parallel during image generation, defaults to True
- class pe.api.LLMAugPE(llm, random_api_prompt_file, variation_api_prompt_file, min_word_count=0, word_count_std=None, token_to_word_ratio=None, max_completion_tokens_limit=None, blank_probabilities=None, tokenizer_model='gpt-3.5-turbo')[source]
Bases:
API
The text API that uses open-source or API-based LLMs. This algorithm is initially proposed in the ICML 2024 Spotlight paper, “Differentially Private Synthetic Data via Foundation Model APIs 2: Text” (https://arxiv.org/abs/2403.01749)
- __init__(llm, random_api_prompt_file, variation_api_prompt_file, min_word_count=0, word_count_std=None, token_to_word_ratio=None, max_completion_tokens_limit=None, blank_probabilities=None, tokenizer_model='gpt-3.5-turbo')[source]
Constructor.
- Parameters:
llm (
pe.llm.LLM
) – The LLM utilized for the random and variation generationrandom_api_prompt_file (str) – The prompt file for the random API. See the explanations to
variation_api_prompt_file
for the format of the prompt filevariation_api_prompt_file (str) –
The prompt file for the variation API. The file is in JSON format and contains the following fields:
message_template
: A list of messages that will be sent to the LLM. Each message contains the following fields:content
: The content of the message. The content can contain variable placeholders (e.g., {variable_name}). The variable_name can be label name in the original data that will be replaced by the actual label value; or “sample” that will be replaced by the input text to the variation API; or “masked_sample” that will be replaced by the masked/blanked input text to the variation API when the blanking feature is enabled; or “word_count” that will be replaced by the target word count of the text when the word count variation feature is enabled; or other variables specified in the replacement rules (see below).role
: The role of the message. The role can be “system”, “user”, or “assistant”.
replacement_rules
: A list of replacement rules that will be applied one by one to update the variable list. Each replacement rule contains the following fields:constraints
: A dictionary of constraints that must be satisfied for the replacement rule to be applied. The key is the variable name and the value is the variable value.replacements
: A dictionary of replacements that will be used to update the variable list if the constraints are satisfied. The key is the variable name and the value is the variable value or a list of variable values to choose from in a uniform random manner.
min_word_count (int, optional) – The minimum word count for the variation API, defaults to 0
word_count_std (float, optional) – The standard deviation for the word count for the variation API. If None, the word count variation feature is disabled and “{word_count}” variable will not be provided to the prompt. Defaults to None
token_to_word_ratio (float, optional) – The token to word ratio for the variation API. If not None, the maximum completion tokens will be set to
token_to_word_ratio
times the target word count when the word count variation feature is enabled. Defaults to Nonemax_completion_tokens_limit (int, optional) – The maximum completion tokens limit for the variation API, defaults to None
blank_probabilities (float or list[float], optional) – The token blank probabilities for the variation API utilized at each PE iteration. If a single float is provided, the same blank probability will be used for all iterations. If None, the blanking feature is disabled and “{masked_sample}” variable will not be provided to the prompt. Defaults to None
tokenizer_model (str, optional) – The tokenizer model used for blanking, defaults to “gpt-3.5-turbo”
- _blank_sample(sample, blank_probability)[source]
Blanking the input text.
- Parameters:
sample (str) – The input text
blank_probability (float) – The token blank probability
- Returns:
The blanked input text
- Return type:
str
- _construct_prompt(prompt_config, variables)[source]
Applying the replacement rules to construct the final prompt messages.
- Parameters:
prompt_config (dict) – The prompt configuration
variables (dict) – The inital variables to be used in the prompt messages
- Returns:
The constructed prompt messages
- Return type:
list[dict]
- random_api(label_info, num_samples)[source]
Generating random synthetic data.
- Parameters:
label_info (omegaconf.dictconfig.DictConfig) – The info of the label
num_samples (int) – The number of random samples to generate
- Returns:
The data object of the generated synthetic data
- Return type:
- variation_api(syn_data)[source]
Generating variations of the synthetic data.
- Parameters:
syn_data (
pe.data.Data
) – The data object of the synthetic data- Returns:
The data object of the variation of the input synthetic data
- Return type:
- class pe.api.NearestImage(data, embedding, nearest_neighbor_mode, variation_degrees, nearest_neighbor_backend='auto')[source]
Bases:
API
The API that generates synthetic images by randomly drawing an image from the given dataset as the RANDOM_API and finding the nearest images in the given dataset as the VARIATION_API.
- __init__(data, embedding, nearest_neighbor_mode, variation_degrees, nearest_neighbor_backend='auto')[source]
Constructor.
- Parameters:
data (
pe.data.Data
) – The data object that contains the imagesembedding (
pe.embedding.Embedding
) – The embedding object that computes the embeddings of the imagesnearest_neighbor_mode (str) – The distance metric to use for finding the nearest neighbors. It should be one of the following: “l2” (l2 distance), “cos_sim” (cosine similarity), “ip” (inner product). Not all backends support all modes
variation_degrees (int or list[int]) – The variation degrees utilized at each PE iteration. If a single value is provided, the same variation degree will be used for all iterations. The value means the number of nearest neighbors to consider for the VARIAITON_API
nearest_neighbor_backend (str, optional) – The backend to use for finding the nearest neighbors. It should be one of the following: “faiss” (FAISS), “sklearn” (scikit-learn), “auto” (using FAISS if available, otherwise scikit-learn). Defaults to “auto”. FAISS supports GPU and is much faster when the number of samples is large. It requires the installation of faiss-gpu or faiss-cpu package. See https://faiss.ai/
- Raises:
ValueError – If the nearest_neighbor_backend is unknown
- _build_nearest_neighbor_graph()[source]
Finding the nearest neighbor for each sample in the given dataset.
- random_api(label_info, num_samples)[source]
Generating random synthetic data by randomly drawing images from the given dataset.
- Parameters:
label_info (omegaconf.dictconfig.DictConfig) – The info of the label, not utilized in this API
num_samples (int) – The number of random samples to generate
- Returns:
The data object of the generated synthetic data
- Return type:
- variation_api(syn_data)[source]
Generating variations of the synthetic data by finding the nearest images in the given dataset.
- Parameters:
syn_data (
pe.data.Data
) – The data object of the synthetic data- Returns:
The data object of the variation of the input synthetic data
- Return type:
- class pe.api.StableDiffusion(prompt, variation_degrees, width=512, height=512, random_api_checkpoint='CompVis/stable-diffusion-v1-4', random_api_guidance_scale=7.5, random_api_num_inference_steps=50, random_api_batch_size=10, variation_api_checkpoint='CompVis/stable-diffusion-v1-4', variation_api_guidance_scale=7.5, variation_api_num_inference_steps=50, variation_api_batch_size=10)[source]
Bases:
API
The API that uses the Stable Diffusion model to generate synthetic data.
- __init__(prompt, variation_degrees, width=512, height=512, random_api_checkpoint='CompVis/stable-diffusion-v1-4', random_api_guidance_scale=7.5, random_api_num_inference_steps=50, random_api_batch_size=10, variation_api_checkpoint='CompVis/stable-diffusion-v1-4', variation_api_guidance_scale=7.5, variation_api_num_inference_steps=50, variation_api_batch_size=10)[source]
Constructor.
- Parameters:
prompt (str or dict) – The prompt used for each label name. It can be either a string or a dictionary. If it is a string, it should be the path to a JSON file that contains the prompt for each label name. If it is a dictionary, it should be a dictionary that maps each label name to its prompt
variation_degrees (float or list[float]) – The variation degrees utilized at each PE iteration. If a single float is provided, the same variation degree will be used for all iterations.
width (int, optional) – The width of the generated images, defaults to 512
height (int, optional) – The height of the generated images, defaults to 512
random_api_checkpoint (str, optional) – The checkpoint of the random API, defaults to “CompVis/stable-diffusion-v1-4”
random_api_guidance_scale (float, optional) – The guidance scale of the random API, defaults to 7.5
random_api_num_inference_steps (int, optional) – The number of inference steps of the random API, defaults to 50
random_api_batch_size (int, optional) – The batch size of the random API, defaults to 10
variation_api_checkpoint (str, optional) – The checkpoint of the variation API, defaults to “CompVis/stable-diffusion-v1-4”
variation_api_guidance_scale (float or list[float], optional) – The guidance scale of the variation API utilized at each PE iteration. If a single float is provided, the same guidance scale will be used for all iterations. Defaults to 7.5
variation_api_num_inference_steps (int or list[int], optional) – The number of inference steps of the variation API utilized at each PE iteration. If a single int is provided, the same number of inference steps will be used for all iterations. Defaults to 50
variation_api_batch_size (int, optional) – The batch size of the variation API, defaults to 10
- Raises:
ValueError – If the prompt is neither a string nor a dictionary
- random_api(label_info, num_samples)[source]
Generating random synthetic data.
- Parameters:
label_info (omegaconf.dictconfig.DictConfig) – The info of the label
num_samples (int) – The number of random samples to generate
- Returns:
The data object of the generated synthetic data
- Return type:
- variation_api(syn_data)[source]
Generating variations of the synthetic data.
- Parameters:
syn_data (
pe.data.Data
) – The data object of the synthetic data- Returns:
The data object of the variation of the input synthetic data
- Return type: