pe.api package

class pe.api.API[source]

Bases: ABC

The abstract class that defines the APIs for the synthetic data generation.

abstract random_api(label_info, num_samples)[source]

The abstract method that generates random synthetic data.

Parameters:
  • label_info (omegaconf.dictconfig.DictConfig) – The info of the label

  • num_samples (int) – The number of random samples to generate

abstract variation_api(syn_data)[source]

The abstract method that generates variations of the synthetic data.

Parameters:

syn_data (pe.data.Data) – The data object of the synthetic data

class pe.api.Avatar(res, variation_degrees, crop=(40, 40, 224, 224), num_processes=50, chunksize=100)[source]

Bases: API

The API that uses the python_avatars library to generate synthetic avatar images.

__init__(res, variation_degrees, crop=(40, 40, 224, 224), num_processes=50, chunksize=100)[source]

Constructor.

Parameters:
  • res (int) – The resolution of the generated images

  • variation_degrees (float or list[float]) – The variation degrees utilized at each PE iteration. If a single value is provided, the same variation degree will be used for all iterations. The value means the probability of changing a parameter to a random value.

  • crop (tuple, optional) – The crop of the generated images from the python_avatars library, defaults to (40, 40, 264 - 40, 280 - 56)

  • num_processes (int, optional) – The number of processes to use for parallel generation, defaults to 50

  • chunksize (int, optional) – The chunksize for parallel generation, defaults to 100

_get_params_from_avatar(avatar)[source]

Get the parameters of an avatar.

Parameters:

avatar (python_avatars.Avatar) – The avatar

Returns:

The parameters of the avatar

Return type:

dict

_get_random_image(_)[source]

Get a random image and its parameters.

Parameters:

_ (int) – The index of the sample

Returns:

The image and its parameters

Return type:

tuple[np.ndarray, dict]

_get_variation_image(params, variation_degree)[source]

Get a variation image and its parameters.

Parameters:
  • params (dict) – The parameters of the avatar

  • variation_degree (float) – The degree of variation

Returns:

The image of the avatar and its parameters

Return type:

tuple[np.ndarray, dict]

_svg_to_numpy(svg)[source]

Converts an SVG string to an image in numpy array format.

Parameters:

svg (str) – The SVG string

Returns:

The image in numpy array format

Return type:

np.ndarray

random_api(label_info, num_samples)[source]

Generating random synthetic data.

Parameters:
  • label_info (omegaconf.dictconfig.DictConfig) – The info of the label

  • num_samples (int) – The number of random samples to generate

Returns:

The data object of the generated synthetic data

Return type:

pe.data.Data

variation_api(syn_data)[source]

Creating variations of the synthetic data.

Parameters:

syn_data (pe.data.Data) – The data object of the synthetic data

Returns:

The data object of the variation of the input synthetic data

Return type:

pe.data.Data

class pe.api.DrawText(font_root_path, font_variation_degrees, font_size_variation_degrees, rotation_degree_variation_degrees, stroke_width_variation_degrees, text_list=['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'], width=28, height=28, background_color=(0, 0, 0), text_color=(255, 255, 255), font_size_list=range(10, 30), stroke_width_list=[0, 1, 2], rotation_degree_list=range(-30, 31))[source]

Bases: API

The API that uses the PIL library to generate synthetic images with text on them.

__init__(font_root_path, font_variation_degrees, font_size_variation_degrees, rotation_degree_variation_degrees, stroke_width_variation_degrees, text_list=['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'], width=28, height=28, background_color=(0, 0, 0), text_color=(255, 255, 255), font_size_list=range(10, 30), stroke_width_list=[0, 1, 2], rotation_degree_list=range(-30, 31))[source]

Constructor.

Parameters:
  • font_root_path (str) – The root path that contains the font files in .ttf format

  • font_variation_degrees (float or list[float]) – The variation degrees for font utilized at each PE iteration. If a single value is provided, the same variation degree will be used for all iterations. The value means the probability of changing the font to a random font.

  • font_size_variation_degrees (int or list[int]) – The variation degrees for font size utilized at each PE iteration. If a single value is provided, the same variation degree will be used for all iterations. The value means the maximum possible variation in font size.

  • rotation_degree_variation_degrees (int or list[int]) – The variation degrees for rotation degree utilized at each PE iteration. If a single value is provided, the same variation degree will be used for all iterations. The value means the maximum possible variation in rotation degree.

  • stroke_width_variation_degrees (int or list[int]) – The variation degrees for stroke width utilized at each PE iteration. If a single value is provided, the same variation degree will be used for all iterations. The value means the maximum possible variation in stroke width.

  • text_list (list or dict, optional) – The texts to be used in the synthetic images. It can be a dictionary that maps label_names to a list of strings, meaning the texts to be used for each label_name. If a list is provided, the same texts will be used for all label_names. Defaults to [“0”, “1”, “2”, “3”, “4”, “5”, “6”, “7”, “8”, “9”]

  • width (int, optional) – The width of the synthetic images, defaults to 28

  • height (int, optional) – The height of the synthetic images, defaults to 28

  • background_color (tuple, optional) – The background color of the synthetic images, defaults to (0, 0, 0)

  • text_color (tuple, optional) – The text color of the synthetic images, defaults to (255, 255, 255)

  • font_size_list (list, optional) – The feasible set of font sizes to be used in the synthetic images, defaults to range(10, 30)

  • stroke_width_list (list, optional) – The feasible set of stroke widths to be used in the synthetic images, defaults to [0, 1, 2]

  • rotation_degree_list (list, optional) – The feasible set of rotation degrees to be used in the synthetic images, defaults to range(-30, 31, 1)

_create_image(font_size, font_file, text, stroke_width, rotation_degree)[source]

Create an image with text on it.

Parameters:
  • font_size (int) – The font size

  • font_file (str) – The font file

  • text (str) – The text

  • stroke_width (int) – The stroke width

  • rotation_degree (int) – The rotation degree

Returns:

The image with text on it

Return type:

np.ndarray

_get_random_image(label_name)[source]

Get a random image and its parameters.

Parameters:

label_name (str) – The label name

Returns:

The image and its parameters

Return type:

tuple[np.ndarray, dict]

_get_variation_image(font_size, font_file, text, stroke_width, rotation_degree, font_size_variation_degree, font_variation_degree, stroke_width_variation_degree, rotation_degree_variation_degree)[source]

Get a variation image and its parameters.

Parameters:
  • font_size (int) – The font size

  • font_file (str) – The font file

  • text (str) – The text

  • stroke_width (int) – The stroke width

  • rotation_degree (int) – The rotation degree

  • font_size_variation_degree (int) – The degree of variation in font size

  • font_variation_degree (float) – The degree of variation in font

  • stroke_width_variation_degree (int) – The degree of variation in stroke width

  • rotation_degree_variation_degree (int) – The degree of variation in rotation degree

Returns:

The image of the avatar and its parameters

Return type:

tuple[np.ndarray, dict]

random_api(label_info, num_samples)[source]

Generating random synthetic data.

Parameters:
  • label_info (omegaconf.dictconfig.DictConfig) – The info of the label

  • num_samples (int) – The number of random samples to generate

Returns:

The data object of the generated synthetic data

Return type:

pe.data.Data

variation_api(syn_data)[source]

Creating variations of the synthetic data.

Parameters:

syn_data (pe.data.Data) – The data object of the synthetic data

Returns:

The data object of the variation of the input synthetic data

Return type:

pe.data.Data

class pe.api.ImprovedDiffusion(variation_degrees, model_path, model_image_size=64, num_channels=192, num_res_blocks=3, learn_sigma=True, class_cond=True, use_checkpoint=False, attention_resolutions='16,8', num_heads=4, num_heads_upsample=-1, use_scale_shift_norm=True, dropout=0.0, diffusion_steps=4000, sigma_small=False, noise_schedule='cosine', use_kl=False, predict_xstart=False, rescale_timesteps=False, rescale_learned_sigmas=False, timestep_respacing='100', batch_size=2000, use_ddim=True, clip_denoised=True, use_data_parallel=True)[source]

Bases: API

The image API that utilizes improved diffusion models from https://arxiv.org/abs/2102.09672.

__init__(variation_degrees, model_path, model_image_size=64, num_channels=192, num_res_blocks=3, learn_sigma=True, class_cond=True, use_checkpoint=False, attention_resolutions='16,8', num_heads=4, num_heads_upsample=-1, use_scale_shift_norm=True, dropout=0.0, diffusion_steps=4000, sigma_small=False, noise_schedule='cosine', use_kl=False, predict_xstart=False, rescale_timesteps=False, rescale_learned_sigmas=False, timestep_respacing='100', batch_size=2000, use_ddim=True, clip_denoised=True, use_data_parallel=True)[source]

Constructor. See https://github.com/openai/improved-diffusion for the explanation of the parameters not listed here.

Parameters:
  • variation_degrees (int or list[int]) – The variation degrees utilized at each PE iteration. If a single int is provided, the same variation degree will be used for all iterations.

  • model_path (str) – The path of the model checkpoint

  • diffusion_steps (int, optional) – The total number of diffusion steps, defaults to 4000

  • timestep_respacing (str or list[str], optional) – The step configurations for image generation utilized at each PE iteration. If a single str is provided, the same step configuration will be used for all iterations. Defaults to “100”

  • batch_size (int, optional) – The batch size for image generation, defaults to 2000

  • use_data_parallel (bool, optional) – Whether to use data parallel during image generation, defaults to True

random_api(label_info, num_samples)[source]

Generating random synthetic data.

Parameters:
  • label_info (omegaconf.dictconfig.DictConfig) – The info of the label, not utilized in this API

  • num_samples (int) – The number of random samples to generate

Returns:

The data object of the generated synthetic data

Return type:

pe.data.Data

variation_api(syn_data)[source]

Generating variations of the synthetic data.

Parameters:

syn_data (pe.data.Data) – The data object of the synthetic data

Returns:

The data object of the variation of the input synthetic data

Return type:

pe.data.Data

class pe.api.ImprovedDiffusion270M(variation_degrees, model_path=None, batch_size=2000, timestep_respacing='100', use_data_parallel=True)[source]

Bases: ImprovedDiffusion

CHECKPOINT_URL = 'https://openaipublic.blob.core.windows.net/diffusion/march-2021/imagenet64_cond_270M_250K.pt'

The URL of the checkpoint path

__init__(variation_degrees, model_path=None, batch_size=2000, timestep_respacing='100', use_data_parallel=True)[source]

The “Class-conditional ImageNet-64 model (270M parameters, trained for 250K iterations)” model from the Improved Diffusion paper.

Parameters:
  • variation_degrees (list[int]) – The variation degrees utilized at each PE iteration

  • model_path (str) – The path of the model checkpoint. If not provided, the checkpoint will be downloaded from the CHECKPOINT_URL

  • batch_size (int, optional) – The batch size for image generation, defaults to 2000

  • timestep_respacing (str, optional) – The step configuration for image generation, defaults to “100”

  • use_data_parallel (bool, optional) – Whether to use data parallel during image generation, defaults to True

class pe.api.LLMAugPE(llm, random_api_prompt_file, variation_api_prompt_file, min_word_count=0, word_count_std=None, token_to_word_ratio=None, max_completion_tokens_limit=None, blank_probabilities=None, tokenizer_model='gpt-3.5-turbo')[source]

Bases: API

The text API that uses open-source or API-based LLMs. This algorithm is initially proposed in the ICML 2024 Spotlight paper, “Differentially Private Synthetic Data via Foundation Model APIs 2: Text” (https://arxiv.org/abs/2403.01749)

__init__(llm, random_api_prompt_file, variation_api_prompt_file, min_word_count=0, word_count_std=None, token_to_word_ratio=None, max_completion_tokens_limit=None, blank_probabilities=None, tokenizer_model='gpt-3.5-turbo')[source]

Constructor.

Parameters:
  • llm (pe.llm.LLM) – The LLM utilized for the random and variation generation

  • random_api_prompt_file (str) – The prompt file for the random API. See the explanations to variation_api_prompt_file for the format of the prompt file

  • variation_api_prompt_file (str) –

    The prompt file for the variation API. The file is in JSON format and contains the following fields:

    • message_template: A list of messages that will be sent to the LLM. Each message contains the following fields:

      • content: The content of the message. The content can contain variable placeholders (e.g., {variable_name}). The variable_name can be label name in the original data that will be replaced by the actual label value; or “sample” that will be replaced by the input text to the variation API; or “masked_sample” that will be replaced by the masked/blanked input text to the variation API when the blanking feature is enabled; or “word_count” that will be replaced by the target word count of the text when the word count variation feature is enabled; or other variables specified in the replacement rules (see below).

      • role: The role of the message. The role can be “system”, “user”, or “assistant”.

    • replacement_rules: A list of replacement rules that will be applied one by one to update the variable list. Each replacement rule contains the following fields:

      • constraints: A dictionary of constraints that must be satisfied for the replacement rule to be applied. The key is the variable name and the value is the variable value.

      • replacements: A dictionary of replacements that will be used to update the variable list if the constraints are satisfied. The key is the variable name and the value is the variable value or a list of variable values to choose from in a uniform random manner.

  • min_word_count (int, optional) – The minimum word count for the variation API, defaults to 0

  • word_count_std (float, optional) – The standard deviation for the word count for the variation API. If None, the word count variation feature is disabled and “{word_count}” variable will not be provided to the prompt. Defaults to None

  • token_to_word_ratio (float, optional) – The token to word ratio for the variation API. If not None, the maximum completion tokens will be set to token_to_word_ratio times the target word count when the word count variation feature is enabled. Defaults to None

  • max_completion_tokens_limit (int, optional) – The maximum completion tokens limit for the variation API, defaults to None

  • blank_probabilities (float or list[float], optional) – The token blank probabilities for the variation API utilized at each PE iteration. If a single float is provided, the same blank probability will be used for all iterations. If None, the blanking feature is disabled and “{masked_sample}” variable will not be provided to the prompt. Defaults to None

  • tokenizer_model (str, optional) – The tokenizer model used for blanking, defaults to “gpt-3.5-turbo”

_blank_sample(sample, blank_probability)[source]

Blanking the input text.

Parameters:
  • sample (str) – The input text

  • blank_probability (float) – The token blank probability

Returns:

The blanked input text

Return type:

str

_construct_prompt(prompt_config, variables)[source]

Applying the replacement rules to construct the final prompt messages.

Parameters:
  • prompt_config (dict) – The prompt configuration

  • variables (dict) – The inital variables to be used in the prompt messages

Returns:

The constructed prompt messages

Return type:

list[dict]

random_api(label_info, num_samples)[source]

Generating random synthetic data.

Parameters:
  • label_info (omegaconf.dictconfig.DictConfig) – The info of the label

  • num_samples (int) – The number of random samples to generate

Returns:

The data object of the generated synthetic data

Return type:

pe.data.Data

variation_api(syn_data)[source]

Generating variations of the synthetic data.

Parameters:

syn_data (pe.data.Data) – The data object of the synthetic data

Returns:

The data object of the variation of the input synthetic data

Return type:

pe.data.Data

class pe.api.NearestImage(data, embedding, nearest_neighbor_mode, variation_degrees, nearest_neighbor_backend='auto')[source]

Bases: API

The API that generates synthetic images by randomly drawing an image from the given dataset as the RANDOM_API and finding the nearest images in the given dataset as the VARIATION_API.

__init__(data, embedding, nearest_neighbor_mode, variation_degrees, nearest_neighbor_backend='auto')[source]

Constructor.

Parameters:
  • data (pe.data.Data) – The data object that contains the images

  • embedding (pe.embedding.Embedding) – The embedding object that computes the embeddings of the images

  • nearest_neighbor_mode (str) – The distance metric to use for finding the nearest neighbors. It should be one of the following: “l2” (l2 distance), “cos_sim” (cosine similarity), “ip” (inner product). Not all backends support all modes

  • variation_degrees (int or list[int]) – The variation degrees utilized at each PE iteration. If a single value is provided, the same variation degree will be used for all iterations. The value means the number of nearest neighbors to consider for the VARIAITON_API

  • nearest_neighbor_backend (str, optional) – The backend to use for finding the nearest neighbors. It should be one of the following: “faiss” (FAISS), “sklearn” (scikit-learn), “auto” (using FAISS if available, otherwise scikit-learn). Defaults to “auto”. FAISS supports GPU and is much faster when the number of samples is large. It requires the installation of faiss-gpu or faiss-cpu package. See https://faiss.ai/

Raises:

ValueError – If the nearest_neighbor_backend is unknown

_build_nearest_neighbor_graph()[source]

Finding the nearest neighbor for each sample in the given dataset.

random_api(label_info, num_samples)[source]

Generating random synthetic data by randomly drawing images from the given dataset.

Parameters:
  • label_info (omegaconf.dictconfig.DictConfig) – The info of the label, not utilized in this API

  • num_samples (int) – The number of random samples to generate

Returns:

The data object of the generated synthetic data

Return type:

pe.data.Data

variation_api(syn_data)[source]

Generating variations of the synthetic data by finding the nearest images in the given dataset.

Parameters:

syn_data (pe.data.Data) – The data object of the synthetic data

Returns:

The data object of the variation of the input synthetic data

Return type:

pe.data.Data

class pe.api.StableDiffusion(prompt, variation_degrees, width=512, height=512, random_api_checkpoint='CompVis/stable-diffusion-v1-4', random_api_guidance_scale=7.5, random_api_num_inference_steps=50, random_api_batch_size=10, variation_api_checkpoint='CompVis/stable-diffusion-v1-4', variation_api_guidance_scale=7.5, variation_api_num_inference_steps=50, variation_api_batch_size=10)[source]

Bases: API

The API that uses the Stable Diffusion model to generate synthetic data.

__init__(prompt, variation_degrees, width=512, height=512, random_api_checkpoint='CompVis/stable-diffusion-v1-4', random_api_guidance_scale=7.5, random_api_num_inference_steps=50, random_api_batch_size=10, variation_api_checkpoint='CompVis/stable-diffusion-v1-4', variation_api_guidance_scale=7.5, variation_api_num_inference_steps=50, variation_api_batch_size=10)[source]

Constructor.

Parameters:
  • prompt (str or dict) – The prompt used for each label name. It can be either a string or a dictionary. If it is a string, it should be the path to a JSON file that contains the prompt for each label name. If it is a dictionary, it should be a dictionary that maps each label name to its prompt

  • variation_degrees (float or list[float]) – The variation degrees utilized at each PE iteration. If a single float is provided, the same variation degree will be used for all iterations.

  • width (int, optional) – The width of the generated images, defaults to 512

  • height (int, optional) – The height of the generated images, defaults to 512

  • random_api_checkpoint (str, optional) – The checkpoint of the random API, defaults to “CompVis/stable-diffusion-v1-4”

  • random_api_guidance_scale (float, optional) – The guidance scale of the random API, defaults to 7.5

  • random_api_num_inference_steps (int, optional) – The number of inference steps of the random API, defaults to 50

  • random_api_batch_size (int, optional) – The batch size of the random API, defaults to 10

  • variation_api_checkpoint (str, optional) – The checkpoint of the variation API, defaults to “CompVis/stable-diffusion-v1-4”

  • variation_api_guidance_scale (float or list[float], optional) – The guidance scale of the variation API utilized at each PE iteration. If a single float is provided, the same guidance scale will be used for all iterations. Defaults to 7.5

  • variation_api_num_inference_steps (int or list[int], optional) – The number of inference steps of the variation API utilized at each PE iteration. If a single int is provided, the same number of inference steps will be used for all iterations. Defaults to 50

  • variation_api_batch_size (int, optional) – The batch size of the variation API, defaults to 10

Raises:

ValueError – If the prompt is neither a string nor a dictionary

random_api(label_info, num_samples)[source]

Generating random synthetic data.

Parameters:
  • label_info (omegaconf.dictconfig.DictConfig) – The info of the label

  • num_samples (int) – The number of random samples to generate

Returns:

The data object of the generated synthetic data

Return type:

pe.data.Data

variation_api(syn_data)[source]

Generating variations of the synthetic data.

Parameters:

syn_data (pe.data.Data) – The data object of the synthetic data

Returns:

The data object of the variation of the input synthetic data

Return type:

pe.data.Data

Subpackages

Submodules