pe.data package

class pe.data.Camelyon17(split='train', root_dir='data', res=64)[source]

Bases: Data

The Camelyon17 dataset.

__init__(split='train', root_dir='data', res=64)[source]

Constructor.

Parameters:

split (str, optional) – The split of the dataset. It should be either “train”, “val”, or “test”, defaults to “train”
root_dir (str, optional) – The root directory to save the dataset, defaults to “data”
res (int, optional) – The resolution of the images, defaults to 64

Raises:

ValueError – If the split is invalid

class pe.data.Cat(root_dir='data', res=512)[source]

Bases: Data

The Cat dataset.

URL = 'https://www.kaggle.com/api/v1/datasets/download/fjxmlzn/cat-cookie-doudou': The URL of the dataset

__init__(root_dir='data', res=512)[source]

Constructor.

Parameters:

root_dir (str, optional) – The root directory to save the dataset, defaults to “data”
res (int, optional) – The resolution of the images, defaults to 512

_download()[source]: Download the dataset if it does not exist.

_read_data()[source]: Read the data from the zip file.

class pe.data.CelebA(root_dir, res=32, attr_name='Male', split='train')[source]

Bases: Data

The CelebA dataset.

__init__(root_dir, res=32, attr_name='Male', split='train')[source]

Constructor.

Parameters:

root_dir (str) – The root directory of the CelebA dataset
res (int, optional) – The resolution of the image, defaults to 32
attr_name (str, optional) – The attribute name to use as the label, defaults to “Male”
split (str, optional) – The split of the dataset, default is “train”

class pe.data.Cifar10(split='train')[source]

Bases: Data

The CIFAR10 dataset.

__init__(split='train')[source]

Constructor.

Parameters:: split (str, optional) – The split of the dataset. It should be either “train” or “test”, defaults to “train”
Raises:: ValueError – If the split is invalid

class pe.data.Data(data_frame=None, metadata={})[source]

Bases: object

The class that holds the private data or synthetic data from PE.

__init__(data_frame=None, metadata={})[source]

Constructor.

Parameters:

data_frame (pandas.DataFrame, optional) – A pandas dataframe that holds the data, defaults to None
metadata (dict, optional) – the metadata of the data, defaults to {}

classmethod concat(data_list, metadata=None)[source]

Concatenate the data frames of a list of data objects

Parameters:

data_list (list[pe.data.Data]) – The list of data objects to concatenate
metadata (dict, optional) – The metadata of the concatenated data. When None, the metadata of the list of data objects must be the same and will be used. Defaults to None

Raises:

ValueError – If the metadata of the data objects are not the same

Returns:

The concatenated data object

Return type:

pe.data.Data

filter(filter_criteria)[source]

Filter the data object according to a filter criteria

Parameters:: filter_criteria (dict) – The filter criteria. None means no filter
Returns:: The filtered data object
Return type:: pe.data.Data

filter_label_id(label_id)[source]

Filter the data frame according to a label id

Parameters:: label_id (int) – The label id that is used to filter the data frame
Returns:: pe.data.Data object with the filtered data frame
Return type:: pe.data.Data

load_checkpoint(path)[source]

Load data from a checkpoint

Parameters:: path (str) – The folder that contains the checkpoint
Returns:: Whether the checkpoint is loaded successfully
Return type:: bool

merge(data)[source]

Merge the data object with another data object

Parameters:: data (pe.data.Data) – The data object to merge
Raises:: ValueError – If the metadata of data is not the same as the metadata of the current object
Returns:: The merged data object
Return type:: pe.data.Data

random_split(num_samples_list, seed=0)[source]

Randomly split the data frame into multiple data frames

Parameters:

num_samples_list (list[int]) – The list of numbers of samples for each data frame
seed (int, optional) – The seed for the random number generator, defaults to 0

Raises:

ValueError – If the sum of num_samples_list is not equal to the number of samples

Returns:

The list of pe.data.Data objects with the splited data

Return type:

list[pe.data.Data]

random_truncate(num_samples)[source]

Randomly truncate the data frame to a certain number of samples

Parameters:: num_samples (int) – The number of samples to randomly truncate
Returns:: A new pe.data.Data object with the randomly truncated data frame
Return type:: pe.data.Data

reset_index(**kwargs)[source]

Reset the index of the data frame

Parameters:: kwargs (dict) – The keyword arguments to pass to the pandas reset_index function
Returns:: A new pe.data.Data object with the reset index data frame
Return type:: pe.data.Data

save_checkpoint(path)[source]

Save the data to a checkpoint.

Parameters:

path (str) – The folder to save the checkpoint

Raises:

ValueError – If the path is None
ValueError – If the data frame is empty

set_label_id(label_id)[source]

Set the label id for the data frame

Parameters:: label_id (int) – The label id to set

split_by_client()[source]

Split the data frame by client ID

Raises:: ValueError – If the client ID column is not in the data frame
Returns:: The list of data objects with the splited data
Return type:: list[pe.data.Data]

split_by_index()[source]

Split the data frame by index

Returns:: The list of data objects with the splited data
Return type:: list[pe.data.Data]

truncate(num_samples)[source]

Truncate the data frame to a certain number of samples

Parameters:: num_samples (int) – The number of samples to truncate
Returns:: A new pe.data.Data object with the truncated data frame
Return type:: pe.data.Data

class pe.data.DigiFace1M(root_dir, res=32, num_processes=50, chunksize=1000)[source]

Bases: Data

The DigiFace1M dataset (https://github.com/microsoft/DigiFace1M/).

__init__(root_dir, res=32, num_processes=50, chunksize=1000)[source]

Constructor.

Parameters:

root_dir (str, optional) – The root directory of the dataset, defaults to “data”
res (int, optional) – The resolution of the images, defaults to 32
num_processes (int, optional) – The number of processes to use for parallel processing, defaults to 50
chunksize (int, optional) – The chunk size to use for parallel processing, defaults to 1000

Raises:

ValueError – If the number of images in root_dir is not 1,219,995

_read_and_process_image(path)[source]

Read and process an image.

Parameters:: path (str) – The path to the image
Returns:: The processed image
Return type:: np.ndarray

class pe.data.MNIST(split='train')[source]

Bases: Data

The MNIST dataset.

__init__(split='train')[source]

Constructor.

Parameters:: split (str, optional) – The split of the dataset. It should be either “train” or “test”, defaults to “train”
Raises:: ValueError – If the split is invalid

class pe.data.OpenReview(root_dir='data', split='train', **kwargs)[source]

Bases: TextCSV

The OpenReview dataset in the ICML 2024 Spotlight paper, “Differentially Private Synthetic Data via Foundation Model APIs 2: Text” (https://arxiv.org/abs/2403.01749).

DOWNLOAD_INFO_DICT = {'test': ('https://raw.githubusercontent.com/AI-secure/aug-pe/bca21c90921bd1151aa7627e676c906165e205a0/data/openreview/iclr23_reviews_test.csv', 'direct'), 'train': ('https://raw.githubusercontent.com/AI-secure/aug-pe/bca21c90921bd1151aa7627e676c906165e205a0/data/openreview/iclr23_reviews_train.csv', 'direct'), 'val': ('https://raw.githubusercontent.com/AI-secure/aug-pe/bca21c90921bd1151aa7627e676c906165e205a0/data/openreview/iclr23_reviews_val.csv', 'direct')}: The download information for the OpenReview dataset.

__init__(root_dir='data', split='train', **kwargs)[source]

Constructor.

Parameters:

root_dir (str, optional) – The root directory of the dataset. If the dataset is not there, it will be downloaded automatically. Defaults to “data”
split (str, optional) – The split of the dataset. It should be either “train”, “val”, or “test”, defaults to “train”

_download(download_info, data_path, processed_data_path)[source]

Download the dataset.

Parameters:

download_info (pe.data.text.openreview.DownloadInfo) – The download information
data_path (str) – The path to the raw data
processed_data_path (str) – The path to the processed data

Raises:

ValueError – If the download type is unknown

class pe.data.PubMed(root_dir='data', split='train', **kwargs)[source]

Bases: TextCSV

The PubMed dataset in the ICML 2024 Spotlight paper, “Differentially Private Synthetic Data via Foundation Model APIs 2: Text” (https://arxiv.org/abs/2403.01749).

DOWNLOAD_INFO_DICT = {'test': ('https://raw.githubusercontent.com/AI-secure/aug-pe/bca21c90921bd1151aa7627e676c906165e205a0/data/pubmed/test.csv', 'direct'), 'train': ('https://drive.google.com/uc?id=12-zV93MQNPvM_ORUoahZ2n4odkkOXD-r', 'gdown'), 'val': ('https://raw.githubusercontent.com/AI-secure/aug-pe/bca21c90921bd1151aa7627e676c906165e205a0/data/pubmed/dev.csv', 'direct')}: The download information for the PubMed dataset.

__init__(root_dir='data', split='train', **kwargs)[source]

Constructor.

Parameters:

root_dir (str, optional) – The root directory of the dataset. If the dataset is not there, it will be downloaded automatically. Defaults to “data”
split (str, optional) – The split of the dataset. It should be either “train”, “val”, or “test”, defaults to “train”

_download(download_info, data_path)[source]

Download the dataset.

Parameters:

download_info (pe.data.text.pubmed.DownloadInfo) – The download information
data_path (str) – The path to the raw data

Raises:

ValueError – If the download type is unknown

class pe.data.TextCSV(csv_path, label_columns=[], text_column='text', num_samples=None)[source]

Bases: Data

The text dataset in CSV format.

__init__(csv_path, label_columns=[], text_column='text', num_samples=None)[source]

Constructor.

Parameters:

csv_path (str) – The path to the CSV file
label_columns (list, optional) – The names of the columns that contain the labels, defaults to []
text_column (str, optional) – The name of the column that contains the text data, defaults to “text”
num_samples (int, optional) – The number of samples to load from the CSV file. If None, load all samples. Defaults to None

Raises:

ValueError – If the label columns or text column does not exist in the CSV file

class pe.data.Yelp(root_dir='data', split='train', **kwargs)[source]

Bases: TextCSV

The Yelp dataset in the ICML 2024 Spotlight paper, “Differentially Private Synthetic Data via Foundation Model APIs 2: Text” (https://arxiv.org/abs/2403.01749).

DOWNLOAD_INFO_DICT = {'test': ('https://raw.githubusercontent.com/AI-secure/aug-pe/bca21c90921bd1151aa7627e676c906165e205a0/data/yelp/test.csv', 'direct'), 'train': ('https://drive.google.com/uc?id=1epLuBxCk5MGnm1GiIfLcTcr-tKgjCrc2', 'gdown'), 'val': ('https://raw.githubusercontent.com/AI-secure/aug-pe/bca21c90921bd1151aa7627e676c906165e205a0/data/yelp/dev.csv', 'direct')}: The download information for the Yelp dataset.

__init__(root_dir='data', split='train', **kwargs)[source]

Constructor.

Parameters:

root_dir (str, optional) – The root directory of the dataset. If the dataset is not there, it will be downloaded automatically. Defaults to “data”
split (str, optional) – The split of the dataset. It should be either “train”, “val”, or “test”, defaults to “train”

_download(download_info, data_path, processed_data_path)[source]

Download the dataset.

Parameters:

download_info (pe.data.text.yelp.DownloadInfo) – The download information
data_path (str) – The path to the raw data
processed_data_path (str) – The path to the processed data

Raises:

ValueError – If the download type is unknown

pe.data.load_image_folder(path, image_size, class_cond=True, num_images=-1, num_workers=10, batch_size=1000)[source]

Load a image dataset from a folder that contains image files. The folder can be nested arbitrarily. The image file names must be in the format of “{class_name without ‘_’}_{suffix in any string}.ext”. The “ext” can be “jpg”, “jpeg”, “png”, or “gif”. The class names will be extracted from the file names before the first “_”. If class_cond is False, the class names will be ignored and all images will be treated as the same class with class name “None”.

Parameters:

path (str) – The path to the root folder that contains the image files
image_size (int) – The size of the images. Images will be resized to this size
class_cond (bool, optional) – Whether to treat the loaded dataset as class conditional, defaults to True
num_images (int, optional) – The number of images to load. If -1, load all images. Defaults to -1
num_workers (int, optional) – The number of workers to use for loading the images, defaults to 10
batch_size (int, optional) – The batch size to use for loading the images, defaults to 1000

Returns:

The loaded data

Return type:

pe.data.Data

Subpackages

Submodules

pe.data.data module
- Data