pe.data package
- class pe.data.Camelyon17(split='train', root_dir='data', res=64)[source]
Bases:
Data
The Camelyon17 dataset.
- __init__(split='train', root_dir='data', res=64)[source]
Constructor.
- Parameters:
split (str, optional) – The split of the dataset. It should be either “train”, “val”, or “test”, defaults to “train”
root_dir (str, optional) – The root directory to save the dataset, defaults to “data”
res (int, optional) – The resolution of the images, defaults to 64
- Raises:
ValueError – If the split is invalid
- class pe.data.Cat(root_dir='data', res=512)[source]
Bases:
Data
The Cat dataset.
- URL = 'https://www.kaggle.com/api/v1/datasets/download/fjxmlzn/cat-cookie-doudou'
The URL of the dataset
- class pe.data.CelebA(root_dir, res=32, attr_name='Male', split='train')[source]
Bases:
Data
The CelebA dataset.
- __init__(root_dir, res=32, attr_name='Male', split='train')[source]
Constructor.
- Parameters:
root_dir (str) – The root directory of the CelebA dataset
res (int, optional) – The resolution of the image, defaults to 32
attr_name (str, optional) – The attribute name to use as the label, defaults to “Male”
split (str, optional) – The split of the dataset, default is “train”
- class pe.data.Data(data_frame=None, metadata={})[source]
Bases:
object
The class that holds the private data or synthetic data from PE.
- __init__(data_frame=None, metadata={})[source]
Constructor.
- Parameters:
data_frame (
pandas.DataFrame
, optional) – A pandas dataframe that holds the data, defaults to Nonemetadata (dict, optional) – the metadata of the data, defaults to {}
- classmethod concat(data_list, metadata=None)[source]
Concatenate the data frames of a list of data objects
- Parameters:
data_list (list[
pe.data.Data
]) – The list of data objects to concatenatemetadata (dict, optional) – The metadata of the concatenated data. When None, the metadata of the list of data objects must be the same and will be used. Defaults to None
- Raises:
ValueError – If the metadata of the data objects are not the same
- Returns:
The concatenated data object
- Return type:
- filter(filter_criteria)[source]
Filter the data object according to a filter criteria
- Parameters:
filter_criteria (dict) – The filter criteria. None means no filter
- Returns:
The filtered data object
- Return type:
- filter_label_id(label_id)[source]
Filter the data frame according to a label id
- Parameters:
label_id (int) – The label id that is used to filter the data frame
- Returns:
pe.data.Data
object with the filtered data frame- Return type:
- load_checkpoint(path)[source]
Load data from a checkpoint
- Parameters:
path (str) – The folder that contains the checkpoint
- Returns:
Whether the checkpoint is loaded successfully
- Return type:
bool
- merge(data)[source]
Merge the data object with another data object
- Parameters:
data (
pe.data.Data
) – The data object to merge- Raises:
ValueError – If the metadata of data is not the same as the metadata of the current object
- Returns:
The merged data object
- Return type:
- random_split(num_samples_list, seed=0)[source]
Randomly split the data frame into multiple data frames
- Parameters:
num_samples_list (list[int]) – The list of numbers of samples for each data frame
seed (int, optional) – The seed for the random number generator, defaults to 0
- Raises:
ValueError – If the sum of num_samples_list is not equal to the number of samples
- Returns:
The list of
pe.data.Data
objects with the splited data- Return type:
list[
pe.data.Data
]
- random_truncate(num_samples)[source]
Randomly truncate the data frame to a certain number of samples
- Parameters:
num_samples (int) – The number of samples to randomly truncate
- Returns:
A new
pe.data.Data
object with the randomly truncated data frame- Return type:
- reset_index(**kwargs)[source]
Reset the index of the data frame
- Parameters:
kwargs (dict) – The keyword arguments to pass to the pandas reset_index function
- Returns:
A new
pe.data.Data
object with the reset index data frame- Return type:
- save_checkpoint(path)[source]
Save the data to a checkpoint.
- Parameters:
path (str) – The folder to save the checkpoint
- Raises:
ValueError – If the path is None
ValueError – If the data frame is empty
- set_label_id(label_id)[source]
Set the label id for the data frame
- Parameters:
label_id (int) – The label id to set
- split_by_client()[source]
Split the data frame by client ID
- Raises:
ValueError – If the client ID column is not in the data frame
- Returns:
The list of data objects with the splited data
- Return type:
list[
pe.data.Data
]
- split_by_index()[source]
Split the data frame by index
- Returns:
The list of data objects with the splited data
- Return type:
list[
pe.data.Data
]
- truncate(num_samples)[source]
Truncate the data frame to a certain number of samples
- Parameters:
num_samples (int) – The number of samples to truncate
- Returns:
A new
pe.data.Data
object with the truncated data frame- Return type:
- class pe.data.DigiFace1M(root_dir, res=32, num_processes=50, chunksize=1000)[source]
Bases:
Data
The DigiFace1M dataset (https://github.com/microsoft/DigiFace1M/).
- __init__(root_dir, res=32, num_processes=50, chunksize=1000)[source]
Constructor.
- Parameters:
root_dir (str, optional) – The root directory of the dataset, defaults to “data”
res (int, optional) – The resolution of the images, defaults to 32
num_processes (int, optional) – The number of processes to use for parallel processing, defaults to 50
chunksize (int, optional) – The chunk size to use for parallel processing, defaults to 1000
- Raises:
ValueError – If the number of images in
root_dir
is not 1,219,995
- class pe.data.OpenReview(root_dir='data', split='train', **kwargs)[source]
Bases:
TextCSV
The OpenReview dataset in the ICML 2024 Spotlight paper, “Differentially Private Synthetic Data via Foundation Model APIs 2: Text” (https://arxiv.org/abs/2403.01749).
- DOWNLOAD_INFO_DICT = {'test': ('https://raw.githubusercontent.com/AI-secure/aug-pe/bca21c90921bd1151aa7627e676c906165e205a0/data/openreview/iclr23_reviews_test.csv', 'direct'), 'train': ('https://raw.githubusercontent.com/AI-secure/aug-pe/bca21c90921bd1151aa7627e676c906165e205a0/data/openreview/iclr23_reviews_train.csv', 'direct'), 'val': ('https://raw.githubusercontent.com/AI-secure/aug-pe/bca21c90921bd1151aa7627e676c906165e205a0/data/openreview/iclr23_reviews_val.csv', 'direct')}
The download information for the OpenReview dataset.
- __init__(root_dir='data', split='train', **kwargs)[source]
Constructor.
- Parameters:
root_dir (str, optional) – The root directory of the dataset. If the dataset is not there, it will be downloaded automatically. Defaults to “data”
split (str, optional) – The split of the dataset. It should be either “train”, “val”, or “test”, defaults to “train”
- _download(download_info, data_path, processed_data_path)[source]
Download the dataset.
- Parameters:
download_info (pe.data.text.openreview.DownloadInfo) – The download information
data_path (str) – The path to the raw data
processed_data_path (str) – The path to the processed data
- Raises:
ValueError – If the download type is unknown
- class pe.data.PubMed(root_dir='data', split='train', **kwargs)[source]
Bases:
TextCSV
The PubMed dataset in the ICML 2024 Spotlight paper, “Differentially Private Synthetic Data via Foundation Model APIs 2: Text” (https://arxiv.org/abs/2403.01749).
- DOWNLOAD_INFO_DICT = {'test': ('https://raw.githubusercontent.com/AI-secure/aug-pe/bca21c90921bd1151aa7627e676c906165e205a0/data/pubmed/test.csv', 'direct'), 'train': ('https://drive.google.com/uc?id=12-zV93MQNPvM_ORUoahZ2n4odkkOXD-r', 'gdown'), 'val': ('https://raw.githubusercontent.com/AI-secure/aug-pe/bca21c90921bd1151aa7627e676c906165e205a0/data/pubmed/dev.csv', 'direct')}
The download information for the PubMed dataset.
- __init__(root_dir='data', split='train', **kwargs)[source]
Constructor.
- Parameters:
root_dir (str, optional) – The root directory of the dataset. If the dataset is not there, it will be downloaded automatically. Defaults to “data”
split (str, optional) – The split of the dataset. It should be either “train”, “val”, or “test”, defaults to “train”
- _download(download_info, data_path)[source]
Download the dataset.
- Parameters:
download_info (pe.data.text.pubmed.DownloadInfo) – The download information
data_path (str) – The path to the raw data
- Raises:
ValueError – If the download type is unknown
- class pe.data.TextCSV(csv_path, label_columns=[], text_column='text', num_samples=None)[source]
Bases:
Data
The text dataset in CSV format.
- __init__(csv_path, label_columns=[], text_column='text', num_samples=None)[source]
Constructor.
- Parameters:
csv_path (str) – The path to the CSV file
label_columns (list, optional) – The names of the columns that contain the labels, defaults to []
text_column (str, optional) – The name of the column that contains the text data, defaults to “text”
num_samples (int, optional) – The number of samples to load from the CSV file. If None, load all samples. Defaults to None
- Raises:
ValueError – If the label columns or text column does not exist in the CSV file
- class pe.data.Yelp(root_dir='data', split='train', **kwargs)[source]
Bases:
TextCSV
The Yelp dataset in the ICML 2024 Spotlight paper, “Differentially Private Synthetic Data via Foundation Model APIs 2: Text” (https://arxiv.org/abs/2403.01749).
- DOWNLOAD_INFO_DICT = {'test': ('https://raw.githubusercontent.com/AI-secure/aug-pe/bca21c90921bd1151aa7627e676c906165e205a0/data/yelp/test.csv', 'direct'), 'train': ('https://drive.google.com/uc?id=1epLuBxCk5MGnm1GiIfLcTcr-tKgjCrc2', 'gdown'), 'val': ('https://raw.githubusercontent.com/AI-secure/aug-pe/bca21c90921bd1151aa7627e676c906165e205a0/data/yelp/dev.csv', 'direct')}
The download information for the Yelp dataset.
- __init__(root_dir='data', split='train', **kwargs)[source]
Constructor.
- Parameters:
root_dir (str, optional) – The root directory of the dataset. If the dataset is not there, it will be downloaded automatically. Defaults to “data”
split (str, optional) – The split of the dataset. It should be either “train”, “val”, or “test”, defaults to “train”
- _download(download_info, data_path, processed_data_path)[source]
Download the dataset.
- Parameters:
download_info (pe.data.text.yelp.DownloadInfo) – The download information
data_path (str) – The path to the raw data
processed_data_path (str) – The path to the processed data
- Raises:
ValueError – If the download type is unknown
- pe.data.load_image_folder(path, image_size, class_cond=True, num_images=-1, num_workers=10, batch_size=1000)[source]
Load a image dataset from a folder that contains image files. The folder can be nested arbitrarily. The image file names must be in the format of “{class_name without ‘_’}_{suffix in any string}.ext”. The “ext” can be “jpg”, “jpeg”, “png”, or “gif”. The class names will be extracted from the file names before the first “_”. If class_cond is False, the class names will be ignored and all images will be treated as the same class with class name “None”.
- Parameters:
path (str) – The path to the root folder that contains the image files
image_size (int) – The size of the images. Images will be resized to this size
class_cond (bool, optional) – Whether to treat the loaded dataset as class conditional, defaults to True
num_images (int, optional) – The number of images to load. If -1, load all images. Defaults to -1
num_workers (int, optional) – The number of workers to use for loading the images, defaults to 10
batch_size (int, optional) – The batch size to use for loading the images, defaults to 1000
- Returns:
The loaded data
- Return type:
Subpackages
- pe.data.image package
- pe.data.text package