Data

API reference: pe.data package.

pe.data.Data is the base class for holding the synthetic samples or the private samples, along with their metadata. Different components are mostly communicated through objects of this class. pe.data.Data has two key attributes:

data_frame: A pandas DataFrame that holds the samples. Each row in the DataFrame is a sample, and each column is part of the sample (e.g., the image, the text, the label) and other information of the sample (e.g., its embedding produced by Embeddings).
metadata: A OmegaConf that holds the metadata of the samples, such as the Private Evolution iteration number when the samples are generated, and the label names of the classes.

Available Datasets

For convenience, some well-known datasets are already packaged as pe.data.Data classes:

Image datasets
- pe.data.Cifar10: The CIFAR10 dataset.
- pe.data.Camelyon17: The Camelyon17 dataset.
- pe.data.Cat: The Cat dataset.
- pe.data.MNIST: The MNIST dataset.
- pe.data.CelebA: The CelebA dataset.
- pe.data.DigiFace1M: The DigiFace1M dataset.
- In addition, you can easily load a custom image dataset from a (nested) directory with the image files using pe.data.load_image_folder().
Text datasets
- pe.data.Yelp: The Yelp dataset.
- pe.data.OpenReview: The OpenReview dataset.
- pe.data.PubMed: The PubMed dataset.
- In addition, you can easily load a custom text dataset from a CSV file using pe.data.TextCSV.

Using Your Own Datasets

To apply Private Evolution to your own private dataset, you need to create a pe.data.Data object that holds your dataset, with two parameters, data_frame and metadata, passed to the constructor:

data_frame: A pandas DataFrame that holds the samples. Each row in the DataFrame is a sample. The following columns must be included:
- pe.constant.data.LABEL_ID_COLUMN_NAME: The label (class) ID of the sample. The label IDs must be in {0, 1, …, K-1} if there are K classes. If you are targeting unconditional generation, the values of this column can just be zeros.
The data_frame can have any numbers of additional columns that hold the data of the samples, as long as the modules you are using (e.g., APIs, Callbacks) can recognize them.
metadata: A dictionary that holds the metadata of the samples. The following keys must be included:
- label_info: A list of dictionaries that hold the information of the classes. The length of the list must be equal to K. Inside each dictionary, a key name must be included, which refers to the name of the class.
In addition, you can include any other keys that hold the metadata of the samples if needed.