Data ==== API reference: :doc:`/api/pe.data`. :py:class:`pe.data.Data` is the base class for holding the synthetic samples or the private samples, along with their metadata. Different components are mostly communicated through objects of this class. :py:class:`pe.data.Data` has two key attributes: * ``data_frame``: A pandas_ DataFrame that holds the samples. Each row in the DataFrame is a sample, and each column is part of the sample (e.g., the image, the text, the label) and other information of the sample (e.g., its embedding produced by :doc:`embedding`). * ``metadata``: A OmegaConf_ that holds the metadata of the samples, such as the **Private Evolution** iteration number when the samples are generated, and the label names of the classes. Available Datasets ------------------ For convenience, some well-known datasets are already packaged as :py:class:`pe.data.Data` classes: * Image datasets * :py:class:`pe.data.Cifar10`: The `CIFAR10 dataset`_. * :py:class:`pe.data.Camelyon17`: The `Camelyon17 dataset`_. * :py:class:`pe.data.Cat`: The `Cat dataset`_. * :py:class:`pe.data.MNIST`: The `MNIST dataset`_. * :py:class:`pe.data.CelebA`: The `CelebA dataset`_. * :py:class:`pe.data.DigiFace1M`: The `DigiFace1M dataset`_. * In addition, you can easily load a custom image dataset from a (nested) directory with the image files using :py:meth:`pe.data.load_image_folder`. * Text datasets * :py:class:`pe.data.Yelp`: The `Yelp dataset`_. * :py:class:`pe.data.OpenReview`: The `OpenReview dataset`_. * :py:class:`pe.data.PubMed`: The `PubMed dataset`_. * In addition, you can easily load a custom text dataset from a CSV file using :py:class:`pe.data.TextCSV`. Using Your Own Datasets ----------------------- To apply **Private Evolution** to your own private dataset, you need to create a :py:class:`pe.data.Data` object that holds your dataset, with two parameters, ``data_frame`` and ``metadata``, passed to the constructor: * ``data_frame``: A pandas_ DataFrame that holds the samples. Each row in the DataFrame is a sample. The following columns must be included: * :py:attr:`pe.constant.data.LABEL_ID_COLUMN_NAME`: The label (class) ID of the sample. The label IDs must be in {0, 1, ..., K-1} if there are K classes. If you are targeting unconditional generation, the values of this column can just be zeros. The ``data_frame`` can have any numbers of additional columns that hold the data of the samples, as long as the modules you are using (e.g., :doc:`api`, :doc:`Callbacks `) can recognize them. * ``metadata``: A dictionary that holds the metadata of the samples. The following keys must be included: * ``label_info``: A list of dictionaries that hold the information of the classes. The length of the list must be equal to K. Inside each dictionary, a key ``name`` must be included, which refers to the name of the class. In addition, you can include any other keys that hold the metadata of the samples if needed. .. _OmegaConf: https://omegaconf.readthedocs.io/en/latest/ .. _pandas: https://pandas.pydata.org/ .. _Cat dataset: https://www.kaggle.com/datasets/fjxmlzn/cat-cookie-doudou .. _CIFAR10 dataset: https://www.cs.toronto.edu/~kriz/cifar.html .. _Camelyon17 dataset: https://camelyon17.grand-challenge.org/ .. _MNIST dataset: https://pytorch.org/vision/main/generated/torchvision.datasets.MNIST.html .. _CelebA dataset: https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html .. _DigiFace1M dataset: https://github.com/microsoft/DigiFace1M .. _Yelp dataset: https://github.com/AI-secure/aug-pe/tree/main/data .. _OpenReview dataset: https://github.com/AI-secure/aug-pe/tree/main/data .. _PubMed dataset: https://github.com/AI-secure/aug-pe/tree/main/data