pe.data.text.yelp module

namedtuple pe.data.text.yelp.DownloadInfo(url, type)

Bases: namedtuple()

DownloadInfo(url, type)

Fields:
  1.  url – Alias for field number 0

  2.  type – Alias for field number 1

class pe.data.text.yelp.Yelp(root_dir='data', split='train', **kwargs)[source]

Bases: TextCSV

The Yelp dataset in the ICML 2024 Spotlight paper, “Differentially Private Synthetic Data via Foundation Model APIs 2: Text” (https://arxiv.org/abs/2403.01749).

DOWNLOAD_INFO_DICT = {'test': ('https://raw.githubusercontent.com/AI-secure/aug-pe/bca21c90921bd1151aa7627e676c906165e205a0/data/yelp/test.csv', 'direct'), 'train': ('https://drive.google.com/uc?id=1epLuBxCk5MGnm1GiIfLcTcr-tKgjCrc2', 'gdown'), 'val': ('https://raw.githubusercontent.com/AI-secure/aug-pe/bca21c90921bd1151aa7627e676c906165e205a0/data/yelp/dev.csv', 'direct')}

The download information for the Yelp dataset.

__init__(root_dir='data', split='train', **kwargs)[source]

Constructor.

Parameters:
  • root_dir (str, optional) – The root directory of the dataset. If the dataset is not there, it will be downloaded automatically. Defaults to “data”

  • split (str, optional) – The split of the dataset. It should be either “train”, “val”, or “test”, defaults to “train”

_download(download_info, data_path, processed_data_path)[source]

Download the dataset.

Parameters:
  • download_info (pe.data.text.yelp.DownloadInfo) – The download information

  • data_path (str) – The path to the raw data

  • processed_data_path (str) – The path to the processed data

Raises:

ValueError – If the download type is unknown