Creating NLP-based Data#

In this notebook, we will use a dataset provider-based abstraction that interfaces with Hugging Face’s datasets. Such a library provides access to a large number of NLP-based datasets, including text classification, question-answering, and language modeling, among others.

Loading the Data#

The first step is to create an instance of the HfHubDatasetProvider, which offers pre-loads the dataset and offers three methods to retrieve them: get_train_dataset(), get_val_dataset() and get_test_dataset().

Additionally, a set of additional arguments can be passed to its constructor according to the user’s needs:

  • dataset_config_name: Name of the dataset configuration.

  • data_dir: Path to the data directory.

  • data_files: Path(s) to the data file(s).

  • cache_dir: Path to the read/write cache directory.

  • revision: Version of the dataset to load.

[1]:
from archai.datasets.nlp.hf_dataset_provider import HfHubDatasetProvider

dataset_provider = HfHubDatasetProvider("glue", dataset_config_name="sst2")

# When loading `train_dataset`, we will override the split argument to only load 1%
# of the data and speed up its encoding
train_dataset = dataset_provider.get_train_dataset(split="train[:1%]")
val_dataset = dataset_provider.get_val_dataset()
print(train_dataset, val_dataset)
Found cached dataset glue (C:/Users/gderosa/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
Found cached dataset glue (C:/Users/gderosa/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
Dataset({
    features: ['sentence', 'label', 'idx'],
    num_rows: 673
}) Dataset({
    features: ['sentence', 'label', 'idx'],
    num_rows: 872
})

Encoding the Data#

After loading the data, one might need to encode it with a tokenizer to apply for an NLP-based task. Again, Archai’s offer a set of functions that ease the process.

Inside the archai.datasets.nlp.hf_dataset_provider_utils module, the user can find different tokenization functions, such as:

  • tokenize_dataset: Tokenize a list of examples using a specified tokenizer.

  • tokenize_contiguous_dataset: Tokenize a list of examples using a specified tokenizer and with contiguous-length batches (no truncation nor padding).

  • tokenize_nsp_dataset: Tokenize a list of examples using a specified tokenizer and with next-sentence prediction (NSP).

[2]:
from transformers import AutoTokenizer
from archai.datasets.nlp.hf_dataset_provider_utils import tokenize_dataset

tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

encoded_train_dataset = train_dataset.map(tokenize_dataset, batched=True, fn_kwargs={"tokenizer": tokenizer, "mapping_column_name": ["sentence"]})
encoded_val_dataset = val_dataset.map(tokenize_dataset, batched=True, fn_kwargs={"tokenizer": tokenizer, "mapping_column_name": ["sentence"]})
print(encoded_train_dataset, encoded_val_dataset)
Loading cached processed dataset at C:\Users\gderosa\.cache\huggingface\datasets\glue\sst2\1.0.0\dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad\cache-c989f437f7c0d7ad.arrow
Loading cached processed dataset at C:\Users\gderosa\.cache\huggingface\datasets\glue\sst2\1.0.0\dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad\cache-31197ec623723cd1.arrow
Dataset({
    features: ['sentence', 'label', 'idx', 'input_ids', 'attention_mask'],
    num_rows: 673
}) Dataset({
    features: ['sentence', 'label', 'idx', 'input_ids', 'attention_mask'],
    num_rows: 872
})