Creating NLP-based Data#
In this notebook, we will use a dataset provider-based abstraction that interfaces with Hugging Face’s datasets
. Such a library provides access to a large number of NLP-based datasets, including text classification, question-answering, and language modeling, among others.
Loading the Data#
The first step is to create an instance of the HfHubDatasetProvider
, which offers pre-loads the dataset and offers three methods to retrieve them: get_train_dataset()
, get_val_dataset()
and get_test_dataset()
.
Additionally, a set of additional arguments can be passed to its constructor according to the user’s needs:
dataset_config_name
: Name of the dataset configuration.data_dir
: Path to the data directory.data_files
: Path(s) to the data file(s).cache_dir
: Path to the read/write cache directory.revision
: Version of the dataset to load.
[1]:
from archai.datasets.nlp.hf_dataset_provider import HfHubDatasetProvider
dataset_provider = HfHubDatasetProvider("glue", dataset_config_name="sst2")
# When loading `train_dataset`, we will override the split argument to only load 1%
# of the data and speed up its encoding
train_dataset = dataset_provider.get_train_dataset(split="train[:1%]")
val_dataset = dataset_provider.get_val_dataset()
print(train_dataset, val_dataset)
Found cached dataset glue (C:/Users/gderosa/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
Found cached dataset glue (C:/Users/gderosa/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
Dataset({
features: ['sentence', 'label', 'idx'],
num_rows: 673
}) Dataset({
features: ['sentence', 'label', 'idx'],
num_rows: 872
})
Encoding the Data#
After loading the data, one might need to encode it with a tokenizer to apply for an NLP-based task. Again, Archai’s offer a set of functions that ease the process.
Inside the archai.datasets.nlp.hf_dataset_provider_utils
module, the user can find different tokenization functions, such as:
tokenize_dataset
: Tokenize a list of examples using a specified tokenizer.tokenize_contiguous_dataset
: Tokenize a list of examples using a specified tokenizer and with contiguous-length batches (no truncation nor padding).tokenize_nsp_dataset
: Tokenize a list of examples using a specified tokenizer and with next-sentence prediction (NSP).
[2]:
from transformers import AutoTokenizer
from archai.datasets.nlp.hf_dataset_provider_utils import tokenize_dataset
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
encoded_train_dataset = train_dataset.map(tokenize_dataset, batched=True, fn_kwargs={"tokenizer": tokenizer, "mapping_column_name": ["sentence"]})
encoded_val_dataset = val_dataset.map(tokenize_dataset, batched=True, fn_kwargs={"tokenizer": tokenizer, "mapping_column_name": ["sentence"]})
print(encoded_train_dataset, encoded_val_dataset)
Loading cached processed dataset at C:\Users\gderosa\.cache\huggingface\datasets\glue\sst2\1.0.0\dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad\cache-c989f437f7c0d7ad.arrow
Loading cached processed dataset at C:\Users\gderosa\.cache\huggingface\datasets\glue\sst2\1.0.0\dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad\cache-31197ec623723cd1.arrow
Dataset({
features: ['sentence', 'label', 'idx', 'input_ids', 'attention_mask'],
num_rows: 673
}) Dataset({
features: ['sentence', 'label', 'idx', 'input_ids', 'attention_mask'],
num_rows: 872
})