Creating NLP-based Data#

In this notebook, we will use a dataset provider-based abstraction that interfaces with NVIDIA /DeepLearningExamples to load and encode pre-defined/custom data.

Loading and Encoding the Data#

The first step is to create an instance of the NvidiaDatasetProvider, which offers pre-loads the dataset and offers three methods to retrieve them: get_train_dataset(), get_val_dataset() and get_test_dataset(). This is useful when loading pre-defined datasets, as well as loading custom-based (OLX prefix) data, which is composed by raw text files, such as train.txt, valid.txt and test.txt.

Additionally, the NvidiaDatasetProvider already encodes and caches the data with a built-in tokenizer. One can change the following arguments according to desired needs:

  • dataset_dir: Dataset folder.

  • cache_dir: Path to the cache folder.

  • vocab_type: Type of vocabulary/tokenizer.

  • vocab_size: Vocabulary size.

  • refresh_cache: Whether cache should be refreshed.

[1]:
import os
from archai.datasets.nlp.nvidia_dataset_provider import NvidiaDatasetProvider

# In this example, we will create a dummy dataset with 3 splits
os.makedirs("dataroot/olx_tmp", exist_ok=True)
with open("dataroot/olx_tmp/train.txt", "w") as f:
    f.write("train")
with open("dataroot/olx_tmp/valid.txt", "w") as f:
    f.write("valid")
with open("dataroot/olx_tmp/test.txt", "w") as f:
    f.write("test")

dataset_provider = NvidiaDatasetProvider("olx_tmp", dataset_dir="dataroot/olx_tmp", refresh_cache=True)

train_dataset = dataset_provider.get_train_dataset()
val_dataset = dataset_provider.get_val_dataset()
test_dataset = dataset_provider.get_test_dataset()
print(train_dataset, val_dataset, test_dataset)
2023-03-21 15:12:37,792 - archai.datasets.nlp.nvidia_dataset_provider_utils — INFO —  Refreshing cache ...
2023-03-21 15:12:37,793 - archai.datasets.nlp.nvidia_dataset_provider_utils — INFO —  Clearing and rebuilding cache ...
2023-03-21 15:12:37,794 - archai.datasets.nlp.nvidia_dataset_provider_utils — INFO —  Corpus: dataset = olx_tmp | vocab_type = gpt2 | vocab_size = None
2023-03-21 15:12:37,796 - archai.datasets.nlp.nvidia_dataset_provider_utils — INFO —  Training vocabulary ...
2023-03-21 15:12:37,797 - archai.datasets.nlp.tokenizer_utils.bbpe_tokenizer — INFO —  Training tokenizer with size = 50257 at c:\Users\gderosa\Projects\archai\docs\getting_started\notebooks\nlp\cache\olx_tmp\gpt2\None\vocab\bbpe_tokenizer.json ...
2023-03-21 15:12:37,798 - archai.datasets.nlp.tokenizer_utils.bbpe_tokenizer — INFO —  Training tokenizer ...
2023-03-21 15:12:37,827 - archai.datasets.nlp.tokenizer_utils.bbpe_tokenizer — DEBUG —  Tokenizer length: 264
2023-03-21 15:12:37,828 - archai.datasets.nlp.tokenizer_utils.bbpe_tokenizer — DEBUG —  Tokenizer file path: c:\Users\gderosa\Projects\archai\docs\getting_started\notebooks\nlp\cache\olx_tmp\gpt2\None\vocab\bbpe_tokenizer.json
2023-03-21 15:12:37,830 - archai.datasets.nlp.nvidia_dataset_provider_utils — INFO —  Vocabulary trained.
2023-03-21 15:12:37,831 - archai.datasets.nlp.tokenizer_utils.tokenizer_base — INFO —  Encoding file: dataroot/olx_tmp\train.txt
2023-03-21 15:12:37,835 - archai.datasets.nlp.tokenizer_utils.tokenizer_base — INFO —  Encoding file: dataroot/olx_tmp\valid.txt
2023-03-21 15:12:37,841 - archai.datasets.nlp.tokenizer_utils.tokenizer_base — INFO —  Encoding file: dataroot/olx_tmp\test.txt
2023-03-21 15:12:37,843 - archai.datasets.nlp.nvidia_dataset_provider_utils — DEBUG —  Size: train = 7 | valid = 7 | test = 6
tensor([200, 222,  85,  83,  66,  74,  79]) tensor([200, 222,  87,  66,  77,  74,  69]) tensor([200, 222,  85,  70,  84,  85])