Creating NLP-based Data#
In this notebook, we will use a dataset provider-based abstraction that interfaces with NVIDIA /DeepLearningExamples
to load and encode pre-defined/custom data.
Loading and Encoding the Data#
The first step is to create an instance of the NvidiaDatasetProvider
, which offers pre-loads the dataset and offers three methods to retrieve them: get_train_dataset()
, get_val_dataset()
and get_test_dataset()
. This is useful when loading pre-defined datasets, as well as loading custom-based (OLX prefix) data, which is composed by raw text files, such as train.txt
, valid.txt
and test.txt
.
Additionally, the NvidiaDatasetProvider
already encodes and caches the data with a built-in tokenizer. One can change the following arguments according to desired needs:
dataset_dir
: Dataset folder.cache_dir
: Path to the cache folder.vocab_type
: Type of vocabulary/tokenizer.vocab_size
: Vocabulary size.refresh_cache
: Whether cache should be refreshed.
[1]:
import os
from archai.datasets.nlp.nvidia_dataset_provider import NvidiaDatasetProvider
# In this example, we will create a dummy dataset with 3 splits
os.makedirs("dataroot/olx_tmp", exist_ok=True)
with open("dataroot/olx_tmp/train.txt", "w") as f:
f.write("train")
with open("dataroot/olx_tmp/valid.txt", "w") as f:
f.write("valid")
with open("dataroot/olx_tmp/test.txt", "w") as f:
f.write("test")
dataset_provider = NvidiaDatasetProvider("olx_tmp", dataset_dir="dataroot/olx_tmp", refresh_cache=True)
train_dataset = dataset_provider.get_train_dataset()
val_dataset = dataset_provider.get_val_dataset()
test_dataset = dataset_provider.get_test_dataset()
print(train_dataset, val_dataset, test_dataset)
2023-03-21 15:12:37,792 - archai.datasets.nlp.nvidia_dataset_provider_utils — INFO — Refreshing cache ...
2023-03-21 15:12:37,793 - archai.datasets.nlp.nvidia_dataset_provider_utils — INFO — Clearing and rebuilding cache ...
2023-03-21 15:12:37,794 - archai.datasets.nlp.nvidia_dataset_provider_utils — INFO — Corpus: dataset = olx_tmp | vocab_type = gpt2 | vocab_size = None
2023-03-21 15:12:37,796 - archai.datasets.nlp.nvidia_dataset_provider_utils — INFO — Training vocabulary ...
2023-03-21 15:12:37,797 - archai.datasets.nlp.tokenizer_utils.bbpe_tokenizer — INFO — Training tokenizer with size = 50257 at c:\Users\gderosa\Projects\archai\docs\getting_started\notebooks\nlp\cache\olx_tmp\gpt2\None\vocab\bbpe_tokenizer.json ...
2023-03-21 15:12:37,798 - archai.datasets.nlp.tokenizer_utils.bbpe_tokenizer — INFO — Training tokenizer ...
2023-03-21 15:12:37,827 - archai.datasets.nlp.tokenizer_utils.bbpe_tokenizer — DEBUG — Tokenizer length: 264
2023-03-21 15:12:37,828 - archai.datasets.nlp.tokenizer_utils.bbpe_tokenizer — DEBUG — Tokenizer file path: c:\Users\gderosa\Projects\archai\docs\getting_started\notebooks\nlp\cache\olx_tmp\gpt2\None\vocab\bbpe_tokenizer.json
2023-03-21 15:12:37,830 - archai.datasets.nlp.nvidia_dataset_provider_utils — INFO — Vocabulary trained.
2023-03-21 15:12:37,831 - archai.datasets.nlp.tokenizer_utils.tokenizer_base — INFO — Encoding file: dataroot/olx_tmp\train.txt
2023-03-21 15:12:37,835 - archai.datasets.nlp.tokenizer_utils.tokenizer_base — INFO — Encoding file: dataroot/olx_tmp\valid.txt
2023-03-21 15:12:37,841 - archai.datasets.nlp.tokenizer_utils.tokenizer_base — INFO — Encoding file: dataroot/olx_tmp\test.txt
2023-03-21 15:12:37,843 - archai.datasets.nlp.nvidia_dataset_provider_utils — DEBUG — Size: train = 7 | valid = 7 | test = 6
tensor([200, 222, 85, 83, 66, 74, 79]) tensor([200, 222, 87, 66, 77, 74, 69]) tensor([200, 222, 85, 70, 84, 85])