Natural Language Processing#

Hugging Face#

Dataset Provider#

class archai.datasets.nlp.hf_dataset_provider.HfHubDatasetProvider(dataset_name: str, dataset_config_name: str | None = None, data_dir: str | None = None, data_files: str | List[str] | Dict[str, str | List[str]] | None = None, cache_dir: str | None = None, revision: str | Version | None = None)[source]#

Hugging Face Hub dataset provider.

get_dataset(split: str | Split | None = None, refresh_cache: bool | None = False, keep_in_memory: bool | None = False, streaming: bool | None = False) Dataset | DatasetDict | IterableDataset | IterableDatasetDict[source]#
get_train_dataset(split: str | Split | None = 'train', refresh_cache: bool | None = False, keep_in_memory: bool | None = False, streaming: bool | None = False) Dataset | IterableDataset[source]#

Get a training dataset.

Returns:

An instance of a training dataset.

get_val_dataset(split: str | Split | None = 'validation', refresh_cache: bool | None = False, keep_in_memory: bool | None = False, streaming: bool | None = False) Dataset | IterableDataset[source]#

Get a validation dataset.

Returns:

An instance of a validation dataset, or the training dataset if validation dataset is not available.

get_test_dataset(split: str | Split | None = 'test', refresh_cache: bool | None = False, keep_in_memory: bool | None = False, streaming: bool | None = False) Dataset | IterableDataset[source]#

Get a testing dataset.

Returns:

An instance of a testing dataset, or the training/validation dataset if testing dataset is not available.

class archai.datasets.nlp.hf_dataset_provider.HfDiskDatasetProvider(data_dir: str, keep_in_memory: bool | None = False)[source]#

Hugging Face disk-saved dataset provider.

get_train_dataset() Dataset[source]#

Get a training dataset.

Returns:

An instance of a training dataset.

get_val_dataset() Dataset[source]#

Get a validation dataset.

Returns:

An instance of a validation dataset, or the training dataset if validation dataset is not available.

get_test_dataset() Dataset[source]#

Get a testing dataset.

Returns:

An instance of a testing dataset, or the training/validation dataset if testing dataset is not available.

Dataset Provider (Utilities)#

archai.datasets.nlp.hf_dataset_provider_utils.should_refresh_cache(refresh: bool) DownloadMode[source]#

Determine whether to refresh the cached dataset.

This function determines whether the cached dataset should be refreshed by re-downloading or re-creating it based on the value of the refresh parameter.

Parameters:

refresh – If True, the cache will be refreshed. If False, the existing cache will be used if it exists.

Returns:

An enumerator indicating whether the cache should be refreshed or not.

archai.datasets.nlp.hf_dataset_provider_utils.tokenize_dataset(examples: Dict[str, List[str]], tokenizer: AutoTokenizer | None = None, mapping_column_name: List[str] | None = None, use_eos_token: bool | None = False, truncate: bool | str | None = True, padding: bool | str | None = 'max_length') Dict[str, Any][source]#

Tokenize a list of examples using a specified tokenizer.

Parameters:
  • examples – A list of examples to be tokenized.

  • tokenizer – The tokenizer to use.

  • mapping_column_name – The columns in examples that should be tokenized.

  • use_eos_token – Whether to append the EOS token to each example.

  • truncate – Whether truncation should be applied.

  • padding – Whether padding should be applied.

Returns:

Tokenized examples.

archai.datasets.nlp.hf_dataset_provider_utils.tokenize_concatenated_dataset(examples: Dict[str, List[str]], tokenizer: AutoTokenizer | None = None, mapping_column_name: List[str] | None = None, use_eos_token: bool | None = False, dtype: dtype | None = None) Dict[str, Any][source]#

Tokenize a list of examples using a specified tokenizer and with concatenated batches (no truncation nor padding).

Parameters:
  • examples – A list of examples to be tokenized.

  • tokenizer – The tokenizer to use.

  • mapping_column_name – The columns in examples that should be tokenized.

  • use_eos_token – Whether to append the EOS token to each example.

  • dtype – Numpy data type of the tokenized examples.

Returns:

Concatenated tokenized examples.

archai.datasets.nlp.hf_dataset_provider_utils.tokenize_contiguous_dataset(examples: Dict[str, List[str]], tokenizer: AutoTokenizer | None = None, mapping_column_name: List[str] | None = None, model_max_length: int | None = 1024) Dict[str, Any][source]#

Tokenize a list of examples using a specified tokenizer and with contiguous-length batches (no truncation nor padding).

Parameters:
  • examples – A list of examples to be tokenized.

  • tokenizer – The tokenizer to use.

  • mapping_column_name – The columns in examples that should be tokenized.

  • model_max_length – Maximum length of sequences.

Returns:

Contiguous-length tokenized examples.

archai.datasets.nlp.hf_dataset_provider_utils.tokenize_nsp_dataset(examples: Dict[str, List[str]], tokenizer: AutoTokenizer | None = None, mapping_column_name: List[str] | None = None, truncate: bool | str | None = True, padding: bool | str | None = 'max_length') Dict[str, Any][source]#

Tokenize a list of examples using a specified tokenizer and with next-sentence prediction (NSP).

Parameters:
  • examples – A list of examples to be tokenized.

  • tokenizer – The tokenizer to use.

  • mapping_column_name – The columns in examples that should be tokenized.

  • truncate – Whether truncation should be applied.

  • padding – Whether padding should be applied.

Returns:

Tokenized examples with NSP labels.

archai.datasets.nlp.hf_dataset_provider_utils.encode_dataset(dataset: Dataset | DatasetDict | IterableDataset | IterableDatasetDict, tokenizer: AutoTokenizer, mapping_fn: Callable[[Any], Dict[str, Any]] | None = None, mapping_fn_kwargs: Dict[str, Any] | None = None, mapping_column_name: str | List[str] | None = 'text', batched: bool | None = True, batch_size: int | None = 1000, writer_batch_size: int | None = 1000, num_proc: int | None = None, format_column_name: str | List[str] | None = None) DatasetDict | IterableDatasetDict[source]#

Encode a dataset using a tokenizer.

Parameters:
  • dataset – The dataset to be encoded.

  • tokenizer – The tokenizer to use for encoding.

  • mapping_fn – A function that maps the dataset. If not provided, the default tokenize_dataset function will be used.

  • mapping_fn_kwargs – Keyword arguments to pass to mapping_fn.

  • mapping_column_name – The columns in the dataset to be tokenized. If str, only one column will be tokenized. If List[str], multiple columns will be tokenized.

  • batched – Whether the mapping should be done in batches or not.

  • batch_size – The number of examples per batch when mapping in batches.

  • writer_batch_size – The number of examples per write operation to cache.

  • num_proc – The number of processes to use for multi-processing.

  • format_column_name – The columns that should be available on the resulting dataset. If str, only one column will be available. If List[str], multiple columns will be available.

Returns:

The encoded dataset.

Fast Dataset Provider#

class archai.datasets.nlp.fast_hf_dataset_provider.FastHfDatasetProvider(train_file: str, validation_file: str, test_file: str, tokenizer: AutoTokenizer | None = None)[source]#

Fast Hugging Face-based dataset provider.

classmethod from_disk(dataset_file_path: str, tokenizer: AutoTokenizer | None = None, tokenizer_name: str | None = None, mapping_fn: Callable[[Any], Dict[str, Any]] | None = None, mapping_fn_kwargs: Dict[str, Any] | None = None, mapping_column_name: List[str] | None = None, validation_split: float | None = 0.0, shuffle: bool | None = True, seed: int | None = 42, num_workers: int | None = 1, use_eos_token: bool | None = True, use_shared_memory: bool | None = True, cache_dir: str | None = 'cache') FastHfDatasetProvider[source]#

Load a dataset provider by loading and encoding data from disk.

Parameters:
  • dataset_file_path – Path to the dataset file stored in disk.

  • tokenizer – Instance of tokenizer to use.

  • tokenizer_name – Name of the tokenizer, if tokenizer has not been passed.

  • mapping_fn – A function that maps the dataset. If not provided, the default tokenize_concatenated_dataset function will be used.

  • mapping_fn_kwargs – Keyword arguments to pass to mapping_fn.

  • mapping_column_name – The columns in the dataset to be tokenized. If str, only one column will be tokenized. If List[str], multiple columns will be tokenized.

  • validation_split – Fraction of the dataset to use for validation.

  • shuffle – Whether to shuffle the dataset.

  • seed – Random seed.

  • num_workers – Number of workers to use for encoding.

  • use_eos_token – Whether to use EOS token to separate sequences.

  • use_shared_memory – Whether to use shared memory for caching.

  • cache_dir – Root path to the cache directory.

Returns:

Dataset provider.

classmethod from_hub(dataset_name: str, dataset_config_name: str | None = None, data_dir: str | None = None, data_files: List[str] | Dict[str, str | List[str]] | None = None, tokenizer: AutoTokenizer | None = None, tokenizer_name: str | None = None, mapping_fn: Callable[[Any], Dict[str, Any]] | None = None, mapping_fn_kwargs: Dict[str, Any] | None = None, mapping_column_name: List[str] | None = None, validation_split: float | None = 0.0, shuffle: bool | None = True, seed: int | None = 42, num_workers: int | None = 1, use_eos_token: bool | None = True, use_shared_memory: bool | None = True, cache_dir: str | None = 'cache') FastHfDatasetProvider[source]#

Load a dataset provider by downloading and encoding data from Hugging Face Hub.

Parameters:
  • dataset_name – Name of the dataset.

  • dataset_config_name – Name of the dataset configuration.

  • data_dir – Path to the data directory.

  • data_files – Path to the source data file(s).

  • tokenizer – Instance of tokenizer to use.

  • tokenizer_name – Name of the tokenizer, if tokenizer has not been passed.

  • mapping_fn – A function that maps the dataset. If not provided, the default tokenize_concatenated_dataset function will be used.

  • mapping_fn_kwargs – Keyword arguments to pass to mapping_fn.

  • mapping_column_name – The columns in the dataset to be tokenized. If str, only one column will be tokenized. If List[str], multiple columns will be tokenized.

  • validation_split – Fraction of the dataset to use for validation.

  • shuffle – Whether to shuffle the dataset.

  • seed – Random seed.

  • num_workers – Number of workers to use for encoding.

  • use_eos_token – Whether to use EOS token to separate sequences.

  • use_shared_memory – Whether to use shared memory for caching.

  • cache_dir – Root path to the cache directory.

Returns:

Dataset provider.

classmethod from_cache(cache_dir: str) FastHfDatasetProvider[source]#

Load a dataset provider from a cache directory.

Parameters:

cache_dir – Path to the cache directory.

Returns:

Dataset provider.

get_train_dataset(seq_len: int | None = 1) FastHfDataset[source]#

Get a training dataset.

Returns:

An instance of a training dataset.

get_val_dataset(seq_len: int | None = 1) FastHfDataset[source]#

Get a validation dataset.

Returns:

An instance of a validation dataset, or the training dataset if validation dataset is not available.

get_test_dataset(seq_len: int | None = 1) FastHfDataset[source]#

Get a testing dataset.

Returns:

An instance of a testing dataset, or the training/validation dataset if testing dataset is not available.

class archai.datasets.nlp.fast_hf_dataset_provider.FastDataCollatorForLanguageModeling(tokenizer: PreTrainedTokenizerBase, mlm: bool = True, mlm_probability: float = 0.15, pad_to_multiple_of: int | None = None, tf_experimental_compile: bool = False, return_tensors: str = 'pt', use_shifted_labels: bool = False)[source]#

Language modeling data collator compatible with FastHfDataset.

Parameters:

use_shifted_labels – Whether to use the original labels (shifted) or the non-shifted labels.

use_shifted_labels: bool = False#
torch_call(examples: List[List[int] | Any | Dict[str, Any]]) Dict[str, Any][source]#

Fast Dataset Provider (Utilities)#

class archai.datasets.nlp.fast_hf_dataset_provider_utils.FastHfDataset(input_ids: Tensor, seq_len: int | None = 1)[source]#

Fast Hugging Face dataset.

class archai.datasets.nlp.fast_hf_dataset_provider_utils.SHMArray(input_array: ndarray, shm: SharedMemory | None = None)[source]#

Numpy array compatible with SharedMemory from multiprocessing.shared_memory.

Reference:

https://numpy.org/doc/stable/user/basics.subclassing.html#slightly-more-realistic-example-attribute-added-to-existing-array

archai.datasets.nlp.fast_hf_dataset_provider_utils.process_with_shared_memory(dataset_dict: DatasetDict, dtype: dtype, num_proc: int | None = 1) Dict[str, SHMArray][source]#

Process the dataset with a shared memory.

Parameters:
  • dataset_dict – Dataset dictionary.

  • dtype – Numpy data type.

  • num_proc – Number of processes.

Returns:

Dictionary with shared memory-processed datasets.

archai.datasets.nlp.fast_hf_dataset_provider_utils.process_with_memory_map_files(dataset_dict: DatasetDict, cache_dir: str, dtype: dtype, num_proc: int | None = 1) Dict[str, ndarray][source]#

Process the dataset with memory map files.

Parameters:
  • dataset_dict – Dataset dictionary.

  • cache_dir – Cache directory.

  • dtype – Numpy data type.

  • num_proc – Number of processes.

Returns:

Dictionary with memory map file-processed datasets.

archai.datasets.nlp.fast_hf_dataset_provider_utils.xor(p: Any, q: Any) bool[source]#

Implements the logical XOR operator.

Parameters:
  • p – Any instance that may act as True or False.

  • q – Any instance that may act as True or False.

Returns:

Logical value.

NVIDIA#

Data Loader (Utilities)#

class archai.datasets.nlp.nvidia_data_loader_utils.LMOrderedIterator(input_ids: LongTensor, bsz: int, bptt: int, device: device | None = None, mem_len: int | None = 0, ext_len: int | None = 0, warmup: bool | None = True)[source]#

Iterator that provides contiguous batches of input tokens without padding.

roll(seed: int) None[source]#

Roll the data according to a random seed.

This method shuffles the input sequence for each batch in the iterator by rolling/shifting the data according to the specified seed. This is useful for creating diverse training data and preventing overfitting.

Parameters:

seed – Seed used to roll/shift the data.

get_batch(i: int, bptt: int | None = None) Tuple[LongTensor, LongTensor, int, bool][source]#

Get a batch of bptt size.

Parameters:
  • i – Identifier of batch.

  • bptt – Sequence length.

Returns:

Tuple of inputs, labels, sequence length and whether batch is from warmup.

get_fixlen_iter(start: int | None = 0) Generator[Tuple, None, None][source]#

Return a generator for generating fixed-length batches.

This method returns a generator that yields fixed-length batches of the specified size, starting from the specified starting point. The batches are contiguous in the original sequence.

Parameters:

start – Starting point for the generator.

Yields:

Fixed-length batches.

Example

>>> for batch in iterator.get_fixlen_iter():
>>>     # Process the batch.
>>>     pass
get_varlen_iter(start: int | None = 0, std: float | None = 5.0, min_len: int | None = 5, max_std: float | None = 3.0) Generator[Tuple, None, None][source]#

Return a generator for generating variable-length batches.

This method returns a generator that yields variable-length batches of data, starting from the specified starting point. The length of each batch is determined by a Gaussian distribution with the specified mean and standard deviation.

Parameters:
  • start – Starting point for the generator.

  • std – Standard deviation.

  • min_len – Minimum length.

  • max_std – Max standard deviation.

Yields:

Variable-length batches.

Example

>>> for batch in iterator.get_varlen_iter():
>>>     # Process the batch.
>>>     pass
class archai.datasets.nlp.nvidia_data_loader_utils.LMMultiFileIterator(paths: List[str], vocab: TokenizerBase, bsz: int, bptt: int, device: str | None = 'cpu', mem_len: int | None = 0, ext_len: int | None = 0, n_chunks: int | None = 16, shuffle: bool | None = False)[source]#

Multi-file non-ordered iterator, i.e. tokens come from different files but are contiguous.

roll(seed: int | None = 0) None[source]#

Backward compatibility for using same API.

get_sequences(path: str) LongTensor[source]#

Get a tensor of sequences from an input file.

Parameters:

path – A path to the input file.

Returns:

Tensor with encoded inputs.

stream_iterator(iterator: Iterator) Generator[Tuple, None, None][source]#

Create a streaming-based iterator.

Parameters:

iterator – Iterator with chunks of sequences.

Yields:

Stream-based batch.

Dataset Provider#

class archai.datasets.nlp.nvidia_dataset_provider.NvidiaDatasetProvider(dataset_name: str | None = 'wt103', dataset_dir: str | None = '', cache_dir: str | None = 'cache', vocab_type: str | None = 'gpt2', vocab_size: int | None = None, refresh_cache: bool | None = False)[source]#

NVIDIA dataset provider.

get_train_dataset() List[int][source]#

Get a training dataset.

Returns:

An instance of a training dataset.

get_val_dataset() List[int][source]#

Get a validation dataset.

Returns:

An instance of a validation dataset, or the training dataset if validation dataset is not available.

get_test_dataset() List[int][source]#

Get a testing dataset.

Returns:

An instance of a testing dataset, or the training/validation dataset if testing dataset is not available.

Dataset Provider (Utilities)#

class archai.datasets.nlp.nvidia_dataset_provider_utils.Corpus(dataset_name: str, dataset_dir: str, cache_dir: str, vocab_type: str, vocab_size: int | None = None, refresh_cache: bool | None = False)[source]#

Create and train the vocabulary/tokenizer, load the dataset and encode the data.

train_and_encode() None[source]#

Train the vocabulary/tokenizer and encodes the corpus.

load() bool[source]#

Load a pre-trained corpus.

Returns:

Whether pre-trained corpus has been successfully loaded.

save_cache() None[source]#

Save the cache.