Natural Language Processing#
Hugging Face#
Dataset Provider#
- class archai.datasets.nlp.hf_dataset_provider.HfHubDatasetProvider(dataset_name: str, dataset_config_name: str | None = None, data_dir: str | None = None, data_files: str | List[str] | Dict[str, str | List[str]] | None = None, cache_dir: str | None = None, revision: str | Version | None = None)[source]#
Hugging Face Hub dataset provider.
- get_dataset(split: str | Split | None = None, refresh_cache: bool | None = False, keep_in_memory: bool | None = False, streaming: bool | None = False) Dataset | DatasetDict | IterableDataset | IterableDatasetDict [source]#
- get_train_dataset(split: str | Split | None = 'train', refresh_cache: bool | None = False, keep_in_memory: bool | None = False, streaming: bool | None = False) Dataset | IterableDataset [source]#
Get a training dataset.
- Returns:
An instance of a training dataset.
- get_val_dataset(split: str | Split | None = 'validation', refresh_cache: bool | None = False, keep_in_memory: bool | None = False, streaming: bool | None = False) Dataset | IterableDataset [source]#
Get a validation dataset.
- Returns:
An instance of a validation dataset, or the training dataset if validation dataset is not available.
- get_test_dataset(split: str | Split | None = 'test', refresh_cache: bool | None = False, keep_in_memory: bool | None = False, streaming: bool | None = False) Dataset | IterableDataset [source]#
Get a testing dataset.
- Returns:
An instance of a testing dataset, or the training/validation dataset if testing dataset is not available.
- class archai.datasets.nlp.hf_dataset_provider.HfDiskDatasetProvider(data_dir: str, keep_in_memory: bool | None = False)[source]#
Hugging Face disk-saved dataset provider.
- get_train_dataset() Dataset [source]#
Get a training dataset.
- Returns:
An instance of a training dataset.
Dataset Provider (Utilities)#
- archai.datasets.nlp.hf_dataset_provider_utils.should_refresh_cache(refresh: bool) DownloadMode [source]#
Determine whether to refresh the cached dataset.
This function determines whether the cached dataset should be refreshed by re-downloading or re-creating it based on the value of the refresh parameter.
- Parameters:
refresh – If True, the cache will be refreshed. If False, the existing cache will be used if it exists.
- Returns:
An enumerator indicating whether the cache should be refreshed or not.
- archai.datasets.nlp.hf_dataset_provider_utils.tokenize_dataset(examples: Dict[str, List[str]], tokenizer: AutoTokenizer | None = None, mapping_column_name: List[str] | None = None, use_eos_token: bool | None = False, truncate: bool | str | None = True, padding: bool | str | None = 'max_length') Dict[str, Any] [source]#
Tokenize a list of examples using a specified tokenizer.
- Parameters:
examples – A list of examples to be tokenized.
tokenizer – The tokenizer to use.
mapping_column_name – The columns in examples that should be tokenized.
use_eos_token – Whether to append the EOS token to each example.
truncate – Whether truncation should be applied.
padding – Whether padding should be applied.
- Returns:
Tokenized examples.
- archai.datasets.nlp.hf_dataset_provider_utils.tokenize_concatenated_dataset(examples: Dict[str, List[str]], tokenizer: AutoTokenizer | None = None, mapping_column_name: List[str] | None = None, use_eos_token: bool | None = False, dtype: dtype | None = None) Dict[str, Any] [source]#
Tokenize a list of examples using a specified tokenizer and with concatenated batches (no truncation nor padding).
- Parameters:
examples – A list of examples to be tokenized.
tokenizer – The tokenizer to use.
mapping_column_name – The columns in examples that should be tokenized.
use_eos_token – Whether to append the EOS token to each example.
dtype – Numpy data type of the tokenized examples.
- Returns:
Concatenated tokenized examples.
- archai.datasets.nlp.hf_dataset_provider_utils.tokenize_contiguous_dataset(examples: Dict[str, List[str]], tokenizer: AutoTokenizer | None = None, mapping_column_name: List[str] | None = None, model_max_length: int | None = 1024) Dict[str, Any] [source]#
Tokenize a list of examples using a specified tokenizer and with contiguous-length batches (no truncation nor padding).
- Parameters:
examples – A list of examples to be tokenized.
tokenizer – The tokenizer to use.
mapping_column_name – The columns in examples that should be tokenized.
model_max_length – Maximum length of sequences.
- Returns:
Contiguous-length tokenized examples.
- archai.datasets.nlp.hf_dataset_provider_utils.tokenize_nsp_dataset(examples: Dict[str, List[str]], tokenizer: AutoTokenizer | None = None, mapping_column_name: List[str] | None = None, truncate: bool | str | None = True, padding: bool | str | None = 'max_length') Dict[str, Any] [source]#
Tokenize a list of examples using a specified tokenizer and with next-sentence prediction (NSP).
- Parameters:
examples – A list of examples to be tokenized.
tokenizer – The tokenizer to use.
mapping_column_name – The columns in examples that should be tokenized.
truncate – Whether truncation should be applied.
padding – Whether padding should be applied.
- Returns:
Tokenized examples with NSP labels.
- archai.datasets.nlp.hf_dataset_provider_utils.encode_dataset(dataset: Dataset | DatasetDict | IterableDataset | IterableDatasetDict, tokenizer: AutoTokenizer, mapping_fn: Callable[[Any], Dict[str, Any]] | None = None, mapping_fn_kwargs: Dict[str, Any] | None = None, mapping_column_name: str | List[str] | None = 'text', batched: bool | None = True, batch_size: int | None = 1000, writer_batch_size: int | None = 1000, num_proc: int | None = None, format_column_name: str | List[str] | None = None) DatasetDict | IterableDatasetDict [source]#
Encode a dataset using a tokenizer.
- Parameters:
dataset – The dataset to be encoded.
tokenizer – The tokenizer to use for encoding.
mapping_fn – A function that maps the dataset. If not provided, the default tokenize_dataset function will be used.
mapping_fn_kwargs – Keyword arguments to pass to mapping_fn.
mapping_column_name – The columns in the dataset to be tokenized. If str, only one column will be tokenized. If List[str], multiple columns will be tokenized.
batched – Whether the mapping should be done in batches or not.
batch_size – The number of examples per batch when mapping in batches.
writer_batch_size – The number of examples per write operation to cache.
num_proc – The number of processes to use for multi-processing.
format_column_name – The columns that should be available on the resulting dataset. If str, only one column will be available. If List[str], multiple columns will be available.
- Returns:
The encoded dataset.
Fast Dataset Provider#
- class archai.datasets.nlp.fast_hf_dataset_provider.FastHfDatasetProvider(train_file: str, validation_file: str, test_file: str, tokenizer: AutoTokenizer | None = None)[source]#
Fast Hugging Face-based dataset provider.
- classmethod from_disk(dataset_file_path: str, tokenizer: AutoTokenizer | None = None, tokenizer_name: str | None = None, mapping_fn: Callable[[Any], Dict[str, Any]] | None = None, mapping_fn_kwargs: Dict[str, Any] | None = None, mapping_column_name: List[str] | None = None, validation_split: float | None = 0.0, shuffle: bool | None = True, seed: int | None = 42, num_workers: int | None = 1, use_eos_token: bool | None = True, use_shared_memory: bool | None = True, cache_dir: str | None = 'cache') FastHfDatasetProvider [source]#
Load a dataset provider by loading and encoding data from disk.
- Parameters:
dataset_file_path – Path to the dataset file stored in disk.
tokenizer – Instance of tokenizer to use.
tokenizer_name – Name of the tokenizer, if tokenizer has not been passed.
mapping_fn – A function that maps the dataset. If not provided, the default tokenize_concatenated_dataset function will be used.
mapping_fn_kwargs – Keyword arguments to pass to mapping_fn.
mapping_column_name – The columns in the dataset to be tokenized. If str, only one column will be tokenized. If List[str], multiple columns will be tokenized.
validation_split – Fraction of the dataset to use for validation.
shuffle – Whether to shuffle the dataset.
seed – Random seed.
num_workers – Number of workers to use for encoding.
use_eos_token – Whether to use EOS token to separate sequences.
use_shared_memory – Whether to use shared memory for caching.
cache_dir – Root path to the cache directory.
- Returns:
Dataset provider.
- classmethod from_hub(dataset_name: str, dataset_config_name: str | None = None, data_dir: str | None = None, data_files: List[str] | Dict[str, str | List[str]] | None = None, tokenizer: AutoTokenizer | None = None, tokenizer_name: str | None = None, mapping_fn: Callable[[Any], Dict[str, Any]] | None = None, mapping_fn_kwargs: Dict[str, Any] | None = None, mapping_column_name: List[str] | None = None, validation_split: float | None = 0.0, shuffle: bool | None = True, seed: int | None = 42, num_workers: int | None = 1, use_eos_token: bool | None = True, use_shared_memory: bool | None = True, cache_dir: str | None = 'cache') FastHfDatasetProvider [source]#
Load a dataset provider by downloading and encoding data from Hugging Face Hub.
- Parameters:
dataset_name – Name of the dataset.
dataset_config_name – Name of the dataset configuration.
data_dir – Path to the data directory.
data_files – Path to the source data file(s).
tokenizer – Instance of tokenizer to use.
tokenizer_name – Name of the tokenizer, if tokenizer has not been passed.
mapping_fn – A function that maps the dataset. If not provided, the default tokenize_concatenated_dataset function will be used.
mapping_fn_kwargs – Keyword arguments to pass to mapping_fn.
mapping_column_name – The columns in the dataset to be tokenized. If str, only one column will be tokenized. If List[str], multiple columns will be tokenized.
validation_split – Fraction of the dataset to use for validation.
shuffle – Whether to shuffle the dataset.
seed – Random seed.
num_workers – Number of workers to use for encoding.
use_eos_token – Whether to use EOS token to separate sequences.
use_shared_memory – Whether to use shared memory for caching.
cache_dir – Root path to the cache directory.
- Returns:
Dataset provider.
- classmethod from_cache(cache_dir: str) FastHfDatasetProvider [source]#
Load a dataset provider from a cache directory.
- Parameters:
cache_dir – Path to the cache directory.
- Returns:
Dataset provider.
- get_train_dataset(seq_len: int | None = 1) FastHfDataset [source]#
Get a training dataset.
- Returns:
An instance of a training dataset.
- get_val_dataset(seq_len: int | None = 1) FastHfDataset [source]#
Get a validation dataset.
- Returns:
An instance of a validation dataset, or the training dataset if validation dataset is not available.
- get_test_dataset(seq_len: int | None = 1) FastHfDataset [source]#
Get a testing dataset.
- Returns:
An instance of a testing dataset, or the training/validation dataset if testing dataset is not available.
- class archai.datasets.nlp.fast_hf_dataset_provider.FastDataCollatorForLanguageModeling(tokenizer: PreTrainedTokenizerBase, mlm: bool = True, mlm_probability: float = 0.15, pad_to_multiple_of: int | None = None, tf_experimental_compile: bool = False, return_tensors: str = 'pt', use_shifted_labels: bool = False)[source]#
Language modeling data collator compatible with FastHfDataset.
- Parameters:
use_shifted_labels – Whether to use the original labels (shifted) or the non-shifted labels.
- use_shifted_labels: bool = False#
Fast Dataset Provider (Utilities)#
- class archai.datasets.nlp.fast_hf_dataset_provider_utils.FastHfDataset(input_ids: Tensor, seq_len: int | None = 1)[source]#
Fast Hugging Face dataset.
- class archai.datasets.nlp.fast_hf_dataset_provider_utils.SHMArray(input_array: ndarray, shm: SharedMemory | None = None)[source]#
Numpy array compatible with SharedMemory from multiprocessing.shared_memory.
Process the dataset with a shared memory.
- Parameters:
dataset_dict – Dataset dictionary.
dtype – Numpy data type.
num_proc – Number of processes.
- Returns:
Dictionary with shared memory-processed datasets.
- archai.datasets.nlp.fast_hf_dataset_provider_utils.process_with_memory_map_files(dataset_dict: DatasetDict, cache_dir: str, dtype: dtype, num_proc: int | None = 1) Dict[str, ndarray] [source]#
Process the dataset with memory map files.
- Parameters:
dataset_dict – Dataset dictionary.
cache_dir – Cache directory.
dtype – Numpy data type.
num_proc – Number of processes.
- Returns:
Dictionary with memory map file-processed datasets.
NVIDIA#
Data Loader (Utilities)#
- class archai.datasets.nlp.nvidia_data_loader_utils.LMOrderedIterator(input_ids: LongTensor, bsz: int, bptt: int, device: device | None = None, mem_len: int | None = 0, ext_len: int | None = 0, warmup: bool | None = True)[source]#
Iterator that provides contiguous batches of input tokens without padding.
- roll(seed: int) None [source]#
Roll the data according to a random seed.
This method shuffles the input sequence for each batch in the iterator by rolling/shifting the data according to the specified seed. This is useful for creating diverse training data and preventing overfitting.
- Parameters:
seed – Seed used to roll/shift the data.
- get_batch(i: int, bptt: int | None = None) Tuple[LongTensor, LongTensor, int, bool] [source]#
Get a batch of bptt size.
- Parameters:
i – Identifier of batch.
bptt – Sequence length.
- Returns:
Tuple of inputs, labels, sequence length and whether batch is from warmup.
- get_fixlen_iter(start: int | None = 0) Generator[Tuple, None, None] [source]#
Return a generator for generating fixed-length batches.
This method returns a generator that yields fixed-length batches of the specified size, starting from the specified starting point. The batches are contiguous in the original sequence.
- Parameters:
start – Starting point for the generator.
- Yields:
Fixed-length batches.
Example
>>> for batch in iterator.get_fixlen_iter(): >>> # Process the batch. >>> pass
- get_varlen_iter(start: int | None = 0, std: float | None = 5.0, min_len: int | None = 5, max_std: float | None = 3.0) Generator[Tuple, None, None] [source]#
Return a generator for generating variable-length batches.
This method returns a generator that yields variable-length batches of data, starting from the specified starting point. The length of each batch is determined by a Gaussian distribution with the specified mean and standard deviation.
- Parameters:
start – Starting point for the generator.
std – Standard deviation.
min_len – Minimum length.
max_std – Max standard deviation.
- Yields:
Variable-length batches.
Example
>>> for batch in iterator.get_varlen_iter(): >>> # Process the batch. >>> pass
- class archai.datasets.nlp.nvidia_data_loader_utils.LMMultiFileIterator(paths: List[str], vocab: TokenizerBase, bsz: int, bptt: int, device: str | None = 'cpu', mem_len: int | None = 0, ext_len: int | None = 0, n_chunks: int | None = 16, shuffle: bool | None = False)[source]#
Multi-file non-ordered iterator, i.e. tokens come from different files but are contiguous.
Dataset Provider#
- class archai.datasets.nlp.nvidia_dataset_provider.NvidiaDatasetProvider(dataset_name: str | None = 'wt103', dataset_dir: str | None = '', cache_dir: str | None = 'cache', vocab_type: str | None = 'gpt2', vocab_size: int | None = None, refresh_cache: bool | None = False)[source]#
NVIDIA dataset provider.
- get_train_dataset() List[int] [source]#
Get a training dataset.
- Returns:
An instance of a training dataset.
Dataset Provider (Utilities)#
- class archai.datasets.nlp.nvidia_dataset_provider_utils.Corpus(dataset_name: str, dataset_dir: str, cache_dir: str, vocab_type: str, vocab_size: int | None = None, refresh_cache: bool | None = False)[source]#
Create and train the vocabulary/tokenizer, load the dataset and encode the data.