Training NLP-based Models with Hugging Face#

Training an NLP-based model involves several steps, including loading the data, encoding the data, defining the model architecture, and conducting the actual training process.

Archai implements abstract base classes that defines the expected behavior of some classes, such as datasets (DatasetProvider) and trainers (TrainerBase). Additionally, we offer boilerplate classes for the most common frameworks, such as a DatasetProvider compatible with huggingface/datasets and a TrainerBase compatible with huggingface/transformers.

Loading and Encoding the Data#

When using a dataset provider, such as Hugging Face’s datasets library, the data loading process is simplified, as the provider takes care of downloading and pre-processing the required dataset. Next, the data needs to be encoded, typically by converting text data into numerical representations that can be fed into the model.

This step is accomplished in the same way as the previous notebook:

[1]:
from transformers import AutoTokenizer, DataCollatorForLanguageModeling
from archai.datasets.nlp.hf_dataset_provider import HfHubDatasetProvider
from archai.datasets.nlp.hf_dataset_provider_utils import tokenize_dataset

tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-350M-mono", model_max_length=1024)
tokenizer.pad_token = tokenizer.eos_token

collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
dataset_provider = HfHubDatasetProvider("wikitext", dataset_config_name="wikitext-103-raw-v1")

# When loading `train_dataset`, we will override the split argument to only load 1%
# of the data and speed up its encoding
train_dataset = dataset_provider.get_train_dataset(split="train[:1%]")
encoded_train_dataset = train_dataset.map(tokenize_dataset, batched=True, fn_kwargs={"tokenizer": tokenizer})
Found cached dataset wikitext (C:/Users/gderosa/.cache/huggingface/datasets/wikitext/wikitext-103-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126)
Loading cached processed dataset at C:\Users\gderosa\.cache\huggingface\datasets\wikitext\wikitext-103-raw-v1\1.0.0\a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126\cache-04d7ff93d438ade6.arrow

Defining the Model#

Once the data is encoded, we can define any NLP-based model. In this example, we will use a CodeGen architecture from huggingface/transformers.

[2]:
from transformers import CodeGenConfig, CodeGenForCausalLM

config = CodeGenConfig(
    n_positions=1024,
    n_embd=768,
    n_layer=12,
    n_head=12,
    rotary_dim=16,
    bos_token_id=0,
    eos_token_id=0,
    vocab_size=50295,
)
model = CodeGenForCausalLM(config=config)

Running the Trainer#

The final step is to use the Hugging Face trainer abstraction (HfTrainer) to conduct the training process, which involves optimizing the model’s parameters using a pre-defined optimization algorithm and loss function, and updating the model’s parameters based on the training data. This process is repeated until the model converges to a satisfactory accuracy or performance level.

[3]:
from transformers import TrainingArguments
from archai.trainers.nlp.hf_trainer import HfTrainer

training_args = TrainingArguments(
    "hf-codegen",
    evaluation_strategy="no",
    logging_steps=1,
    per_device_train_batch_size=1,
    learning_rate=0.01,
    weight_decay=0.1,
    max_steps=1,
)
trainer = HfTrainer(
    model=model,
    args=training_args,
    data_collator=collator,
    train_dataset=encoded_train_dataset,
)

trainer.train()
c:\Users\gderosa\Anaconda3\envs\archai\lib\site-packages\transformers\optimization.py:395: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  FutureWarning,
You're using a CodeGenTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
{'loss': 10.8762, 'learning_rate': 0.0, 'epoch': 0.0}
{'train_runtime': 25.7651, 'train_samples_per_second': 0.039, 'train_steps_per_second': 0.039, 'train_loss': 10.876193046569824, 'epoch': 0.0}
[3]:
TrainOutput(global_step=1, training_loss=10.876193046569824, metrics={'train_runtime': 25.7651, 'train_samples_per_second': 0.039, 'train_steps_per_second': 0.039, 'train_loss': 10.876193046569824, 'epoch': 0.0})