Distillation
With pymarlin
library, distillation can be done in a standalone manner or as an extension to your original training Scenario. In this example, we will go through how the GLUE Task setup was extended to also perform distillation.
Data Preprocessing is the same as here. The main implementation is in the ModuleInterface
which we chose to call DistillRecipe
(inheriting from the GLUE Recipe
).
The key methods of DistillRecipe
that we want to override:
Setting up teacher and student model and related items such as config as needed. Here, we have the option to modify the student config depending on the desired changes to the depth or width of the model.
Modify
train_step
to set teacher in eval mode, get teacher outputs, get student outputs, and compute a custom loss. The loss can be a combination oflogits
,labels
or various intermediate representations such ashidden_states
andattentions
. You have the flexibility to determine your distillation logic.As an example,
on_end_train
can be used to cleanup any changes made to the final student model config and save it to the output directory along with the student model.
That's it! If you have a scenario setup it's as easy as overriding just 2 methods.