Distillation
With pymarlin library, distillation can be done in a standalone manner or as an extension to your original training Scenario. In this example, we will go through how the GLUE Task setup was extended to also perform distillation.
Data Preprocessing is the same as here. The main implementation is in the ModuleInterface which we chose to call DistillRecipe (inheriting from the GLUE Recipe).
The key methods of DistillRecipe that we want to override:
Setting up teacher and student model and related items such as config as needed. Here, we have the option to modify the student config depending on the desired changes to the depth or width of the model.
Modify
train_stepto set teacher in eval mode, get teacher outputs, get student outputs, and compute a custom loss. The loss can be a combination oflogits,labelsor various intermediate representations such ashidden_statesandattentions. You have the flexibility to determine your distillation logic.As an example,
on_end_traincan be used to cleanup any changes made to the final student model config and save it to the output directory along with the student model.
That's it! If you have a scenario setup it's as easy as overriding just 2 methods.