core.trainer_backend
Trainer Backend module:
Currently we support:
These are TrainerBackends for most common scenarios available out of the box.
Alternatively a user can provide a custom TrainerBackend.
build_trainer_backend#
Factory for trainer_backends
Arguments:
trainer_backend_namestr - TrainerBackend Name. Possible choices are currently: sp, sp-amp, sp-amp-apex, ddp, ddp-amp, ddp-amp-apexargssequence - TrainerBackend positional argumentskwargsdict - TrainerBackend keyword arguments
TrainerBackendArguments Objects#
Trainer Backend Arguments dataclass.
TrainerBackend Objects#
Trainer Backend abstract class.
OutputCollector Objects#
Responsible for collecting step outputs and stores them in memory across each call. Concatinates tensors from all steps across first dimension.
collect#
Coalesces train_step and val_step outputs. all tensors concatenated across dimension 0 if input is a torch.Tensor of dimension batch_size x y .., all_outputs will be List[torch.Tensor of dimension total_samples_till_now x y] if input is a torch.Tensor of dimension 1 1, all_outputs will List[torch.Tensor of dimension total_samples_till_now 1] if input is List[torch.Tensor], all_outputs will be List[torch.Tensor] - all tensors concatenated across dimension 0
Arguments:
outputsUnion[torch.Tensor, Iterable[torch.Tensor]] - train_step , val_step outputs
SingleProcess Objects#
Single Process Trainer Backend
__init__#
Single process trainer_backend
process_global_step#
Clip gradients and call optimizer + scheduler
get_state#
Get the current state of the trainer_backend, used for checkpointing.
Returns:
state_dictdict - Dictionary of variables or objects to checkpoint.
update_state#
Update the trainer_backend from a checkpointed state.
Arguments:
state (dict) : Output of get_state() during checkpointing
SingleProcessDpSgd Objects#
Backend which supports Differential Privacy. We are using Opacus library. https://opacus.ai/api/privacy_engine.html
SingleProcessAmp Objects#
SingleProcess + Native PyTorch AMP Trainer Backend
SingleProcessApexAmp Objects#
SingleProcess + Apex AMP Trainer Backend
AbstractTrainerBackendDecorator Objects#
Abstract class implementing the decorator design pattern.
DDPTrainerBackend Objects#
Distributed Data Parallel TrainerBackend.
Wraps ModuleInterface model with DistributedDataParallel which handles gradient averaging across processes.
.. note: Assumes initiailized model parameters are consistent across processes - e.g. by using same random seed in each process at point of model initialization.
setup_distributed_env#
Setup the process group for distributed training.
cleanup#
Destroy the process group used for distributed training.
gather_tensors_on_cpu#
Gather tensors and move to cpu at configurable frequency.
Move tensor to CUDA device, apply all-gather and move back to CPU.
If distributed_training_args.gather_frequency is set, tensors are
moved to CUDA in chunks of that size.
Arguments:
xtorch.tensor - To be gathered.
Returns:
Gathered tensor on the cpu.
DPDDPTrainerBackend Objects#
Distributed Data Parallel TrainerBackend with Differential Privacy.
Wraps ModuleInterface model with DifferentiallyPrivateDistributedDataParallel which handles gradient averaging across processes, along with virtual stepping.
.. note: Assumes initiailized model parameters are consistent across processes - e.g. by using same random seed in each process at point of model initialization.