Skip to main content

Model Benchmarks

PyTorch Model Benchmarks#



Run training or inference tasks with single or half precision for deep learning models, including the following categories:

  • GPT: gpt2-small, gpt2-medium, gpt2-large and gpt2-xl
  • LLAMA: llama2-7b, llama2-13b, llama2-70b
  • BERT: bert-base and bert-large
  • LSTM
  • CNN, listed in torchvision.models, including:
    • resnet: resnet18, resnet34, resnet50, resnet101, resnet152
    • resnext: resnext50_32x4d, resnext101_32x8d
    • wide_resnet: wide_resnet50_2, wide_resnet101_2
    • densenet: densenet121, densenet169, densenet201, densenet161
    • vgg: vgg11, vgg11_bn, vgg13, vgg13_bn, vgg16, vgg16_bn, vgg19_bn, vgg19
    • mnasnet: mnasnet0_5, mnasnet0_75, mnasnet1_0, mnasnet1_3
    • mobilenet: mobilenet_v2
    • shufflenet: shufflenet_v2_x0_5, shufflenet_v2_x1_0, shufflenet_v2_x1_5, shufflenet_v2_x2_0
    • squeezenet: squeezenet1_0, squeezenet1_1
    • others: alexnet, googlenet, inception_v3

For inference, supported percentiles include 50th, 90th, 95th, 99th, and 99.9th.

New: Support fp8_hybrid and fp8_e4m3 precision for BERT models.


model-benchmarks/pytorch-${model_name}/${precision}_train_step_timetime (ms)The average training step time with fp32/fp16 precision.
model-benchmarks/pytorch-${model_name}/${precision}_train_throughputthroughput (samples/s)The average training throughput with fp32/fp16 precision per GPU.
model-benchmarks/pytorch-${model_name}/${precision}_inference_step_timetime (ms)The average inference step time with fp32/fp16 precision.
model-benchmarks/pytorch-${model_name}/${precision}_inference_throughputthroughput (samples/s)The average inference throughput with fp32/fp16 precision.
model-benchmarks/pytorch-${model_name}/${precision}_inference_step_time_${percentile}time (ms)The nth percentile inference step time with fp32/fp16 precision.
model-benchmarks/pytorch-${model_name}/${precision}_inference_throughput_${percentile}throughput (samples/s)The nth percentile inference throughput with fp32/fp16 precision.

Megatron Model benchmarks#



Run GPT pretrain tasks with float32, float16, bfloat16 precisions with Megatron-LM or Megatron-DeepSpeed.

tips: batch_size in this benchmark represents global batch size, the batch size on each GPU instance is micro_batch_size.


megatron-gpt/${precision}_train_step_timetime (ms)The average training step time per iteration.
megatron-gpt/${precision}_train_throughputthroughput (samples/s)The average training throughput per iteration.
megatron-gpt/${precision}_train_tflopstflops/sThe average training tflops per second per iteration.
megatron-gpt/${precision}_train_mem_allocatedGBThe average GPU memory allocated per iteration.
megatron-gpt/${precision}_train_max_mem_allocatedGBThe average maximum GPU memory allocated per iteration.