Skip to main content

Model Benchmarks

PyTorch Model Benchmarks#

model-benchmarks#

Introduction#

Run training or inference tasks with single or half precision for deep learning models, including the following categories:

  • GPT: gpt2-small, gpt2-medium, gpt2-large and gpt2-xl
  • LLAMA: llama2-7b, llama2-13b, llama2-70b
  • MoE: mixtral-8x7b, mixtral-8x22b
  • BERT: bert-base and bert-large
  • LSTM
  • CNN, listed in torchvision.models, including:
    • resnet: resnet18, resnet34, resnet50, resnet101, resnet152
    • resnext: resnext50_32x4d, resnext101_32x8d
    • wide_resnet: wide_resnet50_2, wide_resnet101_2
    • densenet: densenet121, densenet169, densenet201, densenet161
    • vgg: vgg11, vgg11_bn, vgg13, vgg13_bn, vgg16, vgg16_bn, vgg19_bn, vgg19
    • mnasnet: mnasnet0_5, mnasnet0_75, mnasnet1_0, mnasnet1_3
    • mobilenet: mobilenet_v2
    • shufflenet: shufflenet_v2_x0_5, shufflenet_v2_x1_0, shufflenet_v2_x1_5, shufflenet_v2_x2_0
    • squeezenet: squeezenet1_0, squeezenet1_1
    • others: alexnet, googlenet, inception_v3

For inference, supported percentiles include 50th, 90th, 95th, 99th, and 99.9th.

New: Support fp8_hybrid and fp8_e4m3 precision for BERT models.

New: Deterministic Training Support SuperBench now supports deterministic training to ensure reproducibility across runs. This includes fixed seeds and deterministic algorithms. To enable deterministic training, use the following flags:

  • Flags:

    • --enable_determinism: Enables deterministic computation for reproducible results.
    • --deterministic_seed <seed>: Sets the seed for reproducibility (default: 42).
    • --check_frequency <steps>: How often to record deterministic metrics (default: 100).
  • Environment Variables (set automatically by SuperBench when --enable_determinism is used):

    • CUBLAS_WORKSPACE_CONFIG=:4096:8: Ensures deterministic behavior in cuBLAS. This can be overridden by setting it manually before running SuperBench.

Comparing Deterministic Results

To compare deterministic results between runs, use the standard result analysis workflow:

  1. Run benchmark with --enable_determinism flag
  2. Generate baseline: sb result generate-baseline --data-file results.jsonl --summary-rule-file rules.yaml
  3. Compare future runs: sb result diagnosis --data-file new-results.jsonl --rule-file diagnosis-rule.yaml --baseline-file baseline.json

This allows configurable tolerance for floating-point differences via YAML rules.

Configuration Parameter Validation

When determinism is enabled, benchmark configuration parameters (batchsize, num_steps, deterministic_seed, etc.) are automatically recorded in the results file as `deterministic_config*` metrics. The diagnosis rules enforce exact matching of these parameters between runs to ensure valid comparisons:

If any configuration parameter differs between runs, the diagnosis will flag it as a failure, ensuring you only compare runs with identical configurations.

Summary Rule Snippet for Determinism

Include the following rule in your summary rule file (used with sb result summary or sb result generate-baseline --summary-rule-file) to surface deterministic metrics in the results summary:

superbench:  rules:    model-benchmarks-deterministic:      statistics:        - mean      categories: Deterministic      metrics:        - model-benchmarks:.*/deterministic_loss.*        - model-benchmarks:.*/deterministic_act_mean.*        - model-benchmarks:.*/deterministic_check_count.*        - model-benchmarks:.*/deterministic_step.*        - model-benchmarks:.*/deterministic_config_.*        - model-benchmarks:.*/return_code.*

This groups all deterministic outputs โ€” loss fingerprints, activation means, check counts, step numbers, configuration parameters, and return codes โ€” under the Deterministic category.

Diagnosis Rule Snippet for Determinism

Include the following rules in your diagnosis rule file (used with sb result diagnosis or sb result generate-baseline --diagnosis-rule-file) to detect Silent Data Corruption (SDC) and validate configuration consistency:

superbench:  rules:    deterministic_rule:      function: variance      criteria: "lambda x: x != 0"      categories: SDC-Fingerprint      metrics:        - model-benchmarks:.*/deterministic_loss.*        - model-benchmarks:.*/deterministic_act_mean.*        - model-benchmarks:.*/deterministic_check_count.*
    deterministic_config_rule:      function: variance      criteria: "lambda x: x != 0"      categories: SDC-Config      metrics:        - model-benchmarks:.*/deterministic_config_batch_size.*        - model-benchmarks:.*/deterministic_config_num_steps.*        - model-benchmarks:.*/deterministic_config_num_warmup.*        - model-benchmarks:.*/deterministic_config_deterministic_seed.*        - model-benchmarks:.*/deterministic_config_check_frequency.*        - model-benchmarks:.*/deterministic_config_seq_len.*        - model-benchmarks:.*/deterministic_config_hidden_size.*        - model-benchmarks:.*/deterministic_config_num_classes.*        - model-benchmarks:.*/deterministic_config_input_size.*        - model-benchmarks:.*/deterministic_config_num_layers.*        - model-benchmarks:.*/deterministic_config_num_hidden_layers.*        - model-benchmarks:.*/deterministic_config_num_attention_heads.*        - model-benchmarks:.*/deterministic_config_intermediate_size.*
    deterministic_failure_rule:      function: failure_check      criteria: "lambda x: x != 0"      categories: SDC-Failed      metrics:        - model-benchmarks:.*/return_code
  • SDC-Fingerprint (deterministic_rule): Flags any node where loss, activation mean, or check count has any variance from baseline (x != 0), indicating a potential SDC issue.
  • SDC-Config (deterministic_config_rule): Ensures all determinism configuration parameters (seed, batch size, sequence length, hidden size, etc.) are identical across nodes โ€” any mismatch means the comparison is invalid.
  • SDC-Failed (deterministic_failure_rule): Uses failure_check to catch nodes where the determinism benchmark failed to run or returned a non-zero exit code.

For complete rule files covering all benchmark categories (micro-benchmarks, NCCL, GPU copy bandwidth, NVBandwidth, etc.), refer to the rule file documentation in Result Summary and Data Diagnosis.

Metrics#

NameUnitDescription
model-benchmarks/pytorch-${model_name}/${precision}_train_step_timetime (ms)The average training step time with fp32/fp16 precision.
model-benchmarks/pytorch-${model_name}/${precision}_train_throughputthroughput (samples/s)The average training throughput with fp32/fp16 precision per GPU.
model-benchmarks/pytorch-${model_name}/${precision}_inference_step_timetime (ms)The average inference step time with fp32/fp16 precision.
model-benchmarks/pytorch-${model_name}/${precision}_inference_throughputthroughput (samples/s)The average inference throughput with fp32/fp16 precision.
model-benchmarks/pytorch-${model_name}/${precision}_inference_step_time_${percentile}time (ms)The nth percentile inference step time with fp32/fp16 precision.
model-benchmarks/pytorch-${model_name}/${precision}_inference_throughput_${percentile}throughput (samples/s)The nth percentile inference throughput with fp32/fp16 precision.

Megatron Model benchmarks#

megatron-gpt#

Introduction#

Run GPT pretrain tasks with float32, float16, bfloat16 precisions with Megatron-LM or Megatron-DeepSpeed.

tips: batch_size in this benchmark represents global batch size, the batch size on each GPU instance is micro_batch_size.

Metrics#

NameUnitDescription
megatron-gpt/${precision}_train_step_timetime (ms)The average training step time per iteration.
megatron-gpt/${precision}_train_throughputthroughput (samples/s)The average training throughput per iteration.
megatron-gpt/${precision}_train_tflopstflops/sThe average training tflops per second per iteration.
megatron-gpt/${precision}_train_mem_allocatedGBThe average GPU memory allocated per iteration.
megatron-gpt/${precision}_train_max_mem_allocatedGBThe average maximum GPU memory allocated per iteration.