Model Benchmarks
PyTorch Model Benchmarks#
model-benchmarks#
Introduction#
Run training or inference tasks with single or half precision for deep learning models, including the following categories:
- GPT: gpt2-small, gpt2-medium, gpt2-large and gpt2-xl
- LLAMA: llama2-7b, llama2-13b, llama2-70b
- MoE: mixtral-8x7b, mixtral-8x22b
- BERT: bert-base and bert-large
- LSTM
- CNN, listed in
torchvision.models, including:- resnet: resnet18, resnet34, resnet50, resnet101, resnet152
- resnext: resnext50_32x4d, resnext101_32x8d
- wide_resnet: wide_resnet50_2, wide_resnet101_2
- densenet: densenet121, densenet169, densenet201, densenet161
- vgg: vgg11, vgg11_bn, vgg13, vgg13_bn, vgg16, vgg16_bn, vgg19_bn, vgg19
- mnasnet: mnasnet0_5, mnasnet0_75, mnasnet1_0, mnasnet1_3
- mobilenet: mobilenet_v2
- shufflenet: shufflenet_v2_x0_5, shufflenet_v2_x1_0, shufflenet_v2_x1_5, shufflenet_v2_x2_0
- squeezenet: squeezenet1_0, squeezenet1_1
- others: alexnet, googlenet, inception_v3
For inference, supported percentiles include 50th, 90th, 95th, 99th, and 99.9th.
New: Support fp8_hybrid and fp8_e4m3 precision for BERT models.
New: Deterministic Training Support SuperBench now supports deterministic training to ensure reproducibility across runs. This includes fixed seeds and deterministic algorithms. To enable deterministic training, use the following flags:
Flags:
--enable_determinism: Enables deterministic computation for reproducible results.--deterministic_seed <seed>: Sets the seed for reproducibility (default: 42).--check_frequency <steps>: How often to record deterministic metrics (default: 100).
Environment Variables (set automatically by SuperBench when
--enable_determinismis used):CUBLAS_WORKSPACE_CONFIG=:4096:8: Ensures deterministic behavior in cuBLAS. This can be overridden by setting it manually before running SuperBench.
Comparing Deterministic Results
To compare deterministic results between runs, use the standard result analysis workflow:
- Run benchmark with
--enable_determinismflag - Generate baseline:
sb result generate-baseline --data-file results.jsonl --summary-rule-file rules.yaml - Compare future runs:
sb result diagnosis --data-file new-results.jsonl --rule-file diagnosis-rule.yaml --baseline-file baseline.json
This allows configurable tolerance for floating-point differences via YAML rules.
Configuration Parameter Validation
When determinism is enabled, benchmark configuration parameters (batchsize, num_steps, deterministic_seed, etc.) are automatically recorded in the results file as `deterministic_config*` metrics. The diagnosis rules enforce exact matching of these parameters between runs to ensure valid comparisons:
If any configuration parameter differs between runs, the diagnosis will flag it as a failure, ensuring you only compare runs with identical configurations.
Summary Rule Snippet for Determinism
Include the following rule in your summary rule file (used with sb result summary or sb result generate-baseline --summary-rule-file) to surface deterministic metrics in the results summary:
superbench: rules: model-benchmarks-deterministic: statistics: - mean categories: Deterministic metrics: - model-benchmarks:.*/deterministic_loss.* - model-benchmarks:.*/deterministic_act_mean.* - model-benchmarks:.*/deterministic_check_count.* - model-benchmarks:.*/deterministic_step.* - model-benchmarks:.*/deterministic_config_.* - model-benchmarks:.*/return_code.*This groups all deterministic outputs โ loss fingerprints, activation means, check counts, step numbers, configuration parameters, and return codes โ under the Deterministic category.
Diagnosis Rule Snippet for Determinism
Include the following rules in your diagnosis rule file (used with sb result diagnosis or sb result generate-baseline --diagnosis-rule-file) to detect Silent Data Corruption (SDC) and validate configuration consistency:
superbench: rules: deterministic_rule: function: variance criteria: "lambda x: x != 0" categories: SDC-Fingerprint metrics: - model-benchmarks:.*/deterministic_loss.* - model-benchmarks:.*/deterministic_act_mean.* - model-benchmarks:.*/deterministic_check_count.*
deterministic_config_rule: function: variance criteria: "lambda x: x != 0" categories: SDC-Config metrics: - model-benchmarks:.*/deterministic_config_batch_size.* - model-benchmarks:.*/deterministic_config_num_steps.* - model-benchmarks:.*/deterministic_config_num_warmup.* - model-benchmarks:.*/deterministic_config_deterministic_seed.* - model-benchmarks:.*/deterministic_config_check_frequency.* - model-benchmarks:.*/deterministic_config_seq_len.* - model-benchmarks:.*/deterministic_config_hidden_size.* - model-benchmarks:.*/deterministic_config_num_classes.* - model-benchmarks:.*/deterministic_config_input_size.* - model-benchmarks:.*/deterministic_config_num_layers.* - model-benchmarks:.*/deterministic_config_num_hidden_layers.* - model-benchmarks:.*/deterministic_config_num_attention_heads.* - model-benchmarks:.*/deterministic_config_intermediate_size.*
deterministic_failure_rule: function: failure_check criteria: "lambda x: x != 0" categories: SDC-Failed metrics: - model-benchmarks:.*/return_code- SDC-Fingerprint (
deterministic_rule): Flags any node where loss, activation mean, or check count has any variance from baseline (x != 0), indicating a potential SDC issue. - SDC-Config (
deterministic_config_rule): Ensures all determinism configuration parameters (seed, batch size, sequence length, hidden size, etc.) are identical across nodes โ any mismatch means the comparison is invalid. - SDC-Failed (
deterministic_failure_rule): Usesfailure_checkto catch nodes where the determinism benchmark failed to run or returned a non-zero exit code.
For complete rule files covering all benchmark categories (micro-benchmarks, NCCL, GPU copy bandwidth, NVBandwidth, etc.), refer to the rule file documentation in Result Summary and Data Diagnosis.
Metrics#
| Name | Unit | Description |
|---|---|---|
| model-benchmarks/pytorch-${model_name}/${precision}_train_step_time | time (ms) | The average training step time with fp32/fp16 precision. |
| model-benchmarks/pytorch-${model_name}/${precision}_train_throughput | throughput (samples/s) | The average training throughput with fp32/fp16 precision per GPU. |
| model-benchmarks/pytorch-${model_name}/${precision}_inference_step_time | time (ms) | The average inference step time with fp32/fp16 precision. |
| model-benchmarks/pytorch-${model_name}/${precision}_inference_throughput | throughput (samples/s) | The average inference throughput with fp32/fp16 precision. |
| model-benchmarks/pytorch-${model_name}/${precision}_inference_step_time_${percentile} | time (ms) | The nth percentile inference step time with fp32/fp16 precision. |
| model-benchmarks/pytorch-${model_name}/${precision}_inference_throughput_${percentile} | throughput (samples/s) | The nth percentile inference throughput with fp32/fp16 precision. |
Megatron Model benchmarks#
megatron-gpt#
Introduction#
Run GPT pretrain tasks with float32, float16, bfloat16 precisions with Megatron-LM or Megatron-DeepSpeed.
tips: batch_size in this benchmark represents global batch size, the batch size on each GPU instance is micro_batch_size.
Metrics#
| Name | Unit | Description |
|---|---|---|
| megatron-gpt/${precision}_train_step_time | time (ms) | The average training step time per iteration. |
| megatron-gpt/${precision}_train_throughput | throughput (samples/s) | The average training throughput per iteration. |
| megatron-gpt/${precision}_train_tflops | tflops/s | The average training tflops per second per iteration. |
| megatron-gpt/${precision}_train_mem_allocated | GB | The average GPU memory allocated per iteration. |
| megatron-gpt/${precision}_train_max_mem_allocated | GB | The average maximum GPU memory allocated per iteration. |