Skip to main content

Micro Benchmarks

Computation Benchmarks#

kernel-launch#

Introduction#

Measure GPU kernel launch latency, which is defined as the time range from the beginning of the launch API call to the beginning of the kernel execution.

Metrics#

NameUnitDescription
kernel-launch/event_timetime (ms)Launch latency measured in GPU time.
kernel-launch/wall_timetime (ms)Launch latency measured in CPU time.

gemm-flops#

Introduction#

Measure the GPU GEMM FLOPS for different float and int data types, with or without Tensor Core (XDLOPS), performed by NVIDIA cutlass or AMD rocblas-bench.

Metrics#

NameUnitDescription
gemm-flops/fp64_flopsFLOPS (GFLOPS)GEMM float64 peak FLOPS.
gemm-flops/fp32_flopsFLOPS (GFLOPS)GEMM float32 peak FLOPS.
gemm-flops/fp16_flopsFLOPS (GFLOPS)GEMM float16 peak FLOPS.
gemm-flops/fp64_tc_flopsFLOPS (GFLOPS)GEMM float64 peak FLOPS with NVIDIA Tensor Core.
gemm-flops/tf32_tc_flopsFLOPS (GFLOPS)GEMM tensor-float32 peak FLOPS with NVIDIA Tensor Core.
gemm-flops/fp16_tc_flopsFLOPS (GFLOPS)GEMM float16 peak FLOPS with NVIDIA Tensor Core.
gemm-flops/bf16_tc_flopsFLOPS (GFLOPS)GEMM bfloat16 peak FLOPS with NVIDIA Tensor Core.
gemm-flops/int8_tc_iopsIOPS (GIOPS)GEMM int8 peak IOPS with NVIDIA Tensor Core.
gemm-flops/int4_tc_iopsIOPS (GIOPS)GEMM int4 peak IOPS with NVIDIA Tensor Core.
gemm-flops/fp32_xdlops_flopsFLOPS (GFLOPS)GEMM tensor-float32 peak FLOPS with AMD XDLOPS.
gemm-flops/fp16_xdlops_flopsFLOPS (GFLOPS)GEMM float16 peak FLOPS with AMD XDLOPS.
gemm-flops/bf16_xdlops_flopsFLOPS (GFLOPS)GEMM bfloat16 peak FLOPS with AMD XDLOPS.
gemm-flops/int8_xdlops_iopsIOPS (GIOPS)GEMM int8 peak IOPS with AMD XDLOPS.

matmul#

Introduction#

Large scale matmul operation using torch.matmul with one GPU.

Metrics#

NameUnitDescription
pytorch-matmul/nosharding_timetime (ms)Time of pure matmul operation.

cublaslt-gemm / hipblaslt-gemm#

Introduction#

Measure the GEMM performance of cublasLtMatmul or hipblasLt-bench.

Metrics#

NameUnitDescription
cublaslt-gemm/${dtype}_${batch}_${m}_${n}_${k}_flopsFLOPS (TFLOPS)TFLOPS of measured GEMM kernel.
hipblaslt-gemm/${dtype}_${batch}_${m}_${n}_${k}_flopsFLOPS (TFLOPS)TFLOPS of measured GEMM kernel.

cublas-function#

Introduction#

Measure the performance of most common Nvidia cuBLAS functions with parameters in models training including ResNet, VGG, DenseNet, LSTM, BERT, and GPT-2.

The supported functions for cuBLAS are as follows:

  • cublasSgemm
  • cublasSgemmStridedBatched
  • cublasGemmStridedBatchedEx
  • cublasGemmEx
  • cublasCgemm3mStridedBatched
  • cublasCgemm

Metrics#

NameUnitDescription
cublas-function/name_${function_name}_${parameters}_timetime (us)The mean time to execute the cublas function with the parameters.
cublas-function/name_${function_name}_${parameters}_correctnessWhether the calculation results of executing the cublas function with the parameters pass the correctness check if enable correctness check.
cublas-function/name_${function_name}_${parameters}_errorThe error ratio of the calculation results of executing the cublas function with the parameters if enable correctness check.

cudnn-function#

Introduction#

Measure the performance of most common Nvidia cuDNN functions with parameters in models training including ResNet, VGG, DenseNet, LSTM, BERT, and GPT-2.

The supported functions for cuDNN are as follows:

  • cudnnConvolutionBackwardFilter
  • cudnnConvolutionBackwardData
  • cudnnConvolutionForward

Metrics#

NameUnitDescription
cudnn-function/name_${function_name}_${parameters}_timetime (us)The mean time to execute the cudnn function with the parameters.

tensorrt-inference#

Introduction#

Inference PyTorch/ONNX models on NVIDIA GPUs with TensorRT.

Currently the following models are supported:

alexnet, densenet121, densenet169, densenet201, densenet161, googlenet, inception_v3, mnasnet0_5, mnasnet1_0, mobilenet_v2, resnet18, resnet34, resnet50, resnet101, resnet152, resnext50_32x4d, resnext101_32x8d, wide_resnet50_2, wide_resnet101_2, shufflenet_v2_x0_5, shufflenet_v2_x1_0, squeezenet1_0, squeezenet1_1, vgg11, vgg11_bn, vgg13, vgg13_bn, vgg16, vgg16_bn, vgg19_bn, vgg19 lstm, bert-base, bert-large, gpt2-small

Do not support large models like gpt2-large currently because models larger than 2GB (maximum protobuf size) cannot be exported in one ONNX file.

Metrics#

NameUnitDescription
tensorrt-inference/${model}_gpu_time_meantime (ms)The mean GPU latency to execute the kernels for a query.
tensorrt-inference/${model}_gpu_time_99time (ms)The 99th percentile GPU latency to execute the kernels for a query.
tensorrt-inference/${model}_host_time_meantime (ms)The mean H2D, GPU, and D2H latency to execute the kernels for a query.
tensorrt-inference/${model}_host_time_99time (ms)The 99th percentile H2D, GPU, and D2H latency to execute the kernels for a query.
tensorrt-inference/${model}_end_to_end_time_meantime (ms)The mean duration from when the H2D of a query is called to when the D2H of the same query is completed.
tensorrt-inference/${model}_end_to_end_time_99time (ms)The P99 duration from when the H2D of a query is called to when the D2H of the same query is completed.

ort-inference#

Introduction#

Inference performance of the torchvision models using ONNXRuntime. Currently the following models are supported:

alexnet, densenet121, densenet169, densenet201, densenet161, googlenet, inception_v3, mnasnet0_5, mnasnet1_0, mobilenet_v2, resnet18, resnet34, resnet50, resnet101, resnet152, resnext50_32x4d, resnext101_32x8d, wide_resnet50_2, wide_resnet101_2, shufflenet_v2_x0_5, shufflenet_v2_x1_0, squeezenet1_0, squeezenet1_1, vgg11, vgg11_bn, vgg13, vgg13_bn, vgg16, vgg16_bn, vgg19_bn, vgg19

The supported percentiles are 50, 90, 95, 99, and 99.9.

Metrics#

NameUnitDescription
ort-inference/{precision}_{model}_timetime (ms)The mean latency to execute one batch of inference.
ort-inference/{precision}{model}_time{percentile}time (ms)The {percentile}th percentile latency to execute one batch of inference.

gpu-burn#

Introduction#

Multi-GPU CUDA stress test for GPU compute and memory utilization, performed by gpu-burn. Supports the use of double unit types and the use of tensor cores.

Metrics#

NameUnitDescription
gpu-burn/timetime (s)The runtime for gpu-burn test.
gpu-burn/gpu_[0-9]_passyes/noThe result of the gpu-burn test for each GPU (1: yes, 0: no).
gpu-burn/abortyes/noWhether or not GPU-burn test aborted before returning GPU results (1: yes, 0: no).

cpu-hpl#

Introduction#

HPL or High Performance Computing Linpack evaluates compute bandwidth by solving dense linear systems in double precision arethmetic. Performed by High-Performance Linpack Benchmark for Distributed-Memory Computers

Metrics#

NameUnitDescription
cpu-hpl/tests_passHPL completed running and correctness test has passed (1: pass, 0: fail).
cpu-hpl/throughputbandwidth (GFlops)Compute bandwidth.
cpu-hpl/timetime (s)Time elapsed during HPL run.

cpu-stream#

Introduction#

Measure of memory bandwidth and computation rate for simple vector kernels. performed by University of Virginia STREAM benchmark.

Metrics#

NameUnitDescription
cpu-stream/threadsNumber of threads used for the test. Determined by core count.
cpu-stream/['copy', 'scale', 'add', 'triad']_throughputbandwidth (MB/s)Memory throughput of designated kerel operation.
cpu-stream/['copy', 'scale', 'add', 'triad']_time_avgtime (s)Average elapsed times over all iterations.
cpu-stream/['copy', 'scale', 'add', 'triad']_time_mintime (s)Minimum elapsed times over all iterations.
cpu-stream/['copy', 'scale', 'add', 'triad']_time_maxtime (s)Maximum elapsed times over all iterations.

Communication Benchmarks#

cpu-memory-bw-latency#

Introduction#

Measure the memory copy bandwidth and latency across different CPU NUMA nodes. performed by Intel MLC Tool.

Metrics#

NameUnitDescription
cpu-memory-bw-latency/mem_bandwidth_matrix_numa_[0-9]+_[0-9]+_bwbandwidth (MB/s)Former NUMA to latter NUMA memory bandwidth.
cpu-memory-bw-latency/mem_bandwidth_matrix_numa_[0-9]+_[0-9]+_lattime (ns)Former NUMA to latter NUMA memory latency.
cpu-memory-bw-latency/mem_max_bandwidth_all_reads_bwbandwidth (MB/s)Whole-CPU maximum memory bandwidth, full read.
cpu-memory-bw-latency/mem_max_bandwidth_3_1_reads-writes_bwbandwidth (MB/s)Whole-CPU maximum memory bandwidth, read : write = 3 : 1.
cpu-memory-bw-latency/mem_max_bandwidth_2_1_reads-writes_bwbandwidth (MB/s)Whole-CPU maximum memory bandwidth, read : write = 2 : 1.
cpu-memory-bw-latency/mem_max_bandwidth_1_1_reads-writes_bwbandwidth (MB/s)Whole-CPU maximum memory bandwidth, read : write = 1 : 1.
cpu-memory-bw-latency/mem_max_bandwidth_stream-triad_like_bwbandwidth (MB/s)Whole-CPU maximum memory bandwidth, with stream-triad like pattern.

mem-bw#

Introduction#

Measure the memory copy bandwidth across PCI-e and memory copy bandwidth between GPUs, performed by NVIDIA or AMD bandwidth test tool.

Metrics#

NameUnitDescription
mem-bw/h2d_bwbandwidth (GB/s)Host to device copy bandwidth.
mem-bw/d2h_bwbandwidth (GB/s)Device to host copy bandwidth.
mem-bw/d2d_bwbandwidth (GB/s)Device to device copy bandwidth.

gpu-copy-bw#

Measure the memory copy bandwidth performed by GPU SM/DMA engine, including device-to-host, host-to-device and device-to-device. For measurements of peer-to-peer communication performance between AMD GPUs, GPU memory buffers are allocated in hipDeviceMallocUncached (previous hipDeviceMallocFinegrained) mode to maximize performance.

Metrics#

NameUnitDescription
cpu_to_gpu[0-9]+_by_(sm|dma)_under_numa[0-9]+_bwbandwidth (GB/s)The unidirectional bandwidth of one GPU reading one NUMA node's host memory using DMA engine or GPU SM.
gpu[0-9]+_to_cpu_by_(sm|dma)_under_numa[0-9]+_bwbandwidth (GB/s)The unidirectional bandwidth of one GPU writing one NUMA node's host memory using DMA engine or GPU SM.
gpu[0-9]+_to_gpu[0-9]+_by_(sm|dma)_bwbandwidth (GB/s)The unidirectional bandwidth of one GPU reading or writing self's memory using DMA engine or GPU SM.
gpu[0-9]+_to_gpu[0-9]+_(read|write)_by_(sm|dma)_bwbandwidth (GB/s)The unidirectional bandwidth of one GPU reading or writing peer GPU's memory using DMA engine or GPU SM with peer communication enabled.
cpu_and_gpu[0-9]+_by_(sm|dma)_under_numa[0-9]+_bwbandwidth (GB/s)The bidirectional bandwidth of one GPU reading and writing one NUMA node's host memory using DMA engine or GPU SM.
gpu[0-9]+_and_cpu_by_(sm|dma)_under_numa[0-9]+_bwbandwidth (GB/s)Same as above, but generated by --dtoh --bidirectional.
gpu[0-9]+_and_gpu[0-9]+_by_(sm|dma)_bwbandwidth (GB/s)The bidirectional bandwidth of one GPU reading and writing self's memory using DMA engine or GPU SM.
gpu[0-9]+_and_gpu[0-9]+_(read|write)_by_(sm|dma)_bwbandwidth (GB/s)The bidirectional bandwidth of one GPU reading and writing peer GPU's memory using DMA engine or GPU SM with peer communication enabled.
gpu[0-9]+_to_gpu_all_write_by_sm_bwbandwidth (GB/s)The unidirectional bandwidth of one GPU writing all peer GPUs' memory using GPU SM with peer communication enabled.
gpu_all_to_gpu[0-9]+_write_by_sm_bwbandwidth (GB/s)The unidirectional bandwidth of all peer GPUs writing one GPU's memory using GPU SM with peer communication enabled.
gpu_all_to_gpu_all_write_by_sm_bwbandwidth (GB/s)The unidirectional bandwidth of all peer GPUs writing all peer GPUs' memory using GPU SM with peer communication enabled.

ib-loopback#

Introduction#

Measure the InfiniBand loopback verbs bandwidth, performed by OFED performance tests.

Metrics#

NameUnitDescription
ib-loopback/ibwrite_bw${msg_size}bandwidth (GB/s)InfiniBand loopback write bandwidth with given message size.
ib-loopback/ibread_bw${msg_size}bandwidth (GB/s)InfiniBand loopback read bandwidth with given message size.
ib-loopback/ibsend_bw${msg_size}bandwidth (GB/s)InfiniBand loopback send bandwidth with given message size.

nccl-bw / rccl-bw#

Introduction#

Measure the performance of NCCL/RCCL operations under multi nodes' traffic pattern, performed by nccl-tests or rccl-tests. Support the following operations currently: allreduce, allgather, broadcast, reduce, reducescatter, alltoall. Support both in-place and out-of-place measurements.

Support the following traffic patterns:

  • all-nodes, validate the NCCL/RCCL performance across all VM nodes simultaneously.
  • pair-wise, validate the NCCL/RCCL performance across VM pairs with all possible combinations in parallel.
  • k-batch, validate the NCCL/RCCL performance across VM groups with a specified batch scale.
  • topo-aware, validate the NCCL/RCCL performance across VM pairs with different distances/hops as a quick test.

Metrics#

NameUnitDescription
nccl-bw/${operation}_${msg_size}_timetime (us)NCCL operation lantency with given message size.
nccl-bw/${operation}_${msg_size}_algbwbandwidth (GB/s)NCCL operation algorithm bandwidth with given message size.
nccl-bw/${operation}_${msg_size}_busbwbandwidth (GB/s)NCCL operation bus bandwidth with given message size.
rccl-bw/${operation}_${msg_size}_timetime (us)RCCL operation lantency with given message size.
rccl-bw/${operation}_${msg_size}_algbwbandwidth (GB/s)RCCL operation algorithm bandwidth with given message size.
rccl-bw/${operation}_${msg_size}_busbwbandwidth (GB/s)RCCL operation bus bandwidth with given message size.

If mpi mode is enable and traffic pattern is specified, the metrics pattern will change to nccl-bw/${operation}_${serial_index)_${parallel_index):${msg_size}_time

  • serial_index represents the serial index of the host group in serial.
  • parallel_index represents the parallel index of the host list in parallel.

tcp-connectivity#

Introduction#

Test the TCP connectivity between current node and nodes in the hostfile, performed by tcping

Metrics#

MetricsUnitDescription
tcp-connectivity/${hostname/ip}_successed_countcountsuccessed times of tcp connections between current node and other nodes
tcp-connectivity/${hostname/ip}_failed_countcountfailed times of tcp connections between current node and other nodes
tcp-connectivity/${hostname/ip}_success_ratesuccess rate (successed/total) of tcp connection between current node and other nodes
tcp-connectivity/${hostname/ip}_time_mintime (ms)mininum latency of tcp connections between current node and other nodes
tcp-connectivity/${hostname/ip}_time_maxtime (ms)maximum latency of tcp connections between current node and other nodes
tcp-connectivity/${hostname/ip}_time_avgtime (ms)average latency of tcp connections between current node and other nodes

gpcnet-network-test / gpcnet-network-load-test#

Introduction#

Distributed test, test the global network performance and congestion, performed by GPCNET

gpcnet-network-test: Full system network tests in random and natural ring, alltoall and allreduce, at least 2 nodes

gpcnet-network-load-test: Select full system network tests run with four congestors to measure network congestion or contention, at least 10 nodes

  • supporting network tests: RR Two-sided Lat (8 B), RR Get Lat (8 B), RR Two-sided BW (131072 B), RR Put BW (131072 B), RR Two-sided BW+Sync (131072 B), Nat Two-sided BW (131072 B), Multiple Allreduce (8 B), Multiple Alltoall (4096 B)
  • supporting congestors: Alltoall (4096 B), Two-sided Incast (4096 B), Put Incast (4096 B), Get Bcast (4096 B)

Metrics#

MetricsUnitDescription
gpcnet-network-test/rrtwo-sided_lat${stat}time (us)statistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'random ring communication pattern two-side latency' for network testing
gpcnet-network-test/rrtwo-sided+sync_bw${stat}bandwidth (MiB/s/rank)fstatistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'random ring communication pattern two-side bandwidth with barrier' for network testing
gpcnet-network-test/multipleallreduce_time${stat}time (us)statistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'multiple allreduce bandwidth' for network testing
gpcnet-network-test/rrget_lat${stat}bandwidth (MiB/s/rank)statistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'RR GetLat (8 B)' for network testing
gpcnet-network-test/rrtwo-sided_bw${stat}bandwidth (MiB/s/rank)statistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'RR Two-sidedBW (131072 B)' for network testing
gpcnet-network-test/nattwo-sided_bw${stat}bandwidth (MiB/s/rank)statistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'Nat Two-sidedBW (131072 B)' for network testing
gpcnet-network-test/multiplealltoall_bw${stat}bandwidth (MiB/s/rank)statistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'Multiple Alltoall (4096 B)' for network testing
gpcnet-network-load-test/rrtwo-sided_lat_x${stat}factor (x)summary about congestion impact factor of the network test algorithm
gpcnet-network-load-test/rrtwo-sided+sync_bw_x${stat}factor (x)summary about congestion impact factor of the network test algorithm
gpcnet-network-load-test/multipleallreduce_x${stat}factor (x)summary about congestion impact factor of the network test algorithm

ib-traffic#

Introduction#

Measure the InfiniBand performance under multi nodes' traffic pattern.

The direction between client and server can be 'cpu-to-cpu'/'gpu-to-gpu'/'gpu-to-cpu'/'cpu-to-gpu'.

The traffic pattern is defined in a config file, which is pre-defined for one-to-many, many-to-one and all-to-all patterns. Each row in the config is one round, and all pairs of nodes in a row run ib command simultaneously.

Besides the above three patterns, ib-traffic also supports topology-aware traffic pattern. To run ib-traffic with topology-aware pattern, the user needs to specify 3 required (and 2 optional) parameters in YAML config file:

  • --pattern โ€ƒtopo-aware
  • --ibstat โ€ƒpath to ibstat output
  • --ibnetdiscover โ€ƒpath to ibnetdiscover output
  • --min_dist โ€ƒminimum distance of VM pairs (optional, default 2)
  • --max_dist โ€ƒmaximum distance of VM pairs (optional, default 6)

Each row in the config file has all VM pairs with a fixed distance (#hops). That's by default, 1st, 2nd, 3rd row has all VM pairs with topology distance of 2, 4, 6, respectively.

Metrics#

MetricsUnitDescription
ib-traffic/ib_write_bw_${msg_size}_${direction}_${line}_${pair}:${server}_${client}bandwidth (GB/s)The max bandwidth of perftest (ib_write_bw, ib_send_bw, ib_read_bw) using ${msg_size} with ${direction}('cpu-to-cpu'/'gpu-to-gpu'/'gpu-to-cpu'/'cpu-to-gpu') run between the ${pair}th node pair in the ${line}th line of the config, ${server} and ${client} are the hostname of server and client.
ib-traffic/ib_write_lat_${msg_size}_${direction}_${line}_${pair}:${server}_${client}time (us)The max latency of perftest (ib_write_lat, ib_send_lat, ib_read_lat) using ${msg_size} with ${direction}('cpu-to-cpu'/'gpu-to-gpu'/'gpu-to-cpu'/'cpu-to-gpu') run between the ${pair}th node pair in the ${line}th line of the config, ${server} and ${client} are the hostname of server and client.

Computation-communication Benchmarks#

computation-communication-overlap#

Introduction#

Test the performance of single node when communication and computation overlap.

Metrics#

NameUnitDescription
pytorch-computation-communication-overlap/mul_timetime (ms)Time of communication and mul kernel computation overlap.
pytorch-computation-communication-overlap/matmul_timetime (ms)Time of communication and matmul kernel computation overlap.

sharding-matmul#

Introduction#

Test the performance of large scale matmul operation with multiple GPUs:

  • allreduce: Each GPU will calculate part of the MM calculation, and use AllReduce to merge all data into one tensor.
  • allgather: Each GPU will calculate part of the MM calculation, and use AllGather + Concat to merge all data into one tensor.

Metrics#

NameUnitDescription
pytorch-sharding-matmul/allreduce_timetime (ms)Time of sharding matmul using allreduce.
pytorch-sharding-matmul/allgather_timetime (ms)Time of sharding matmul using allgather.

dist-inference#

Introduction#

Test the performance of distributed model inference. Support both PyTorch implementation and cpp implementation.

Metrics#

NameUnitDescription
pytorch-dist-inference/step_timestime (ms)Average time of model inference runs.
pytorch-dist-inference/steptimes${percentile}time (ms)Tail (50,90,95,99,99.9) time of model inference runs.

Storage Benchmarks#

disk-benchmark#

Introduction#

Measure the disk performance through FIO.

Metrics#

NameUnitDescription
disk-benchmark/${disk_name}_rand_read_write_bssize (bytes)Disk random read write block size.
disk-benchmark/${disk_name}_rand_read_write_read_iopsIOPSDisk random read write read IOPS.
disk-benchmark/${disk_name}_rand_read_write_read_lat_ns_95.0time (ns)Disk random read write read latency in 95.0 percentile.
disk-benchmark/${disk_name}_rand_read_write_read_lat_ns_99.0time (ns)Disk random read write read latency in 99.0 percentile.
disk-benchmark/${disk_name}_rand_read_write_read_lat_ns_99.9time (ns)Disk random read write read latency in 99.9 percentile.
disk-benchmark/${disk_name}_rand_read_write_write_iopsIOPSDisk random read write write IOPS.
disk-benchmark/${disk_name}_rand_read_write_write_lat_ns_95.0time (ns)Disk random read write write latency in 95.0 percentile.
disk-benchmark/${disk_name}_rand_read_write_write_lat_ns_99.0time (ns)Disk random read write write latency in 99.0 percentile.
disk-benchmark/${disk_name}_rand_read_write_write_lat_ns_99.9time (ns)Disk random read write write latency in 99.9 percentile.