Micro Benchmarks

Computation Benchmarks#

`kernel-launch`#

Introduction#

Measure GPU kernel launch latency, which is defined as the time range from the beginning of the launch API call to the beginning of the kernel execution.

Metrics#

Name	Unit	Description
kernel-launch/event_time	time (ms)	Launch latency measured in GPU time.
kernel-launch/wall_time	time (ms)	Launch latency measured in CPU time.

`gemm-flops`#

Introduction#

Measure the GPU GEMM FLOPS for different float and int data types, with or without Tensor Core (XDLOPS), performed by NVIDIA cutlass or AMD rocblas-bench.

Metrics#

Name	Unit	Description
gemm-flops/fp64_flops	FLOPS (GFLOPS)	GEMM float64 peak FLOPS.
gemm-flops/fp32_flops	FLOPS (GFLOPS)	GEMM float32 peak FLOPS.
gemm-flops/fp16_flops	FLOPS (GFLOPS)	GEMM float16 peak FLOPS.
gemm-flops/fp64_tc_flops	FLOPS (GFLOPS)	GEMM float64 peak FLOPS with NVIDIA Tensor Core.
gemm-flops/tf32_tc_flops	FLOPS (GFLOPS)	GEMM tensor-float32 peak FLOPS with NVIDIA Tensor Core.
gemm-flops/fp16_tc_flops	FLOPS (GFLOPS)	GEMM float16 peak FLOPS with NVIDIA Tensor Core.
gemm-flops/bf16_tc_flops	FLOPS (GFLOPS)	GEMM bfloat16 peak FLOPS with NVIDIA Tensor Core.
gemm-flops/int8_tc_iops	IOPS (GIOPS)	GEMM int8 peak IOPS with NVIDIA Tensor Core.
gemm-flops/int4_tc_iops	IOPS (GIOPS)	GEMM int4 peak IOPS with NVIDIA Tensor Core.
gemm-flops/fp32_xdlops_flops	FLOPS (GFLOPS)	GEMM tensor-float32 peak FLOPS with AMD XDLOPS.
gemm-flops/fp16_xdlops_flops	FLOPS (GFLOPS)	GEMM float16 peak FLOPS with AMD XDLOPS.
gemm-flops/bf16_xdlops_flops	FLOPS (GFLOPS)	GEMM bfloat16 peak FLOPS with AMD XDLOPS.
gemm-flops/int8_xdlops_iops	IOPS (GIOPS)	GEMM int8 peak IOPS with AMD XDLOPS.

`matmul`#

Introduction#

Large scale matmul operation using torch.matmul with one GPU.

Metrics#

Name	Unit	Description
pytorch-matmul/nosharding_time	time (ms)	Time of pure matmul operation.

`cublaslt-gemm` / `hipblaslt-gemm`#

Introduction#

Measure the GEMM performance of cublasLtMatmul or hipblasLt-bench.

Metrics#

Name	Unit	Description
cublaslt-gemm/${dtype}_${batch}_${m}_${n}_${k}_flops	FLOPS (TFLOPS)	TFLOPS of measured GEMM kernel.
hipblaslt-gemm/${dtype}_${batch}_${m}_${n}_${k}_flops	FLOPS (TFLOPS)	TFLOPS of measured GEMM kernel.

`cublas-function`#

Introduction#

Measure the performance of most common Nvidia cuBLAS functions with parameters in models training including ResNet, VGG, DenseNet, LSTM, BERT, and GPT-2.

The supported functions for cuBLAS are as follows:

cublasSgemm
cublasSgemmStridedBatched
cublasGemmStridedBatchedEx
cublasGemmEx
cublasCgemm3mStridedBatched
cublasCgemm

Metrics#

Name	Unit	Description
cublas-function/name_${function_name}_${parameters}_time	time (us)	The mean time to execute the cublas function with the parameters.
cublas-function/name_${function_name}_${parameters}_correctness		Whether the calculation results of executing the cublas function with the parameters pass the correctness check if enable correctness check.
cublas-function/name_${function_name}_${parameters}_error		The error ratio of the calculation results of executing the cublas function with the parameters if enable correctness check.

`cudnn-function`#

Introduction#

Measure the performance of most common Nvidia cuDNN functions with parameters in models training including ResNet, VGG, DenseNet, LSTM, BERT, and GPT-2.

The supported functions for cuDNN are as follows:

cudnnConvolutionBackwardFilter
cudnnConvolutionBackwardData
cudnnConvolutionForward

Metrics#

Name	Unit	Description
cudnn-function/name_${function_name}_${parameters}_time	time (us)	The mean time to execute the cudnn function with the parameters.

`tensorrt-inference`#

Introduction#

Inference PyTorch/ONNX models on NVIDIA GPUs with TensorRT.

Currently the following models are supported:

alexnet, densenet121, densenet169, densenet201, densenet161, googlenet, inception_v3, mnasnet0_5, mnasnet1_0, mobilenet_v2, resnet18, resnet34, resnet50, resnet101, resnet152, resnext50_32x4d, resnext101_32x8d, wide_resnet50_2, wide_resnet101_2, shufflenet_v2_x0_5, shufflenet_v2_x1_0, squeezenet1_0, squeezenet1_1, vgg11, vgg11_bn, vgg13, vgg13_bn, vgg16, vgg16_bn, vgg19_bn, vgg19 lstm, bert-base, bert-large, gpt2-small

Do not support large models like gpt2-large currently because models larger than 2GB (maximum protobuf size) cannot be exported in one ONNX file.

Metrics#

Name	Unit	Description
tensorrt-inference/${model}_gpu_time_mean	time (ms)	The mean GPU latency to execute the kernels for a query.
tensorrt-inference/${model}_gpu_time_99	time (ms)	The 99th percentile GPU latency to execute the kernels for a query.
tensorrt-inference/${model}_host_time_mean	time (ms)	The mean H2D, GPU, and D2H latency to execute the kernels for a query.
tensorrt-inference/${model}_host_time_99	time (ms)	The 99th percentile H2D, GPU, and D2H latency to execute the kernels for a query.
tensorrt-inference/${model}_end_to_end_time_mean	time (ms)	The mean duration from when the H2D of a query is called to when the D2H of the same query is completed.
tensorrt-inference/${model}_end_to_end_time_99	time (ms)	The P99 duration from when the H2D of a query is called to when the D2H of the same query is completed.

`ort-inference`#

Introduction#

Inference performance of the torchvision models using ONNXRuntime. Currently the following models are supported:

alexnet, densenet121, densenet169, densenet201, densenet161, googlenet, inception_v3, mnasnet0_5, mnasnet1_0, mobilenet_v2, resnet18, resnet34, resnet50, resnet101, resnet152, resnext50_32x4d, resnext101_32x8d, wide_resnet50_2, wide_resnet101_2, shufflenet_v2_x0_5, shufflenet_v2_x1_0, squeezenet1_0, squeezenet1_1, vgg11, vgg11_bn, vgg13, vgg13_bn, vgg16, vgg16_bn, vgg19_bn, vgg19

The supported percentiles are 50, 90, 95, 99, and 99.9.

Metrics#

Name	Unit	Description
ort-inference/{precision}_{model}_time	time (ms)	The mean latency to execute one batch of inference.
ort-inference/{precision}{model}_time{percentile}	time (ms)	The {percentile}th percentile latency to execute one batch of inference.

`gpu-burn`#

Introduction#

Multi-GPU CUDA stress test for GPU compute and memory utilization, performed by gpu-burn. Supports the use of double unit types and the use of tensor cores.

Metrics#

Name	Unit	Description
gpu-burn/time	time (s)	The runtime for gpu-burn test.
gpu-burn/gpu_[0-9]_pass	yes/no	The result of the gpu-burn test for each GPU (1: yes, 0: no).
gpu-burn/abort	yes/no	Whether or not GPU-burn test aborted before returning GPU results (1: yes, 0: no).

`cpu-hpl`#

Introduction#

HPL or High Performance Computing Linpack evaluates compute bandwidth by solving dense linear systems in double precision arethmetic. Performed by High-Performance Linpack Benchmark for Distributed-Memory Computers

Metrics#

Name	Unit	Description
cpu-hpl/tests_pass		HPL completed running and correctness test has passed (1: pass, 0: fail).
cpu-hpl/throughput	bandwidth (GFlops)	Compute bandwidth.
cpu-hpl/time	time (s)	Time elapsed during HPL run.

`cpu-stream`#

Introduction#

Measure of memory bandwidth and computation rate for simple vector kernels. performed by University of Virginia STREAM benchmark.

Metrics#

Name	Unit	Description
cpu-stream/threads		Number of threads used for the test. Determined by core count.
cpu-stream/['copy', 'scale', 'add', 'triad']_throughput	bandwidth (MB/s)	Memory throughput of designated kerel operation.
cpu-stream/['copy', 'scale', 'add', 'triad']_time_avg	time (s)	Average elapsed times over all iterations.
cpu-stream/['copy', 'scale', 'add', 'triad']_time_min	time (s)	Minimum elapsed times over all iterations.
cpu-stream/['copy', 'scale', 'add', 'triad']_time_max	time (s)	Maximum elapsed times over all iterations.

Communication Benchmarks#

`cpu-memory-bw-latency`#

Introduction#

Measure the memory copy bandwidth and latency across different CPU NUMA nodes. performed by Intel MLC Tool.

Metrics#

Name	Unit	Description
cpu-memory-bw-latency/mem_bandwidth_matrix_numa_[0-9]+_[0-9]+_bw	bandwidth (MB/s)	Former NUMA to latter NUMA memory bandwidth.
cpu-memory-bw-latency/mem_bandwidth_matrix_numa_[0-9]+_[0-9]+_lat	time (ns)	Former NUMA to latter NUMA memory latency.
cpu-memory-bw-latency/mem_max_bandwidth_all_reads_bw	bandwidth (MB/s)	Whole-CPU maximum memory bandwidth, full read.
cpu-memory-bw-latency/mem_max_bandwidth_3_1_reads-writes_bw	bandwidth (MB/s)	Whole-CPU maximum memory bandwidth, read : write = 3 : 1.
cpu-memory-bw-latency/mem_max_bandwidth_2_1_reads-writes_bw	bandwidth (MB/s)	Whole-CPU maximum memory bandwidth, read : write = 2 : 1.
cpu-memory-bw-latency/mem_max_bandwidth_1_1_reads-writes_bw	bandwidth (MB/s)	Whole-CPU maximum memory bandwidth, read : write = 1 : 1.
cpu-memory-bw-latency/mem_max_bandwidth_stream-triad_like_bw	bandwidth (MB/s)	Whole-CPU maximum memory bandwidth, with stream-triad like pattern.

`mem-bw`#

Introduction#

Measure the memory copy bandwidth across PCI-e and memory copy bandwidth between GPUs, performed by NVIDIA or AMD bandwidth test tool.

Metrics#

Name	Unit	Description
mem-bw/h2d_bw	bandwidth (GB/s)	Host to device copy bandwidth.
mem-bw/d2h_bw	bandwidth (GB/s)	Device to host copy bandwidth.
mem-bw/d2d_bw	bandwidth (GB/s)	Device to device copy bandwidth.

`gpu-copy-bw`#

Measure the memory copy bandwidth performed by GPU SM/DMA engine, including device-to-host, host-to-device and device-to-device. For measurements of peer-to-peer communication performance between AMD GPUs, GPU memory buffers are allocated in hipDeviceMallocUncached (previous hipDeviceMallocFinegrained) mode to maximize performance.

Metrics#

Name	Unit	Description
cpu_to_gpu[0-9]+_by_(sm\|dma)_under_numa[0-9]+_bw	bandwidth (GB/s)	The unidirectional bandwidth of one GPU reading one NUMA node's host memory using DMA engine or GPU SM.
gpu[0-9]+_to_cpu_by_(sm\|dma)_under_numa[0-9]+_bw	bandwidth (GB/s)	The unidirectional bandwidth of one GPU writing one NUMA node's host memory using DMA engine or GPU SM.
gpu[0-9]+_to_gpu[0-9]+_by_(sm\|dma)_bw	bandwidth (GB/s)	The unidirectional bandwidth of one GPU reading or writing self's memory using DMA engine or GPU SM.
gpu[0-9]+_to_gpu[0-9]+_(read\|write)_by_(sm\|dma)_bw	bandwidth (GB/s)	The unidirectional bandwidth of one GPU reading or writing peer GPU's memory using DMA engine or GPU SM with peer communication enabled.
cpu_and_gpu[0-9]+_by_(sm\|dma)_under_numa[0-9]+_bw	bandwidth (GB/s)	The bidirectional bandwidth of one GPU reading and writing one NUMA node's host memory using DMA engine or GPU SM.
gpu[0-9]+_and_cpu_by_(sm\|dma)_under_numa[0-9]+_bw	bandwidth (GB/s)	Same as above, but generated by --dtoh --bidirectional.
gpu[0-9]+_and_gpu[0-9]+_by_(sm\|dma)_bw	bandwidth (GB/s)	The bidirectional bandwidth of one GPU reading and writing self's memory using DMA engine or GPU SM.
gpu[0-9]+_and_gpu[0-9]+_(read\|write)_by_(sm\|dma)_bw	bandwidth (GB/s)	The bidirectional bandwidth of one GPU reading and writing peer GPU's memory using DMA engine or GPU SM with peer communication enabled.
gpu[0-9]+_to_gpu_all_write_by_sm_bw	bandwidth (GB/s)	The unidirectional bandwidth of one GPU writing all peer GPUs' memory using GPU SM with peer communication enabled.
gpu_all_to_gpu[0-9]+_write_by_sm_bw	bandwidth (GB/s)	The unidirectional bandwidth of all peer GPUs writing one GPU's memory using GPU SM with peer communication enabled.
gpu_all_to_gpu_all_write_by_sm_bw	bandwidth (GB/s)	The unidirectional bandwidth of all peer GPUs writing all peer GPUs' memory using GPU SM with peer communication enabled.

`ib-loopback`#

Introduction#

Measure the InfiniBand loopback verbs bandwidth, performed by OFED performance tests.

Metrics#

Name	Unit	Description
ib-loopback/ibwrite_bw${msg_size}	bandwidth (GB/s)	InfiniBand loopback write bandwidth with given message size.
ib-loopback/ibread_bw${msg_size}	bandwidth (GB/s)	InfiniBand loopback read bandwidth with given message size.
ib-loopback/ibsend_bw${msg_size}	bandwidth (GB/s)	InfiniBand loopback send bandwidth with given message size.

`nccl-bw` / `rccl-bw`#

Introduction#

Measure the performance of NCCL/RCCL operations under multi nodes' traffic pattern, performed by nccl-tests or rccl-tests. Support the following operations currently: allreduce, allgather, broadcast, reduce, reducescatter, alltoall. Support both in-place and out-of-place measurements.

Support the following traffic patterns:

all-nodes, validate the NCCL/RCCL performance across all VM nodes simultaneously.
pair-wise, validate the NCCL/RCCL performance across VM pairs with all possible combinations in parallel.
k-batch, validate the NCCL/RCCL performance across VM groups with a specified batch scale.
topo-aware, validate the NCCL/RCCL performance across VM pairs with different distances/hops as a quick test.

Metrics#

Name	Unit	Description
nccl-bw/${operation}_${msg_size}_time	time (us)	NCCL operation lantency with given message size.
nccl-bw/${operation}_${msg_size}_algbw	bandwidth (GB/s)	NCCL operation algorithm bandwidth with given message size.
nccl-bw/${operation}_${msg_size}_busbw	bandwidth (GB/s)	NCCL operation bus bandwidth with given message size.
rccl-bw/${operation}_${msg_size}_time	time (us)	RCCL operation lantency with given message size.
rccl-bw/${operation}_${msg_size}_algbw	bandwidth (GB/s)	RCCL operation algorithm bandwidth with given message size.
rccl-bw/${operation}_${msg_size}_busbw	bandwidth (GB/s)	RCCL operation bus bandwidth with given message size.

If mpi mode is enable and traffic pattern is specified, the metrics pattern will change to nccl-bw/${operation}_${serial_index)_${parallel_index):${msg_size}_time

serial_index represents the serial index of the host group in serial.
parallel_index represents the parallel index of the host list in parallel.

`tcp-connectivity`#

Introduction#

Test the TCP connectivity between current node and nodes in the hostfile, performed by tcping

Metrics#

Metrics	Unit	Description
tcp-connectivity/${hostname/ip}_successed_count	count	successed times of tcp connections between current node and other nodes
tcp-connectivity/${hostname/ip}_failed_count	count	failed times of tcp connections between current node and other nodes
tcp-connectivity/${hostname/ip}_success_rate		success rate (successed/total) of tcp connection between current node and other nodes
tcp-connectivity/${hostname/ip}_time_min	time (ms)	mininum latency of tcp connections between current node and other nodes
tcp-connectivity/${hostname/ip}_time_max	time (ms)	maximum latency of tcp connections between current node and other nodes
tcp-connectivity/${hostname/ip}_time_avg	time (ms)	average latency of tcp connections between current node and other nodes

`gpcnet-network-test` / `gpcnet-network-load-test`#

Introduction#

Distributed test, test the global network performance and congestion, performed by GPCNET

gpcnet-network-test: Full system network tests in random and natural ring, alltoall and allreduce, at least 2 nodes

gpcnet-network-load-test: Select full system network tests run with four congestors to measure network congestion or contention, at least 10 nodes

supporting network tests: RR Two-sided Lat (8 B), RR Get Lat (8 B), RR Two-sided BW (131072 B), RR Put BW (131072 B), RR Two-sided BW+Sync (131072 B), Nat Two-sided BW (131072 B), Multiple Allreduce (8 B), Multiple Alltoall (4096 B)
supporting congestors: Alltoall (4096 B), Two-sided Incast (4096 B), Put Incast (4096 B), Get Bcast (4096 B)

Metrics#

Metrics	Unit	Description
gpcnet-network-test/rrtwo-sided_lat${stat}	time (us)	statistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'random ring communication pattern two-side latency' for network testing
gpcnet-network-test/rrtwo-sided+sync_bw${stat}	bandwidth (MiB/s/rank)	fstatistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'random ring communication pattern two-side bandwidth with barrier' for network testing
gpcnet-network-test/multipleallreduce_time${stat}	time (us)	statistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'multiple allreduce bandwidth' for network testing
gpcnet-network-test/rrget_lat${stat}	bandwidth (MiB/s/rank)	statistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'RR GetLat (8 B)' for network testing
gpcnet-network-test/rrtwo-sided_bw${stat}	bandwidth (MiB/s/rank)	statistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'RR Two-sidedBW (131072 B)' for network testing
gpcnet-network-test/nattwo-sided_bw${stat}	bandwidth (MiB/s/rank)	statistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'Nat Two-sidedBW (131072 B)' for network testing
gpcnet-network-test/multiplealltoall_bw${stat}	bandwidth (MiB/s/rank)	statistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'Multiple Alltoall (4096 B)' for network testing
gpcnet-network-load-test/rrtwo-sided_lat_x${stat}	factor (x)	summary about congestion impact factor of the network test algorithm
gpcnet-network-load-test/rrtwo-sided+sync_bw_x${stat}	factor (x)	summary about congestion impact factor of the network test algorithm
gpcnet-network-load-test/multipleallreduce_x${stat}	factor (x)	summary about congestion impact factor of the network test algorithm

`ib-traffic`#

Introduction#

Measure the InfiniBand performance under multi nodes' traffic pattern.

The direction between client and server can be 'cpu-to-cpu'/'gpu-to-gpu'/'gpu-to-cpu'/'cpu-to-gpu'.

The traffic pattern is defined in a config file, which is pre-defined for one-to-many, many-to-one and all-to-all patterns. Each row in the config is one round, and all pairs of nodes in a row run ib command simultaneously.

Besides the above three patterns, ib-traffic also supports topology-aware traffic pattern. To run ib-traffic with topology-aware pattern, the user needs to specify 3 required (and 2 optional) parameters in YAML config file:

--pattern topo-aware
--ibstat path to ibstat output
--ibnetdiscover path to ibnetdiscover output
--min_dist minimum distance of VM pairs (optional, default 2)
--max_dist maximum distance of VM pairs (optional, default 6)

Each row in the config file has all VM pairs with a fixed distance (#hops). That's by default, 1st, 2nd, 3rd row has all VM pairs with topology distance of 2, 4, 6, respectively.

Metrics#

Metrics	Unit	Description
ib-traffic/ib_write_bw_${msg_size}_${direction}_${line}_${pair}:${server}_${client}	bandwidth (GB/s)	The max bandwidth of perftest (ib_write_bw, ib_send_bw, ib_read_bw) using ${msg_size} with ${direction}('cpu-to-cpu'/'gpu-to-gpu'/'gpu-to-cpu'/'cpu-to-gpu') run between the ${pair}^th node pair in the ${line}^th line of the config, ${server} and ${client} are the hostname of server and client.
ib-traffic/ib_write_lat_${msg_size}_${direction}_${line}_${pair}:${server}_${client}	time (us)	The max latency of perftest (ib_write_lat, ib_send_lat, ib_read_lat) using ${msg_size} with ${direction}('cpu-to-cpu'/'gpu-to-gpu'/'gpu-to-cpu'/'cpu-to-gpu') run between the ${pair}^th node pair in the ${line}^th line of the config, ${server} and ${client} are the hostname of server and client.

`nvbandwidth`#

Introduction#

Measures bandwidth and latency for various memcpy patterns across different links using copy engine or kernel copy methods, performed by nvbandwidth

Metrics#

Metrics	Unit	Description
host_to_device_memcpy_ce_cpu[0-9]_gpu[0-9]_bw	GB/s	Host to device CE memcpy using cuMemcpyAsync
host_to_device_memcpy_ce_sum_bw	GB/s	Sum of the output matrix
device_to_host_memcpy_ce_cpu[0-9]_gpu[0-9]_bw	GB/s	Device to host CE memcpy using cuMemcpyAsync
device_to_host_memcpy_ce_sum_bw	GB/s	Sum of the output matrix
host_to_device_bidirectional_memcpy_ce_cpu[0-9]_gpu[0-9]_bw	GB/s	A host to device copy is measured while a device to host copy is run simultaneously. Only the host to device copy bandwidth is reported.
host_to_device_bidirectional_memcpy_ce_sum_bw	GB/s	Sum of the output matrix
device_to_host_bidirectional_memcpy_ce_cpu[0-9]_gpu[0-9]_bw	GB/s	A device to host copy is measured while a host to device copy is run simultaneously. Only the device to host copy bandwidth is reported.
device_to_host_bidirectional_memcpy_ce_sum_bw	GB/s	Sum of the output matrix
device_to_device_memcpy_read_ce_gpu[0-9]_gpu[0-9]_bw	GB/s	Measures bandwidth of cuMemcpyAsync between each pair of accessible peers. Read tests launch a copy from the peer device to the target using the target's context.
device_to_device_memcpy_read_ce_sum_bw	GB/s	Sum of the output matrix
device_to_device_memcpy_write_ce_gpu[0-9]_gpu[0-9]_bw	GB/s	Measures bandwidth of cuMemcpyAsync between each pair of accessible peers. Write tests launch a copy from the target device to the peer using the target's context.
device_to_device_memcpy_write_ce_sum_bw	GB/s	Sum of the output matrix
device_to_device_bidirectional_memcpy_read_ce_gpu[0-9]_gpu[0-9]_bw	GB/s	Measures bandwidth of cuMemcpyAsync between each pair of accessible peers. A copy in the opposite direction of the measured copy is run simultaneously but not measured. Read tests launch a copy from the peer device to the target using the target's context.
device_to_device_bidirectional_memcpy_read_ce_sum_bw	GB/s	Sum of the output matrix
device_to_device_bidirectional_memcpy_write_ce_gpu[0-9]_gpu[0-9]_bw	GB/s	Measures bandwidth of cuMemcpyAsync between each pair of accessible peers. A copy in the opposite direction of the measured copy is run simultaneously but not measured. Write tests launch a copy from the target device to the peer using the target's context.
device_to_device_bidirectional_memcpy_write_ce_sum_bw	GB/s	Sum of the output matrix
all_to_host_memcpy_ce_cpu[0-9]_gpu[0-9]_bw	GB/s	Measures bandwidth of cuMemcpyAsync between a single device and the host while simultaneously running copies from all other devices to the host.
all_to_host_memcpy_ce_sum_bw	GB/s	Sum of the output matrix
all_to_host_bidirectional_memcpy_ce_cpu[0-9]_gpu[0-9]_bw	GB/s	A device to host copy is measured while a host to device copy is run simultaneously. Only the device to host copy bandwidth is reported. All other devices generate simultaneous host to device and device to host interfering traffic.
all_to_host_bidirectional_memcpy_ce_sum_bw	GB/s	Sum of the output matrix
host_to_all_memcpy_ce_cpu[0-9]_gpu[0-9]_bw	GB/s	Measures bandwidth of cuMemcpyAsync between the host to a single device while simultaneously running copies from the host to all other devices.
host_to_all_memcpy_ce_sum_bw	GB/s	Sum of the output matrix
host_to_all_bidirectional_memcpy_ce_cpu[0-9]_gpu[0-9]_bw	GB/s	A host to device copy is measured while a device to host copy is run simultaneously. Only the host to device copy bandwidth is reported. All other devices generate simultaneous host to device and device to host interfering traffic.
host_to_all_bidirectional_memcpy_ce_sum_bw	GB/s	Sum of the output matrix
all_to_one_write_ce_gpu[0-9]_gpu[0-9]_bw	GB/s	Measures the total bandwidth of copies from all accessible peers to a single device, for each device. Bandwidth is reported as the total inbound bandwidth for each device. Write tests launch a copy from the target device to the peer using the target's context.
all_to_one_write_ce_sum_bw	GB/s	Sum of the output matrix
all_to_one_read_ce_gpu[0-9]_gpu[0-9]_bw	GB/s	Measures the total bandwidth of copies from all accessible peers to a single device, for each device. Bandwidth is reported as the total outbound bandwidth for each device. Read tests launch a copy from the peer device to the target using the target's context.
all_to_one_read_ce_sum_bw	GB/s	Sum of the output matrix
one_to_all_write_ce_gpu[0-9]_gpu[0-9]_bw	GB/s	Measures the total bandwidth of copies from a single device to all accessible peers, for each device. Bandwidth is reported as the total outbound bandwidth for each device. Write tests launch a copy from the target device to the peer using the target's context.
one_to_all_write_ce_sum_bw	GB/s	Sum of the output matrix
one_to_all_read_ce_gpu[0-9]_gpu[0-9]_bw	GB/s	Measures the total bandwidth of copies from a single device to all accessible peers, for each device. Bandwidth is reported as the total inbound bandwidth for each device. Read tests launch a copy from the peer device to the target using the target's context.
one_to_all_read_ce_sum_bw	GB/s	Sum of the output matrix
host_to_device_memcpy_sm_cpu[0-9]_gpu[0-9]_bw	GB/s	Host to device SM memcpy using a copy kernel
host_to_device_memcpy_sm_sum_bw	GB/s	Sum of the output matrix
device_to_host_memcpy_sm_cpu[0-9]_gpu[0-9]_bw	GB/s	Device to host SM memcpy using a copy kernel
device_to_host_memcpy_sm_sum_bw	GB/s	Sum of the output matrix
device_to_device_memcpy_read_sm_gpu[0-9]_gpu[0-9]_bw	GB/s	Measures bandwidth of a copy kernel between each pair of accessible peers. Read tests launch a copy from the peer device to the target using the target's context.
device_to_device_memcpy_read_sm_sum_bw	GB/s	Sum of the output matrix
device_to_device_memcpy_write_sm_gpu[0-9]_gpu[0-9]_bw	GB/s	Measures bandwidth of a copy kernel between each pair of accessible peers. Write tests launch a copy from the target device to the peer using the target's context.
device_to_device_memcpy_write_sm_sum_bw	GB/s	Sum of the output matrix
device_to_device_bidirectional_memcpy_read_sm_gpu[0-9]_gpu[0-9]_bw	GB/s	Measures bandwidth of a copy kernel between each pair of accessible peers. Copies are run in both directions between each pair, and the sum is reported. Read tests launch a copy from the peer device to the target using the target's context.
device_to_device_bidirectional_memcpy_read_sm_sum_bw	GB/s	Sum of the output matrix
device_to_device_bidirectional_memcpy_write_sm_gpu[0-9]_gpu[0-9]_bw	GB/s	Measures bandwidth of a copy kernel between each pair of accessible peers. Copies are run in both directions between each pair, and the sum is reported. Write tests launch a copy from the target device to the peer using the target's context.
device_to_device_bidirectional_memcpy_write_sm_sum_bw	GB/s	Sum of the output matrix
all_to_host_memcpy_sm_cpu[0-9]_gpu[0-9]_bw	GB/s	Measures bandwidth of a copy kernel between a single device and the host while simultaneously running copies from all other devices to the host.
all_to_host_memcpy_sm_sum_bw	GB/s	Sum of the output matrix
all_to_host_bidirectional_memcpy_sm_cpu[0-9]_gpu[0-9]_bw	GB/s	A device to host bandwidth of a copy kernel is measured while a host to device copy is run simultaneously. Only the device to host copy bandwidth is reported. All other devices generate simultaneous host to device and device to host interfering traffic using copy kernels.
all_to_host_bidirectional_memcpy_sm_sum_bw	GB/s	Sum of the output matrix
host_to_all_memcpy_sm_cpu[0-9]_gpu[0-9]_bw	GB/s	Measures bandwidth of a copy kernel between the host to a single device while simultaneously running copies from the host to all other devices.
host_to_all_memcpy_sm_sum_bw	GB/s	Sum of the output matrix
host_to_all_bidirectional_memcpy_sm_cpu[0-9]_gpu[0-9]_bw	GB/s	A host to device bandwidth of a copy kernel is measured while a device to host copy is run simultaneously. Only the host to device copy bandwidth is reported. All other devices generate simultaneous host to device and device to host interfering traffic using copy kernels.
host_to_all_bidirectional_memcpy_sm_sum_bw	GB/s	Sum of the output matrix
all_to_one_write_sm_gpu[0-9]_gpu[0-9]_bw	GB/s	Measures the total bandwidth of copies from all accessible peers to a single device, for each device. Bandwidth is reported as the total inbound bandwidth for each device. Write tests launch a copy from the target device to the peer using the target's context.
all_to_one_write_sm_sum_bw	GB/s	Sum of the output matrix
all_to_one_read_sm_gpu[0-9]_gpu[0-9]_bw	GB/s	Measures the total bandwidth of copies from all accessible peers to a single device, for each device. Bandwidth is reported as the total outbound bandwidth for each device. Read tests launch a copy from the peer device to the target using the target's context.
all_to_one_read_sm_sum_bw	GB/s	Sum of the output matrix
one_to_all_write_sm_gpu[0-9]_gpu[0-9]_bw	GB/s	Measures the total bandwidth of copies from a single device to all accessible peers, for each device. Bandwidth is reported as the total outbound bandwidth for each device. Write tests launch a copy from the target device to the peer using the target's context.
one_to_all_write_sm_sum_bw	GB/s	Sum of the output matrix
one_to_all_read_sm_gpu[0-9]_gpu[0-9]_bw	GB/s	Measures the total bandwidth of copies from a single device to all accessible peers, for each device. Bandwidth is reported as the total inbound bandwidth for each device. Read tests launch a copy from the peer device to the target using the target's context.
one_to_all_read_sm_sum_bw	GB/s	Sum of the output matrix
host_device_latency_sm_cpu[0-9]_gpu[0-9]_lat	µs	Host - device SM copy latency using a ptr chase kernel
host_device_latency_sm_sum_lat	µs	Sum of the output matrix
device_to_device_latency_sm_gpu[0-9]_gpu[0-9]_lat	µs	Measures latency of a pointer dereference operation between each pair of accessible peers. Memory is allocated on a GPU and is accessed by the peer GPU to determine latency.
device_to_device_latency_sm_sum_lat	µs	Sum of the output matrix

Computation-communication Benchmarks#

`computation-communication-overlap`#

Introduction#

Test the performance of single node when communication and computation overlap.

Metrics#

Name	Unit	Description
pytorch-computation-communication-overlap/mul_time	time (ms)	Time of communication and mul kernel computation overlap.
pytorch-computation-communication-overlap/matmul_time	time (ms)	Time of communication and matmul kernel computation overlap.

`sharding-matmul`#

Introduction#

Test the performance of large scale matmul operation with multiple GPUs:

allreduce: Each GPU will calculate part of the MM calculation, and use AllReduce to merge all data into one tensor.
allgather: Each GPU will calculate part of the MM calculation, and use AllGather + Concat to merge all data into one tensor.

Metrics#

Name	Unit	Description
pytorch-sharding-matmul/allreduce_time	time (ms)	Time of sharding matmul using allreduce.
pytorch-sharding-matmul/allgather_time	time (ms)	Time of sharding matmul using allgather.

`dist-inference`#

Introduction#

Test the performance of distributed model inference. Support both PyTorch implementation and cpp implementation.

Metrics#

Name	Unit	Description
pytorch-dist-inference/step_times	time (ms)	Average time of model inference runs.
pytorch-dist-inference/steptimes${percentile}	time (ms)	Tail (50,90,95,99,99.9) time of model inference runs.

Storage Benchmarks#

`disk-benchmark`#

Introduction#

Measure the disk performance through FIO.

Metrics#

Name	Unit	Description
disk-benchmark/${disk_name}_rand_read_write_bs	size (bytes)	Disk random read write block size.
disk-benchmark/${disk_name}_rand_read_write_read_iops	IOPS	Disk random read write read IOPS.
disk-benchmark/${disk_name}_rand_read_write_read_lat_ns_95.0	time (ns)	Disk random read write read latency in 95.0 percentile.
disk-benchmark/${disk_name}_rand_read_write_read_lat_ns_99.0	time (ns)	Disk random read write read latency in 99.0 percentile.
disk-benchmark/${disk_name}_rand_read_write_read_lat_ns_99.9	time (ns)	Disk random read write read latency in 99.9 percentile.
disk-benchmark/${disk_name}_rand_read_write_write_iops	IOPS	Disk random read write write IOPS.
disk-benchmark/${disk_name}_rand_read_write_write_lat_ns_95.0	time (ns)	Disk random read write write latency in 95.0 percentile.
disk-benchmark/${disk_name}_rand_read_write_write_lat_ns_99.0	time (ns)	Disk random read write write latency in 99.0 percentile.
disk-benchmark/${disk_name}_rand_read_write_write_lat_ns_99.9	time (ns)	Disk random read write write latency in 99.9 percentile.

Computation Benchmarks#

kernel-launch#

Introduction#

Metrics#

gemm-flops#

Introduction#

Metrics#

matmul#

Introduction#

Metrics#

cublaslt-gemm / hipblaslt-gemm#

Introduction#

Metrics#

cublas-function#

Introduction#

Metrics#

cudnn-function#

Introduction#

Metrics#

tensorrt-inference#

Introduction#

Metrics#

ort-inference#

Introduction#

Metrics#

gpu-burn#

Introduction#

Metrics#

cpu-hpl#

Introduction#

Metrics#

cpu-stream#

Introduction#

Metrics#

Communication Benchmarks#

cpu-memory-bw-latency#

Introduction#

Metrics#

mem-bw#

Introduction#

Metrics#

gpu-copy-bw#

Metrics#

ib-loopback#

Introduction#

Metrics#

nccl-bw / rccl-bw#

Introduction#

Metrics#

tcp-connectivity#

Introduction#

Metrics#

gpcnet-network-test / gpcnet-network-load-test#

Introduction#

Metrics#

ib-traffic#

Introduction#

Metrics#

nvbandwidth#

Introduction#

Metrics#

Computation-communication Benchmarks#

computation-communication-overlap#

Introduction#

Metrics#

sharding-matmul#

Introduction#

Metrics#

dist-inference#

Introduction#

Metrics#

Storage Benchmarks#

disk-benchmark#

Introduction#

Metrics#

`kernel-launch`#

`gemm-flops`#

`matmul`#

`cublaslt-gemm` / `hipblaslt-gemm`#

`cublas-function`#

`cudnn-function`#

`tensorrt-inference`#

`ort-inference`#

`gpu-burn`#

`cpu-hpl`#

`cpu-stream`#

`cpu-memory-bw-latency`#

`mem-bw`#

`gpu-copy-bw`#

`ib-loopback`#

`nccl-bw` / `rccl-bw`#

`tcp-connectivity`#

`gpcnet-network-test` / `gpcnet-network-load-test`#

`ib-traffic`#

`nvbandwidth`#

`computation-communication-overlap`#

`sharding-matmul`#

`dist-inference`#

`disk-benchmark`#