Micro Benchmarks
#
Computation Benchmarkskernel-launch
#
#
IntroductionMeasure GPU kernel launch latency, which is defined as the time range from the beginning of the launch API call to the beginning of the kernel execution.
#
MetricsName | Unit | Description |
---|---|---|
kernel-launch/event_time | time (ms) | Launch latency measured in GPU time. |
kernel-launch/wall_time | time (ms) | Launch latency measured in CPU time. |
gemm-flops
#
#
IntroductionMeasure the GPU GEMM FLOPS for different float and int data types, with or without Tensor Core (XDLOPS), performed by NVIDIA cutlass or AMD rocblas-bench.
#
MetricsName | Unit | Description |
---|---|---|
gemm-flops/fp64_flops | FLOPS (GFLOPS) | GEMM float64 peak FLOPS. |
gemm-flops/fp32_flops | FLOPS (GFLOPS) | GEMM float32 peak FLOPS. |
gemm-flops/fp16_flops | FLOPS (GFLOPS) | GEMM float16 peak FLOPS. |
gemm-flops/fp64_tc_flops | FLOPS (GFLOPS) | GEMM float64 peak FLOPS with NVIDIA Tensor Core. |
gemm-flops/tf32_tc_flops | FLOPS (GFLOPS) | GEMM tensor-float32 peak FLOPS with NVIDIA Tensor Core. |
gemm-flops/fp16_tc_flops | FLOPS (GFLOPS) | GEMM float16 peak FLOPS with NVIDIA Tensor Core. |
gemm-flops/bf16_tc_flops | FLOPS (GFLOPS) | GEMM bfloat16 peak FLOPS with NVIDIA Tensor Core. |
gemm-flops/int8_tc_iops | IOPS (GIOPS) | GEMM int8 peak IOPS with NVIDIA Tensor Core. |
gemm-flops/int4_tc_iops | IOPS (GIOPS) | GEMM int4 peak IOPS with NVIDIA Tensor Core. |
gemm-flops/fp32_xdlops_flops | FLOPS (GFLOPS) | GEMM tensor-float32 peak FLOPS with AMD XDLOPS. |
gemm-flops/fp16_xdlops_flops | FLOPS (GFLOPS) | GEMM float16 peak FLOPS with AMD XDLOPS. |
gemm-flops/bf16_xdlops_flops | FLOPS (GFLOPS) | GEMM bfloat16 peak FLOPS with AMD XDLOPS. |
gemm-flops/int8_xdlops_iops | IOPS (GIOPS) | GEMM int8 peak IOPS with AMD XDLOPS. |
matmul
#
#
IntroductionLarge scale matmul operation using torch.matmul
with one GPU.
#
MetricsName | Unit | Description |
---|---|---|
pytorch-matmul/nosharding_time | time (ms) | Time of pure matmul operation. |
cublaslt-gemm
/ hipblaslt-gemm
#
#
IntroductionMeasure the GEMM performance of cublasLtMatmul
or hipblasLt-bench
.
#
MetricsName | Unit | Description |
---|---|---|
cublaslt-gemm/${dtype}_${batch}_${m}_${n}_${k}_flops | FLOPS (TFLOPS) | TFLOPS of measured GEMM kernel. |
hipblaslt-gemm/${dtype}_${batch}_${m}_${n}_${k}_flops | FLOPS (TFLOPS) | TFLOPS of measured GEMM kernel. |
cublas-function
#
#
IntroductionMeasure the performance of most common Nvidia cuBLAS functions with parameters in models training including ResNet, VGG, DenseNet, LSTM, BERT, and GPT-2.
The supported functions for cuBLAS are as follows:
- cublasSgemm
- cublasSgemmStridedBatched
- cublasGemmStridedBatchedEx
- cublasGemmEx
- cublasCgemm3mStridedBatched
- cublasCgemm
#
MetricsName | Unit | Description |
---|---|---|
cublas-function/name_${function_name}_${parameters}_time | time (us) | The mean time to execute the cublas function with the parameters. |
cublas-function/name_${function_name}_${parameters}_correctness | Whether the calculation results of executing the cublas function with the parameters pass the correctness check if enable correctness check. | |
cublas-function/name_${function_name}_${parameters}_error | The error ratio of the calculation results of executing the cublas function with the parameters if enable correctness check. |
cudnn-function
#
#
IntroductionMeasure the performance of most common Nvidia cuDNN functions with parameters in models training including ResNet, VGG, DenseNet, LSTM, BERT, and GPT-2.
The supported functions for cuDNN are as follows:
- cudnnConvolutionBackwardFilter
- cudnnConvolutionBackwardData
- cudnnConvolutionForward
#
MetricsName | Unit | Description |
---|---|---|
cudnn-function/name_${function_name}_${parameters}_time | time (us) | The mean time to execute the cudnn function with the parameters. |
tensorrt-inference
#
#
IntroductionInference PyTorch/ONNX models on NVIDIA GPUs with TensorRT.
Currently the following models are supported:
alexnet, densenet121, densenet169, densenet201, densenet161, googlenet, inception_v3, mnasnet0_5, mnasnet1_0, mobilenet_v2, resnet18, resnet34, resnet50, resnet101, resnet152, resnext50_32x4d, resnext101_32x8d, wide_resnet50_2, wide_resnet101_2, shufflenet_v2_x0_5, shufflenet_v2_x1_0, squeezenet1_0, squeezenet1_1, vgg11, vgg11_bn, vgg13, vgg13_bn, vgg16, vgg16_bn, vgg19_bn, vgg19 lstm, bert-base, bert-large, gpt2-small
Do not support large models like
gpt2-large
currently because models larger than 2GB (maximum protobuf size) cannot be exported in one ONNX file.
#
MetricsName | Unit | Description |
---|---|---|
tensorrt-inference/${model}_gpu_time_mean | time (ms) | The mean GPU latency to execute the kernels for a query. |
tensorrt-inference/${model}_gpu_time_99 | time (ms) | The 99th percentile GPU latency to execute the kernels for a query. |
tensorrt-inference/${model}_host_time_mean | time (ms) | The mean H2D, GPU, and D2H latency to execute the kernels for a query. |
tensorrt-inference/${model}_host_time_99 | time (ms) | The 99th percentile H2D, GPU, and D2H latency to execute the kernels for a query. |
tensorrt-inference/${model}_end_to_end_time_mean | time (ms) | The mean duration from when the H2D of a query is called to when the D2H of the same query is completed. |
tensorrt-inference/${model}_end_to_end_time_99 | time (ms) | The P99 duration from when the H2D of a query is called to when the D2H of the same query is completed. |
ort-inference
#
#
IntroductionInference performance of the torchvision models using ONNXRuntime. Currently the following models are supported:
alexnet, densenet121, densenet169, densenet201, densenet161, googlenet, inception_v3, mnasnet0_5, mnasnet1_0, mobilenet_v2, resnet18, resnet34, resnet50, resnet101, resnet152, resnext50_32x4d, resnext101_32x8d, wide_resnet50_2, wide_resnet101_2, shufflenet_v2_x0_5, shufflenet_v2_x1_0, squeezenet1_0, squeezenet1_1, vgg11, vgg11_bn, vgg13, vgg13_bn, vgg16, vgg16_bn, vgg19_bn, vgg19
The supported percentiles are 50, 90, 95, 99, and 99.9.
#
MetricsName | Unit | Description |
---|---|---|
ort-inference/{precision}_{model}_time | time (ms) | The mean latency to execute one batch of inference. |
ort-inference/{precision}{model}_time{percentile} | time (ms) | The {percentile}th percentile latency to execute one batch of inference. |
gpu-burn
#
#
IntroductionMulti-GPU CUDA stress test for GPU compute and memory utilization, performed by gpu-burn. Supports the use of double unit types and the use of tensor cores.
#
MetricsName | Unit | Description |
---|---|---|
gpu-burn/time | time (s) | The runtime for gpu-burn test. |
gpu-burn/gpu_[0-9]_pass | yes/no | The result of the gpu-burn test for each GPU (1: yes, 0: no). |
gpu-burn/abort | yes/no | Whether or not GPU-burn test aborted before returning GPU results (1: yes, 0: no). |
cpu-hpl
#
#
IntroductionHPL or High Performance Computing Linpack evaluates compute bandwidth by solving dense linear systems in double precision arethmetic. Performed by High-Performance Linpack Benchmark for Distributed-Memory Computers
#
MetricsName | Unit | Description |
---|---|---|
cpu-hpl/tests_pass | HPL completed running and correctness test has passed (1: pass, 0: fail). | |
cpu-hpl/throughput | bandwidth (GFlops) | Compute bandwidth. |
cpu-hpl/time | time (s) | Time elapsed during HPL run. |
cpu-stream
#
#
IntroductionMeasure of memory bandwidth and computation rate for simple vector kernels. performed by University of Virginia STREAM benchmark.
#
MetricsName | Unit | Description |
---|---|---|
cpu-stream/threads | Number of threads used for the test. Determined by core count. | |
cpu-stream/['copy', 'scale', 'add', 'triad']_throughput | bandwidth (MB/s) | Memory throughput of designated kerel operation. |
cpu-stream/['copy', 'scale', 'add', 'triad']_time_avg | time (s) | Average elapsed times over all iterations. |
cpu-stream/['copy', 'scale', 'add', 'triad']_time_min | time (s) | Minimum elapsed times over all iterations. |
cpu-stream/['copy', 'scale', 'add', 'triad']_time_max | time (s) | Maximum elapsed times over all iterations. |
#
Communication Benchmarkscpu-memory-bw-latency
#
#
IntroductionMeasure the memory copy bandwidth and latency across different CPU NUMA nodes. performed by Intel MLC Tool.
#
MetricsName | Unit | Description |
---|---|---|
cpu-memory-bw-latency/mem_bandwidth_matrix_numa_[0-9]+_[0-9]+_bw | bandwidth (MB/s) | Former NUMA to latter NUMA memory bandwidth. |
cpu-memory-bw-latency/mem_bandwidth_matrix_numa_[0-9]+_[0-9]+_lat | time (ns) | Former NUMA to latter NUMA memory latency. |
cpu-memory-bw-latency/mem_max_bandwidth_all_reads_bw | bandwidth (MB/s) | Whole-CPU maximum memory bandwidth, full read. |
cpu-memory-bw-latency/mem_max_bandwidth_3_1_reads-writes_bw | bandwidth (MB/s) | Whole-CPU maximum memory bandwidth, read : write = 3 : 1. |
cpu-memory-bw-latency/mem_max_bandwidth_2_1_reads-writes_bw | bandwidth (MB/s) | Whole-CPU maximum memory bandwidth, read : write = 2 : 1. |
cpu-memory-bw-latency/mem_max_bandwidth_1_1_reads-writes_bw | bandwidth (MB/s) | Whole-CPU maximum memory bandwidth, read : write = 1 : 1. |
cpu-memory-bw-latency/mem_max_bandwidth_stream-triad_like_bw | bandwidth (MB/s) | Whole-CPU maximum memory bandwidth, with stream-triad like pattern. |
mem-bw
#
#
IntroductionMeasure the memory copy bandwidth across PCI-e and memory copy bandwidth between GPUs, performed by NVIDIA or AMD bandwidth test tool.
#
MetricsName | Unit | Description |
---|---|---|
mem-bw/h2d_bw | bandwidth (GB/s) | Host to device copy bandwidth. |
mem-bw/d2h_bw | bandwidth (GB/s) | Device to host copy bandwidth. |
mem-bw/d2d_bw | bandwidth (GB/s) | Device to device copy bandwidth. |
gpu-copy-bw
#
Measure the memory copy bandwidth performed by GPU SM/DMA engine, including device-to-host, host-to-device and device-to-device.
For measurements of peer-to-peer communication performance between AMD GPUs, GPU memory buffers are allocated in hipDeviceMallocUncached
(previous hipDeviceMallocFinegrained
) mode to maximize performance.
#
MetricsName | Unit | Description |
---|---|---|
cpu_to_gpu[0-9]+_by_(sm|dma)_under_numa[0-9]+_bw | bandwidth (GB/s) | The unidirectional bandwidth of one GPU reading one NUMA node's host memory using DMA engine or GPU SM. |
gpu[0-9]+_to_cpu_by_(sm|dma)_under_numa[0-9]+_bw | bandwidth (GB/s) | The unidirectional bandwidth of one GPU writing one NUMA node's host memory using DMA engine or GPU SM. |
gpu[0-9]+_to_gpu[0-9]+_by_(sm|dma)_bw | bandwidth (GB/s) | The unidirectional bandwidth of one GPU reading or writing self's memory using DMA engine or GPU SM. |
gpu[0-9]+_to_gpu[0-9]+_(read|write)_by_(sm|dma)_bw | bandwidth (GB/s) | The unidirectional bandwidth of one GPU reading or writing peer GPU's memory using DMA engine or GPU SM with peer communication enabled. |
cpu_and_gpu[0-9]+_by_(sm|dma)_under_numa[0-9]+_bw | bandwidth (GB/s) | The bidirectional bandwidth of one GPU reading and writing one NUMA node's host memory using DMA engine or GPU SM. |
gpu[0-9]+_and_cpu_by_(sm|dma)_under_numa[0-9]+_bw | bandwidth (GB/s) | Same as above, but generated by --dtoh --bidirectional. |
gpu[0-9]+_and_gpu[0-9]+_by_(sm|dma)_bw | bandwidth (GB/s) | The bidirectional bandwidth of one GPU reading and writing self's memory using DMA engine or GPU SM. |
gpu[0-9]+_and_gpu[0-9]+_(read|write)_by_(sm|dma)_bw | bandwidth (GB/s) | The bidirectional bandwidth of one GPU reading and writing peer GPU's memory using DMA engine or GPU SM with peer communication enabled. |
gpu[0-9]+_to_gpu_all_write_by_sm_bw | bandwidth (GB/s) | The unidirectional bandwidth of one GPU writing all peer GPUs' memory using GPU SM with peer communication enabled. |
gpu_all_to_gpu[0-9]+_write_by_sm_bw | bandwidth (GB/s) | The unidirectional bandwidth of all peer GPUs writing one GPU's memory using GPU SM with peer communication enabled. |
gpu_all_to_gpu_all_write_by_sm_bw | bandwidth (GB/s) | The unidirectional bandwidth of all peer GPUs writing all peer GPUs' memory using GPU SM with peer communication enabled. |
ib-loopback
#
#
IntroductionMeasure the InfiniBand loopback verbs bandwidth, performed by OFED performance tests.
#
MetricsName | Unit | Description |
---|---|---|
ib-loopback/ibwrite_bw${msg_size} | bandwidth (GB/s) | InfiniBand loopback write bandwidth with given message size. |
ib-loopback/ibread_bw${msg_size} | bandwidth (GB/s) | InfiniBand loopback read bandwidth with given message size. |
ib-loopback/ibsend_bw${msg_size} | bandwidth (GB/s) | InfiniBand loopback send bandwidth with given message size. |
nccl-bw
/ rccl-bw
#
#
IntroductionMeasure the performance of NCCL/RCCL operations under multi nodes' traffic pattern, performed by nccl-tests or rccl-tests. Support the following operations currently: allreduce, allgather, broadcast, reduce, reducescatter, alltoall. Support both in-place and out-of-place measurements.
Support the following traffic patterns:
all-nodes
, validate the NCCL/RCCL performance across all VM nodes simultaneously.pair-wise
, validate the NCCL/RCCL performance across VM pairs with all possible combinations in parallel.k-batch
, validate the NCCL/RCCL performance across VM groups with a specified batch scale.topo-aware
, validate the NCCL/RCCL performance across VM pairs with different distances/hops as a quick test.
#
MetricsName | Unit | Description |
---|---|---|
nccl-bw/${operation}_${msg_size}_time | time (us) | NCCL operation lantency with given message size. |
nccl-bw/${operation}_${msg_size}_algbw | bandwidth (GB/s) | NCCL operation algorithm bandwidth with given message size. |
nccl-bw/${operation}_${msg_size}_busbw | bandwidth (GB/s) | NCCL operation bus bandwidth with given message size. |
rccl-bw/${operation}_${msg_size}_time | time (us) | RCCL operation lantency with given message size. |
rccl-bw/${operation}_${msg_size}_algbw | bandwidth (GB/s) | RCCL operation algorithm bandwidth with given message size. |
rccl-bw/${operation}_${msg_size}_busbw | bandwidth (GB/s) | RCCL operation bus bandwidth with given message size. |
If mpi mode is enable and traffic pattern is specified, the metrics pattern will change to nccl-bw/${operation}_${serial_index)_${parallel_index):${msg_size}_time
serial_index
represents the serial index of the host group in serial.parallel_index
represents the parallel index of the host list in parallel.
tcp-connectivity
#
#
IntroductionTest the TCP connectivity between current node and nodes in the hostfile, performed by tcping
#
MetricsMetrics | Unit | Description |
---|---|---|
tcp-connectivity/${hostname/ip}_successed_count | count | successed times of tcp connections between current node and other nodes |
tcp-connectivity/${hostname/ip}_failed_count | count | failed times of tcp connections between current node and other nodes |
tcp-connectivity/${hostname/ip}_success_rate | success rate (successed/total) of tcp connection between current node and other nodes | |
tcp-connectivity/${hostname/ip}_time_min | time (ms) | mininum latency of tcp connections between current node and other nodes |
tcp-connectivity/${hostname/ip}_time_max | time (ms) | maximum latency of tcp connections between current node and other nodes |
tcp-connectivity/${hostname/ip}_time_avg | time (ms) | average latency of tcp connections between current node and other nodes |
gpcnet-network-test
/ gpcnet-network-load-test
#
#
IntroductionDistributed test, test the global network performance and congestion, performed by GPCNET
gpcnet-network-test: Full system network tests in random and natural ring, alltoall and allreduce, at least 2 nodes
gpcnet-network-load-test: Select full system network tests run with four congestors to measure network congestion or contention, at least 10 nodes
- supporting network tests: RR Two-sided Lat (8 B), RR Get Lat (8 B), RR Two-sided BW (131072 B), RR Put BW (131072 B), RR Two-sided BW+Sync (131072 B), Nat Two-sided BW (131072 B), Multiple Allreduce (8 B), Multiple Alltoall (4096 B)
- supporting congestors: Alltoall (4096 B), Two-sided Incast (4096 B), Put Incast (4096 B), Get Bcast (4096 B)
#
MetricsMetrics | Unit | Description |
---|---|---|
gpcnet-network-test/rrtwo-sided_lat${stat} | time (us) | statistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'random ring communication pattern two-side latency' for network testing |
gpcnet-network-test/rrtwo-sided+sync_bw${stat} | bandwidth (MiB/s/rank) | fstatistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'random ring communication pattern two-side bandwidth with barrier' for network testing |
gpcnet-network-test/multipleallreduce_time${stat} | time (us) | statistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'multiple allreduce bandwidth' for network testing |
gpcnet-network-test/rrget_lat${stat} | bandwidth (MiB/s/rank) | statistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'RR GetLat (8 B)' for network testing |
gpcnet-network-test/rrtwo-sided_bw${stat} | bandwidth (MiB/s/rank) | statistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'RR Two-sidedBW (131072 B)' for network testing |
gpcnet-network-test/nattwo-sided_bw${stat} | bandwidth (MiB/s/rank) | statistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'Nat Two-sidedBW (131072 B)' for network testing |
gpcnet-network-test/multiplealltoall_bw${stat} | bandwidth (MiB/s/rank) | statistical values(min, max, avg, 99%, 99.9%) obtained by all nodes use algorithm 'Multiple Alltoall (4096 B)' for network testing |
gpcnet-network-load-test/rrtwo-sided_lat_x${stat} | factor (x) | summary about congestion impact factor of the network test algorithm |
gpcnet-network-load-test/rrtwo-sided+sync_bw_x${stat} | factor (x) | summary about congestion impact factor of the network test algorithm |
gpcnet-network-load-test/multipleallreduce_x${stat} | factor (x) | summary about congestion impact factor of the network test algorithm |
ib-traffic
#
#
IntroductionMeasure the InfiniBand performance under multi nodes' traffic pattern.
The direction between client and server can be 'cpu-to-cpu'/'gpu-to-gpu'/'gpu-to-cpu'/'cpu-to-gpu'.
The traffic pattern is defined in a config file, which is pre-defined for one-to-many, many-to-one and all-to-all patterns. Each row in the config is one round, and all pairs of nodes in a row run ib command simultaneously.
Besides the above three patterns, ib-traffic also supports topology-aware traffic pattern. To run ib-traffic with topology-aware pattern, the user needs to specify 3 required (and 2 optional) parameters in YAML config file:
- --pattern โtopo-aware
- --ibstat โpath to ibstat output
- --ibnetdiscover โpath to ibnetdiscover output
- --min_dist โminimum distance of VM pairs (optional, default 2)
- --max_dist โmaximum distance of VM pairs (optional, default 6)
Each row in the config file has all VM pairs with a fixed distance (#hops). That's by default, 1st, 2nd, 3rd row has all VM pairs with topology distance of 2, 4, 6, respectively.
#
MetricsMetrics | Unit | Description |
---|---|---|
ib-traffic/ib_write_bw_${msg_size}_${direction}_${line}_${pair}:${server}_${client} | bandwidth (GB/s) | The max bandwidth of perftest (ib_write_bw, ib_send_bw, ib_read_bw) using ${msg_size} with ${direction}('cpu-to-cpu'/'gpu-to-gpu'/'gpu-to-cpu'/'cpu-to-gpu') run between the ${pair}th node pair in the ${line}th line of the config, ${server} and ${client} are the hostname of server and client. |
ib-traffic/ib_write_lat_${msg_size}_${direction}_${line}_${pair}:${server}_${client} | time (us) | The max latency of perftest (ib_write_lat, ib_send_lat, ib_read_lat) using ${msg_size} with ${direction}('cpu-to-cpu'/'gpu-to-gpu'/'gpu-to-cpu'/'cpu-to-gpu') run between the ${pair}th node pair in the ${line}th line of the config, ${server} and ${client} are the hostname of server and client. |
#
Computation-communication Benchmarkscomputation-communication-overlap
#
#
IntroductionTest the performance of single node when communication and computation overlap.
#
MetricsName | Unit | Description |
---|---|---|
pytorch-computation-communication-overlap/mul_time | time (ms) | Time of communication and mul kernel computation overlap. |
pytorch-computation-communication-overlap/matmul_time | time (ms) | Time of communication and matmul kernel computation overlap. |
sharding-matmul
#
#
IntroductionTest the performance of large scale matmul operation with multiple GPUs:
- allreduce: Each GPU will calculate part of the MM calculation, and use AllReduce to merge all data into one tensor.
- allgather: Each GPU will calculate part of the MM calculation, and use AllGather + Concat to merge all data into one tensor.
#
MetricsName | Unit | Description |
---|---|---|
pytorch-sharding-matmul/allreduce_time | time (ms) | Time of sharding matmul using allreduce. |
pytorch-sharding-matmul/allgather_time | time (ms) | Time of sharding matmul using allgather. |
dist-inference
#
#
IntroductionTest the performance of distributed model inference. Support both PyTorch implementation and cpp implementation.
#
MetricsName | Unit | Description |
---|---|---|
pytorch-dist-inference/step_times | time (ms) | Average time of model inference runs. |
pytorch-dist-inference/steptimes${percentile} | time (ms) | Tail (50,90,95,99,99.9) time of model inference runs. |
#
Storage Benchmarksdisk-benchmark
#
#
IntroductionMeasure the disk performance through FIO.
#
MetricsName | Unit | Description |
---|---|---|
disk-benchmark/${disk_name}_rand_read_write_bs | size (bytes) | Disk random read write block size. |
disk-benchmark/${disk_name}_rand_read_write_read_iops | IOPS | Disk random read write read IOPS. |
disk-benchmark/${disk_name}_rand_read_write_read_lat_ns_95.0 | time (ns) | Disk random read write read latency in 95.0 percentile. |
disk-benchmark/${disk_name}_rand_read_write_read_lat_ns_99.0 | time (ns) | Disk random read write read latency in 99.0 percentile. |
disk-benchmark/${disk_name}_rand_read_write_read_lat_ns_99.9 | time (ns) | Disk random read write read latency in 99.9 percentile. |
disk-benchmark/${disk_name}_rand_read_write_write_iops | IOPS | Disk random read write write IOPS. |
disk-benchmark/${disk_name}_rand_read_write_write_lat_ns_95.0 | time (ns) | Disk random read write write latency in 95.0 percentile. |
disk-benchmark/${disk_name}_rand_read_write_write_lat_ns_99.0 | time (ns) | Disk random read write write latency in 99.0 percentile. |
disk-benchmark/${disk_name}_rand_read_write_write_lat_ns_99.9 | time (ns) | Disk random read write write latency in 99.9 percentile. |