We are very happy to announce that SuperBench 0.3.0 version is officially released today!
You can install and try superbench by following Getting Started Tutorial.
SuperBench 0.3.0 Release Notes#
SuperBench Framework#
Runner#
Benchmarks#
- Support Docker benchmark.
Single-node Validation#
Micro Benchmarks#
Memory (Tool: NVIDIA/AMD Bandwidth Test Tool)
Metrics | Unit | Description |
---|
H2D_Mem_BW_GPU | GB/s | host-to-GPU bandwidth for each GPU |
D2H_Mem_BW_GPU | GB/s | GPU-to-host bandwidth for each GPU |
IBLoopback (Tool: PerfTest – Standard RDMA Test Tool)
Metrics | Unit | Description |
---|
IB_Write | MB/s | The IB write loopback throughput with different message sizes |
IB_Read | MB/s | The IB read loopback throughput with different message sizes |
IB_Send | MB/s | The IB send loopback throughput with different message sizes |
NCCL/RCCL (Tool: NCCL/RCCL Tests)
Metrics | Unit | Description |
---|
NCCL_AllReduce | GB/s | The NCCL AllReduce performance with different message sizes |
NCCL_AllGather | GB/s | The NCCL AllGather performance with different message sizes |
NCCL_broadcast | GB/s | The NCCL Broadcast performance with different message sizes |
NCCL_reduce | GB/s | The NCCL Reduce performance with different message sizes |
NCCL_reduce_scatter | GB/s | The NCCL ReduceScatter performance with different message sizes |
Disk (Tool: FIO – Standard Disk Performance Tool)
Metrics | Unit | Description |
---|
Seq_Read | MB/s | Sequential read performance |
Seq_Write | MB/s | Sequential write performance |
Rand_Read | MB/s | Random read performance |
Rand_Write | MB/s | Random write performance |
Seq_R/W_Read | MB/s | Read performance in sequential read/write, fixed measurement (read:write = 4:1) |
Seq_R/W_Write | MB/s | Write performance in sequential read/write (read:write = 4:1) |
Rand_R/W_Read | MB/s | Read performance in random read/write (read:write = 4:1) |
Rand_R/W_Write | MB/s | Write performance in random read/write (read:write = 4:1) |
H2D/D2H SM Transmission Bandwidth (Tool: MSR-A build)
Metrics | Unit | Description |
---|
H2D_SM_BW_GPU | GB/s | host-to-GPU bandwidth using GPU kernel for each GPU |
D2H_SM_BW_GPU | GB/s | GPU-to-host bandwidth using GPU kernel for each GPU |
AMD GPU Support#
Docker Image Support#
- ROCm 4.2 PyTorch 1.7.0
- ROCm 4.0 PyTorch 1.7.0
Micro Benchmarks#
Kernel Launch (Tool: MSR-A build)
Metrics | Unit | Description |
---|
Kernel_Launch_Event_Time | Time (ms) | Dispatch latency measured in GPU time using hipEventRecord() |
Kernel_Launch_Wall_Time | Time (ms) | Dispatch latency measured in CPU time |
GEMM FLOPS (Tool: AMD rocblas-bench Tool)
Metrics | Unit | Description |
---|
FP64 | GFLOPS | FP64 FLOPS without MatrixCore |
FP32(MC) | GFLOPS | TF32 FLOPS with MatrixCore |
FP16(MC) | GFLOPS | FP16 FLOPS with MatrixCore |
BF16(MC) | GFLOPS | BF16 FLOPS with MatrixCore |
INT8(MC) | GOPS | INT8 FLOPS with MatrixCore |
E2E Benchmarks#
CNN models -- Use PyTorch torchvision models
- ResNet: ResNet-50, ResNet-101, ResNet-152
- DenseNet: DenseNet-169, DenseNet-201
- VGG: VGG-11, VGG-13, VGG-16, VGG-19
BERT -- Use huggingface Transformers
LSTM -- Use PyTorch
GPT-2 -- Use huggingface Transformers
Bug Fix#
- VGG models failed on A100 GPU with batch_size=128
Other Improvement#
Contribution related
- Contribute rule
- System information collection
Document
- Add release process doc
- Add design documents
- Add developer guide doc for coding style
- Add contribution rules
- Add docker image list
- Add initial validation results