Releasing SuperBench v0.3
ยท 4 min read
We are very happy to announce that SuperBench 0.3.0 version is officially released today!
You can install and try superbench by following Getting Started Tutorial.
#
SuperBench 0.3.0 Release Notes#
SuperBench Framework#
Runner- Implement MPI mode.
#
Benchmarks- Support Docker benchmark.
#
Single-node Validation#
Micro BenchmarksMemory (Tool: NVIDIA/AMD Bandwidth Test Tool)
Metrics Unit Description H2D_Mem_BW_GPU GB/s host-to-GPU bandwidth for each GPU D2H_Mem_BW_GPU GB/s GPU-to-host bandwidth for each GPU IBLoopback (Tool: PerfTest โ Standard RDMA Test Tool)
Metrics Unit Description IB_Write MB/s The IB write loopback throughput with different message sizes IB_Read MB/s The IB read loopback throughput with different message sizes IB_Send MB/s The IB send loopback throughput with different message sizes NCCL/RCCL (Tool: NCCL/RCCL Tests)
Metrics Unit Description NCCL_AllReduce GB/s The NCCL AllReduce performance with different message sizes NCCL_AllGather GB/s The NCCL AllGather performance with different message sizes NCCL_broadcast GB/s The NCCL Broadcast performance with different message sizes NCCL_reduce GB/s The NCCL Reduce performance with different message sizes NCCL_reduce_scatter GB/s The NCCL ReduceScatter performance with different message sizes Disk (Tool: FIO โ Standard Disk Performance Tool)
Metrics Unit Description Seq_Read MB/s Sequential read performance Seq_Write MB/s Sequential write performance Rand_Read MB/s Random read performance Rand_Write MB/s Random write performance Seq_R/W_Read MB/s Read performance in sequential read/write, fixed measurement (read:write = 4:1) Seq_R/W_Write MB/s Write performance in sequential read/write (read:write = 4:1) Rand_R/W_Read MB/s Read performance in random read/write (read:write = 4:1) Rand_R/W_Write MB/s Write performance in random read/write (read:write = 4:1) H2D/D2H SM Transmission Bandwidth (Tool: MSR-A build)
Metrics Unit Description H2D_SM_BW_GPU GB/s host-to-GPU bandwidth using GPU kernel for each GPU D2H_SM_BW_GPU GB/s GPU-to-host bandwidth using GPU kernel for each GPU
#
AMD GPU Support#
Docker Image Support- ROCm 4.2 PyTorch 1.7.0
- ROCm 4.0 PyTorch 1.7.0
#
Micro BenchmarksKernel Launch (Tool: MSR-A build)
Metrics Unit Description Kernel_Launch_Event_Time Time (ms) Dispatch latency measured in GPU time using hipEventRecord() Kernel_Launch_Wall_Time Time (ms) Dispatch latency measured in CPU time GEMM FLOPS (Tool: AMD rocblas-bench Tool)
Metrics Unit Description FP64 GFLOPS FP64 FLOPS without MatrixCore FP32(MC) GFLOPS TF32 FLOPS with MatrixCore FP16(MC) GFLOPS FP16 FLOPS with MatrixCore BF16(MC) GFLOPS BF16 FLOPS with MatrixCore INT8(MC) GOPS INT8 FLOPS with MatrixCore
#
E2E BenchmarksCNN models -- Use PyTorch torchvision models
- ResNet: ResNet-50, ResNet-101, ResNet-152
- DenseNet: DenseNet-169, DenseNet-201
- VGG: VGG-11, VGG-13, VGG-16, VGG-19โ
BERT -- Use huggingface Transformers
- BERT
- BERT Large
LSTM -- Use PyTorch
GPT-2 -- Use huggingface Transformers
#
Bug Fix- VGG models failed on A100 GPU with batch_size=128
#
Other ImprovementContribution related
- Contribute rule
- System information collection
Document
- Add release process doc
- Add design documents
- Add developer guide doc for coding style
- Add contribution rules
- Add docker image list
- Add initial validation results