Skip to main content

Releasing SuperBench v0.3

ยท 4 min read
Peng Cheng
SuperBench Team

We are very happy to announce that SuperBench 0.3.0 version is officially released today!

You can install and try superbench by following Getting Started Tutorial.

SuperBench 0.3.0 Release Notes#

SuperBench Framework#

Runner#

  • Implement MPI mode.

Benchmarks#

  • Support Docker benchmark.

Single-node Validation#

Micro Benchmarks#

  1. Memory (Tool: NVIDIA/AMD Bandwidth Test Tool)

    MetricsUnitDescription
    H2D_Mem_BW_GPUGB/shost-to-GPU bandwidth for each GPU
    D2H_Mem_BW_GPUGB/sGPU-to-host bandwidth for each GPU
  2. IBLoopback (Tool: PerfTest โ€“ Standard RDMA Test Tool)

    MetricsUnitDescription
    IB_WriteMB/sThe IB write loopback throughput with different message sizes
    IB_ReadMB/sThe IB read loopback throughput with different message sizes
    IB_SendMB/sThe IB send loopback throughput with different message sizes
  3. NCCL/RCCL (Tool: NCCL/RCCL Tests)

    MetricsUnitDescription
    NCCL_AllReduceGB/sThe NCCL AllReduce performance with different message sizes
    NCCL_AllGatherGB/sThe NCCL AllGather performance with different message sizes
    NCCL_broadcastGB/sThe NCCL Broadcast performance with different message sizes
    NCCL_reduceGB/sThe NCCL Reduce performance with different message sizes
    NCCL_reduce_scatterGB/sThe NCCL ReduceScatter performance with different message sizes
  4. Disk (Tool: FIO โ€“ Standard Disk Performance Tool)

    MetricsUnitDescription
    Seq_ReadMB/sSequential read performance
    Seq_WriteMB/sSequential write performance
    Rand_ReadMB/sRandom read performance
    Rand_WriteMB/sRandom write performance
    Seq_R/W_ReadMB/sRead performance in sequential read/write, fixed measurement (read:write = 4:1)
    Seq_R/W_WriteMB/sWrite performance in sequential read/write (read:write = 4:1)
    Rand_R/W_ReadMB/sRead performance in random read/write (read:write = 4:1)
    Rand_R/W_WriteMB/sWrite performance in random read/write (read:write = 4:1)
  5. H2D/D2H SM Transmission Bandwidth (Tool: MSR-A build)

    MetricsUnitDescription
    H2D_SM_BW_GPUGB/shost-to-GPU bandwidth using GPU kernel for each GPU
    D2H_SM_BW_GPUGB/sGPU-to-host bandwidth using GPU kernel for each GPU

AMD GPU Support#

Docker Image Support#

  • ROCm 4.2 PyTorch 1.7.0
  • ROCm 4.0 PyTorch 1.7.0

Micro Benchmarks#

  1. Kernel Launch (Tool: MSR-A build)

    MetricsUnitDescription
    Kernel_Launch_Event_TimeTime (ms)Dispatch latency measured in GPU time using hipEventRecord()
    Kernel_Launch_Wall_TimeTime (ms)Dispatch latency measured in CPU time
  2. GEMM FLOPS (Tool: AMD rocblas-bench Tool)

    MetricsUnitDescription
    FP64GFLOPSFP64 FLOPS without MatrixCore
    FP32(MC)GFLOPSTF32 FLOPS with MatrixCore
    FP16(MC)GFLOPSFP16 FLOPS with MatrixCore
    BF16(MC)GFLOPSBF16 FLOPS with MatrixCore
    INT8(MC)GOPSINT8 FLOPS with MatrixCore

E2E Benchmarks#

  1. CNN models -- Use PyTorch torchvision models

    • ResNet: ResNet-50, ResNet-101, ResNet-152
    • DenseNet: DenseNet-169, DenseNet-201
    • VGG: VGG-11, VGG-13, VGG-16, VGG-19โ€‹
  2. BERT -- Use huggingface Transformers

    • BERT
    • BERT Large
  3. LSTM -- Use PyTorch

  4. GPT-2 -- Use huggingface Transformers

Bug Fix#

  • VGG models failed on A100 GPU with batch_size=128

Other Improvement#

  1. Contribution related

    • Contribute rule
    • System information collection
  2. Document

    • Add release process doc
    • Add design documents
    • Add developer guide doc for coding style
    • Add contribution rules
    • Add docker image list
    • Add initial validation results