Skip to main content

Releasing SuperBench v0.10

· 2 min read
Peng Cheng
SuperBench Team

We are very happy to announce that SuperBench 0.10.0 version is officially released today!

You can install and try superbench by following Getting Started Tutorial.

SuperBench 0.10.0 Release Notes#

SuperBench Improvements#

  • Support monitoring for AMD GPUs.
  • Support ROCm 5.7 and ROCm 6.0 dockerfile.
  • Add MSCCL support for Nvidia GPU.
  • Fix NUMA domains swap issue in NDv4 topology file.
  • Add NDv5 topo file.
  • Fix NCCL and NCCL-test to 2.18.3 for hang issue in CUDA 12.2.

Micro-benchmark Improvements#

  • Add HPL random generator to gemm-flops with ROCm.
  • Add DirectXGPURenderFPS benchmark to measure the FPS of rendering simple frames.
  • Add HWDecoderFPS benchmark to measure the FPS of hardware decoder performance.
  • Update Docker image for H100 support.
  • Update MLC version into 3.10 for CUDA/ROCm dockerfile.
  • Bug fix for GPU Burn test.
  • Support INT8 in cublaslt function.
  • Add hipBLASLt function benchmark.
  • Support cpu-gpu and gpu-cpu in ib-validation.
  • Support graph mode in NCCL/RCCL benchmarks for latency metrics.
  • Support cpp implementation in distributed inference benchmark.
  • Add O2 option for gpu copy ROCm build.
  • Support different hipblasLt data types in dist inference.
  • Support in-place in NCCL/RCCL benchmark.
  • Support data type option in NCCL/RCCL benchmark.
  • Improve P2P performance with fine-grained GPU memory in GPU-copy test for AMD GPUs.
  • Update hipblaslt GEMM metric unit to tflops.
  • Support FP8 for hipblaslt benchmark.

Model Benchmark Improvements#

  • Change torch.distributed.launch to torchrun.
  • Support Megatron-LM/Megatron-Deepspeed GPT pretrain benchmark.

Result Analysis#

  • Support baseline generation from multiple nodes.

Releasing SuperBench v0.9

· One min read
Peng Cheng
SuperBench Team

We are very happy to announce that SuperBench 0.9.0 version is officially released today!

You can install and try superbench by following Getting Started Tutorial.

SuperBench 0.9.0 Release Notes#

SuperBench Improvement#

  • Support Ctrl+C and interrupt to stop all SuperBench testing.
  • Support Windows Docker for VDI/Gaming GPU.
  • Support DirectX platform for Nvidia and AMD GPU.
  • Add System Config Info feature in SB runner to support distributed collection.
  • Support DirectX test pipeline.

Micro-benchmark Improvement#

  • Add DirectXGPUCopyBw Benchmark to measure HtoD/DtoH bandwidth by DirectX.
  • Add DirectXGPUCoreFLops Benchmark to measure peak FLOPS by DirectX..
  • Add DirectXGPUMemBw Benchmark to measure GPU memory bandwidth by DirectX..
  • Add DirectXVCNEncodingLatency Benchmark to measure the VCN hardware encoding latency on AMD graphic GPUs.
  • Support best algorithm selection in cudnn-function microbenchmark.
  • Revise step time collection in distributed inference benchmark.

Model Benchmark Improvement#

  • Fix early stop logic due to num_steps in model benchmarks.
  • Support TensorRT models on Nvidia H100.

Documentation#

  • Improve documentation for System Config Info.
  • Update outdate references.

Releasing SuperBench v0.8

· 2 min read
Peng Cheng
SuperBench Team

We are very happy to announce that SuperBench 0.8.0 version is officially released today!

You can install and try superbench by following Getting Started Tutorial.

SuperBench 0.8.0 Release Notes#

SuperBench Improvements#

  • Support SuperBench Executor running on Windows.
  • Remove fixed rccl version in rocm5.1.x docker file.
  • Upgrade networkx version to fix installation compatibility issue.
  • Pin setuptools version to v65.7.0.
  • Limit ansible_runner version for Python 3.6.
  • Support cgroup V2 when read system metrics in monitor.
  • Fix analyzer bug in Python 3.8 due to pandas api change.
  • Collect real-time GPU power in monitor.
  • Remove unreachable condition when write host list in mpi mode.
  • Upgrade Docker image with cuda12.1, nccl 2.17.1-1, hpcx v2.14, and mlc 3.10.
  • Fix wrong unit of cpu-memory-bw-latency in document.

Micro-benchmark Improvements#

  • Add STREAM benchmark for sustainable memory bandwidth and the corresponding computation rate.
  • Add HPL Benchmark for HPC Linpack Benchmark.
  • Support flexible warmup and non-random data initialization in cublas-benchmark.
  • Support error tolerance in micro-benchmark for CuDNN function.
  • Add distributed inference benchmark.
  • Support tensor core precisions (e.g., FP8) and batch/shape range in cublaslt gemm.

Model Benchmark Improvements#

  • Fix torch.dist init issue with multiple models.
  • Support TE FP8 in BERT/GPT2 model.
  • Add num_workers configurations in model benchmark.

Releasing SuperBench v0.7

· One min read
Peng Cheng
SuperBench Team

We are very happy to announce that SuperBench 0.7.0 version is officially released today!

You can install and try superbench by following Getting Started Tutorial.

SuperBench 0.7.0 Release Notes#

SuperBench Improvement#

  • Support non-zero return code when "sb deploy" or "sb run" fails in Ansible.
  • Support log flushing to the result file during runtime.
  • Update version to include revision hash and date.
  • Support "pattern" in mpi mode to run tasks in parallel.
  • Support topo-aware, all-pair, and K-batch pattern in mpi mode.
  • Fix Transformers version to avoid Tensorrt failure.
  • Add CUDA11.8 Docker image for NVIDIA arch90 GPUs.
  • Support "sb deploy" without pulling image.

Micro-benchmark Improvements#

  • Support list of custom config string in cudnn-functions and cublas-functions.
  • Support correctness check in cublas-functions.
  • Support GEMM-FLOPS for NVIDIA arch90 GPUs.
  • Support cuBLASLt FP16 and FP8 GEMM.
  • Add wait time option to resolve mem-bw unstable issue.
  • Fix bug for incorrect datatype judgement in cublas-function source code.

Model Benchmark Improvements#

  • Support FP8 in BERT model training.

Distributed Benchmark Improvements#

  • Support pair-wise pattern in IB validation benchmark.
  • Support topo-aware, pair-wise, and K-batch pattern in nccl-bw benchmark.

Releasing SuperBench v0.6

· 2 min read
Peng Cheng
SuperBench Team

We are very happy to announce that SuperBench 0.6.0 version is officially released today!

You can install and try superbench by following Getting Started Tutorial.

SuperBench 0.6.0 Release Notes#

SuperBench Improvement#

  • Support running on host directly without Docker.
  • Support running sb command inside docker image.
  • Support ROCm 5.1.1.
  • Support ROCm 5.1.3.
  • Fix bugs in data diagnosis.
  • Fix cmake and build issues.
  • Support automatic configuration yaml selection on Azure VM.
  • Refine error message when GPU is not detected.
  • Add return code for Timeout.
  • Update Dockerfile for NCCL/RCCL version, tag name, and verbose output.
  • Support node_num=1 in mpi mode.
  • Update Python setup for require packages.
  • Enhance parameter parsing to allow spaces in value.
  • Support NO_COLOR for SuperBench output.

Micro-benchmark Improvements#

  • Fix issues in ib loopback benchmark.
  • Fix stability issue in ib loopback benchmark.

Distributed Benchmark Improvements#

  • Enhance pair-wise IB benchmark.
  • Bug Fix in IB benchmark.
  • Support topology-aware IB benchmark.

Data Diagnosis and Analysis#

  • Add failure check function in data_diagnosis.py.
  • Support JSON and JSONL in Diagnosis.
  • Add support to store values of metrics in data diagnosis.
  • Support exit code of sb result diagnosis.
  • Format int type and unify empty value to N/A in diagnosis output files.

Releasing SuperBench v0.5

· 2 min read
Peng Cheng
SuperBench Team

We are very happy to announce that SuperBench 0.5.0 version is officially released today!

You can install and try superbench by following Getting Started Tutorial.

SuperBench 0.5.0 Release Notes#

Micro-benchmark Improvements#

  • Support NIC only NCCL bandwidth benchmark on single node in NCCL/RCCL bandwidth test.
  • Support bi-directional bandwidth benchmark in GPU copy bandwidth test.
  • Support data checking in GPU copy bandwidth test.
  • Update rccl-tests submodule to fix divide by zero error.
  • Add GPU-Burn micro-benchmark.

Model-benchmark Improvements#

  • Sync results on root rank for e2e model benchmarks in distributed mode.
  • Support customized env in local and torch.distributed mode.
  • Add support for pytorch>=1.9.0.
  • Keep BatchNorm as fp32 for pytorch cnn models cast to fp16.
  • Remove FP16 samples type converting time.
  • Support FAMBench.

Inference Benchmark Improvements#

  • Revise the default setting for inference benchmark.
  • Add percentile metrics for inference benchmarks.
  • Support T4 and A10 in GEMM benchmark.
  • Add configuration with inference benchmark.

Other Improvements#

  • Add command to support listing all optional parameters for benchmarks.
  • Unify benchmark naming convention and support multiple tests with same benchmark and different parameters/options in one configuration file.
  • Support timeout to detect the benchmark failure and stop the process automatically.
  • Add rocm5.0 dockerfile.
  • Improve output interface.

Data Diagnosis and Analysis#

  • Support multi-benchmark check.
  • Support result summary in md, html and excel formats.
  • Support data diagnosis in md and html formats.
  • Support result output for all nodes in data diagnosis.

Releasing SuperBench v0.4

· One min read
Peng Cheng
SuperBench Team

We are very happy to announce that SuperBench 0.4.0 version is officially released today!

You can install and try superbench by following Getting Started Tutorial.

SuperBench 0.4.0 Release Notes#

SuperBench Framework#

Monitor#

  • Add monitor framework for NVIDIA GPU, CPU, memory and disk.

Data Diagnosis and Analysis#

  • Support baseline-based data diagnosis.
  • Support basic analysis feature (boxplot figure, outlier detection, etc.).

Single-node Validation#

Micro Benchmarks#

  • CPU Memory Validation (tool: Intel Memory Latency Checker).
  • GPU Copy Bandwidth (tool: built by MSRA).
  • Add ORT Model on AMD GPU platform.
  • Add inference backend TensorRT.
  • Add inference backend ORT.

Multi-node Validation#

Micro Benchmarks#

  • IB Networking validation.
  • TCP validation (tool: TCPing).
  • GPCNet Validation (tool: GPCNet).

Other Improvement#

  1. Enhancement

    • Add pipeline for AMD docker.
    • Integrate system config info script with SuperBench.
    • Support FP32 mode without TF32.
    • Refine unit test for microbenchmark.
    • Unify metric names for all benchmarks.
  2. Document

    • Add benchmark list
    • Add monitor document
    • Add data diagnosis document

Releasing SuperBench v0.3

· 4 min read
Peng Cheng
SuperBench Team

We are very happy to announce that SuperBench 0.3.0 version is officially released today!

You can install and try superbench by following Getting Started Tutorial.

SuperBench 0.3.0 Release Notes#

SuperBench Framework#

Runner#

  • Implement MPI mode.

Benchmarks#

  • Support Docker benchmark.

Single-node Validation#

Micro Benchmarks#

  1. Memory (Tool: NVIDIA/AMD Bandwidth Test Tool)

    MetricsUnitDescription
    H2D_Mem_BW_GPUGB/shost-to-GPU bandwidth for each GPU
    D2H_Mem_BW_GPUGB/sGPU-to-host bandwidth for each GPU
  2. IBLoopback (Tool: PerfTest – Standard RDMA Test Tool)

    MetricsUnitDescription
    IB_WriteMB/sThe IB write loopback throughput with different message sizes
    IB_ReadMB/sThe IB read loopback throughput with different message sizes
    IB_SendMB/sThe IB send loopback throughput with different message sizes
  3. NCCL/RCCL (Tool: NCCL/RCCL Tests)

    MetricsUnitDescription
    NCCL_AllReduceGB/sThe NCCL AllReduce performance with different message sizes
    NCCL_AllGatherGB/sThe NCCL AllGather performance with different message sizes
    NCCL_broadcastGB/sThe NCCL Broadcast performance with different message sizes
    NCCL_reduceGB/sThe NCCL Reduce performance with different message sizes
    NCCL_reduce_scatterGB/sThe NCCL ReduceScatter performance with different message sizes
  4. Disk (Tool: FIO – Standard Disk Performance Tool)

    MetricsUnitDescription
    Seq_ReadMB/sSequential read performance
    Seq_WriteMB/sSequential write performance
    Rand_ReadMB/sRandom read performance
    Rand_WriteMB/sRandom write performance
    Seq_R/W_ReadMB/sRead performance in sequential read/write, fixed measurement (read:write = 4:1)
    Seq_R/W_WriteMB/sWrite performance in sequential read/write (read:write = 4:1)
    Rand_R/W_ReadMB/sRead performance in random read/write (read:write = 4:1)
    Rand_R/W_WriteMB/sWrite performance in random read/write (read:write = 4:1)
  5. H2D/D2H SM Transmission Bandwidth (Tool: MSR-A build)

    MetricsUnitDescription
    H2D_SM_BW_GPUGB/shost-to-GPU bandwidth using GPU kernel for each GPU
    D2H_SM_BW_GPUGB/sGPU-to-host bandwidth using GPU kernel for each GPU

AMD GPU Support#

Docker Image Support#

  • ROCm 4.2 PyTorch 1.7.0
  • ROCm 4.0 PyTorch 1.7.0

Micro Benchmarks#

  1. Kernel Launch (Tool: MSR-A build)

    MetricsUnitDescription
    Kernel_Launch_Event_TimeTime (ms)Dispatch latency measured in GPU time using hipEventRecord()
    Kernel_Launch_Wall_TimeTime (ms)Dispatch latency measured in CPU time
  2. GEMM FLOPS (Tool: AMD rocblas-bench Tool)

    MetricsUnitDescription
    FP64GFLOPSFP64 FLOPS without MatrixCore
    FP32(MC)GFLOPSTF32 FLOPS with MatrixCore
    FP16(MC)GFLOPSFP16 FLOPS with MatrixCore
    BF16(MC)GFLOPSBF16 FLOPS with MatrixCore
    INT8(MC)GOPSINT8 FLOPS with MatrixCore

E2E Benchmarks#

  1. CNN models -- Use PyTorch torchvision models

    • ResNet: ResNet-50, ResNet-101, ResNet-152
    • DenseNet: DenseNet-169, DenseNet-201
    • VGG: VGG-11, VGG-13, VGG-16, VGG-19​
  2. BERT -- Use huggingface Transformers

    • BERT
    • BERT Large
  3. LSTM -- Use PyTorch

  4. GPT-2 -- Use huggingface Transformers

Bug Fix#

  • VGG models failed on A100 GPU with batch_size=128

Other Improvement#

  1. Contribution related

    • Contribute rule
    • System information collection
  2. Document

    • Add release process doc
    • Add design documents
    • Add developer guide doc for coding style
    • Add contribution rules
    • Add docker image list
    • Add initial validation results

Releasing SuperBench v0.2

· One min read
SuperBench Team

We are very happy to announce that SuperBench 0.2.0 version is officially released today!

You can install and try superbench by following Getting Started Tutorial.

SuperBench 0.2.0 Release Notes#

SuperBench Framework#

  • Implemented a CLI to provide a command line interface.
  • Implemented Runner for nodes control and management.
  • Implemented Executor.
  • Implemented Benchmark framework.

Supported Benchmarks#

  • Supported Micro-benchmarks
    • GEMM FLOPS (GFLOPS, TensorCore, cuBLAS, cuDNN)
    • Kernel Launch Time (Kernel_Launch_Event_Time, Kernel_Launch_Wall_Time)
    • Operator Performance (MatMul, Sharding_MatMul)
  • Supported Model-benchmarks
    • CNN models (Reference: torchvision models)
      • ResNet (ResNet-18, ResNet-34, ResNet-50, ResNet-101, ResNet-152)
      • DenseNet (DenseNet-161, DenseNet-169, DenseNet-201)
      • VGG (VGG-11, VGG-13, VGG-16, VGG-19, VGG11_bn, VGG13_bn, VGG16_bn, VGG19_bn)
      • MNASNet (mnasnet0_5, mnasnet0_75, mnasnet1_0, mnasnet1_3)
      • AlexNet
      • GoogLeNet
      • Inception_v3
      • mobilenet_v2
      • ResNeXt (resnext50_32x4d, resnext101_32x8d)
      • Wide ResNet (wide_resnet50_2, wide_resnet101_2)
      • ShuffleNet (shufflenet_v2_x0_5, shufflenet_v2_x1_0, shufflenet_v2_x1_5, shufflenet_v2_x2_0)
      • SqueezeNet (squeezenet1_0, squeezenet1_1)
    • LSTM model
    • BERT models (BERT-Base, BERT-Large)
    • GPT-2 model (specify which config)

Examples and Documents#

  • Added examples to run benchmarks respectively.
  • Tutorial Documents (introduction, getting-started, developer-guides, APIs, benchmarks).
  • Built SuperBench website.

Introduce SuperBench

· 3 min read
SuperBench Team

This blog is to introduce SuperBench to help you validate AI infrastructure.

The Advantages of SuperBench#

Easy-to-use CLI#

In order to provide good user experience, SuperBench provides a command line interface to help users deploy and run benchmarks. Empowered by SuperBench CLI, user can deploy and run their benchmarks with only one command, which greatly shorten the learning curve of using tools, to help user easily evaluate the performance of AI workload.

Below is a simple example to show how to deploy and run benchmarks locally. For more information, please view CLI Document

  1. Deploy

    sb deploy -f local.ini
  2. Run Benchmark

    sb run -f local.ini -c config.yaml

Among them, local.ini is the configuration file to manage worker nodes that will actually run benchmarks. In below case, the worker node is localhost, same as control node.

local.ini
[all]localhost ansible_connection=local

config.yaml is a config file to configure the details of benchmarkings. You can customize your benchmarks by modified this file.

For more information, please view configuration

Modular and Extensible Framework#

  1. Executor Framework

    In order to facilitate the benchmarking and validation on large-scale clusters, we designed and implemented a modular and extensible framework. SuperBench framework includes a runner as control node, as well as multiple executors as worker nodes. A runner received commands from CLI and distribute to all nodes (worker nodes) in the cluster, collect data, and summarize the results. Each worker will run executor to execute the specified benchmark tasks.

    SuperBench Executor Workflow

  2. Benchmark Framework

    SuperBench supports micro-benchmark for primitive computation and communication benchmarking, and model-benchmark to measure domain-aware end-to-end deep learning workloads. SuperBench implemented an abstract BenchmarkBase to provide common function. All kind of benchmarks are built based on this abstract class. It also provides a unified interface and result format for all benchmarks. Developers can easily add new benchmarks.

    SuperBench Benchmark Package

Conprehensive and Strandardized Benchmarks#

SuperBench supports a set of benchmarks listed as below.

  • Micro-Benchmarks

    • Computation benchmarks
      • GEMM Flops
      • Kernel Launch Time
      • Operator Performance
    • Communication benchmarks
      • Memory
      • Device P2P
      • RDMA
      • NCCL
    • Computation-Communication Benchmarks
    • Storage Benchmarks
  • Model-Benchmarks

    • CNN models
    • LSTM models
    • BERT models
    • GPT-2 models

For the details of each benchmark, please view micro-benchmarks and model-benchmarks.

What's next?#

We want to extend SuperBench capability to distributed validation and auto-diagnosis, to build a benchmarking eco-system. The following figure shows the whole picture. SuperBench Capabilities and Extension

With SuperBench and its extensions, we can support:

  • Quick and trustable distributed validation
    • Distributed validation tools to validate hundreds or thousands of servers automatically
    • Provide minute-level fast validation and guarantee high repeatability for each benchmarks
    • Provide baseline for different systems as Performance/Quality Gates for hardware and system release
  • Detailed auto diagnosis
    • Provide comprehensive diagnosis benchmarks to analyze the detailed issued on defective node
    • Provide detailed performance report and advanced analysis tool

Call for Contributor#

This project welcomes contributions and suggestions.