Microsoft Collective Communication Library (MSCCL) is a platform to execute custom collective communication algorithms for multiple accelerators supported by Microsoft Azure. MSCCL enables hardware and application specific optimizations that can deliver huge speedups over unspecialized communication algorithms.

The table below shows speedups given by switching from NVIDIA’s NCCL to MSCCL. To get these speedups in your own Microsoft Azure workload follow the instructions in the msccl-tools and msccl repositories.

Configuration	Allreduce	Alltoall
1xNDv4
64xNDv4

The graphs in the table above show the speedup on the Y axis for a range of user data sizes on the X axis. Each graph shows the speedup for a specific hardware configuration and collective operation. For example, the graph in the “1xNDv4” row and “Allreduce” column shows the speedups given by MSCCL for the Allreduce collective when running on a single Azure NDv4 VM containing 8 NVIDIA A100 GPUs.

Methods

These speedups were produced by running the relevant benchmarks from nccl-tests on the target hardware configuration for MSCCL, with the algorithms available in msccl-tools, and NCCL 2.8.4.