Quick Start

Prerequisites

  • GPUs

    • NVIDIA CUDA architecture 7.0 (Volta) or later, or AMD CDNA 2 architecture (GFX90a) or later are required. Features are more thoroughly tested on CUDA architecture 8.0 (Ampere) or later and AMD CDNA 3 architecture (GFX942) or later.

    • A part of the features require GPUs to be connected peer-to-peer (through NVLink/xGMI or under the same PCIe switch).

      • On NVIDIA platforms, check the connectivity via nvidia-smi topo -m. If the output shows NV# or PIX, it means the GPUs are connected peer-to-peer.

      • On AMD platforms, check the connectivity via rocm-smi --showtopohops. If the output shows 1, it means the GPUs are connected peer-to-peer.

    • Below are example systems that meet the requirements:

      • Azure SKUs

      • Non-Azure Systems

        • NVIDIA A100 GPUs + CUDA >= 11.8

        • NVIDIA H100 GPUs + CUDA >= 12.0

        • AMD MI250X GPUs + ROCm >= 5.7

        • AMD MI300X GPUs + ROCm >= 6.0

  • OS

    • Tested on Ubuntu 18.04 and later

  • Libraries

    • libnuma

      sudo apt-get install libnuma-dev
      
    • (Optional, for building the Python module) Python >= 3.8 and Python Development Package

      sudo apt-get satisfy "python3 (>=3.8), python3-dev (>=3.8)"
      

      If you don’t want to build Python module, you need to set -DMSCCLPP_BUILD_PYTHON_BINDINGS=OFF in your cmake command (see details in Install from Source).

    • (Optional, for benchmarks) MPI

  • Others

    • For NVIDIA platforms, nvidia_peermem driver should be loaded on all nodes. Check it via:

      lsmod | grep nvidia_peermem
      
    • For NVLink SHARP (NVLS) support on NVIDIA platforms, the Linux kernel version should be 5.6 or above.

Docker Images

We provide docker images which package all prerequisites for MSCCL++. You can setup your dev environment with the following command. Note that our docker images don’t contain MSCCL++ by default, so you need to build it from source inside the container (see Install from Source below).

# For NVIDIA platforms
$ docker run -it --privileged --net=host --ipc=host --gpus all --name mscclpp-dev ghcr.io/microsoft/mscclpp/mscclpp:base-dev-cuda12.8 bash
# For AMD platforms
$ docker run -it --privileged --net=host --ipc=host --security-opt=seccomp=unconfined --group-add=video --name mscclpp-dev ghcr.io/microsoft/mscclpp/mscclpp:base-dev-rocm6.2 bash

See all available images here.

Install from Source

If you want to install only the Python module, you can skip this section and go to Install from Source (Python Module).

CMake 3.25 or later is required.

$ git clone https://github.com/microsoft/mscclpp.git
$ mkdir -p mscclpp/build && cd mscclpp/build

For NVIDIA platforms, build MSCCL++ as follows. Replace /usr with your desired installation path.

# For NVIDIA platforms
$ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/usr ..
$ make -j$(nproc)

For AMD platforms, use HIPCC instead of the default C++ compiler. The HIPCC path is usually /opt/rocm/bin/hipcc in official ROCm installations. If the path is different in your environment, please change it accordingly.

# For AMD platforms
$ CXX=/opt/rocm/bin/hipcc cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/usr ..
$ make -j$(nproc)

After build succeeds, install the headers and binaries.

$ sudo make install

Tip

There are a few optional CMake options you can set:

  • -DMSCCLPP_GPU_ARCHS=<arch-list>: Specify the GPU architectures to build for. For example, -DMSCCLPP_GPU_ARCHS="80,90" for NVIDIA A100 and H100 GPUs, -DMSCCLPP_GPU_ARCHS=gfx942 for AMD MI300x GPU.

  • -DMSCCLPP_BYPASS_GPU_CHECK=ON -DMSCCLPP_USE_CUDA=ON: If the build environment doesn’t have GPUs and only has CUDA installed, you can set these options to bypass GPU checks and use CUDA APIs. This is useful for building on CI systems or environments without GPUs.

  • -DMSCCLPP_BYPASS_GPU_CHECK=ON -DMSCCLPP_USE_ROCM=ON: If the build environment doesn’t have GPUs and only has ROCm installed, you can set these options to bypass GPU checks and use ROCm APIs.

  • -DMSCCLPP_BUILD_PYTHON_BINDINGS=OFF: Don’t build the Python module.

  • -DMSCCLPP_BUILD_TESTS=OFF: Don’t build the tests.

  • -DMSCCLPP_BUILD_APPS_NCCL=OFF: Don’t build the NCCL API.

Install from Source (Python Module)

Python 3.8 or later is required.

# For NVIDIA platforms
$ python -m pip install .
# For AMD platforms, set the C++ compiler to HIPCC
$ CXX=/opt/rocm/bin/hipcc python -m pip install .

VSCode Dev Container

If you are using VSCode, you can use our VSCode Dev Container that automatically launches a development environment and installs MSCCL++ in it. Steps to use our VSCode Dev Container:

  1. Open the MSCCL++ repository in VSCode.

  2. Make sure your Docker is running.

  3. Make sure you have the Dev Containers extension installed in VSCode.

  4. Open the command palette with Ctrl+Shift+P and select Dev Containers: Rebuild and Reopen in Container.

  5. Wait for the container to build and open (may take a few minutes).

Note

  • Our Dev Container is set up for NVIDIA GPUs by default. If you are using AMD GPUs, you need to copy devcontainer_amd.json to devcontainer.json.

  • Our Dev Container runs an SSH server over the host network and the port number is 22345 by default. You can change the port number by modifying the SSH_PORT argument in the devcontainer.json file.

  • Our Dev Container uses a non-root user devuser by default, but note that you may need the root privileges to enable all hardware features of the GPUs inside the container. devuser is already configured to have sudo privileges without a password.

For more details on how to use the Dev Container, see the Dev Containers tutorial.

Unit Tests

unit_tests require one GPU on the system. It only tests operation of basic components.

$ make -j unit_tests
$ ./test/unit_tests

For thorough testing of MSCCL++ features, we need to use mp_unit_tests that require at least two GPUs on the system. mp_unit_tests also requires MPI to be installed on the system. For example, the following commands compile and run mp_unit_tests with two processes (two GPUs). The number of GPUs can be changed by changing the number of processes.

$ make -j mp_unit_tests
$ mpirun -np 2 ./test/mp_unit_tests

To run mp_unit_tests with more than two nodes, you need to specify the -ip_port argument that is accessible from all nodes. For example:

$ mpirun -np 16 -npernode 8 -hostfile hostfile ./test/mp_unit_tests -ip_port 10.0.0.5:50000

Performance Benchmark

Python Benchmark

Install the MSCCL++ Python package and run our Python AllReduce benchmark as follows. It requires MPI on the system.

# Choose `requirements_*.txt` according to your CUDA/ROCm version.
$ python3 -m pip install -r ./python/requirements_cuda12.txt
$ mpirun -tag-output -np 8 python3 ./python/mscclpp_benchmark/allreduce_bench.py

NCCL/RCCL Benchmark over MSCCL++

We implement NCCL APIs using MSCCL++. How to use:

  1. Build MSCCL++ from source.

  2. Replace your libnccl.so library with libmscclpp_nccl.so, which is compiled under ./build/apps/nccl/ directory.

For example, you can run nccl-tests using libmscclpp_nccl.so as follows, where MSCCLPP_BUILD is your MSCCL++ build directory.

mpirun -np 8 --bind-to numa --allow-run-as-root -x LD_PRELOAD=$MSCCLPP_BUILD/apps/nccl/libmscclpp_nccl.so ./build/all_reduce_perf -b 1K -e 256M -f 2 -d half -G 20 -w 10 -n 50

If MSCCL++ is built on AMD platforms, libmscclpp_nccl.so would replace the RCCL library (i.e., librccl.so).

MSCCL++ also supports fallback to NCCL/RCCL collectives by adding following environment variables.

-x MSCCLPP_ENABLE_NCCL_FALLBACK=TRUE
-x MSCCLPP_NCCL_LIB_PATH=/path_to_nccl_lib/libnccl.so (or /path_to_rccl_lib/librccl.so for AMD platforms)
-x MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION="list of collective name[s]"

The value "list of collective name[s]" can be a combination of collectives, such as "allgather", "allreduce", "broadcast", and "reducescatter". Alternatively, it can simply be set to "all" to enable fallback for all these collectives. By default, if the parameter MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION is not specified, "all" will be applied.

Example 1, Allreduce will fallback to NCCL ncclAllReduce since allreduce is in the fallback list.

mpirun -np 8 --bind-to numa --allow-run-as-root -x LD_PRELOAD=$MSCCLPP_BUILD/apps/nccl/libmscclpp_nccl.so -x MSCCLPP_ENABLE_NCCL_FALLBACK=TRUE -x MSCCLPP_NCCL_LIB_PATH=$NCCL_BUILD/lib/libnccl.so -x MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION="allreduce,allgather" ./build/all_reduce_perf -b 1K -e 256M -f 2 -d half -G 20 -w 10 -n 50

Example 2, ReduceScatter will still use msccl++ implementation since reducescatter is not in the fallbacklist.

mpirun -np 8 --bind-to numa --allow-run-as-root -x LD_PRELOAD=$MSCCLPP_BUILD/apps/nccl/libmscclpp_nccl.so -x MSCCLPP_ENABLE_NCCL_FALLBACK=TRUE -x MSCCLPP_NCCL_LIB_PATH=$NCCL_BUILD/lib/libnccl.so -x MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION="broadcast" -x MSCCLPP_EXECUTION_PLAN_DIR=/$PATH_TO_EXECUTION_PLANS/execution-files ./build/reduce_scatter_perf -b 1K -e 256M -f 2 -d half -G 20 -w 10 -n 50

On AMD platforms, you need to add RCCL_MSCCL_ENABLE=0 to avoid conflicts with the fallback features.