Customized Collective Algorithm with NCCL API

Note

This tutorial demonstrates how to plug a custom collective algorithm (an AllGather variant) into the MSCCL++ NCCL interposition / algorithm registration path and invoke it transparently via the standard NCCL API (ncclAllGather).

Overview

The example shows how to:

Define a device kernel (allgather) that uses PortChannel device handles to exchange data.
Wrap that kernel inside an algorithm class (AllgatherAlgoBuilder) responsible for:
- Connection discovery / proxy setup.
- Context key generation (so contexts can be reused / cached).
- Launch function binding (kernel wrapper executed when NCCL all-gather is called).
Register the algorithm builder with the global AlgorithmCollectionBuilder and install a selector deciding which implementation to return for a given collective request.
Run a multi-process (multi-rank) test using standard NCCL calls. The user program remains unchanged apart from initialization / registration code.
(Optionally) Capture the sequence of ncclAllGather calls into a CUDA Graph for efficient replay.

Location

Example source directory:

examples/customized-collective-algorithm/

Key file: customized_allgather.cu.

Build and Run

From the repository root:

cd examples/customized-collective-algorithm
make

Run (inside container you may need root privileges depending on GPU access):

LD_PRELOAD=<MSCCLPP_INSTALL_DIR>/lib/libmscclpp_nccl.so ./customized_allgather

Expected (abbreviated) output on success:

GPU 0: bytes 268435456, elapsed 7.35012 ms/iter, BW 109.564 GB/s
Succeed!