Customized Collective Algorithm with NCCL API
Note
This tutorial demonstrates how to plug a custom collective algorithm (an AllGather variant) into the MSCCL++ NCCL interposition / algorithm registration path and invoke it transparently via the standard NCCL API (ncclAllGather
).
Overview
The example shows how to:
Define a device kernel (
allgather
) that usesPortChannel
device handles to exchange data.Wrap that kernel inside an algorithm class (
AllgatherAlgoBuilder
) responsible for:Connection discovery / proxy setup.
Context key generation (so contexts can be reused / cached).
Launch function binding (kernel wrapper executed when NCCL all-gather is called).
Register the algorithm builder with the global
AlgorithmCollectionBuilder
and install a selector deciding which implementation to return for a given collective request.Run a multi-process (multi-rank) test using standard NCCL calls. The user program remains unchanged apart from initialization / registration code.
(Optionally) Capture the sequence of
ncclAllGather
calls into a CUDA Graph for efficient replay.
Location
Example source directory:
examples/customized-collective-algorithm/
Key file: customized_allgather.cu
.
Build and Run
From the repository root:
cd examples/customized-collective-algorithm
make
Run (inside container you may need root privileges depending on GPU access):
LD_PRELOAD=<MSCCLPP_INSTALL_DIR>/lib/libmscclpp_nccl.so ./customized_allgather
Expected (abbreviated) output on success:
GPU 0: bytes 268435456, elapsed 7.35012 ms/iter, BW 109.564 GB/s
Succeed!