C++ API Reference

This reference organizes the MSCCL++ C++ API into two main categories: Host-Side Interfaces for CPU code and Device-Side Interfaces for GPU kernels. Components that are used in both host and device code are documented in the Device-Side Interfaces section.

Host-Side Interfaces

These are the interfaces used in CPU code to set up connections, manage memory, and coordinate operations.

Bootstrap and Process Coordination

class Bootstrap

Base class for bootstraps.

Subclassed by mscclpp::TcpBootstrap

Public Functions

inline Bootstrap(): Constructor.

virtual ~Bootstrap() = default: Destructor.

virtual int getRank() const = 0

Return the rank of the process.

Returns:: The rank of the process.

virtual int getNranks() const = 0

Return the total number of ranks.

Returns:: The total number of ranks.

virtual int getNranksPerNode() const = 0

Return the total number of ranks per node.

Returns:: The total number of ranks per node.

virtual void send(void *data, int size, int peer, int tag) = 0

Send arbitrary data to another process.

Data sent via send(senderBuff, size, receiverRank, tag) can be received via recv(receiverBuff, size, senderRank, tag). Multiple calls to send() with the same peer and tag will be ordered by the order of calls, corresponding to the order of recv() calls on the receiving side. In cases where the execution order of multiple send()s or recv()s between two ranks is unknown, they should be differentiated by using different tag values to prevent unexpected behavior.

Parameters:

data – The data to send.
size – The size of the data to send.
peer – The rank of the process to send the data to.
tag – The tag to send the data with.

virtual void recv(void *data, int size, int peer, int tag) = 0

Receive data sent from another process by send().

Data sent via send(senderBuff, size, receiverRank, tag) can be received via recv(receiverBuff, size, senderRank, tag). Multiple calls to send() with the same peer and tag will be ordered by the order of calls, corresponding to the order of recv() calls on the receiving side. In cases where the execution order of multiple send()s or recv()s between two ranks is unknown, they should be differentiated by using different tag values to prevent unexpected behavior.

Parameters:

data – The buffer to write the received data to.
size – The size of the data to receive.
peer – The rank of the process to receive the data from.
tag – The tag to receive the data with.

virtual void allGather(void *allData, int size) = 0

Gather data from all processes.

When called by rank r, this sends data from allData[r * size] to allData[(r + 1) * size - 1] to all other ranks. The data sent by rank r is received into allData[r * size] of other ranks.

Parameters:

allData – The buffer to write the received data to.
size – The size of the data each rank sends.

virtual void barrier() = 0: Synchronize all processes.

void groupBarrier(const std::vector<int> &ranks)

A partial barrier that synchronizes a group of ranks.

Parameters:: ranks – The ranks to synchronize.

void send(const std::vector<char> &data, int peer, int tag)

Wrapper of send() that sends a vector of characters.

Parameters:

data – The data to send.
peer – The rank of the process to send the data to.
tag – The tag to send the data with.

void recv(std::vector<char> &data, int peer, int tag)

Wrapper of recv() that receives a vector of characters.

Note

The data vector will be resized to the size of the received data.

Parameters:

data – The buffer to write the received data to.
peer – The rank of the process to receive the data from.
tag – The tag to receive the data with.

class TcpBootstrap : public mscclpp::Bootstrap 

A native implementation of the bootstrap using TCP sockets.

Public Functions

TcpBootstrap(int rank, int nRanks)

Constructor.

Parameters:

rank – The rank of the process.
nRanks – The total number of ranks.

~TcpBootstrap(): Destructor.

UniqueId getUniqueId() const

Return the unique ID stored in the TcpBootstrap.

Returns:: The unique ID stored in the TcpBootstrap.

void initialize(UniqueId uniqueId, int64_t timeoutSec = 30)

Initialize the TcpBootstrap with a given unique ID. The unique ID can be generated by any method; it can be created by createUniqueId() or can be any arbitrary bit array provided by the user.

Parameters:

uniqueId – The unique ID to initialize the TcpBootstrap with.
timeoutSec – The connection timeout in seconds.

void initialize(const std::string &ifIpPortTrio, int64_t timeoutSec = 30)

Initialize the TcpBootstrap with a string formatted as “ip:port” or “interface:ip:port”.

Parameters:

ifIpPortTrio – The string formatted as “ip:port” or “interface:ip:port”.
timeoutSec – The connection timeout in seconds.

virtual int getRank() const override: Return the rank of the process.

virtual int getNranks() const override: Return the total number of ranks.

virtual int getNranksPerNode() const override: Return the total number of ranks per node.

virtual void send(void *data, int size, int peer, int tag) override

Send arbitrary data to another process.

Data sent via send(senderBuff, size, receiverRank, tag) can be received via recv(receiverBuff, size, senderRank, tag). Multiple calls to send() with the same peer and tag will be ordered by the order of calls, corresponding to the order of recv() calls on the receiving side. In cases where the execution order of multiple send()s or recv()s between two ranks is unknown, they should be differentiated by using different tag values to prevent unexpected behavior.

Parameters:

data – The data to send.
size – The size of the data to send.
peer – The rank of the process to send the data to.
tag – The tag to send the data with.

virtual void recv(void *data, int size, int peer, int tag) override

Receive data sent from another process by send().

Data sent via send(senderBuff, size, receiverRank, tag) can be received via recv(receiverBuff, size, senderRank, tag). Multiple calls to send() with the same peer and tag will be ordered by the order of calls, corresponding to the order of recv() calls on the receiving side. In cases where the execution order of multiple send()s or recv()s between two ranks is unknown, they should be differentiated by using different tag values to prevent unexpected behavior.

Parameters:

data – The buffer to write the received data to.
size – The size of the data to receive.
peer – The rank of the process to receive the data from.
tag – The tag to receive the data with.

virtual void allGather(void *allData, int size) override

Gather data from all processes.

When called by rank r, this sends data from allData[r * size] to allData[(r + 1) * size - 1] to all other ranks. The data sent by rank r is received into allData[r * size] of other ranks.

Parameters:

allData – The buffer to write the received data to.
size – The size of the data each rank sends.

void broadcast(void *data, int size, int root)

Broadcast data from the root process to all processes using a ring-based algorithm.

When called by the root rank, this sends the size bytes starting at memory location data to all other ranks. Non-root ranks receive these bytes into their own data buffer, overwriting its previous contents. The data propagates sequentially through a logical ring of processes until all ranks have received it.

Parameters:

data – Pointer to the send buffer (root) or receive buffer (non-root)
size – Number of bytes to broadcast
root – Rank initiating the broadcast

virtual void barrier() override: Synchronize all processes.

Public Static Functions

static UniqueId createUniqueId()

Create a random unique ID.

Returns:: The created unique ID.

using mscclpp::UniqueId = std::array<uint8_t, UniqueIdBytes>: Unique ID for initializing the TcpBootstrap.

constexpr unsigned int mscclpp::UniqueIdBytes = 128

Connection Setup and Memory Management

class Connection

Connection between two processes.

Public Functions

Connection() = default: Constructor.

void write(RegisteredMemory dst, uint64_t dstOffset, RegisteredMemory src, uint64_t srcOffset, uint64_t size)

Write data from a source RegisteredMemory to a destination RegisteredMemory.

Parameters:

dst – The destination RegisteredMemory.
dstOffset – The offset in bytes from the start of the destination RegisteredMemory.
src – The source RegisteredMemory.
srcOffset – The offset in bytes from the start of the source RegisteredMemory.
size – The number of bytes to write.

void updateAndSync(RegisteredMemory dst, uint64_t dstOffset, uint64_t *src, uint64_t newValue)

Update an 8-byte value in a destination RegisteredMemory and synchronize the change with the remote process.

Parameters:

dst – The destination RegisteredMemory.
dstOffset – The offset in bytes from the start of the destination RegisteredMemory.
src – A pointer to the value to update.
newValue – The new value to write.

void flush(int64_t timeoutUsec = -1)

Flush any pending writes to the remote process.

Parameters:: timeoutUsec – Timeout in microseconds. Default: -1 (no timeout)

Transport transport() const

Get the transport used by the local process.

Returns:: The transport used by the local process.

Transport remoteTransport() const

Get the transport used by the remote process.

Returns:: The transport used by the remote process.

std::shared_ptr<Context> context() const

Get the context associated with this connection.

Returns:: A shared pointer to the context associated with this connection.

const Device &localDevice() const

Get the device used by the local endpoint.

Returns:: The device used by the local endpoint.

int getMaxWriteQueueSize() const

Get the maximum write queue size.

Returns:: The maximum number of write requests that can be queued.

class Context : public std::enable_shared_from_this<Context>

Context for communication. This provides a low-level interface for forming connections in use-cases where the process group abstraction offered by Communicator is not suitable, e.g., ephemeral client-server connections. Correct use of this class requires external synchronization when finalizing connections with the connect() method.

As an example, a client-server scenario where the server will write to the client might proceed as follows:

The client creates an endpoint with createEndpoint() and sends it to the server.
The server receives the client endpoint, creates its own endpoint with createEndpoint(), sends it to the client, and creates a connection with connect().
The client receives the server endpoint, creates a connection with connect() and sends a RegisteredMemory to the server.
The server receives the RegisteredMemory and writes to it using the previously created connection. The client waiting to create a connection before sending the RegisteredMemory ensures that the server cannot write to the RegisteredMemory before the connection is established.

While some transports may have more relaxed implementation behavior, this should not be relied upon.

Public Functions

~Context(): Destructor.

RegisteredMemory registerMemory(void *ptr, size_t size, TransportFlags transports)

Register a region of GPU memory for use in this context.

Parameters:

ptr – Base pointer to the memory.
size – Size of the memory region in bytes.
transports – Transport flags.

Returns:

A RegisteredMemory object representing the registered memory region.

Endpoint createEndpoint(EndpointConfig config)

Create an endpoint for establishing connections.

Parameters:: config – The configuration for the endpoint.
Returns:: The newly created endpoint.

Connection connect(const Endpoint &localEndpoint, const Endpoint &remoteEndpoint)

Establish a connection between two endpoints. While this method immediately returns a connection object, the connection is only safe to use after the corresponding connection on the remote endpoint has been established. This method must be called on both endpoints to establish a connection.

Parameters:

localEndpoint – The local endpoint.
remoteEndpoint – The remote endpoint.

Returns:

A connection object.

Public Static Functions

static inline std::shared_ptr<Context> create(): Create a new Context instance.

class Communicator

A class that sets up all registered memories and connections between processes.

A typical way to use this class:

Call connect() to declare connections between the calling process and other processes.
Call registerMemory() to register memory regions that will be used for communication.
Call sendMemory() or recvMemory() to send/receive registered memory regions to/from other processes.
Call get() on futures returned by connect(). Use the returned connections to create flags.
Call buildSemaphore() to create a Semaphore out of the flags.
Call get() on all futures returned by buildSemaphore() and recvMemory().
All done; use semaphores and registered memories to build channels.

CAUTION: The order of method calls matters when the same remote rank and tags are used. That is, the i-th “sending” method call (sendMemory(), connect(), and buildSemaphore()) on the local rank must be matched by the i-th “receiving” method call (recvMemory(), connect(), and buildSemaphore()) on the remote rank.

Correct Example 1:

// Rank 0
communicator.sendMemory(memory1, 1, tag);
communicator.sendMemory(memory2, 1, tag);
auto connection = communicator.connect(Transport::CudaIpc, 1, tag);
connection.get(); // This will return the connection.
// Rank 1
auto mem1 = communicator.recvMemory(0, tag);
auto mem2 = communicator.recvMemory(0, tag);
auto connection = communicator.connect(Transport::CudaIpc, 0, tag);
mem2.get();       // This will return memory2.
connection.get(); // This will return the connection.
mem1.get();       // This will return memory1.

Correct Example 2:

// Rank 0
communicator.sendMemory(memory0, 1, tag);
auto mem1 = communicator.recvMemory(1, tag);
auto connection = communicator.connect(Transport::CudaIpc, 1, tag);
connection.get(); // This will return the connection.
mem1.get();       // This will return memory1.
// Rank 1
auto mem0 = communicator.recvMemory(0, tag);
communicator.sendMemory(memory1, 0, tag);
auto connection = communicator.connect(Transport::CudaIpc, 0, tag);
mem0.get();       // This will return memory0.
connection.get(); // This will return the connection.

Wrong Example:

// Rank 0
communicator.sendMemory(memory0, 1, tag);
auto mem1 = communicator.recvMemory(1, tag);
auto connection = communicator.connect(Transport::CudaIpc, 1, tag);
// Rank 1
auto mem0 = communicator.recvMemory(0, tag);
auto connection = communicator.connect(Transport::CudaIpc, 0, tag); // undefined behavior
communicator.sendMemory(memory1, 0, tag);

In the wrong example, the connection information from rank 1 will be sent to the mem1 object on rank 0, where the object type is RegisteredMemory, not Connection.

Public Functions

Communicator(std::shared_ptr<Bootstrap> bootstrap, std::shared_ptr<Context> context = nullptr)

Initializes the communicator with a given bootstrap implementation.

Parameters:

bootstrap – An implementation of the Bootstrap that the communicator will use.
context – An optional context to use for the communicator. If not provided, a new context will be created.

~Communicator(): Destroy the communicator.

std::shared_ptr<Bootstrap> bootstrap()

Returns the bootstrap held by this communicator.

Returns:: The bootstrap held by this communicator.

std::shared_ptr<Context> context()

Returns the context held by this communicator.

Returns:: The context held by this communicator.

RegisteredMemory registerMemory(void *ptr, size_t size, TransportFlags transports)

Register a region of GPU memory for use in this communicator’s context.

Parameters:

ptr – Base pointer to the memory.
size – Size of the memory region in bytes.
transports – Transport flags.

Returns:

A RegisteredMemory object representing the registered memory region.

void sendMemory(RegisteredMemory memory, int remoteRank, int tag = 0)

Send information of a registered memory to the remote side.

The send will be started upon calling this function, but this function returns immediately without waiting for the completion of the send. RegisteredMemory sent via sendMemory(memory, remoteRank, tag) can be received via recvMemory(remoteRank, tag).

Multiple calls to either sendMemory() or connect() with the same remoteRank and tag will be ordered by the order of calls, corresponding to the order of recvMemory() or connect() calls on the receiving side. In cases where the execution order is unknown between two ranks, they should be differentiated by using different tag values to prevent unexpected behavior.

Parameters:

memory – The registered memory buffer to send information about.
remoteRank – The rank of the remote process.
tag – The tag to use for identifying the send.

std::shared_future<RegisteredMemory> recvMemory(int remoteRank, int tag = 0)

Receive memory information from a corresponding sendMemory call on the remote side.

This function returns a future immediately. The actual receive will be performed upon calling the first get() on the future. RegisteredMemory sent via sendMemory(memory, remoteRank, tag) can be received via recvMemory(remoteRank, tag).

Multiple calls to either sendMemory() or connect() with the same remoteRank and tag will be ordered by the order of calls, corresponding to the order of recvMemory() or connect() calls on the receiving side. In cases where the execution order is unknown between two ranks, they should be differentiated by using different tag values to prevent unexpected behavior.

Note

To guarantee the receiving order, calling get() on a future returned by recvMemory() or connect() may start receiving other RegisteredMemory or Connection objects of which futures were returned by an earlier call to recvMemory() or connect() with the same remoteRank and tag. For example, if we call recvMemory() or connect() five times with the same remoteRank and tag and then call get() on the last future, it will start receiving the five RegisteredMemory or Connection objects in order, back to back.

Parameters:

remoteRank – The rank of the remote process.
tag – The tag to use for identifying the receive.

Returns:

A future of registered memory.

std::shared_future<Connection> connect(const Endpoint &localEndpoint, int remoteRank, int tag = 0)

Connect to a remote rank.

This function will start (but not wait for) sending metadata about the local endpoint to the remote rank, and return a future connection without waiting for the remote rank to respond. The connection will be established when the remote rank responds with its own endpoint and the local rank calls the first get() on the future. Note that this function is two-way and a connection from rank i to remote rank j needs to have a counterpart from rank j to rank i. Note that with IB, buffers are registered at a page level and if a buffer is spread through multiple pages and does not fully utilize all of them, IB’s QP has to register for all involved pages. This potentially has security risks if the connection’s accesses are given to a malicious process.

Multiple calls to either sendMemory() or connect() with the same remoteRank and tag will be ordered by the order of calls, corresponding to the order of recvMemory() or connect() calls on the receiving side. In cases where the execution order is unknown between two ranks, they should be differentiated by using different tag values to prevent unexpected behavior.

Note

To guarantee the receiving order, calling get() on a future returned by recvMemory() or connect() may start receiving other RegisteredMemory or Connection objects of which futures were returned by an earlier call to recvMemory() or connect() with the same remoteRank and tag. For example, if we call recvMemory() or connect() five times with the same remoteRank and tag and then call get() on the last future, it will start receiving the five RegisteredMemory or Connection objects in order, back to back.

Parameters:

localEndpoint – The local endpoint.
remoteRank – The rank of the remote process.
tag – The tag to use for identifying the send and receive.

Returns:

A future of the connection.

std::shared_future<Connection> connect(const EndpointConfig &localConfig, int remoteRank, int tag = 0)

Connect to a remote rank. Wrapper of connect(localEndpoint, remoteRank, tag).

Parameters:

localConfig – The configuration for the local endpoint.
remoteRank – The rank of the remote process.
tag – The tag to use for identifying the send and receive.

Returns:

A future of the connection.

std::shared_future<Semaphore> buildSemaphore(const Connection &connection, int remoteRank, int tag = 0)

Build a semaphore for cross-process synchronization.

Parameters:

connection – The connection associated with this semaphore.
remoteRank – The rank of the remote process.
tag – The tag to use for identifying the operation.

Returns:

A future of the built semaphore.

int remoteRankOf(const Connection &connection)

Get the remote rank a connection is connected to.

Parameters:: connection – The connection to get the remote rank for.
Returns:: The remote rank the connection is connected to.

int tagOf(const Connection &connection)

Get the tag a connection was made with.

Parameters:: connection – The connection to get the tag for.
Returns:: The tag the connection was made with.

struct Device

Declaration of a device.

Public Functions

Device() = default: Constructor.

inline Device(DeviceType type, int id = -1)

Constructor.

Parameters:

type – Device type.
id – Device ID. Default is -1 (no specific ID).

Public Members

DeviceType type: Device Type.

int id: Device ID.

class Endpoint

One end of a connection.

Public Functions

Endpoint() = default: Constructor.

const EndpointConfig &config() const

Get the configuration used to create the endpoint.

Returns:: The configuration used to create the endpoint.

Transport transport() const

Get the transport used.

Returns:: The transport used.

const Device &device() const

Get the device used.

Returns:: The device used.

uint64_t hostHash() const

Get the host hash.

Returns:: The host hash.

uint64_t pidHash() const

Get the process ID hash.

Returns:: The process ID hash.

int maxWriteQueueSize() const

Get the maximum write queue size.

Returns:: The maximum number of write requests that can be queued.

std::vector<char> serialize() const

Serialize the Endpoint object to a vector of characters.

Returns:: A vector of characters representing the serialized Endpoint object.

Public Static Functions

static Endpoint deserialize(const std::vector<char> &data)

Deserialize an Endpoint object from a vector of characters.

Parameters:: data – A vector of characters representing a serialized Endpoint object.
Returns:: A deserialized Endpoint object.

struct EndpointConfig

Configuration for creating communication endpoints.

Public Functions

inline EndpointConfig(Transport transport = Transport::Unknown, Device device = DeviceType::GPU, int maxWriteQueueSize = -1, Ib ib = {})

Constructs endpoint configuration with specified transport, device, and optional settings.

Parameters:

transport – Communication transport to use.
device – Target device for the endpoint.
maxWriteQueueSize – Maximum write queue size (-1 for system default).
ib – IB-specific configuration.

Public Members

Transport transport: Communication transport type (e.g., CudaIpc, IB0-IB7, Ethernet).

Device device: Target device for the endpoint (GPU or CPU with optional device ID).

int maxWriteQueueSize: Maximum number of write requests that can be queued (-1 for default).

Ib ib: InfiniBand-specific options (used only for Transport::IBx).

struct Ib

InfiniBand-specific configuration options that control queue pair behavior and performance characteristics. These settings are only used when the transport is an InfiniBand type (IB0-IB7); they are ignored for other transports.

Public Functions

inline Ib(int maxCqSize = DefaultMaxCqSize, int maxCqPollNum = DefaultMaxCqPollNum, int maxSendWr = DefaultMaxSendWr, int maxWrPerSend = DefaultMaxWrPerSend)

Constructor.

Parameters:

maxCqSize – Maximum completion queue size.
maxCqPollNum – Maximum completion queue poll count.
maxSendWr – Maximum outstanding send work requests.
maxWrPerSend – Maximum work requests per send operation.

Public Members

int maxCqSize: Maximum size of the completion queue.

int maxCqPollNum: Maximum number of completion queue polls per operation.

int maxSendWr: Maximum number of outstanding send work requests.

int maxWrPerSend: Maximum number of work requests per send operation.

class NvlsConnection

Public Functions

SwitchChannel bindAllocatedMemory(CUdeviceptr devicePtr, size_t size)

Bind the memory allocated via mscclpp::GpuBuffer to the multicast handle. The behavior is undefined if the devicePtr is not allocated by mscclpp::GpuBuffer.

Parameters:

devicePtr – The device pointer returned by mscclpp::GpuBuffer::data().
size – The bytes of the memory to bind to the multicast handle.

Returns:

SwitchChannel with devicePtr, mcPtr and bufferSize

class RegisteredMemory

Block of memory that has been registered to a Context. RegisteredMemory does not own the memory it points to, but it provides a way to transfer metadata about the memory to other processes, hence allowing their access to the memory block.

Public Functions

RegisteredMemory() = default: Constructor.

~RegisteredMemory(): Destructor.

void *data() const

Get a pointer to the memory block.

Returns:: A pointer to the memory block.

void *originalDataPtr() const

Get a pointer to the original memory block.

Returns:: A pointer to the original memory block.

size_t size() const

Get the size of the memory block.

Returns:: The size of the memory block.

TransportFlags transports() const

Get the transport flags associated with the memory block.

Returns:: The transport flags associated with the memory block.

std::vector<char> serialize() const

Serialize the RegisteredMemory object to a vector of characters.

Returns:: A vector of characters representing the serialized RegisteredMemory object.

Public Static Functions

static RegisteredMemory deserialize(const std::vector<char> &data)

Deserialize a RegisteredMemory object from a vector of characters.

Parameters:: data – A vector of characters representing a serialized RegisteredMemory object.
Returns:: A deserialized RegisteredMemory object.

class TransportFlags : private detail::TransportFlagsBase

Stores transport flags.

Public Functions

TransportFlags() = default: Constructor.

TransportFlags(Transport transport)

Constructor.

Parameters:: transport – The transport to set the flag for.

bool has(Transport transport) const

Check if a specific transport flag is set.

Parameters:: transport – The transport to check the flag for.
Returns:: True if the flag is set, false otherwise.

bool none() const

Check if no transport flags are set.

Returns:: True if no flags are set, false otherwise.

bool any() const

Check if any transport flags are set.

Returns:: True if any flags are set, false otherwise.

bool all() const

Check if all transport flags are set.

Returns:: True if all flags are set, false otherwise.

size_t count() const

Get the number of transport flags that are set.

Returns:: The number of flags that are set.

TransportFlags &operator|=(TransportFlags other)

Bitwise OR assignment operator for TransportFlags.

Parameters:: other – The other TransportFlags to perform the OR operation with.
Returns:: A reference to the modified TransportFlags.

TransportFlags operator|(TransportFlags other) const

Bitwise OR operator for TransportFlags.

Parameters:: other – The other TransportFlags to perform the OR operation with.
Returns:: A new TransportFlags object with the result of the OR operation.

TransportFlags operator|(Transport transport) const

Bitwise OR operator for TransportFlags and Transport.

Parameters:: transport – The Transport to perform the OR operation with.
Returns:: A new TransportFlags object with the result of the OR operation.

TransportFlags &operator&=(TransportFlags other)

Bitwise AND assignment operator for TransportFlags.

Parameters:: other – The other TransportFlags to perform the AND operation with.
Returns:: A reference to the modified TransportFlags.

TransportFlags operator&(TransportFlags other) const

Bitwise AND operator for TransportFlags.

Parameters:: other – The other TransportFlags to perform the AND operation with.
Returns:: A new TransportFlags object with the result of the AND operation.

TransportFlags operator&(Transport transport) const

Bitwise AND operator for TransportFlags and Transport.

Parameters:: transport – The Transport to perform the AND operation with.
Returns:: A new TransportFlags object with the result of the AND operation.

TransportFlags &operator^=(TransportFlags other)

Bitwise XOR assignment operator for TransportFlags.

Parameters:: other – The other TransportFlags to perform the XOR operation with.
Returns:: A reference to the modified TransportFlags.

TransportFlags operator^(TransportFlags other) const

Bitwise XOR operator for TransportFlags.

Parameters:: other – The other TransportFlags to perform the XOR operation with.
Returns:: A new TransportFlags object with the result of the XOR operation.

TransportFlags operator^(Transport transport) const

Bitwise XOR operator for TransportFlags and Transport.

Parameters:: transport – The Transport to perform the XOR operation with.
Returns:: A new TransportFlags object with the result of the XOR operation.

TransportFlags operator~() const

Bitwise NOT operator for TransportFlags.

Returns:: A new TransportFlags object with the result of the NOT operation.

bool operator==(TransportFlags other) const

Equality comparison operator for TransportFlags.

Parameters:: other – The other TransportFlags to compare with.
Returns:: True if the two TransportFlags objects are equal, false otherwise.

bool operator!=(TransportFlags other) const

Inequality comparison operator for TransportFlags.

Parameters:: other – The other TransportFlags to compare with.
Returns:: True if the two TransportFlags objects are not equal, false otherwise.

detail::TransportFlagsBase toBitset() const

Convert the TransportFlags object to a bitset representation.

Returns:: A detail::TransportFlagsBase object representing the TransportFlags object.

enum class mscclpp::DeviceType

Available device types.

Values:

enumerator Unknown

enumerator CPU

enumerator GPU

enum class mscclpp::Transport

Enumerates the available transport types.

Values:

enumerator Unknown

enumerator CudaIpc

enumerator IB0

enumerator IB1

enumerator IB2

enumerator IB3

enumerator IB4

enumerator IB5

enumerator IB6

enumerator IB7

enumerator Ethernet

enumerator NumTransports

std::shared_ptr<NvlsConnection> mscclpp::connectNvlsCollective(std::shared_ptr<Communicator> comm, std::vector<int> allRanks, size_t bufferSize)

Connect to NVLS on setup.

This function used to connect to NVLS on setup. NVLS collective using multicast operations to send/recv data. Here we need to put all involved ranks into the collective group.

Parameters:

comm – The communicator.
allRanks – The ranks of all processes involved in the collective.
config – The configuration for the local endpoint.

Returns:

std::shared_ptr<NvlsConnection> A shared pointer to the NVLS connection.

Semaphores

class Host2DeviceSemaphore

A semaphore for sending signals from the host to the device.

Public Types

using DeviceHandle = Host2DeviceSemaphoreDeviceHandle : Device-side handle for Host2DeviceSemaphore.

Public Functions

Host2DeviceSemaphore(const Semaphore &semaphore)

Constructor.

Parameters:: semaphore –

Host2DeviceSemaphore(Communicator &communicator, const Connection &connection)

Constructor.

Parameters:

communicator – The communicator.
connection – The connection associated with this semaphore.

Connection &connection()

Returns the connection.

Returns:: The connection associated with this semaphore.

void signal(): Signal the device.

DeviceHandle deviceHandle() const: Returns the device-side handle.

class Host2HostSemaphore

A semaphore for sending signals from the local host to a remote host.

Public Functions

Host2HostSemaphore(const Semaphore &semaphore)

Constructor.

Parameters:: semaphore –

Host2HostSemaphore(Communicator &communicator, const Connection &connection)

Constructor.

Parameters:

communicator – The communicator.
connection – The connection associated with this semaphore. Transport::CudaIpc is not allowed for Host2HostSemaphore.

Connection &connection()

Returns the connection.

Returns:: The connection associated with this semaphore.

void signal(): Signal the remote host.

bool poll()

Check if the remote host has signaled.

Returns:: true if the remote host has signaled.

void wait(int64_t maxSpinCount = 10000000)

Wait for the remote host to signal.

Parameters:: maxSpinCount – The maximum number of spin counts before throwing an exception. Never throws if negative.

class MemoryDevice2DeviceSemaphore

A semaphore for sending signals from the local device to a peer device via a GPU thread.

Public Types

using DeviceHandle = MemoryDevice2DeviceSemaphoreDeviceHandle : Device-side handle for MemoryDevice2DeviceSemaphore.

Public Functions

MemoryDevice2DeviceSemaphore(const Semaphore &semaphore)

Constructor.

Parameters:: semaphore –

MemoryDevice2DeviceSemaphore(Communicator &communicator, const Connection &connection)

Constructor.

Parameters:

communicator – The communicator.
connection – The connection associated with this semaphore.

Connection &connection()

Returns the connection.

Returns:: The connection associated with this semaphore.

DeviceHandle deviceHandle() const: Returns the device-side handle.

class Semaphore

Semaphore used by channels for synchronization.

Public Functions

Semaphore() = default: Constructor.

Semaphore(const SemaphoreStub &localStub, const SemaphoreStub &remoteStub)

Constructor.

Parameters:

localStub – SemaphoreStub allocated on the local process.
remoteStub – SemaphoreStub allocated on the remote process.

Connection &connection()

Get the connection associated with this semaphore.

Returns:: The connection.

const RegisteredMemory &localMemory() const

Get the local memory associated with this semaphore.

Returns:: A reference to the local registered memory.

const RegisteredMemory &remoteMemory() const

Get the remote memory associated with this semaphore.

Returns:: A reference to the remote registered memory.

class SemaphoreStub

SemaphoreStub object only used for constructing Semaphore, not for direct use by the user.

Public Functions

SemaphoreStub(const Connection &connection)

Constructor.

Parameters:: connection – The connection associated with this semaphore.

const RegisteredMemory &memory() const

Get the memory associated with this semaphore.

Returns:: A reference to the registered memory for this semaphore.

std::vector<char> serialize() const

Serialize into a vector of characters.

Returns:: A vector of characters representing the serialized SemaphoreStub object.

Public Static Functions

static SemaphoreStub deserialize(const std::vector<char> &data)

Deserialize a SemaphoreStub object from a vector of characters.

Parameters:: data – A vector of characters representing a serialized SemaphoreStub object.
Returns:: A deserialized SemaphoreStub object.

Channels

struct BaseMemoryChannel

Memory channel without specifying source/destination memory regions.

Subclassed by mscclpp::MemoryChannel

Public Types

using DeviceHandle = BaseMemoryChannelDeviceHandle : Device-side handle for BaseMemoryChannel.

Public Functions

BaseMemoryChannel() = default: Constructor.

BaseMemoryChannel(std::shared_ptr<MemoryDevice2DeviceSemaphore> semaphore)

Constructor.

Parameters:: semaphore – Semaphore used to synchronize the communication.

BaseMemoryChannel(const Semaphore &semaphore)

Constructor.

Parameters:: semaphore – Semaphore used to synchronize the communication.

BaseMemoryChannel(const BaseMemoryChannel &other) = default

Constructor.

Parameters:: other – Other BaseMemoryChannel to copy from.

DeviceHandle deviceHandle() const

Returns the device-side handle. User should make sure the BaseMemoryChannel is not released when using the returned handle.

Returns:: The device-side handle.

struct BasePortChannel

Port channel without specifying source/destination memory regions.

Subclassed by mscclpp::PortChannel

Public Types

using DeviceHandle = BasePortChannelDeviceHandle : Device-side handle for BasePortChannel.

Public Functions

BasePortChannel() = default: Constructor.

BasePortChannel(SemaphoreId semaphoreId, std::shared_ptr<Host2DeviceSemaphore> semaphore, std::shared_ptr<Proxy> proxy)

Constructor.

Parameters:

semaphoreId – The ID of the semaphore.
semaphore – The semaphore used to synchronize the communication.
proxy – The proxy used for communication.

BasePortChannel(SemaphoreId semaphoreId, const Semaphore &semaphore, std::shared_ptr<Proxy> proxy)

Constructor.

Parameters:

semaphoreId – The ID of the semaphore.
semaphore – The semaphore used to synchronize the communication.
proxy – The proxy used for communication.

BasePortChannel(const BasePortChannel &other) = default

Copy constructor.

Parameters:: other – The other BasePortChannel to copy from.

BasePortChannel &operator=(BasePortChannel &other) = default

Assignment operator.

Parameters:: other – The other BasePortChannel to assign from.

DeviceHandle deviceHandle() const

Returns the device-side handle. User should make sure the BasePortChannel is not released when using the returned handle.

Returns:: The device-side handle.

struct MemoryChannel : public mscclpp::BaseMemoryChannel 

Channel for accessing peer memory directly from GPU threads.

Public Types

using DeviceHandle = MemoryChannelDeviceHandle : Device-side handle for MemoryChannel.

Public Functions

MemoryChannel() = default: Constructor.

MemoryChannel(std::shared_ptr<MemoryDevice2DeviceSemaphore> semaphore, RegisteredMemory dst, RegisteredMemory src, void *packetBuffer = nullptr)

Constructor.

Parameters:

semaphore – The semaphore used to synchronize the communication.
dst – Registered memory of the destination.
src – Registered memory of the source.
packetBuffer – A buffer used to store packets. packetBuffer is optional and if it is nullptr, unpackPacket() and unpackPackets() methods are not available.

MemoryChannel(const Semaphore &semaphore, RegisteredMemory dst, RegisteredMemory src, void *packetBuffer = nullptr)

Constructor.

Parameters:

semaphore – The semaphore used to synchronize the communication.
dst – Registered memory of the destination.
src – Registered memory of the source.
packetBuffer – A buffer used to store packets. packetBuffer is optional and if it is nullptr, unpackPacket() and unpackPackets() methods are not available.

DeviceHandle deviceHandle() const

Returns the device-side handle. User should make sure the MemoryChannel is not released when using the returned handle.

Returns:: The device-side handle.

struct PortChannel : public mscclpp::BasePortChannel 

Port channel.

Public Types

using DeviceHandle = PortChannelDeviceHandle : Device-side handle for PortChannel.

Public Functions

PortChannel() = default: Constructor.

PortChannel(SemaphoreId semaphoreId, std::shared_ptr<Host2DeviceSemaphore> semaphore, std::shared_ptr<Proxy> proxy, MemoryId dst, MemoryId src)

Constructor.

Parameters:

semaphoreId – The ID of the semaphore.
semaphore – The semaphore.
proxy – The proxy.
dst – The destination memory region.
src – The source memory region.

PortChannel(SemaphoreId semaphoreId, const Semaphore &semaphore, std::shared_ptr<Proxy> proxy, MemoryId dst, MemoryId src)

Constructor.

Parameters:

semaphoreId – The ID of the semaphore.
semaphore – The semaphore.
proxy – The proxy.
dst – The destination memory region.
src – The source memory region.

PortChannel(const PortChannel &other) = default

Copy constructor.

Parameters:: other – The other PortChannel to copy from.

PortChannel &operator=(PortChannel &other) = default

Assignment operator.

Parameters:: other – The other PortChannel to assign from.

DeviceHandle deviceHandle() const

Returns the device-side handle. User should make sure the PortChannel is not released when using the returned handle.

Returns:: The device-side handle.

struct SwitchChannel

Proxy Service and FIFO Management

class BaseProxyService

Base class for proxy services. Proxy services are used to proxy data between devices.

Subclassed by mscclpp::ProxyService

class Fifo

Host-side proxy FIFO for device-produced work elements.

Public Functions

Fifo(int size = DEFAULT_FIFO_SIZE)

Constructor.

Parameters:: size – Number of entries (default: DEFAULT_FIFO_SIZE).

~Fifo(): Destructor.

ProxyTrigger poll()

Poll and get the trigger at the head.

Returns:: ProxyTrigger at the head of the FIFO.

void pop(): Remove the head trigger.

int size() const

Get FIFO size.

Returns:: Number of entries in the FIFO.

FifoDeviceHandle deviceHandle() const

Get device-side FIFO handle.

Returns:: FifoDeviceHandle for device access.

class Proxy

Host-side proxy for PortChannels.

Public Functions

Proxy(ProxyHandler handler, std::function<void()> threadInit, int fifoSize = DEFAULT_FIFO_SIZE)

Constructor.

Parameters:

handler – Handler for each FIFO trigger.
threadInit – Optional function run in proxy thread before FIFO consumption.
fifoSize – FIFO size (default: DEFAULT_FIFO_SIZE).

Proxy(ProxyHandler handler, int fifoSize = DEFAULT_FIFO_SIZE)

Constructor.

Parameters:

handler – Handler for each FIFO trigger.
fifoSize – FIFO size (default: DEFAULT_FIFO_SIZE).

~Proxy(): Destructor. Stops proxy if running.

void start(bool blocking = false): Start proxy.

void stop(): Stop proxy.

std::shared_ptr<Fifo> fifo()

Get reference to FIFO used by proxy.

Returns:: Shared pointer to FIFO.

class ProxyService : public mscclpp::BaseProxyService 

Proxy service implementation.

Public Functions

ProxyService(int fifoSize = DEFAULT_FIFO_SIZE)

Constructor.

Parameters:: fifoSize – Size of the FIFO used by the proxy service (default: DEFAULT_FIFO_SIZE).

SemaphoreId buildAndAddSemaphore(Communicator &communicator, const Connection &connection)

Build and add a semaphore to the proxy service.

Parameters:: connection – The connection associated with the semaphore.
Returns:: The ID of the semaphore.

SemaphoreId addSemaphore(const Semaphore &semaphore)

Add a semaphore to the proxy service.

Parameters:: semaphore – The semaphore to be added
Returns:: The ID of the semaphore.

SemaphoreId addSemaphore(std::shared_ptr<Host2DeviceSemaphore> semaphore)

Add a semaphore to the proxy service.

Parameters:: semaphore – The semaphore to be added
Returns:: The ID of the semaphore.

MemoryId addMemory(RegisteredMemory memory)

Register a memory region with the proxy service.

Parameters:: memory – The memory region to register.
Returns:: The ID of the memory region.

MemoryId nextMemoryId(uint32_t count = 1) const

Get the next available memory ID.

Parameters:: count – The number of consecutive IDs required (default: 1).
Returns:: The first ID of an available range [first, first + count).

std::shared_ptr<Host2DeviceSemaphore> semaphore(SemaphoreId id) const

Get a semaphore by ID.

Parameters:: id – The ID of the semaphore.
Returns:: The semaphore.

BasePortChannel basePortChannel(SemaphoreId id)

Get a base port channel by semaphore ID.

Parameters:: id – The ID of the semaphore.
Returns:: The base port channel.

PortChannel portChannel(SemaphoreId id, MemoryId dst, MemoryId src)

Get a port channel by semaphore ID and memory regions.

Parameters:

id – The ID of the semaphore.
dst – The destination memory region.
src – The source memory region.

Returns:

The port channel.

virtual void startProxy(bool blocking = false)

Start the proxy service.

Parameters:: blocking – Whether to block until the proxy thread has started (default: false).

virtual void stopProxy(): Stop the proxy service.

using mscclpp::ProxyHandler = std::function<ProxyHandlerResult(ProxyTrigger)>: Handler function type for proxy.

enum class mscclpp::ProxyHandlerResult

Return values for ProxyHandler.

Values:

enumerator Continue: Move to next trigger in FIFO.

enumerator Stop: Stop and exit proxy.

constexpr size_t mscclpp::DEFAULT_FIFO_SIZE = 512

Utilities

struct AvoidCudaGraphCaptureGuard: A RAII guard that will cudaThreadExchangeStreamCaptureMode to cudaStreamCaptureModeRelaxed on construction and restore the previous mode on destruction. This is helpful when we want to avoid CUDA/HIP graph capture.

struct CudaStreamWithFlags

A RAII wrapper around cudaStream_t that will call cudaStreamDestroy on destruction.

Public Functions

CudaStreamWithFlags(): Constructor without flags. This will not create any stream. set() can be called later to create a stream with specified flags.

CudaStreamWithFlags(unsigned int flags)

Constructor with flags. This will create a stream with the specified flags on the current device.

Parameters:: flags – The flags to create the stream with.

~CudaStreamWithFlags(): Destructor. This will destroy the stream if it was created.

void set(unsigned int flags)

Set the stream with the specified flags. The current device at the time of the construction will be used. If the stream was already created, it will raise an error with ErrorCode::InvalidUsage.

Parameters:: flags – The flags to create the stream with.
Throws:: Error – if the stream was already created.

bool empty() const

Check if the stream is empty (not created).

Returns:: true if the stream is empty, false otherwise.

template<class T = char> class GpuBuffer

Allocates a GPU memory space specialized for communication. The memory is zeroed out. Get the device pointer by GpuBuffer::data().

Use this function for communication buffers, i.e., only when other devices (CPU, GPU, NIC, etc.) may access this memory space at the same time with the local device (GPU). Running heavy computation over this memory space may perform bad and is not recommended in general.

The allocated memory space is managed by the memory_ object, not by the class instance. Which means, the class destructor will NOT free the allocated memory if memory_ is shared with and alive in other contexts.

Template Parameters:: T – Type of each element in the allocated memory. Default is char.

Public Functions

inline GpuBuffer(size_t nelems)

Constructs a GpuBuffer with the specified number of elements.

Parameters:: nelems – Number of elements to allocate. If it is zero, data() will return a null pointer.

inline size_t nelems() const

Returns the number of elements in the allocated memory.

Returns:: The number of elements.

inline size_t bytes() const

Returns the number of bytes that is actually allocated. This may be larger than nelems() * sizeof(T).

Returns:: The number of bytes.

inline std::shared_ptr<T> memory()

Returns the shared pointer to the allocated memory. If nelems() is zero, this function will return an empty shared pointer.

Returns:: A std::shared_ptr to the allocated memory.

inline T *data()

Returns the device pointer to the allocated memory. Equivalent to memory().get(). If nelems() is zero, this function will return a null pointer.

Returns:: A device pointer to the allocated memory.

inline int deviceId() const

Returns the device id of the allocated memory.

Returns:: The device id.

class GpuStream

A managed non-blocking GPU stream object created by GpuStreamPool. This object does not own the stream. When this object is destroyed, it will return the underlying CudaStreamWithFlags to the pool.

Public Functions

~GpuStream(): Destructor. This will return the underlying CudaStreamWithFlags to the pool, not destroy it.

class GpuStreamPool

A pool of managed GPU streams. Only provides non-blocking streams. This is intended to be used for reusing temporal streams.

Public Functions

GpuStream getStream()

Get a non-blocking GPU stream from the pool. If no streams are available, a new one will be created.

Returns:: A GpuStream object.

void clear(): Clear the pool, which will remove all streams from the pool.

int mscclpp::getDeviceNumaNode(int deviceId)

Return the NUMA node ID of the given GPU device ID.

Parameters:: deviceId – The GPU device ID.
Throws:: Error – if the device ID is invalid or if the NUMA node cannot be determined.
Returns:: The NUMA node ID of the device.

std::string mscclpp::getHostName(int maxlen, const char delim)

Get the host name of the system.

Parameters:

maxlen – The maximum length of the returned string.
delim – The delimiter to use for the host name; if the delimiter is found before maxlen characters, the host name will be truncated at that point.

Throws:

Error – if it fails to retrieve the host name (error code: SystemError).

Returns:

The host name of the system, truncated to maxlen characters if necessary, and split by the specified delimiter.

int mscclpp::getIBDeviceCount()

Get the number of available InfiniBand devices.

Returns:: The number of available InfiniBand devices.

std::string mscclpp::getIBDeviceName(Transport ibTransport)

Get the name of the InfiniBand device associated with the specified transport.

Parameters:: ibTransport – The InfiniBand transport to get the device name for.
Returns:: The name of the InfiniBand device associated with the specified transport.

Transport mscclpp::getIBTransportByDeviceName(const std::string &ibDeviceName)

Get the InfiniBand transport associated with the specified device name.

Parameters:: ibDeviceName – The name of the InfiniBand device to get the transport for.
Returns:: The InfiniBand transport associated with the specified device name.

template<class T = char> void mscclpp::gpuMemcpy(T *dst, const T *src, size_t nelems, cudaMemcpyKind kind = cudaMemcpyDefault)

Copies memory from src to dst synchronously.

Template Parameters:

T – Type of each element in the memory.

Parameters:

dst – Destination address.
src – Source address.
nelems – Number of elements to copy.
kind – The kind of copy operation. Default is cudaMemcpyDefault.

template<class T = char> void mscclpp::gpuMemcpyAsync(T *dst, const T *src, size_t nelems, cudaStream_t stream, cudaMemcpyKind kind = cudaMemcpyDefault)

Copies memory from src to dst asynchronously.

Template Parameters:

T – Type of each element in the memory.

Parameters:

dst – Destination address.
src – Source address.
nelems – Number of elements to copy.
stream – The stream to use for the copy operation.
kind – The kind of copy operation. Default is cudaMemcpyDefault.

std::shared_ptr<GpuStreamPool> mscclpp::gpuStreamPool()

Get the singleton instance of GpuStreamPool.

Returns:: A shared pointer to the GpuStreamPool instance.

bool mscclpp::isCuMemMapAllocated(void *ptr)

Check if ptr is allocaed by cuMemMap.

Parameters:: ptr – The pointer to check.
Returns:: True if the pointer is allocated by cuMemMap, false otherwise.

bool mscclpp::isNvlsSupported()

Check if NVLink SHARP (NVLS) is supported.

Returns:: True if NVLink SHARP (NVLS) is supported, false otherwise.

void mscclpp::numaBind(int node)

NUMA bind the current thread to the specified NUMA node.

Parameters:: node – The NUMA node ID to bind to.
Throws:: Error – if the given NUMA node ID is invalid.

Executor Interface

class ExecutionPlan

Represents a compiled execution plan loaded from disk.

An ExecutionPlan encapsulates metadata about a collective algorithm such as its name, the collective it implements, and the supported message-size range. The concrete implementation is hidden behind the PIMPL pointer.

Public Functions

ExecutionPlan(const std::string &planPath, int rank)

Construct an ExecutionPlan by loading the plan file at planPath.

Parameters:

planPath – Filesystem path to the serialized plan.
rank – The rank of the current process.

~ExecutionPlan() = default: Destructor.

std::string name() const: Return the human-readable name of the plan.

std::string collective() const: Return the collective implemented by this plan (e.g., “allreduce”, “allgather”).

size_t minMessageSize() const: Minimum message size (in bytes) for which this plan is valid.

size_t maxMessageSize() const: Maximum message size (in bytes) for which this plan is valid.

bool isInPlace() const: Whether this plan performs the operation in-place.

class Executor

High-level executor responsible for invoking execution plans on a communicator.

Public Functions

Executor(std::shared_ptr<Communicator> comm, std::shared_ptr<char> defaultScratchBuffer = nullptr)

Construct an Executor using the provided communicator.

Parameters:

comm – Communicator instance used for underlying communication.
defaultScratchBuffer – Optional scratch buffer used by some plans (may be nullptr).

Executor(const Executor&) = delete: Copy construction is disabled for Executor.

Executor &operator=(const Executor&) = delete: Copy assignment is disabled for Executor.

~Executor(): Destructor. Cleans up internal resources held by the Executor.

void execute(int rank, void *sendbuff, void *recvBuff, size_t sendBuffSize, size_t recvBuffSize, DataType dataType, const ExecutionPlan &plan, cudaStream_t stream, PacketType packetType = PacketType::LL16)

Execute a plan.

This method dispatches the given plan on the provided CUDA stream.

Parameters:

rank – Rank of the calling process.
sendbuff – Pointer to the send buffer.
recvBuff – Pointer to the receive buffer.
sendBuffSize – Size of the send buffer in bytes.
recvBuffSize – Size of the receive buffer in bytes.
dataType – Data type of elements in the buffers.
plan – The execution plan to run.
stream – CUDA stream to execute kernels/operations on.
packetType – Packet type used for low-latency transports (default: LL16).

enum class mscclpp::DataType

Data types supported by mscclpp operations.

Values:

enumerator INT32

enumerator UINT32

enumerator FLOAT16

enumerator FLOAT32

enumerator BFLOAT16

enumerator FP8_E4M3

enumerator FP8_E5M2

enum class mscclpp::PacketType

Packet formats used by low-latency transport.

Values:

enumerator LL8

enumerator LL16

Environment and Configuration

class Env

The MSCCL++ environment. The constructor reads environment variables and sets the corresponding fields. Use the env() function to get the environment object.

Public Members

const std::string debug: Env name: MSCCLPP_DEBUG. The debug flag, one of VERSION, WARN, INFO, ABORT, or TRACE. Unset by default.

const std::string debugSubsys: Env name: MSCCLPP_DEBUG_SUBSYS. The debug subsystem, a comma-separated list of subsystems to enable debug logging for. If the first character is ‘^’, it inverts the mask, i.e., enables all subsystems except those specified. Possible values are INIT, COLL, P2P, SHM, NET, GRAPH, TUNING, ENV, ALLOC, CALL, MSCCLPP_EXECUTOR, MSCCLPP_NCCL, ALL. Unset by default.

const std::string debugFile: Env name: MSCCLPP_DEBUG_FILE. A file path to write debug logs to. Unset by default.

const std::string logLevel: Env name: MSCCLPP_LOG_LEVEL. One of DEBUG, INFO, WARN, or ERROR, in the order of severity (lower to higher level). A lower level is a superset of a higher level. Default is ERROR.

const std::string logSubsys: Env name: MSCCLPP_LOG_SUBSYS. The log subsystem, a comma-separated list of subsystems to enable logging for. Possible values are ENV, NET, CONN, EXEC, NCCL, ALL (default). If the first character is ‘^’, it inverts the mask, i.e., enables all subsystems except those specified. For example, “^NET,CONN” enables all subsystems except NET and CONN.

const std::string logFile: Env name: MSCCLPP_LOG_FILE. A file path to write log messages to. Unset by default.

const std::string hcaDevices: Env name: MSCCLPP_HCA_DEVICES. A comma-separated list of HCA devices to use for IB transport. i-th device in the list will be used for the i-th GPU in the system. If unset, it will use ibverbs APIs to find the devices automatically.

const std::string hostid: Env name: MSCCLPP_HOSTID. A string that uniquely identifies the host. If unset, it will use the hostname. This is used to determine whether the host is the same across different processes.

const std::string socketFamily: Env name: MSCCLPP_SOCKET_FAMILY. The socket family to use for TCP sockets (used by TcpBootstrap and the Ethernet transport). Possible values are AF_INET (IPv4) and AF_INET6 (IPv6). If unset, it will not force any family and will use the first one found.

const std::string socketIfname: Env name: MSCCLPP_SOCKET_IFNAME. The interface name to use for TCP sockets (used by TcpBootstrap and the Ethernet transport). If unset, it will use the first interface found that matches the socket family.

const std::string commId: Env name: MSCCLPP_COMM_ID. To be deprecated; don’t use this.

const std::string executionPlanDir: Env name: MSCCLPP_EXECUTION_PLAN_DIR. The directory to find execution plans from. This should be set to use execution plans for the NCCL API. Unset by default.

const std::string npkitDumpDir: Env name: MSCCLPP_NPKIT_DUMP_DIR. The directory to dump NPKIT traces to. If this is set, NPKIT will be enabled and will dump traces to this directory. Unset by default.

const bool cudaIpcUseDefaultStream: Env name: MSCCLPP_CUDAIPC_USE_DEFAULT_STREAM. If set to true, the CUDA IPC transport will use the default stream for all operations. If set to false, it will use a separate stream for each operation. This is an experimental feature and should be false in most cases. Default is false.

const std::string ncclSharedLibPath: Env name: MSCCLPP_NCCL_LIB_PATH. The path to the original NCCL/RCCL shared library. If set, it will be used as a fallback for NCCL operations in cases where the MSCCL++ NCCL cannot work.

const std::string forceNcclFallbackOperation: Env name: MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION. A comma-separated list of NCCL operations that should always use the fallback implementation, even if the MSCCL++ NCCL can handle them. This is useful for debugging purposes. Currently supports all, broadcast, allreduce, reducescatter, and allgather.

const bool disableChannelCache: Env name: MSCCLPP_DISABLE_CHANNEL_CACHE. If set to true, it will disable the channel cache for NCCL APIs. Currently, this should be set to true if the application may call NCCL APIs on the same local buffer with different remote buffers, e.g., in the case of a dynamic communicator. If CUDA/HIP graphs are used, disabling the channel cache won’t affect the performance, but otherwise it may lead to performance degradation. Default is false.

const bool forceDisableNvls: Env name: MSCCLPP_FORCE_DISABLE_NVLS. If set to true, it will disable the NVLS support in MSCCL++. Default is false.

std::shared_ptr<Env> mscclpp::env()

Get the MSCCL++ environment.

Returns:: A reference to the global environment object.

Error Handling

class BaseError : public std::runtime_error

Base class for all errors thrown by MSCCL++.

Subclassed by mscclpp::CuError, mscclpp::CudaError, mscclpp::Error, mscclpp::IbError, mscclpp::SysError

Public Functions

BaseError(const std::string &message, int errorCode)

Constructor of BaseError.

Parameters:

message – The error message.
errorCode – The error code.

explicit BaseError(int errorCode)

Constructor of BaseError.

Parameters:: errorCode – The error code.

virtual ~BaseError() = default: Virtual destructor for BaseError.

int getErrorCode() const

Get the error code.

Returns:: The error code.

const char *what() const noexcept override

Get the error message.

Returns:: The error message.

class CudaError : public mscclpp::BaseError : An error from a CUDA runtime library call.

class CuError : public mscclpp::BaseError : An error from a CUDA driver library call.

class Error : public mscclpp::BaseError : A generic error.

class IbError : public mscclpp::BaseError : An error from an ibverbs library call.

class SysError : public mscclpp::BaseError : An error from a system call that sets errno.

enum class mscclpp::ErrorCode

Enumeration of error codes used by MSCCL++.

Values:

enumerator SystemError

enumerator InternalError

enumerator RemoteError

enumerator InvalidUsage

enumerator Timeout

enumerator Aborted

enumerator ExecutorError

std::string mscclpp::errorToString(enum ErrorCode error)

Convert an error code to a string.

Parameters:: error – The error code to convert.
Returns:: The string representation of the error code.

Version

std::string mscclpp::version()

Return a version string.

Returns:: The MSCCL++ version string in “major.minor.patch” format.

Macro Functions

MSCCLPP_CUDATHROW(cmd)

Throw mscclpp::CudaError if cmd does not return cudaSuccess.

Parameters:

cmd – The command to execute.

MSCCLPP_CUTHROW(cmd)

Throw mscclpp::CuError if cmd does not return CUDA_SUCCESS.

Parameters:

cmd – The command to execute.

Device-Side Interfaces

These device-side handle structures provide GPU kernel interfaces for MSCCL++ communication primitives. They are designed to be used directly in CUDA/HIP device code.

Channel Device Interfaces

struct BaseMemoryChannelDeviceHandle

Device-side handle of a MemoryChannel without specific source and destination.

Subclassed by mscclpp::MemoryChannelDeviceHandle

Public Functions

inline void signal()

Signal the remote semaphore.

This function guarantees that all the memory operation before this function is completed before the remote semaphore is signaled.

inline void relaxedSignal()

Signal the remote semaphore.

This function is a relaxed version of signal() and provides no guarantee on the completion of memory operations. User requires to call proper fencing before using this function.

inline bool poll()

Check if the remote semaphore has signaled.

Returns:: true if the remote semaphore has signaled.

inline void wait(int64_t maxSpinCount = 10000000)

Wait for the remote semaphore to send a signal.

Parameters:: maxSpinCount – The maximum number of spins before asserting. Never assert if negative.

inline void relaxedWait(int64_t maxSpinCount = 10000000)

Wait for the remote semaphore to send a signal.

This function is a relaxed version of signal() and provides no guarantee on the completion of memory operations. User requires to call proper fencing before using this function.

Parameters:: maxSpinCount – The maximum number of spins before asserting. Never assert if negative.

struct BasePortChannelDeviceHandle

Subclassed by mscclpp::PortChannelDeviceHandle

Public Functions

inline void put(MemoryId dstId, uint64_t dstOffset, MemoryId srcId, uint64_t srcOffset, uint64_t size)

Push a TriggerData to the FIFO.

Parameters:

dstId – The ID of destination memory region.
dstOffset – The offset into the destination memory region.
srcId – The ID of source memory region.
srcOffset – The offset into the source memory region.
size – The size of the transfer.

inline void put(MemoryId dstId, MemoryId srcId, uint64_t offset, uint64_t size)

Push a TriggerData to the FIFO.

Parameters:

dstId – The ID of destination memory region.
srcId – The ID of source memory region.
offset – The common offset into the destination and source memory regions.
size – The size of the transfer.

inline void signal(): Push a TriggerFlag to the FIFO.

inline void putWithSignal(MemoryId dstId, uint64_t dstOffset, MemoryId srcId, uint64_t srcOffset, uint64_t size)

Push a TriggerData and a TriggerFlag at the same time to the FIFO.

Parameters:

dstId – The ID of destination memory region.
dstOffset – The offset into the destination memory region.
srcId – The ID of source memory region.
srcOffset – The offset into the source memory region.
size – The size of the transfer.

inline void putWithSignal(MemoryId dstId, MemoryId srcId, uint64_t offset, uint64_t size)

Push a TriggerData and a TriggerFlag at the same time to the FIFO.

Parameters:

dstId – The ID of destination memory region.
srcId – The ID of source memory region.
offset – The common offset into the destination and source memory regions.
size – The size of the transfer.

inline void putWithSignalAndFlush(MemoryId dstId, uint64_t dstOffset, MemoryId srcId, uint64_t srcOffset, uint64_t size, int64_t maxSpinCount = 1000000)

Push a TriggerData, a TriggerFlag, and a TriggerSync at the same time to the FIFO.

Parameters:

dstId – The ID of destination memory region.
dstOffset – The offset into the destination memory region.
srcId – The ID of source memory region.
srcOffset – The offset into the source memory region.
size – The size of the transfer.
maxSpinCount – The maximum number of spin counts before asserting. Never assert if negative.

inline void putWithSignalAndFlush(MemoryId dstId, MemoryId srcId, uint64_t offset, uint64_t size, int64_t maxSpinCount = 1000000)

Push a TriggerData, a TriggerFlag, and a TriggerSync at the same time to the FIFO.

Parameters:

dstId – The ID of destination memory region.
srcId – The ID of source memory region.
offset – The common offset into the destination and source memory regions.
size – The size of the transfer.
maxSpinCount – The maximum number of spin counts before asserting. Never assert if negative.

inline void flush(int64_t maxSpinCount = 1000000)

Push a TriggerSync to the FIFO.

Parameters:: maxSpinCount – The maximum number of spin counts before asserting. Never assert if negative.

inline bool poll()

Check if the port channel has been signaled.

Returns:: true if the port channel has been signaled.

inline void wait(int64_t maxSpinCount = 10000000)

Wait for the port channel to be signaled.

Parameters:: maxSpinCount – The maximum number of spin counts before asserting. Never assert if negative.

union LL16Packet

#include <packet_device.hpp>

LL (low latency) protocol packet with 8 bytes of data and 8 bytes of flags.

Public Types

using Payload = uint2

Public Functions

LL16Packet() = default

inline LL16Packet(uint2 val, uint32_t flag)

inline void write(uint32_t val1, uint32_t val2, uint32_t flag)

Write 8 bytes of data to the packet.

Parameters:

val1 – The first 4-byte data to write.
val2 – The second 4-byte data to write.
flag – The flag to write.

inline void write(uint64_t val, uint32_t flag)

Write 8 bytes of data to the packet.

Parameters:

val – The 8-byte data to write.
flag – The flag to write.

inline void write(uint2 val, uint32_t flag)

Write 8 bytes of data to the packet.

Parameters:

val – The 8-byte data to write.
flag – The flag to write.

inline bool readOnce(uint32_t flag, uint2 &data) const

Helper of read().

Parameters:

flag – The flag to read.
data – The 8-byte data read.

Returns:

True if the flag is not equal to the given flag.

inline uint2 read(uint32_t flag, [[maybe_unused]] int64_t maxSpinCount = 100000000) const

Read 8 bytes of data from the packet. It will spin until the flag is equal to the given flag.

Parameters:

flag – The flag to read.
maxSpinCount – The maximum number of spin counts before asserting. Never assert if negative.

Returns:

The 8-byte data read.

inline void clear(): Clear the packet.

Public Members

uint32_t data1

uint32_t flag1

uint32_t data2

uint32_t flag2

struct mscclpp::LL16Packet::[anonymous] [anonymous]

ulonglong2 raw_

union LL8Packet

#include <packet_device.hpp>

LL (low latency) protocol packet with 4 bytes of data and 4 bytes of flags.

Public Types

using Payload = uint32_t

Public Functions

LL8Packet() = default

inline LL8Packet(uint32_t val, uint32_t flag)

inline void write(uint32_t val, uint32_t flag)

Write 4 bytes of data to the packet.

Parameters:

val – The 4-byte data to write.
flag – The flag to write.

inline bool readOnce(uint32_t flag, uint32_t &data) const

Helper of read().

Parameters:

flag – The flag to read.
data – The 4-byte data read.

Returns:

True if the flag is not equal to the given flag.

inline uint32_t read(uint32_t flag, [[maybe_unused]] int64_t maxSpinCount = 1000000) const

Read 4 bytes of data from the packet. It will spin until the flag is equal to the given flag.

Parameters:

flag – The flag to read.
maxSpinCount – The maximum number of spin counts before asserting. Never assert if negative.

Returns:

The 4-byte data read.

inline void clear(): Clear the packet.

Public Members

uint32_t data

uint32_t flag

struct mscclpp::LL8Packet::[anonymous] [anonymous]

uint64_t raw_

struct MemoryChannelDeviceHandle : public mscclpp::BaseMemoryChannelDeviceHandle 

Device-side handle of a MemoryChannel.

Public Functions

template<typename T> inline T read(uint64_t index)

Load a value from the remote memory.

Template Parameters:: T – The type of the value to be loaded.
Parameters:: index – The index of the value to be loaded. The offset in bytes is calculated as index * sizeof(T).
Returns:: The value loaded.

template<typename T> inline void write(uint64_t index, const T &v)

Write a value to the remote memory.

Template Parameters:

T – The type of the value to be written.

Parameters:

index – The index of the value to be written. The offset in bytes is calculated as index * sizeof(T).
v – The value to be written.

template<int Alignment = 16, bool CopyRemainder = true> inline void put(uint64_t targetOffset, uint64_t originOffset, uint64_t originBytes, uint32_t threadId, uint32_t numThreads)

Copy data from the local memory (origin) to the remote memory (target).

This function is intended to be collectively called by multiple threads. Each thread copies a part of data.

Template Parameters:

Alignment – The alignment of the source and destination addresses. Should be 4, 8, or a multiple of 16.
CopyRemainder – Whether to copy remainder bytes when the number of bytes is not a multiple of Alignment.

Parameters:

targetOffset – The offset in bytes of the remote address. Should be a multiple of Alignment.
originOffset – The offset in bytes of the local address. Should be a multiple of Alignment.
originBytes – Bytes of the origin to be copied. Should be a multiple of Alignment.
threadId – The index of the current thread among all threads running this function. This is different from the threadIdx in CUDA.
numThreads – The total number of threads that run this function.

template<int Alignment = 16, bool CopyRemainder = true> inline void put(uint64_t offset, uint64_t originBytes, uint32_t threadId, uint32_t numThreads): Wrapper of put() with the same offset for target and origin.

template<int Alignment = 16, bool CopyRemainder = true> inline void get(uint64_t targetOffset, uint64_t originOffset, uint64_t originBytes, uint32_t threadId, uint32_t numThreads)

Copy data from the remote memory (origin) to the local memory (target).

This function is intended to be collectively called by multiple threads. Each thread copies a part of data.

Template Parameters:

Alignment – The alignment of the source and destination addresses. Should be 4, 8, or a multiple of 16.
CopyRemainder – Whether to copy remainder bytes when the number of bytes is not a multiple of Alignment.

Parameters:

targetOffset – The offset in bytes of the local address. Should be a multiple of Alignment.
originOffset – The offset in bytes of the remote address. Should be a multiple of Alignment.
originBytes – Bytes of the origin to be copied. Should be a multiple of Alignment.
threadId – The index of the current thread among all threads running this function. This is different from the threadIdx in CUDA.
numThreads – The total number of threads that run this function.

template<int Alignment = 16, bool CopyRemainder = true> inline void get(uint64_t offset, uint64_t originBytes, uint32_t threadId, uint32_t numThreads): Wrapper of get() with the same offset for target and origin.

template<typename PacketType = LL16Packet> inline void putPackets(uint64_t targetOffset, uint64_t originOffset, uint64_t originBytes, uint32_t threadId, uint32_t numThreads, uint32_t flag)

Copy data from the local memory (origin) to the remote memory (target) using packets.

This function is intended to be collectively called by multiple threads. Each thread copies a part of data.

Template Parameters:

PacketType – The packet type. It should be either LL16Packet or LL8Packet.

Parameters:

targetOffset – The offset in bytes of the remote address.
originOffset – The offset in bytes of the local address.
originBytes – Bytes of the origin to be copied.
threadId – The index of the current thread among all threads running this function. This is different from the threadIdx in CUDA.
numThreads – The total number of threads that run this function.
flag – The flag to write.

template<typename PacketType = LL16Packet> inline void putPackets(uint64_t offset, uint64_t originBytes, uint32_t threadId, uint32_t numThreads, uint32_t flag): Wrapper of putPackets() with the same offset for target and origin.

template<typename PacketType = LL16Packet> inline auto unpackPacket(uint64_t index, uint32_t flag, int64_t maxSpinCount = -1)

Retrieve data from a packet in the local packet buffer.

Template Parameters:

PacketType – The packet type. It should be either LL16Packet or LL8Packet.

Parameters:

index – The index of the packet to be read. The offset in bytes is calculated as index * sizeof(PacketType).
flag – The flag to read.
maxSpinCount – The maximum number of spins before asserting. Never assert if negative.

Returns:

The value read from the packet. The type of the value depends on the packet type.

template<typename PacketType = LL16Packet> inline void unpackPackets(uint64_t targetOffset, uint64_t originOffset, uint64_t originBytes, uint32_t threadId, uint32_t numThreads, uint32_t flag, int64_t maxSpinCount = -1)

Retrieve data from packets in the local packet buffer (target) and write to the local memory (origin).

This function is intended to be collectively called by multiple threads. Each thread copies a part of data.

Template Parameters:

PacketType – The packet type. It should be either LL16Packet or LL8Packet.

Parameters:

targetOffset – The offset in bytes of the local packet buffer.
originOffset – The offset in bytes of the local address.
originBytes – Bytes of the origin to be copied.
threadId – The index of the current thread among all threads running this function. This is different from the threadIdx in CUDA.
numThreads – The total number of threads that run this function.
flag – The flag to write.
maxSpinCount – The maximum number of spins before asserting. Never assert if negative.

template<typename PacketType = LL16Packet> inline void unpackPackets(uint64_t offset, uint64_t originBytes, uint32_t threadId, uint32_t numThreads, uint32_t flag, int64_t maxSpinCount = -1): Wrapper of unpackPackets() with the same offset for target and origin.

struct PortChannelDeviceHandle : public mscclpp::BasePortChannelDeviceHandle 

Public Functions

inline void put(uint64_t dstOffset, uint64_t srcOffset, uint64_t size)

Push a TriggerData to the FIFO.

Parameters:

dstOffset – The offset into the destination memory region.
srcOffset – The offset into the source memory region.
size – The size of the transfer.

inline void put(uint64_t offset, uint64_t size)

Push a TriggerData to the FIFO.

Parameters:

offset – The common offset into the destination and source memory regions.
size – The size of the transfer.

inline void putWithSignal(uint64_t dstOffset, uint64_t srcOffset, uint64_t size)

Push a TriggerData and a TriggerFlag at the same time to the FIFO.

Parameters:

dstOffset – The offset into the destination memory region.
srcOffset – The offset into the source memory region.
size – The size of the transfer.

inline void putWithSignal(uint64_t offset, uint64_t size)

Push a TriggerData and a TriggerFlag at the same time to the FIFO.

Parameters:

offset – The common offset into the destination and source memory regions.
size – The size of the transfer.

inline void putWithSignalAndFlush(uint64_t dstOffset, uint64_t srcOffset, uint64_t size, int64_t maxSpinCount = 1000000)

Push a TriggerData, a TriggerFlag, and a TriggerSync at the same time to the FIFO.

Parameters:

dstOffset – The offset into the destination memory region.
srcOffset – The offset into the source memory region.
size – The size of the transfer.
maxSpinCount – The maximum number of spin counts before asserting. Never assert if negative.

inline void putWithSignalAndFlush(uint64_t offset, uint64_t size)

Push a TriggerData, a TriggerFlag, and a TriggerSync at the same time to the FIFO.

Parameters:

offset – The common offset into the destination and source memory regions.
size – The size of the transfer.

struct SwitchChannelDeviceHandle: Device-side handle for SwitchChannel.

using mscclpp::LLPacket = LL16Packet 

using mscclpp::MemoryId = uint32_t: Numeric ID of RegisteredMemory. ProxyService has an internal array indexed by these handles mapping to the actual.

using mscclpp::SemaphoreId = uint32_t: Numeric ID of Semaphore. ProxyService has an internal array indexed by these handles mapping to the actual semaphores.

Semaphore Device Interfaces

struct Host2DeviceSemaphoreDeviceHandle

Device-side handle for Host2DeviceSemaphore.

Public Functions

inline bool poll()

Poll if the host has signaled.

Returns:: true if the host has signaled.

inline void wait([[maybe_unused]] int64_t maxSpinCount = 100000000): Wait for the host to signal.

inline uint64_t loadExpectedInbound()

Thread-safe read of expected inbound value.

Returns:: The expected inbound value.

inline uint64_t incExpectedInbound()

Thread-safe increment of expected inbound value.

Returns:: The incremented expected inbound value.

inline uint64_t loadInbound()

Thread-safe read of inbound value.

Returns:: The inbound value.

Public Members

uint64_t *inboundToken: A local memory space where a host thread (on behalf of the remote device) will write its semaphore value and the local device will read it.

uint64_t *expectedInboundToken: A local memory space where the local device stores the expected value of the inboundToken to wait for.

struct MemoryDevice2DeviceSemaphoreDeviceHandle

Device-side handle for MemoryDevice2DeviceSemaphore.

Public Functions

inline bool poll()

Poll if remote device has signaled.

Returns:: true if remote device has signaled.

inline void wait([[maybe_unused]] int64_t maxSpinCount = 100000000): Wait for remote device to signal.

inline void relaxedWait([[maybe_unused]] int64_t maxSpinCount = 100000000): Relaxed wait; no memory completion guarantee. Use it only for synchronizing execution, not data.

inline void signal(): Signal remote device, ensures prior memory ops complete.

inline void relaxedSignal(): Relaxed signal; no memory completion guarantee. Use it only for synchronizing execution, not data.

inline uint64_t loadExpectedInbound()

Thread-safe read of expected inbound value.

Returns:: The expected inbound value.

inline uint64_t incExpectedInbound()

Thread-safe increment of expected inbound value.

Returns:: The incremented expected inbound value.

inline uint64_t loadInbound()

Thread-safe read of inbound value.

Returns:: The inbound value.

inline uint64_t loadInboundRelaxed()

Thread-safe read of inbound value without memory completion guarantee.

Returns:: The inbound value.

inline uint64_t loadOutbound()

Thread-safe read of outbound value.

Returns:: The outbound value.

inline uint64_t incOutbound()

Thread-safe increment of outbound value.

Returns:: The incremented outbound value.

Public Members

uint64_t *inboundToken: A local memory space where the remote device will write its semaphore value and the local device will read it.

uint64_t *outboundToken: A local memory space where the local device stores the semaphore value to be written to the remote device.

uint64_t *remoteInboundToken: A remote memory space where the local device writes its outboundToken on. This is inboundToken of the remote device.

uint64_t *expectedInboundToken: A local memory space where the local device stores the expected value of the inboundToken to wait for.

FIFO Device Interfaces

struct FifoDeviceHandle

Concurrent FIFO where multiple device threads (the number of threads should not exceed the FIFO size) to push Head pointer is on device, tail pointer is on host (readable by device). The FIFO’s capacity is limited only by MAX_UINT64—effectively infinite for practical use. Exceeding this limit will overflow the counter and lead to undefined behavior.

Public Functions

inline uint64_t push(ProxyTrigger trigger, int64_t maxSpinCount = 1000000)

Push a trigger to the FIFO.

Parameters:

trigger – Trigger to push.
maxSpinCount – Max spin count before assert. Never assert if negative.

Returns:

Previous head of the FIFO where the trigger was pushed.

inline bool poll(uint64_t fifoHead)

Poll whether a specific trigger is popped from the FIFO.

Parameters:: fifoHead – FIFO head where the trigger was pushed.
Returns:: True if the trigger is popped; false otherwise.

inline void sync(uint64_t fifoHead, [[maybe_unused]] int64_t maxSpinCount = 1000000)

Wait until a specific trigger is popped from the FIFO.

Parameters:

fifoHead – FIFO head where the trigger was pushed.
maxSpinCount – Max spin count before assert. Never assert if negative.

Public Members

ProxyTrigger *triggers: FIFO buffer on host.

uint64_t *head: FIFO head on device.

uint64_t *tail: FIFO tail on host.

uint64_t *tailCache: Cached tail value.

int size: FIFO size.

union ProxyTrigger

#include <fifo_device.hpp>

Pair of 64-bit unsigned integers used as a trigger for the proxy. Used as a work element in the concurrent FIFO. Most significant bit of snd is reserved.

Public Functions

ProxyTrigger() = default: Default constructor.

inline ProxyTrigger(TriggerType type, uint32_t dstId, uint64_t dstOffset, uint32_t srcId, uint64_t srcOffset, uint64_t bytes, uint32_t semaphoreId)

Constructor.

Parameters:

type – The type of the trigger.
dstId – The destination ID of memory region.
dstOffset – The offset into the destination memory region.
srcId – The source ID of memory region.
srcOffset – The offset into the source memory region.
bytes – The bytes of the transfer.
semaphoreId – The ID of the semaphore.

Public Members

uint64_t fst

uint64_t snd

struct mscclpp::ProxyTrigger::[anonymous] [anonymous]

uint64_t size

uint64_t srcOffset

uint64_t __pad0__

uint64_t dstOffset

uint64_t srcMemoryId

uint64_t dstMemoryId

uint64_t type

uint64_t semaphoreId

uint64_t __pad1__

uint64_t reserved

struct mscclpp::ProxyTrigger::[anonymous] fields

constexpr unsigned int mscclpp::TriggerBitsFifoReserved = 1

constexpr unsigned int mscclpp::TriggerBitsMemoryId = 9

constexpr unsigned int mscclpp::TriggerBitsOffset = 32

constexpr unsigned int mscclpp::TriggerBitsSemaphoreId = 10

constexpr unsigned int mscclpp::TriggerBitsSize = 32

constexpr unsigned int mscclpp::TriggerBitsType = 3

using mscclpp::TriggerType = uint64_t

constexpr TriggerType mscclpp::TriggerData = 0x1

constexpr TriggerType mscclpp::TriggerFlag = 0x2

constexpr TriggerType mscclpp::TriggerSync = 0x4

Device Utilities

struct DeviceSemaphore

A device-wide semaphore. This semaphore can be used to control access to a resource across multiple threads or blocks. It uses atomic operations to ensure that the semaphore value is updated correctly across threads. The semaphore value is an integer that can be set, acquired, and released.

Example usage of DeviceSemaphore:

__global__ void myKernel(mscclpp::DeviceSemaphore* semaphore) {
  // Initialize the semaphore (allow up to 3 threads access the resource simultaneously)
  if (blockIdx.x == 0 && threadIdx.x == 0) {
    semaphore->set(3);
  }
  // Each block acquires the semaphore before accessing the shared resource
  if (threadIdx.x == 0) {
    semaphore->acquire();
  }
  __syncthreads();
  // Access the shared resource
  // ...
  __syncthreads();
  // Release the semaphore after accessing the shared resource
  if (threadIdx.x == 0) {
    semaphore->release();
  }
}

Public Functions

DeviceSemaphore() = default: Construct a new DeviceSemaphore object.

inline DeviceSemaphore(int initialValue)

Parameters:: initialValue – The initial value of the semaphore.

~DeviceSemaphore() = default: Destroy the DeviceSemaphore object.

inline void set(int value)

Set the semaphore value. This function is used to initialize or reset the semaphore value. The initial value is typically set to a positive integer to allow acquiring the semaphore.

Parameters:: value – The value to set.

inline void acquire([[maybe_unused]] int maxSpinCount = -1)

Acquire the semaphore (decrease the semaphore value by 1). This function will spin until the semaphore is acquired or the maximum spin count is reached.

Parameters:: maxSpinCount – The maximum number of spin counts before asserting. Never assert if negative.

inline void release(): Release the semaphore (increase the semaphore value by 1).

struct DeviceSyncer

A device-wide barrier. This barrier can be used to synchronize multiple thread blocks within a kernel. It uses atomic operations to ensure that all threads in the same kernel reach the barrier before proceeding and they can safely read data written by other threads in different blocks.

Example usage of DeviceSyncer:

__global__ void myKernel(mscclpp::DeviceSyncer* syncer, int numBlocks) {
  // Do some work here
  // ...
  // Synchronize all blocks
  syncer->sync(numBlocks);
  // All blocks have reached this point
  // ...
}

Public Functions

DeviceSyncer() = default: Construct a new DeviceSyncer object.

~DeviceSyncer() = default: Destroy the DeviceSyncer object.

inline void sync(int blockNum, [[maybe_unused]] int64_t maxSpinCount = 100000000)

Synchronize all threads inside a kernel. Guarantee that all previous work of all threads in cooperating blocks is finished.

Parameters:

blockNum – The number of blocks that will synchronize.
maxSpinCount – The maximum number of spin counts before asserting. Never assert if negative.

Public Static Attributes

static const unsigned int NumCounters = 3U: The number of sync counters.

union VectorType

#include <gpu_data_types.hpp>

Vector type.

Public Types

using ElementType = T

Public Functions

inline VectorType()

inline operator T*()

inline operator const T*() const

inline T &operator[](int i)

inline const T &operator[](int i) const

Public Members

T data[N]

Words<sizeof(T) * N> words

Public Static Attributes

static constexpr int Size = N

template<int Bytes> struct Words: Word array.

template<int Alignment = 16, bool CopyRemainder = true> void mscclpp::copy(void *dst, void *src, uint64_t bytes, uint32_t threadId, uint32_t numThreads)

Copy data from the source memory to the destination memory.

This function is intended to be collectively called by multiple threads. Each thread copies a part of elements.

Note

The source and destination addresses do not have to be aligned to the Alignment value, but the misalignment to Alignment should be multiple of 4 bytes and should be the same for both source and destination addresses. The behavior of this function is undefined otherwise.

Note

The number of bytes to be copied should be a multiple of 4 bytes. If the number of bytes is not a multiple of 4 bytes, the remainder bytes will not be copied.

Template Parameters:

Alignment – The alignment of the data to be copied. A larger alignment value is more likely to achieve higher copying throughput. Should be one of 4, 8, or 16.
CopyRemainder – If false, the function will not copy data that is unaligned to the Alignment value. If true, the function will try to copy the unaligned data with conditions (see the notes).

Parameters:

dst – The destination address.
src – The source address.
bytes – Bytes of the data to be copied. Should be a multiple of 4 bytes.
threadId – The index of the current thread among all threads running this function. Should be less than numThreads.
numThreads – The total number of threads that run this function.

template<typename PacketType = LL16Packet> void mscclpp::copyFromPackets(void *originPtr, const void *targetPtr, uint64_t originBytes, uint32_t threadId, uint32_t numThreads, uint32_t flag, int64_t maxSpinCount = -1)

Read packets from the target buffer and write retrieved data to the origin.

This function is intended to be collectively called by multiple threads. Each thread reads a part of packets.

Template Parameters:

PacketType – The packet type. It should be either LL16Packet or LL8Packet.

Parameters:

originPtr – The origin buffer.
targetPtr – The target buffer.
originBytes – The number of bytes to read from the origin buffer.
threadId – The index of the current thread among all threads running this function. Should be less than numThreads.
numThreads – The total number of threads that run this function.
flag – The flag to write in the packets.
maxSpinCount – The maximum number of spin counts before asserting. Never assert if negative.

template<typename PacketType = LL16Packet> void mscclpp::copyToPackets(void *targetPtr, const void *originPtr, uint64_t originBytes, uint32_t threadId, uint32_t numThreads, uint32_t flag)

Read data from the origin and write packets to the target buffer.

This function is intended to be collectively called by multiple threads. Each thread copies a part of packets.

Template Parameters:

PacketType – The packet type. It should be either LL16Packet or LL8Packet.

Parameters:

targetPtr – The target buffer.
originPtr – The origin buffer.
originBytes – The number of bytes to write to the target buffer.
threadId – The index of the current thread among all threads running this function. Should be less than numThreads.
numThreads – The total number of threads that run this function.
flag – The flag to write in the packets.

Atomics

constexpr cuda::memory_order mscclpp::memoryOrderAcqRel = cuda::memory_order_acq_rel

constexpr cuda::memory_order mscclpp::memoryOrderAcquire = cuda::memory_order_acquire

constexpr cuda::memory_order mscclpp::memoryOrderRelaxed = cuda::memory_order_relaxed

constexpr cuda::memory_order mscclpp::memoryOrderRelease = cuda::memory_order_release

constexpr cuda::memory_order mscclpp::memoryOrderSeqCst = cuda::memory_order_seq_cst

constexpr cuda::thread_scope mscclpp::scopeDevice = cuda::thread_scope_device

constexpr cuda::thread_scope mscclpp::scopeSystem = cuda::thread_scope_system

template<typename T, cuda::thread_scope Scope = cuda::thread_scope_system> T mscclpp::atomicFetchAdd(T *ptr, const T &val, cuda::memory_order memoryOrder)

template<typename T, cuda::thread_scope Scope = cuda::thread_scope_system> T mscclpp::atomicLoad(T *ptr, cuda::memory_order memoryOrder)

template<typename T, cuda::thread_scope Scope = cuda::thread_scope_system> void mscclpp::atomicStore(T *ptr, const T &val, cuda::memory_order memoryOrder)

Vector Data Types

using mscclpp::bf16x2 = VectorType<__bfloat16, 2>

using mscclpp::bf16x4 = VectorType<__bfloat16, 4>

using mscclpp::bf16x8 = VectorType<__bfloat16, 8>

using mscclpp::f16x2 = VectorType<__half, 2>

using mscclpp::f16x4 = VectorType<__half, 4>

using mscclpp::f16x8 = VectorType<__half, 8>

using mscclpp::f32x1 = VectorType<float, 1>

using mscclpp::f32x2 = VectorType<float, 2>

using mscclpp::f32x4 = VectorType<float, 4>

using mscclpp::f64x1 = VectorType<double, 1>

using mscclpp::i32x1 = VectorType<int32_t, 1>

using mscclpp::i32x2 = VectorType<int32_t, 2>

using mscclpp::i32x4 = VectorType<int32_t, 4>

using mscclpp::u32x1 = VectorType<uint32_t, 1>

using mscclpp::u32x2 = VectorType<uint32_t, 2>

using mscclpp::u32x4 = VectorType<uint32_t, 4>

Macro Functions

MSCCLPP_ASSERT_DEVICE(__cond, __msg): Assert a condition on the device and print a message if the condition is false. This macro does nothing in a release mode build (when DEBUG_BUILD is undefined).

OR_POLL_MAYBE_JAILBREAK(__cond1, __cond2, __max_spin_cnt)

POLL_MAYBE_JAILBREAK(__cond, __max_spin_cnt)