C++ API Reference

This reference organizes the MSCCL++ C++ API into two main categories: Host-Side Interfaces for CPU code and Device-Side Interfaces for GPU kernels. Components that are used in both host and device code are documented in the Device-Side Interfaces section.

Host-Side Interfaces

These are the interfaces used in CPU code to set up connections, manage memory, and coordinate operations.

Bootstrap and Process Coordination

class Bootstrap

Base class for bootstraps.

Subclassed by mscclpp::TcpBootstrap

Public Functions

inline Bootstrap()

Constructor.

virtual ~Bootstrap() = default

Destructor.

virtual int getRank() const = 0

Return the rank of the process.

Returns:

The rank of the process.

virtual int getNranks() const = 0

Return the total number of ranks.

Returns:

The total number of ranks.

virtual int getNranksPerNode() const = 0

Return the total number of ranks per node.

Returns:

The total number of ranks per node.

virtual void send(void *data, int size, int peer, int tag) = 0

Send arbitrary data to another process.

Data sent via send(senderBuff, size, receiverRank, tag) can be received via

recv(receiverBuff, size,

senderRank, tag)

. Multiple calls to send() with the same peer and tag will be ordered by the order of calls, corresponding to the order of recv() calls on the receiving side. In cases where the execution order of multiple send()s or recv()s between two ranks is unknown, they should be differentiated by using different tag values to prevent unexpected behavior.

Parameters:
  • data – The data to send.

  • size – The size of the data to send.

  • peer – The rank of the process to send the data to.

  • tag – The tag to send the data with.

virtual void recv(void *data, int size, int peer, int tag) = 0

Receive data sent from another process by send().

Data sent via send(senderBuff, size, receiverRank, tag) can be received via

recv(receiverBuff, size,

senderRank, tag)

. Multiple calls to send() with the same peer and tag will be ordered by the order of calls, corresponding to the order of recv() calls on the receiving side. In cases where the execution order of multiple send()s or recv()s between two ranks is unknown, they should be differentiated by using different tag values to prevent unexpected behavior.

Parameters:
  • data – The buffer to write the received data to.

  • size – The size of the data to receive.

  • peer – The rank of the process to receive the data from.

  • tag – The tag to receive the data with.

virtual void allGather(void *allData, int size) = 0

Gather data from all processes.

When called by rank r, this sends data from allData[r * size] to allData[(r + 1) * size - 1] to all other ranks. The data sent by rank r is received into allData[r * size] of other ranks.

Parameters:
  • allData – The buffer to write the received data to.

  • size – The size of the data each rank sends.

virtual void barrier() = 0

Synchronize all processes.

void groupBarrier(const std::vector<int> &ranks)

A partial barrier that synchronizes a group of ranks.

Parameters:

ranks – The ranks to synchronize.

void send(const std::vector<char> &data, int peer, int tag)

Wrapper of send() that sends a vector of characters.

Parameters:
  • data – The data to send.

  • peer – The rank of the process to send the data to.

  • tag – The tag to send the data with.

void recv(std::vector<char> &data, int peer, int tag)

Wrapper of recv() that receives a vector of characters.

Note

The data vector will be resized to the size of the received data.

Parameters:
  • data – The buffer to write the received data to.

  • peer – The rank of the process to receive the data from.

  • tag – The tag to receive the data with.

class TcpBootstrap : public mscclpp::Bootstrap

A native implementation of the bootstrap using TCP sockets.

Public Functions

TcpBootstrap(int rank, int nRanks)

Constructor.

Parameters:
  • rank – The rank of the process.

  • nRanks – The total number of ranks.

~TcpBootstrap()

Destructor.

UniqueId getUniqueId() const

Return the unique ID stored in the TcpBootstrap.

Returns:

The unique ID stored in the TcpBootstrap.

void initialize(UniqueId uniqueId, int64_t timeoutSec = 30)

Initialize the TcpBootstrap with a given unique ID. The unique ID can be generated by any method; it can be created by createUniqueId() or can be any arbitrary bit array provided by the user.

Parameters:
  • uniqueId – The unique ID to initialize the TcpBootstrap with.

  • timeoutSec – The connection timeout in seconds.

void initialize(const std::string &ifIpPortTrio, int64_t timeoutSec = 30)

Initialize the TcpBootstrap with a string formatted as “ip:port” or “interface:ip:port”.

Parameters:
  • ifIpPortTrio – The string formatted as “ip:port” or “interface:ip:port”.

  • timeoutSec – The connection timeout in seconds.

virtual int getRank() const override

Return the rank of the process.

virtual int getNranks() const override

Return the total number of ranks.

virtual int getNranksPerNode() const override

Return the total number of ranks per node.

virtual void send(void *data, int size, int peer, int tag) override

Send arbitrary data to another process.

Data sent via send(senderBuff, size, receiverRank, tag) can be received via

recv(receiverBuff, size,

senderRank, tag)

. Multiple calls to send() with the same peer and tag will be ordered by the order of calls, corresponding to the order of recv() calls on the receiving side. In cases where the execution order of multiple send()s or recv()s between two ranks is unknown, they should be differentiated by using different tag values to prevent unexpected behavior.

Parameters:
  • data – The data to send.

  • size – The size of the data to send.

  • peer – The rank of the process to send the data to.

  • tag – The tag to send the data with.

virtual void recv(void *data, int size, int peer, int tag) override

Receive data sent from another process by send().

Data sent via send(senderBuff, size, receiverRank, tag) can be received via

recv(receiverBuff, size,

senderRank, tag)

. Multiple calls to send() with the same peer and tag will be ordered by the order of calls, corresponding to the order of recv() calls on the receiving side. In cases where the execution order of multiple send()s or recv()s between two ranks is unknown, they should be differentiated by using different tag values to prevent unexpected behavior.

Parameters:
  • data – The buffer to write the received data to.

  • size – The size of the data to receive.

  • peer – The rank of the process to receive the data from.

  • tag – The tag to receive the data with.

virtual void allGather(void *allData, int size) override

Gather data from all processes.

When called by rank r, this sends data from allData[r * size] to allData[(r + 1) * size - 1] to all other ranks. The data sent by rank r is received into allData[r * size] of other ranks.

Parameters:
  • allData – The buffer to write the received data to.

  • size – The size of the data each rank sends.

void broadcast(void *data, int size, int root)

Broadcast data from the root process to all processes using a ring-based algorithm.

When called by the root rank, this sends the size bytes starting at memory location data to all other ranks. Non-root ranks receive these bytes into their own data buffer, overwriting its previous contents. The data propagates sequentially through a logical ring of processes until all ranks have received it.

Parameters:
  • data – Pointer to the send buffer (root) or receive buffer (non-root)

  • size – Number of bytes to broadcast

  • root – Rank initiating the broadcast

virtual void barrier() override

Synchronize all processes.

Public Static Functions

static UniqueId createUniqueId()

Create a random unique ID.

Returns:

The created unique ID.

using mscclpp::UniqueId = std::array<uint8_t, UniqueIdBytes>

Unique ID for initializing the TcpBootstrap.

constexpr unsigned int mscclpp::UniqueIdBytes = 128

Connection Setup and Memory Management

class Connection

Connection between two processes.

Public Functions

Connection(std::shared_ptr<Context> context, const Endpoint &localEndpoint)

Constructor.

Parameters:
  • context – The context associated with the connection.

  • localEndpoint – The local endpoint of the connection.

virtual ~Connection() = default

Destructor.

virtual void write(RegisteredMemory dst, uint64_t dstOffset, RegisteredMemory src, uint64_t srcOffset, uint64_t size) = 0

Write data from a source RegisteredMemory to a destination RegisteredMemory.

Parameters:
virtual void updateAndSync(RegisteredMemory dst, uint64_t dstOffset, uint64_t *src, uint64_t newValue) = 0

Update an 8-byte value in a destination RegisteredMemory and synchronize the change with the remote process.

Parameters:
  • dst – The destination RegisteredMemory.

  • dstOffset – The offset in bytes from the start of the destination RegisteredMemory.

  • src – A pointer to the value to update.

  • newValue – The new value to write.

virtual void flush(int64_t timeoutUsec = -1) = 0

Flush any pending writes to the remote process.

Parameters:

timeoutUsec – Timeout in microseconds. Default: -1 (no timeout)

virtual Transport transport() const = 0

Get the transport used by the local process.

Returns:

The transport used by the local process.

virtual Transport remoteTransport() const = 0

Get the transport used by the remote process.

Returns:

The transport used by the remote process.

std::shared_ptr<Context> context() const

Get the context associated with this connection.

Returns:

A shared pointer to the context associated with this connection.

const Device &localDevice() const

Get the device used by the local endpoint.

Returns:

The device used by the local endpoint.

int getMaxWriteQueueSize() const

Get the maximum write queue size.

Returns:

The maximum number of write requests that can be queued.

class Context : public std::enable_shared_from_this<Context>

Context for communication. This provides a low-level interface for forming connections in use-cases where the process group abstraction offered by Communicator is not suitable, e.g., ephemeral client-server connections. Correct use of this class requires external synchronization when finalizing connections with the connect() method.

As an example, a client-server scenario where the server will write to the client might proceed as follows:

  1. The client creates an endpoint with createEndpoint() and sends it to the server.

  2. The server receives the client endpoint, creates its own endpoint with createEndpoint(), sends it to the client, and creates a connection with connect().

  3. The client receives the server endpoint, creates a connection with connect() and sends a RegisteredMemory to the server.

  4. The server receives the RegisteredMemory and writes to it using the previously created connection. The client waiting to create a connection before sending the RegisteredMemory ensures that the server cannot write to the RegisteredMemory before the connection is established.

While some transports may have more relaxed implementation behavior, this should not be relied upon.

Public Functions

~Context()

Destructor.

RegisteredMemory registerMemory(void *ptr, size_t size, TransportFlags transports)

Register a region of GPU memory for use in this context.

Parameters:
  • ptr – Base pointer to the memory.

  • size – Size of the memory region in bytes.

  • transports – Transport flags.

Returns:

A RegisteredMemory object representing the registered memory region.

Endpoint createEndpoint(EndpointConfig config)

Create an endpoint for establishing connections.

Parameters:

config – The configuration for the endpoint.

Returns:

The newly created endpoint.

std::shared_ptr<Connection> connect(const Endpoint &localEndpoint, const Endpoint &remoteEndpoint)

Establish a connection between two endpoints. While this method immediately returns a connection object, the connection is only safe to use after the corresponding connection on the remote endpoint has been established. This method must be called on both endpoints to establish a connection.

Parameters:
  • localEndpoint – The local endpoint.

  • remoteEndpoint – The remote endpoint.

Returns:

A shared pointer to the connection.

Public Static Functions

static inline std::shared_ptr<Context> create()

Create a new Context instance.

class Communicator

A class that sets up all registered memories and connections between processes.

A typical way to use this class:

  1. Call connect() to declare connections between the calling process and other processes.

  2. Call registerMemory() to register memory regions that will be used for communication.

  3. Call sendMemory() or recvMemory() to send/receive registered memory regions to/from other processes.

  4. Call get() on futures returned by connect(). Use the returned connections to create flags.

  5. Call buildSemaphore() to create a Semaphore out of the flags.

  6. Call get() on all futures returned by buildSemaphore() and recvMemory().

  7. All done; use semaphores and registered memories to build channels.

CAUTION: The order of method calls matters when the same remote rank and tags are used. That is, the i-th “sending” method call (sendMemory(), connect(), and buildSemaphore()) on the local rank must be matched by the i-th “receiving” method call (recvMemory(), connect(), and buildSemaphore()) on the remote rank.

Correct Example 1:

// Rank 0
communicator.sendMemory(memory1, 1, tag);
communicator.sendMemory(memory2, 1, tag);
auto connection = communicator.connect(Transport::CudaIpc, 1, tag);
connection.get(); // This will return the connection.
// Rank 1
auto mem1 = communicator.recvMemory(0, tag);
auto mem2 = communicator.recvMemory(0, tag);
auto connection = communicator.connect(Transport::CudaIpc, 0, tag);
mem2.get();       // This will return memory2.
connection.get(); // This will return the connection.
mem1.get();       // This will return memory1.

Correct Example 2:

// Rank 0
communicator.sendMemory(memory0, 1, tag);
auto mem1 = communicator.recvMemory(1, tag);
auto connection = communicator.connect(Transport::CudaIpc, 1, tag);
connection.get(); // This will return the connection.
mem1.get();       // This will return memory1.
// Rank 1
auto mem0 = communicator.recvMemory(0, tag);
communicator.sendMemory(memory1, 0, tag);
auto connection = communicator.connect(Transport::CudaIpc, 0, tag);
mem0.get();       // This will return memory0.
connection.get(); // This will return the connection.

Wrong Example:

// Rank 0
communicator.sendMemory(memory0, 1, tag);
auto mem1 = communicator.recvMemory(1, tag);
auto connection = communicator.connect(Transport::CudaIpc, 1, tag);
// Rank 1
auto mem0 = communicator.recvMemory(0, tag);
auto connection = communicator.connect(Transport::CudaIpc, 0, tag); // undefined behavior
communicator.sendMemory(memory1, 0, tag);
In the wrong example, the connection information from rank 1 will be sent to the mem1 object on rank 0, where the object type is RegisteredMemory, not Connection.

Public Functions

Communicator(std::shared_ptr<Bootstrap> bootstrap, std::shared_ptr<Context> context = nullptr)

Initializes the communicator with a given bootstrap implementation.

Parameters:
  • bootstrap – An implementation of the Bootstrap that the communicator will use.

  • context – An optional context to use for the communicator. If not provided, a new context will be created.

~Communicator()

Destroy the communicator.

std::shared_ptr<Bootstrap> bootstrap()

Returns the bootstrap held by this communicator.

Returns:

The bootstrap held by this communicator.

std::shared_ptr<Context> context()

Returns the context held by this communicator.

Returns:

The context held by this communicator.

RegisteredMemory registerMemory(void *ptr, size_t size, TransportFlags transports)

Register a region of GPU memory for use in this communicator’s context.

Parameters:
  • ptr – Base pointer to the memory.

  • size – Size of the memory region in bytes.

  • transports – Transport flags.

Returns:

A RegisteredMemory object representing the registered memory region.

void sendMemory(RegisteredMemory memory, int remoteRank, int tag = 0)

Send information of a registered memory to the remote side.

The send will be started upon calling this function, but this function returns immediately without waiting for the completion of the send. RegisteredMemory sent via sendMemory(memory, remoteRank, tag) can be received via recvMemory(remoteRank, tag).

Multiple calls to either sendMemory() or connect() with the same remoteRank and tag will be ordered by the order of calls, corresponding to the order of recvMemory() or connect() calls on the receiving side. In cases where the execution order is unknown between two ranks, they should be differentiated by using different tag values to prevent unexpected behavior.

Parameters:
  • memory – The registered memory buffer to send information about.

  • remoteRank – The rank of the remote process.

  • tag – The tag to use for identifying the send.

std::shared_future<RegisteredMemory> recvMemory(int remoteRank, int tag = 0)

Receive memory information from a corresponding sendMemory call on the remote side.

This function returns a future immediately. The actual receive will be performed upon calling the first get() on the future. RegisteredMemory sent via sendMemory(memory, remoteRank, tag) can be received via recvMemory(remoteRank, tag).

Multiple calls to either sendMemory() or connect() with the same remoteRank and tag will be ordered by the order of calls, corresponding to the order of recvMemory() or connect() calls on the receiving side. In cases where the execution order is unknown between two ranks, they should be differentiated by using different tag values to prevent unexpected behavior.

Note

To guarantee the receiving order, calling get() on a future returned by recvMemory() or connect() may start receiving other RegisteredMemory or Connection objects of which futures were returned by an earlier call to recvMemory() or connect() with the same remoteRank and tag. For example, if we call recvMemory() or connect() five times with the same remoteRank and tag and then call get() on the last future, it will start receiving the five RegisteredMemory or Connection objects in order, back to back.

Parameters:
  • remoteRank – The rank of the remote process.

  • tag – The tag to use for identifying the receive.

Returns:

A future of registered memory.

std::shared_future<std::shared_ptr<Connection>> connect(EndpointConfig localConfig, int remoteRank, int tag = 0)

Connect to a remote rank.

This function will start (but not wait for) sending metadata about the local endpoint to the remote rank, and return a future connection without waiting for the remote rank to respond. The connection will be established when the remote rank responds with its own endpoint and the local rank calls the first get() on the future. Note that this function is two-way and a connection from rank i to remote rank j needs to have a counterpart from rank j to rank i. Note that with IB, buffers are registered at a page level and if a buffer is spread through multiple pages and does not fully utilize all of them, IB’s QP has to register for all involved pages. This potentially has security risks if the connection’s accesses are given to a malicious process.

Multiple calls to either sendMemory() or connect() with the same remoteRank and tag will be ordered by the order of calls, corresponding to the order of recvMemory() or connect() calls on the receiving side. In cases where the execution order is unknown between two ranks, they should be differentiated by using different tag values to prevent unexpected behavior.

Note

To guarantee the receiving order, calling get() on a future returned by recvMemory() or connect() may start receiving other RegisteredMemory or Connection objects of which futures were returned by an earlier call to recvMemory() or connect() with the same remoteRank and tag. For example, if we call recvMemory() or connect() five times with the same remoteRank and tag and then call get() on the last future, it will start receiving the five RegisteredMemory or Connection objects in order, back to back.

Parameters:
  • localConfig – The configuration for the local endpoint.

  • remoteRank – The rank of the remote process.

  • tag – The tag to use for identifying the send and receive.

Returns:

A future of shared pointer to the connection.

std::shared_future<Semaphore> buildSemaphore(std::shared_ptr<Connection> connection, int remoteRank, int tag = 0)

Build a semaphore for cross-process synchronization.

Parameters:
  • connection – The connection associated with this semaphore.

  • remoteRank – The rank of the remote process.

  • tag – The tag to use for identifying the operation.

Returns:

A future of the built semaphore.

int remoteRankOf(const Connection &connection)

Get the remote rank a connection is connected to.

Parameters:

connection – The connection to get the remote rank for.

Returns:

The remote rank the connection is connected to.

int tagOf(const Connection &connection)

Get the tag a connection was made with.

Parameters:

connection – The connection to get the tag for.

Returns:

The tag the connection was made with.

struct Device

Declaration of a device.

Public Functions

Device() = default

Constructor.

inline Device(DeviceType type, int id = -1)

Constructor.

Parameters:
  • typeDevice type.

  • idDevice ID. Default is -1 (no specific ID).

Public Members

DeviceType type

Device Type.

int id

Device ID.

class Endpoint

One end of a connection.

Public Functions

Endpoint() = default

Constructor.

Transport transport() const

Get the transport used.

Returns:

The transport used.

const Device &device() const

Get the device used.

Returns:

The device used.

uint64_t hostHash() const

Get the host hash.

Returns:

The host hash.

uint64_t pidHash() const

Get the process ID hash.

Returns:

The process ID hash.

int maxWriteQueueSize() const

Get the maximum write queue size.

Returns:

The maximum number of write requests that can be queued.

std::vector<char> serialize() const

Serialize the Endpoint object to a vector of characters.

Returns:

A vector of characters representing the serialized Endpoint object.

Public Static Functions

static Endpoint deserialize(const std::vector<char> &data)

Deserialize an Endpoint object from a vector of characters.

Parameters:

data – A vector of characters representing a serialized Endpoint object.

Returns:

A deserialized Endpoint object.

struct EndpointConfig

Used to configure an endpoint.

Public Functions

inline EndpointConfig(Transport transport = Transport::Unknown, Device device = DeviceType::GPU, int ibMaxCqSize = DefaultMaxCqSize, int ibMaxCqPollNum = DefaultMaxCqPollNum, int ibMaxSendWr = DefaultMaxSendWr, int ibMaxWrPerSend = DefaultMaxWrPerSend, int maxWriteQueueSize = -1)

Constructor that takes a transport and sets the other fields to their default values.

Parameters:
  • transport – The transport to use.

  • device – The device to use.

  • ibMaxCqSize – The maximum completion queue size.

  • ibMaxCqPollNum – The maximum completion queue poll number.

  • ibMaxSendWr – The maximum send work requests.

  • ibMaxWrPerSend – The maximum work requests per send.

  • maxWriteQueueSize – The maximum write queue size.

class NvlsConnection

Public Functions

SwitchChannel bindAllocatedMemory(CUdeviceptr devicePtr, size_t size)

Bind the memory allocated via mscclpp::GpuBuffer to the multicast handle. The behavior is undefined if the devicePtr is not allocated by mscclpp::GpuBuffer.

Parameters:
  • devicePtr – The device pointer returned by mscclpp::GpuBuffer::data().

  • size – The bytes of the memory to bind to the multicast handle.

Returns:

SwitchChannel with devicePtr, mcPtr and bufferSize

class RegisteredMemory

Block of memory that has been registered to a Context. RegisteredMemory does not own the memory it points to, but it provides a way to transfer metadata about the memory to other processes, hence allowing their access to the memory block.

Public Functions

RegisteredMemory() = default

Constructor.

~RegisteredMemory()

Destructor.

void *data() const

Get a pointer to the memory block.

Returns:

A pointer to the memory block.

void *originalDataPtr() const

Get a pointer to the original memory block.

Returns:

A pointer to the original memory block.

size_t size() const

Get the size of the memory block.

Returns:

The size of the memory block.

TransportFlags transports() const

Get the transport flags associated with the memory block.

Returns:

The transport flags associated with the memory block.

std::vector<char> serialize() const

Serialize the RegisteredMemory object to a vector of characters.

Returns:

A vector of characters representing the serialized RegisteredMemory object.

Public Static Functions

static RegisteredMemory deserialize(const std::vector<char> &data)

Deserialize a RegisteredMemory object from a vector of characters.

Parameters:

data – A vector of characters representing a serialized RegisteredMemory object.

Returns:

A deserialized RegisteredMemory object.

class TransportFlags : private detail::TransportFlagsBase

Stores transport flags.

Public Functions

TransportFlags() = default

Constructor.

TransportFlags(Transport transport)

Constructor.

Parameters:

transport – The transport to set the flag for.

bool has(Transport transport) const

Check if a specific transport flag is set.

Parameters:

transport – The transport to check the flag for.

Returns:

True if the flag is set, false otherwise.

bool none() const

Check if no transport flags are set.

Returns:

True if no flags are set, false otherwise.

bool any() const

Check if any transport flags are set.

Returns:

True if any flags are set, false otherwise.

bool all() const

Check if all transport flags are set.

Returns:

True if all flags are set, false otherwise.

size_t count() const

Get the number of transport flags that are set.

Returns:

The number of flags that are set.

TransportFlags &operator|=(TransportFlags other)

Bitwise OR assignment operator for TransportFlags.

Parameters:

other – The other TransportFlags to perform the OR operation with.

Returns:

A reference to the modified TransportFlags.

TransportFlags operator|(TransportFlags other) const

Bitwise OR operator for TransportFlags.

Parameters:

other – The other TransportFlags to perform the OR operation with.

Returns:

A new TransportFlags object with the result of the OR operation.

TransportFlags operator|(Transport transport) const

Bitwise OR operator for TransportFlags and Transport.

Parameters:

transport – The Transport to perform the OR operation with.

Returns:

A new TransportFlags object with the result of the OR operation.

TransportFlags &operator&=(TransportFlags other)

Bitwise AND assignment operator for TransportFlags.

Parameters:

other – The other TransportFlags to perform the AND operation with.

Returns:

A reference to the modified TransportFlags.

TransportFlags operator&(TransportFlags other) const

Bitwise AND operator for TransportFlags.

Parameters:

other – The other TransportFlags to perform the AND operation with.

Returns:

A new TransportFlags object with the result of the AND operation.

TransportFlags operator&(Transport transport) const

Bitwise AND operator for TransportFlags and Transport.

Parameters:

transport – The Transport to perform the AND operation with.

Returns:

A new TransportFlags object with the result of the AND operation.

TransportFlags &operator^=(TransportFlags other)

Bitwise XOR assignment operator for TransportFlags.

Parameters:

other – The other TransportFlags to perform the XOR operation with.

Returns:

A reference to the modified TransportFlags.

TransportFlags operator^(TransportFlags other) const

Bitwise XOR operator for TransportFlags.

Parameters:

other – The other TransportFlags to perform the XOR operation with.

Returns:

A new TransportFlags object with the result of the XOR operation.

TransportFlags operator^(Transport transport) const

Bitwise XOR operator for TransportFlags and Transport.

Parameters:

transport – The Transport to perform the XOR operation with.

Returns:

A new TransportFlags object with the result of the XOR operation.

TransportFlags operator~() const

Bitwise NOT operator for TransportFlags.

Returns:

A new TransportFlags object with the result of the NOT operation.

bool operator==(TransportFlags other) const

Equality comparison operator for TransportFlags.

Parameters:

other – The other TransportFlags to compare with.

Returns:

True if the two TransportFlags objects are equal, false otherwise.

bool operator!=(TransportFlags other) const

Inequality comparison operator for TransportFlags.

Parameters:

other – The other TransportFlags to compare with.

Returns:

True if the two TransportFlags objects are not equal, false otherwise.

detail::TransportFlagsBase toBitset() const

Convert the TransportFlags object to a bitset representation.

Returns:

A detail::TransportFlagsBase object representing the TransportFlags object.

enum class mscclpp::DeviceType

Available device types.

Values:

enumerator Unknown
enumerator CPU
enumerator GPU
enum class mscclpp::Transport

Enumerates the available transport types.

Values:

enumerator Unknown
enumerator CudaIpc
enumerator IB0
enumerator IB1
enumerator IB2
enumerator IB3
enumerator IB4
enumerator IB5
enumerator IB6
enumerator IB7
enumerator Ethernet
enumerator NumTransports
std::shared_ptr<NvlsConnection> mscclpp::connectNvlsCollective(std::shared_ptr<Communicator> comm, std::vector<int> allRanks, size_t bufferSize)

Connect to NVLS on setup.

This function used to connect to NVLS on setup. NVLS collective using multicast operations to send/recv data. Here we need to put all involved ranks into the collective group.

Parameters:
  • comm – The communicator.

  • allRanks – The ranks of all processes involved in the collective.

  • config – The configuration for the local endpoint.

Returns:

std::shared_ptr<NvlsConnection> A shared pointer to the NVLS connection.

Semaphores

class Host2DeviceSemaphore

A semaphore for sending signals from the host to the device.

Public Types

using DeviceHandle = Host2DeviceSemaphoreDeviceHandle

Device-side handle for Host2DeviceSemaphore.

Public Functions

Host2DeviceSemaphore(const Semaphore &semaphore)

Constructor.

Parameters:

semaphore

Host2DeviceSemaphore(Communicator &communicator, std::shared_ptr<Connection> connection)

Constructor.

Parameters:
  • communicator – The communicator.

  • connection – The connection associated with this semaphore.

std::shared_ptr<Connection> connection() const

Returns the connection.

Returns:

The connection associated with this semaphore.

void signal()

Signal the device.

DeviceHandle deviceHandle() const

Returns the device-side handle.

class Host2HostSemaphore

A semaphore for sending signals from the local host to a remote host.

Public Functions

Host2HostSemaphore(const Semaphore &semaphore)

Constructor.

Parameters:

semaphore

Host2HostSemaphore(Communicator &communicator, std::shared_ptr<Connection> connection)

Constructor.

Parameters:
std::shared_ptr<Connection> connection() const

Returns the connection.

Returns:

The connection associated with this semaphore.

void signal()

Signal the remote host.

bool poll()

Check if the remote host has signaled.

Returns:

true if the remote host has signaled.

void wait(int64_t maxSpinCount = 10000000)

Wait for the remote host to signal.

Parameters:

maxSpinCount – The maximum number of spin counts before throwing an exception. Never throws if negative.

class MemoryDevice2DeviceSemaphore

A semaphore for sending signals from the local device to a peer device via a GPU thread.

Public Types

using DeviceHandle = MemoryDevice2DeviceSemaphoreDeviceHandle

Device-side handle for MemoryDevice2DeviceSemaphore.

Public Functions

MemoryDevice2DeviceSemaphore(const Semaphore &semaphore)

Constructor.

Parameters:

semaphore

MemoryDevice2DeviceSemaphore(Communicator &communicator, std::shared_ptr<Connection> connection)

Constructor.

Parameters:
  • communicator – The communicator.

  • connection – The connection associated with this semaphore.

std::shared_ptr<Connection> connection() const

Returns the connection.

Returns:

The connection associated with this semaphore.

DeviceHandle deviceHandle() const

Returns the device-side handle.

class Semaphore

Semaphore used by channels for synchronization.

Public Functions

Semaphore() = default

Constructor.

Semaphore(const SemaphoreStub &localStub, const SemaphoreStub &remoteStub)

Constructor.

Parameters:
std::shared_ptr<Connection> connection() const

Get the connection associated with this semaphore.

Returns:

A shared pointer to the connection.

const RegisteredMemory &localMemory() const

Get the local memory associated with this semaphore.

Returns:

A reference to the local registered memory.

const RegisteredMemory &remoteMemory() const

Get the remote memory associated with this semaphore.

Returns:

A reference to the remote registered memory.

class SemaphoreStub

SemaphoreStub object only used for constructing Semaphore, not for direct use by the user.

Public Functions

SemaphoreStub(std::shared_ptr<Connection> connection)

Constructor.

Parameters:

connection – A shared pointer to the connection associated with this semaphore.

const RegisteredMemory &memory() const

Get the memory associated with this semaphore.

Returns:

A reference to the registered memory for this semaphore.

std::vector<char> serialize() const

Serialize into a vector of characters.

Returns:

A vector of characters representing the serialized SemaphoreStub object.

Public Static Functions

static SemaphoreStub deserialize(const std::vector<char> &data)

Deserialize a SemaphoreStub object from a vector of characters.

Parameters:

data – A vector of characters representing a serialized SemaphoreStub object.

Returns:

A deserialized SemaphoreStub object.

Channels

struct BaseMemoryChannel

Memory channel without specifying source/destination memory regions.

Subclassed by mscclpp::MemoryChannel

Public Types

using DeviceHandle = BaseMemoryChannelDeviceHandle

Device-side handle for BaseMemoryChannel.

Public Functions

BaseMemoryChannel() = default

Constructor.

BaseMemoryChannel(std::shared_ptr<MemoryDevice2DeviceSemaphore> semaphore)

Constructor.

Parameters:

semaphoreSemaphore used to synchronize the communication.

BaseMemoryChannel(const Semaphore &semaphore)

Constructor.

Parameters:

semaphoreSemaphore used to synchronize the communication.

BaseMemoryChannel(const BaseMemoryChannel &other) = default

Constructor.

Parameters:

other – Other BaseMemoryChannel to copy from.

DeviceHandle deviceHandle() const

Returns the device-side handle. User should make sure the BaseMemoryChannel is not released when using the returned handle.

Returns:

The device-side handle.

struct BasePortChannel

Port channel without specifying source/destination memory regions.

Subclassed by mscclpp::PortChannel

Public Types

using DeviceHandle = BasePortChannelDeviceHandle

Device-side handle for BasePortChannel.

Public Functions

BasePortChannel() = default

Constructor.

BasePortChannel(SemaphoreId semaphoreId, std::shared_ptr<Host2DeviceSemaphore> semaphore, std::shared_ptr<Proxy> proxy)

Constructor.

Parameters:
  • semaphoreId – The ID of the semaphore.

  • semaphore – The semaphore used to synchronize the communication.

  • proxy – The proxy used for communication.

BasePortChannel(SemaphoreId semaphoreId, const Semaphore &semaphore, std::shared_ptr<Proxy> proxy)

Constructor.

Parameters:
  • semaphoreId – The ID of the semaphore.

  • semaphore – The semaphore used to synchronize the communication.

  • proxy – The proxy used for communication.

BasePortChannel(const BasePortChannel &other) = default

Copy constructor.

Parameters:

other – The other BasePortChannel to copy from.

BasePortChannel &operator=(BasePortChannel &other) = default

Assignment operator.

Parameters:

other – The other BasePortChannel to assign from.

DeviceHandle deviceHandle() const

Returns the device-side handle. User should make sure the BasePortChannel is not released when using the returned handle.

Returns:

The device-side handle.

struct MemoryChannel : public mscclpp::BaseMemoryChannel

Channel for accessing peer memory directly from GPU threads.

Public Types

using DeviceHandle = MemoryChannelDeviceHandle

Device-side handle for MemoryChannel.

Public Functions

MemoryChannel() = default

Constructor.

MemoryChannel(std::shared_ptr<MemoryDevice2DeviceSemaphore> semaphore, RegisteredMemory dst, RegisteredMemory src, void *packetBuffer = nullptr)

Constructor.

Parameters:
  • semaphore – The semaphore used to synchronize the communication.

  • dst – Registered memory of the destination.

  • src – Registered memory of the source.

  • packetBuffer – A buffer used to store packets. packetBuffer is optional and if it is nullptr, unpackPacket() and unpackPackets() methods are not available.

MemoryChannel(const Semaphore &semaphore, RegisteredMemory dst, RegisteredMemory src, void *packetBuffer = nullptr)

Constructor.

Parameters:
  • semaphore – The semaphore used to synchronize the communication.

  • dst – Registered memory of the destination.

  • src – Registered memory of the source.

  • packetBuffer – A buffer used to store packets. packetBuffer is optional and if it is nullptr, unpackPacket() and unpackPackets() methods are not available.

DeviceHandle deviceHandle() const

Returns the device-side handle. User should make sure the MemoryChannel is not released when using the returned handle.

Returns:

The device-side handle.

struct PortChannel : public mscclpp::BasePortChannel

Port channel.

Public Types

using DeviceHandle = PortChannelDeviceHandle

Device-side handle for PortChannel.

Public Functions

PortChannel() = default

Constructor.

PortChannel(SemaphoreId semaphoreId, std::shared_ptr<Host2DeviceSemaphore> semaphore, std::shared_ptr<Proxy> proxy, MemoryId dst, MemoryId src)

Constructor.

Parameters:
  • semaphoreId – The ID of the semaphore.

  • semaphore – The semaphore.

  • proxy – The proxy.

  • dst – The destination memory region.

  • src – The source memory region.

PortChannel(SemaphoreId semaphoreId, const Semaphore &semaphore, std::shared_ptr<Proxy> proxy, MemoryId dst, MemoryId src)

Constructor.

Parameters:
  • semaphoreId – The ID of the semaphore.

  • semaphore – The semaphore.

  • proxy – The proxy.

  • dst – The destination memory region.

  • src – The source memory region.

PortChannel(const PortChannel &other) = default

Copy constructor.

Parameters:

other – The other PortChannel to copy from.

PortChannel &operator=(PortChannel &other) = default

Assignment operator.

Parameters:

other – The other PortChannel to assign from.

DeviceHandle deviceHandle() const

Returns the device-side handle. User should make sure the PortChannel is not released when using the returned handle.

Returns:

The device-side handle.

struct SwitchChannel

Proxy Service and FIFO Management

class BaseProxyService

Base class for proxy services. Proxy services are used to proxy data between devices.

Subclassed by mscclpp::ProxyService

class Fifo

Host-side proxy FIFO for device-produced work elements.

Public Functions

Fifo(int size = DEFAULT_FIFO_SIZE)

Constructor.

Parameters:

size – Number of entries (default: DEFAULT_FIFO_SIZE).

~Fifo()

Destructor.

ProxyTrigger poll()

Poll and get the trigger at the head.

Returns:

ProxyTrigger at the head of the FIFO.

void pop()

Remove the head trigger.

int size() const

Get FIFO size.

Returns:

Number of entries in the FIFO.

FifoDeviceHandle deviceHandle() const

Get device-side FIFO handle.

Returns:

FifoDeviceHandle for device access.

class Proxy

Host-side proxy for PortChannels.

Public Functions

Proxy(ProxyHandler handler, std::function<void()> threadInit, int fifoSize = DEFAULT_FIFO_SIZE)

Constructor.

Parameters:
  • handler – Handler for each FIFO trigger.

  • threadInit – Optional function run in proxy thread before FIFO consumption.

  • fifoSize – FIFO size (default: DEFAULT_FIFO_SIZE).

Proxy(ProxyHandler handler, int fifoSize = DEFAULT_FIFO_SIZE)

Constructor.

Parameters:
  • handler – Handler for each FIFO trigger.

  • fifoSize – FIFO size (default: DEFAULT_FIFO_SIZE).

~Proxy()

Destructor. Stops proxy if running.

void start()

Start proxy.

void stop()

Stop proxy.

std::shared_ptr<Fifo> fifo()

Get reference to FIFO used by proxy.

Returns:

Shared pointer to FIFO.

class ProxyService : public mscclpp::BaseProxyService

Proxy service implementation.

Public Functions

ProxyService(int fifoSize = DEFAULT_FIFO_SIZE)

Constructor.

Parameters:

fifoSize – Size of the FIFO used by the proxy service (default: DEFAULT_FIFO_SIZE).

SemaphoreId buildAndAddSemaphore(Communicator &communicator, std::shared_ptr<Connection> connection)

Build and add a semaphore to the proxy service.

Parameters:

connection – The connection associated with the semaphore.

Returns:

The ID of the semaphore.

SemaphoreId addSemaphore(const Semaphore &semaphore)

Add a semaphore to the proxy service.

Parameters:

semaphore – The semaphore to be added

Returns:

The ID of the semaphore.

SemaphoreId addSemaphore(std::shared_ptr<Host2DeviceSemaphore> semaphore)

Add a semaphore to the proxy service.

Parameters:

semaphore – The semaphore to be added

Returns:

The ID of the semaphore.

MemoryId addMemory(RegisteredMemory memory)

Register a memory region with the proxy service.

Parameters:

memory – The memory region to register.

Returns:

The ID of the memory region.

std::shared_ptr<Host2DeviceSemaphore> semaphore(SemaphoreId id) const

Get a semaphore by ID.

Parameters:

id – The ID of the semaphore.

Returns:

The semaphore.

BasePortChannel basePortChannel(SemaphoreId id)

Get a base port channel by semaphore ID.

Parameters:

id – The ID of the semaphore.

Returns:

The base port channel.

PortChannel portChannel(SemaphoreId id, MemoryId dst, MemoryId src)

Get a port channel by semaphore ID and memory regions.

Parameters:
  • id – The ID of the semaphore.

  • dst – The destination memory region.

  • src – The source memory region.

Returns:

The port channel.

virtual void startProxy()

Start the proxy service.

virtual void stopProxy()

Stop the proxy service.

using mscclpp::ProxyHandler = std::function<ProxyHandlerResult(ProxyTrigger)>

Handler function type for proxy.

enum class mscclpp::ProxyHandlerResult

Return values for ProxyHandler.

Values:

enumerator Continue

Move to next trigger in FIFO.

enumerator Stop

Stop and exit proxy.

constexpr size_t mscclpp::DEFAULT_FIFO_SIZE = 512

Utilities

struct AvoidCudaGraphCaptureGuard

A RAII guard that will cudaThreadExchangeStreamCaptureMode to cudaStreamCaptureModeRelaxed on construction and restore the previous mode on destruction. This is helpful when we want to avoid CUDA/HIP graph capture.

struct CudaStreamWithFlags

A RAII wrapper around cudaStream_t that will call cudaStreamDestroy on destruction.

Public Functions

CudaStreamWithFlags()

Constructor without flags. This will not create any stream. set() can be called later to create a stream with specified flags.

CudaStreamWithFlags(unsigned int flags)

Constructor with flags. This will create a stream with the specified flags on the current device.

Parameters:

flags – The flags to create the stream with.

~CudaStreamWithFlags()

Destructor. This will destroy the stream if it was created.

void set(unsigned int flags)

Set the stream with the specified flags. The current device at the time of the construction will be used. If the stream was already created, it will raise an error with ErrorCode::InvalidUsage.

Parameters:

flags – The flags to create the stream with.

Throws:

Error – if the stream was already created.

bool empty() const

Check if the stream is empty (not created).

Returns:

true if the stream is empty, false otherwise.

template<class T = char>
class GpuBuffer

Allocates a GPU memory space specialized for communication. The memory is zeroed out. Get the device pointer by GpuBuffer::data().

Use this function for communication buffers, i.e., only when other devices (CPU, GPU, NIC, etc.) may access this memory space at the same time with the local device (GPU). Running heavy computation over this memory space may perform bad and is not recommended in general.

The allocated memory space is managed by the memory_ object, not by the class instance. Which means, the class destructor will NOT free the allocated memory if memory_ is shared with and alive in other contexts.

Template Parameters:

T – Type of each element in the allocated memory. Default is char.

Public Functions

inline GpuBuffer(size_t nelems)

Constructs a GpuBuffer with the specified number of elements.

Parameters:

nelems – Number of elements to allocate. If it is zero, data() will return a null pointer.

inline size_t nelems() const

Returns the number of elements in the allocated memory.

Returns:

The number of elements.

inline size_t bytes() const

Returns the number of bytes that is actually allocated. This may be larger than nelems() * sizeof(T).

Returns:

The number of bytes.

inline std::shared_ptr<T> memory()

Returns the shared pointer to the allocated memory. If nelems() is zero, this function will return an empty shared pointer.

Returns:

A std::shared_ptr to the allocated memory.

inline T *data()

Returns the device pointer to the allocated memory. Equivalent to memory().get(). If nelems() is zero, this function will return a null pointer.

Returns:

A device pointer to the allocated memory.

inline int deviceId() const

Returns the device id of the allocated memory.

Returns:

The device id.

class GpuStream

A managed non-blocking GPU stream object created by GpuStreamPool. This object does not own the stream. When this object is destroyed, it will return the underlying CudaStreamWithFlags to the pool.

Public Functions

~GpuStream()

Destructor. This will return the underlying CudaStreamWithFlags to the pool, not destroy it.

class GpuStreamPool

A pool of managed GPU streams. Only provides non-blocking streams. This is intended to be used for reusing temporal streams.

Public Functions

GpuStream getStream()

Get a non-blocking GPU stream from the pool. If no streams are available, a new one will be created.

Returns:

A GpuStream object.

void clear()

Clear the pool, which will remove all streams from the pool.

int mscclpp::getDeviceNumaNode(int deviceId)

Return the NUMA node ID of the given GPU device ID.

Parameters:

deviceId – The GPU device ID.

Throws:

Error – if the device ID is invalid or if the NUMA node cannot be determined.

Returns:

The NUMA node ID of the device.

std::string mscclpp::getHostName(int maxlen, const char delim)

Get the host name of the system.

Parameters:
  • maxlen – The maximum length of the returned string.

  • delim – The delimiter to use for the host name; if the delimiter is found before maxlen characters, the host name will be truncated at that point.

Throws:

Error – if it fails to retrieve the host name (error code: SystemError).

Returns:

The host name of the system, truncated to maxlen characters if necessary, and split by the specified delimiter.

int mscclpp::getIBDeviceCount()

Get the number of available InfiniBand devices.

Returns:

The number of available InfiniBand devices.

std::string mscclpp::getIBDeviceName(Transport ibTransport)

Get the name of the InfiniBand device associated with the specified transport.

Parameters:

ibTransport – The InfiniBand transport to get the device name for.

Returns:

The name of the InfiniBand device associated with the specified transport.

Transport mscclpp::getIBTransportByDeviceName(const std::string &ibDeviceName)

Get the InfiniBand transport associated with the specified device name.

Parameters:

ibDeviceName – The name of the InfiniBand device to get the transport for.

Returns:

The InfiniBand transport associated with the specified device name.

template<class T = char>
void mscclpp::gpuMemcpy(T *dst, const T *src, size_t nelems, cudaMemcpyKind kind = cudaMemcpyDefault)

Copies memory from src to dst synchronously.

Template Parameters:

T – Type of each element in the memory.

Parameters:
  • dst – Destination address.

  • src – Source address.

  • nelems – Number of elements to copy.

  • kind – The kind of copy operation. Default is cudaMemcpyDefault.

template<class T = char>
void mscclpp::gpuMemcpyAsync(T *dst, const T *src, size_t nelems, cudaStream_t stream, cudaMemcpyKind kind = cudaMemcpyDefault)

Copies memory from src to dst asynchronously.

Template Parameters:

T – Type of each element in the memory.

Parameters:
  • dst – Destination address.

  • src – Source address.

  • nelems – Number of elements to copy.

  • stream – The stream to use for the copy operation.

  • kind – The kind of copy operation. Default is cudaMemcpyDefault.

std::shared_ptr<GpuStreamPool> mscclpp::gpuStreamPool()

Get the singleton instance of GpuStreamPool.

Returns:

A shared pointer to the GpuStreamPool instance.

bool mscclpp::isCuMemMapAllocated(void *ptr)

Check if ptr is allocaed by cuMemMap.

Parameters:

ptr – The pointer to check.

Returns:

True if the pointer is allocated by cuMemMap, false otherwise.

bool mscclpp::isNvlsSupported()

Check if NVLink SHARP (NVLS) is supported.

Returns:

True if NVLink SHARP (NVLS) is supported, false otherwise.

void mscclpp::numaBind(int node)

NUMA bind the current thread to the specified NUMA node.

Parameters:

node – The NUMA node ID to bind to.

Throws:

Error – if the given NUMA node ID is invalid.

Executor Interface

class ExecutionPlan
class Executor
enum class mscclpp::DataType

Values:

enumerator INT32
enumerator UINT32
enumerator FLOAT16
enumerator FLOAT32
enumerator BFLOAT16
enum class mscclpp::PacketType

Values:

enumerator LL8
enumerator LL16

Environment and Configuration

class Env

The MSCCL++ environment. The constructor reads environment variables and sets the corresponding fields. Use the env() function to get the environment object.

Public Members

const std::string debug

Env name: MSCCLPP_DEBUG. The debug flag, one of VERSION, WARN, INFO, ABORT, or TRACE. Unset by default.

const std::string debugSubsys

Env name: MSCCLPP_DEBUG_SUBSYS. The debug subsystem, a comma-separated list of subsystems to enable debug logging for. If the first character is ‘^’, it inverts the mask, i.e., enables all subsystems except those specified. Possible values are INIT, COLL, P2P, SHM, NET, GRAPH, TUNING, ENV, ALLOC, CALL, MSCCLPP_EXECUTOR, ALL. Unset by default.

const std::string debugFile

Env name: MSCCLPP_DEBUG_FILE. A file path to write debug logs to. Unset by default.

const std::string hcaDevices

Env name: MSCCLPP_HCA_DEVICES. A comma-separated list of HCA devices to use for IB transport. i-th device in the list will be used for the i-th GPU in the system. If unset, it will use ibverbs APIs to find the devices automatically.

const std::string hostid

Env name: MSCCLPP_HOSTID. A string that uniquely identifies the host. If unset, it will use the hostname. This is used to determine whether the host is the same across different processes.

const std::string socketFamily

Env name: MSCCLPP_SOCKET_FAMILY. The socket family to use for TCP sockets (used by TcpBootstrap and the Ethernet transport). Possible values are AF_INET (IPv4) and AF_INET6 (IPv6). If unset, it will not force any family and will use the first one found.

const std::string socketIfname

Env name: MSCCLPP_SOCKET_IFNAME. The interface name to use for TCP sockets (used by TcpBootstrap and the Ethernet transport). If unset, it will use the first interface found that matches the socket family.

const std::string commId

Env name: MSCCLPP_COMM_ID. To be deprecated; don’t use this.

const std::string executionPlanDir

Env name: MSCCLPP_EXECUTION_PLAN_DIR. The directory to find execution plans from. This should be set to use execution plans for the NCCL API. Unset by default.

const std::string npkitDumpDir

Env name: MSCCLPP_NPKIT_DUMP_DIR. The directory to dump NPKIT traces to. If this is set, NPKIT will be enabled and will dump traces to this directory. Unset by default.

const bool cudaIpcUseDefaultStream

Env name: MSCCLPP_CUDAIPC_USE_DEFAULT_STREAM. If set to true, the CUDA IPC transport will use the default stream for all operations. If set to false, it will use a separate stream for each operation. This is an experimental feature and should be false in most cases. Default is false.

const std::string ncclSharedLibPath

Env name: MSCCLPP_NCCL_LIB_PATH. The path to the original NCCL/RCCL shared library. If set, it will be used as a fallback for NCCL operations in cases where the MSCCL++ NCCL cannot work.

const std::string forceNcclFallbackOperation

Env name: MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION. A comma-separated list of NCCL operations that should always use the fallback implementation, even if the MSCCL++ NCCL can handle them. This is useful for debugging purposes. Currently supports all, broadcast, allreduce, reducescatter, and allgather.

const bool enableNcclFallback

Env name: MSCCLPP_ENABLE_NCCL_FALLBACK. If set to true, it will enable the fallback implementation for NCCL operations. This is useful for debugging purposes. Default is false.

const bool disableChannelCache

Env name: MSCCLPP_DISABLE_CHANNEL_CACHE. If set to true, it will disable the channel cache for NCCL APIs. Currently, this should be set to true if the application may call NCCL APIs on the same local buffer with different remote buffers, e.g., in the case of a dynamic communicator. If CUDA/HIP graphs are used, disabling the channel cache won’t affect the performance, but otherwise it may lead to performance degradation. Default is false.

const bool forceDisableNvls

Env name: MSCCLPP_FORCE_DISABLE_NVLS. If set to true, it will disable the NVLS support in MSCCL++. Default is false.

std::shared_ptr<Env> mscclpp::env()

Get the MSCCL++ environment.

Returns:

A reference to the global environment object.

Error Handling

class BaseError : public std::runtime_error

Base class for all errors thrown by MSCCL++.

Subclassed by mscclpp::CuError, mscclpp::CudaError, mscclpp::Error, mscclpp::IbError, mscclpp::SysError

Public Functions

BaseError(const std::string &message, int errorCode)

Constructor of BaseError.

Parameters:
  • message – The error message.

  • errorCode – The error code.

explicit BaseError(int errorCode)

Constructor of BaseError.

Parameters:

errorCode – The error code.

virtual ~BaseError() = default

Virtual destructor for BaseError.

int getErrorCode() const

Get the error code.

Returns:

The error code.

const char *what() const noexcept override

Get the error message.

Returns:

The error message.

class CudaError : public mscclpp::BaseError

An error from a CUDA runtime library call.

class CuError : public mscclpp::BaseError

An error from a CUDA driver library call.

class Error : public mscclpp::BaseError

A generic error.

class IbError : public mscclpp::BaseError

An error from an ibverbs library call.

class SysError : public mscclpp::BaseError

An error from a system call that sets errno.

enum class mscclpp::ErrorCode

Enumeration of error codes used by MSCCL++.

Values:

enumerator SystemError
enumerator InternalError
enumerator RemoteError
enumerator InvalidUsage
enumerator Timeout
enumerator Aborted
enumerator ExecutorError
std::string mscclpp::errorToString(enum ErrorCode error)

Convert an error code to a string.

Parameters:

error – The error code to convert.

Returns:

The string representation of the error code.

Version

std::string mscclpp::version()

Return a version string.

Returns:

The MSCCL++ version string in “major.minor.patch” format.

Macro Functions

MSCCLPP_CUDATHROW(cmd)

Throw mscclpp::CudaError if cmd does not return cudaSuccess.

Parameters:
  • cmd – The command to execute.

MSCCLPP_CUTHROW(cmd)

Throw mscclpp::CuError if cmd does not return CUDA_SUCCESS.

Parameters:
  • cmd – The command to execute.

Device-Side Interfaces

These device-side handle structures provide GPU kernel interfaces for MSCCL++ communication primitives. They are designed to be used directly in CUDA/HIP device code.

Channel Device Interfaces

struct BaseMemoryChannelDeviceHandle

Device-side handle of a MemoryChannel without specific source and destination.

Subclassed by mscclpp::MemoryChannelDeviceHandle

Public Functions

inline void signal()

Signal the remote semaphore.

This function guarantees that all the memory operation before this function is completed before the remote semaphore is signaled.

inline void relaxedSignal()

Signal the remote semaphore.

This function is a relaxed version of signal() and provides no guarantee on the completion of memory operations. User requires to call proper fencing before using this function.

inline bool poll()

Check if the remote semaphore has signaled.

Returns:

true if the remote semaphore has signaled.

inline void wait(int64_t maxSpinCount = 10000000)

Wait for the remote semaphore to send a signal.

Parameters:

maxSpinCount – The maximum number of spins before asserting. Never assert if negative.

inline void relaxedWait(int64_t maxSpinCount = 10000000)

Wait for the remote semaphore to send a signal.

This function is a relaxed version of signal() and provides no guarantee on the completion of memory operations. User requires to call proper fencing before using this function.

Parameters:

maxSpinCount – The maximum number of spins before asserting. Never assert if negative.

struct BasePortChannelDeviceHandle

Subclassed by mscclpp::PortChannelDeviceHandle

Public Functions

inline void put(MemoryId dstId, uint64_t dstOffset, MemoryId srcId, uint64_t srcOffset, uint64_t size)

Push a TriggerData to the FIFO.

Parameters:
  • dstId – The ID of destination memory region.

  • dstOffset – The offset into the destination memory region.

  • srcId – The ID of source memory region.

  • srcOffset – The offset into the source memory region.

  • size – The size of the transfer.

inline void put(MemoryId dstId, MemoryId srcId, uint64_t offset, uint64_t size)

Push a TriggerData to the FIFO.

Parameters:
  • dstId – The ID of destination memory region.

  • srcId – The ID of source memory region.

  • offset – The common offset into the destination and source memory regions.

  • size – The size of the transfer.

inline void signal()

Push a TriggerFlag to the FIFO.

inline void putWithSignal(MemoryId dstId, uint64_t dstOffset, MemoryId srcId, uint64_t srcOffset, uint64_t size)

Push a TriggerData and a TriggerFlag at the same time to the FIFO.

Parameters:
  • dstId – The ID of destination memory region.

  • dstOffset – The offset into the destination memory region.

  • srcId – The ID of source memory region.

  • srcOffset – The offset into the source memory region.

  • size – The size of the transfer.

inline void putWithSignal(MemoryId dstId, MemoryId srcId, uint64_t offset, uint64_t size)

Push a TriggerData and a TriggerFlag at the same time to the FIFO.

Parameters:
  • dstId – The ID of destination memory region.

  • srcId – The ID of source memory region.

  • offset – The common offset into the destination and source memory regions.

  • size – The size of the transfer.

inline void putWithSignalAndFlush(MemoryId dstId, uint64_t dstOffset, MemoryId srcId, uint64_t srcOffset, uint64_t size, int64_t maxSpinCount = 1000000)

Push a TriggerData, a TriggerFlag, and a TriggerSync at the same time to the FIFO.

Parameters:
  • dstId – The ID of destination memory region.

  • dstOffset – The offset into the destination memory region.

  • srcId – The ID of source memory region.

  • srcOffset – The offset into the source memory region.

  • size – The size of the transfer.

  • maxSpinCount – The maximum number of spin counts before asserting. Never assert if negative.

inline void putWithSignalAndFlush(MemoryId dstId, MemoryId srcId, uint64_t offset, uint64_t size, int64_t maxSpinCount = 1000000)

Push a TriggerData, a TriggerFlag, and a TriggerSync at the same time to the FIFO.

Parameters:
  • dstId – The ID of destination memory region.

  • srcId – The ID of source memory region.

  • offset – The common offset into the destination and source memory regions.

  • size – The size of the transfer.

  • maxSpinCount – The maximum number of spin counts before asserting. Never assert if negative.

inline void flush(int64_t maxSpinCount = 1000000)

Push a TriggerSync to the FIFO.

Parameters:

maxSpinCount – The maximum number of spin counts before asserting. Never assert if negative.

inline bool poll()

Check if the port channel has been signaled.

Returns:

true if the port channel has been signaled.

inline void wait(int64_t maxSpinCount = 10000000)

Wait for the port channel to be signaled.

Parameters:

maxSpinCount – The maximum number of spin counts before asserting. Never assert if negative.

Warning

doxygenunion: Cannot find union “mscclpp::ChannelTrigger” in doxygen xml output for project “mscclpp” from directory: ./doxygen/xml

union LL16Packet
#include <packet_device.hpp>

LL (low latency) protocol packet with 8 bytes of data and 8 bytes of flags.

Public Types

using Payload = uint2

Public Functions

LL16Packet() = default
inline LL16Packet(uint2 val, uint32_t flag)
inline void write(uint32_t val1, uint32_t val2, uint32_t flag)

Write 8 bytes of data to the packet.

Parameters:
  • val1 – The first 4-byte data to write.

  • val2 – The second 4-byte data to write.

  • flag – The flag to write.

inline void write(uint64_t val, uint32_t flag)

Write 8 bytes of data to the packet.

Parameters:
  • val – The 8-byte data to write.

  • flag – The flag to write.

inline void write(uint2 val, uint32_t flag)

Write 8 bytes of data to the packet.

Parameters:
  • val – The 8-byte data to write.

  • flag – The flag to write.

inline bool readOnce(uint32_t flag, uint2 &data) const

Helper of read().

Parameters:
  • flag – The flag to read.

  • data – The 8-byte data read.

Returns:

True if the flag is not equal to the given flag.

inline uint2 read(uint32_t flag, int64_t maxSpinCount = 100000000) const

Read 8 bytes of data from the packet. It will spin until the flag is equal to the given flag.

Parameters:
  • flag – The flag to read.

  • maxSpinCount – The maximum number of spin counts before asserting. Never assert if negative.

Returns:

The 8-byte data read.

inline void clear()

Clear the packet.

Public Members

uint32_t data1
uint32_t flag1
uint32_t data2
uint32_t flag2
struct mscclpp::LL16Packet
ulonglong2 raw_
union LL8Packet
#include <packet_device.hpp>

LL (low latency) protocol packet with 4 bytes of data and 4 bytes of flags.

Public Types

using Payload = uint32_t

Public Functions

LL8Packet() = default
inline LL8Packet(uint32_t val, uint32_t flag)
inline void write(uint32_t val, uint32_t flag)

Write 4 bytes of data to the packet.

Parameters:
  • val – The 4-byte data to write.

  • flag – The flag to write.

inline bool readOnce(uint32_t flag, uint32_t &data) const

Helper of read().

Parameters:
  • flag – The flag to read.

  • data – The 4-byte data read.

Returns:

True if the flag is not equal to the given flag.

inline uint32_t read(uint32_t flag, int64_t maxSpinCount = 1000000) const

Read 4 bytes of data from the packet. It will spin until the flag is equal to the given flag.

Parameters:
  • flag – The flag to read.

  • maxSpinCount – The maximum number of spin counts before asserting. Never assert if negative.

Returns:

The 4-byte data read.

inline void clear()

Clear the packet.

Public Members

uint32_t data
uint32_t flag
struct mscclpp::LL8Packet
uint64_t raw_
struct MemoryChannelDeviceHandle : public mscclpp::BaseMemoryChannelDeviceHandle

Device-side handle of a MemoryChannel.

Public Functions

template<typename T>
inline T read(uint64_t index)

Load a value from the remote memory.

Template Parameters:

T – The type of the value to be loaded.

Parameters:

index – The index of the value to be loaded. The offset in bytes is calculated as index * sizeof(T).

Returns:

The value loaded.

template<typename T>
inline void write(uint64_t index, const T &v)

Write a value to the remote memory.

Template Parameters:

T – The type of the value to be written.

Parameters:
  • index – The index of the value to be written. The offset in bytes is calculated as index * sizeof(T).

  • v – The value to be written.

template<int Alignment = 16, bool CopyRemainder = true>
inline void put(uint64_t targetOffset, uint64_t originOffset, uint64_t originBytes, uint32_t threadId, uint32_t numThreads)

Copy data from the local memory (origin) to the remote memory (target).

This function is intended to be collectively called by multiple threads. Each thread copies a part of data.

Template Parameters:
  • Alignment – The alignment of the source and destination addresses. Should be 4, 8, or a multiple of 16.

  • CopyRemainder – Whether to copy remainder bytes when the number of bytes is not a multiple of Alignment.

Parameters:
  • targetOffset – The offset in bytes of the remote address. Should be a multiple of Alignment.

  • originOffset – The offset in bytes of the local address. Should be a multiple of Alignment.

  • originBytes – Bytes of the origin to be copied. Should be a multiple of Alignment.

  • threadId – The index of the current thread among all threads running this function. This is different from the threadIdx in CUDA.

  • numThreads – The total number of threads that run this function.

template<int Alignment = 16, bool CopyRemainder = true>
inline void put(uint64_t offset, uint64_t originBytes, uint32_t threadId, uint32_t numThreads)

Wrapper of put() with the same offset for target and origin.

template<int Alignment = 16, bool CopyRemainder = true>
inline void get(uint64_t targetOffset, uint64_t originOffset, uint64_t originBytes, uint32_t threadId, uint32_t numThreads)

Copy data from the remote memory (origin) to the local memory (target).

This function is intended to be collectively called by multiple threads. Each thread copies a part of data.

Template Parameters:
  • Alignment – The alignment of the source and destination addresses. Should be 4, 8, or a multiple of 16.

  • CopyRemainder – Whether to copy remainder bytes when the number of bytes is not a multiple of Alignment.

Parameters:
  • targetOffset – The offset in bytes of the local address. Should be a multiple of Alignment.

  • originOffset – The offset in bytes of the remote address. Should be a multiple of Alignment.

  • originBytes – Bytes of the origin to be copied. Should be a multiple of Alignment.

  • threadId – The index of the current thread among all threads running this function. This is different from the threadIdx in CUDA.

  • numThreads – The total number of threads that run this function.

template<int Alignment = 16, bool CopyRemainder = true>
inline void get(uint64_t offset, uint64_t originBytes, uint32_t threadId, uint32_t numThreads)

Wrapper of get() with the same offset for target and origin.

template<typename PacketType = LL16Packet>
inline void putPackets(uint64_t targetOffset, uint64_t originOffset, uint64_t originBytes, uint32_t threadId, uint32_t numThreads, uint32_t flag)

Copy data from the local memory (origin) to the remote memory (target) using packets.

This function is intended to be collectively called by multiple threads. Each thread copies a part of data.

Template Parameters:

PacketType – The packet type. It should be either LL16Packet or LL8Packet.

Parameters:
  • targetOffset – The offset in bytes of the remote address.

  • originOffset – The offset in bytes of the local address.

  • originBytes – Bytes of the origin to be copied.

  • threadId – The index of the current thread among all threads running this function. This is different from the threadIdx in CUDA.

  • numThreads – The total number of threads that run this function.

  • flag – The flag to write.

template<typename PacketType = LL16Packet>
inline void putPackets(uint64_t offset, uint64_t originBytes, uint32_t threadId, uint32_t numThreads, uint32_t flag)

Wrapper of putPackets() with the same offset for target and origin.

template<typename PacketType = LL16Packet>
inline auto unpackPacket(uint64_t index, uint32_t flag, int64_t maxSpinCount = -1)

Retrieve data from a packet in the local packet buffer.

Template Parameters:

PacketType – The packet type. It should be either LL16Packet or LL8Packet.

Parameters:
  • index – The index of the packet to be read. The offset in bytes is calculated as index * sizeof(PacketType).

  • flag – The flag to read.

  • maxSpinCount – The maximum number of spins before asserting. Never assert if negative.

Returns:

The value read from the packet. The type of the value depends on the packet type.

template<typename PacketType = LL16Packet>
inline void unpackPackets(uint64_t targetOffset, uint64_t originOffset, uint64_t originBytes, uint32_t threadId, uint32_t numThreads, uint32_t flag, int64_t maxSpinCount = -1)

Retrieve data from packets in the local packet buffer (target) and write to the local memory (origin).

This function is intended to be collectively called by multiple threads. Each thread copies a part of data.

Template Parameters:

PacketType – The packet type. It should be either LL16Packet or LL8Packet.

Parameters:
  • targetOffset – The offset in bytes of the local packet buffer.

  • originOffset – The offset in bytes of the local address.

  • originBytes – Bytes of the origin to be copied.

  • threadId – The index of the current thread among all threads running this function. This is different from the threadIdx in CUDA.

  • numThreads – The total number of threads that run this function.

  • flag – The flag to write.

  • maxSpinCount – The maximum number of spins before asserting. Never assert if negative.

template<typename PacketType = LL16Packet>
inline void unpackPackets(uint64_t offset, uint64_t originBytes, uint32_t threadId, uint32_t numThreads, uint32_t flag, int64_t maxSpinCount = -1)

Wrapper of unpackPackets() with the same offset for target and origin.

struct PortChannelDeviceHandle : public mscclpp::BasePortChannelDeviceHandle

Public Functions

inline void put(uint64_t dstOffset, uint64_t srcOffset, uint64_t size)

Push a TriggerData to the FIFO.

Parameters:
  • dstOffset – The offset into the destination memory region.

  • srcOffset – The offset into the source memory region.

  • size – The size of the transfer.

inline void put(uint64_t offset, uint64_t size)

Push a TriggerData to the FIFO.

Parameters:
  • offset – The common offset into the destination and source memory regions.

  • size – The size of the transfer.

inline void putWithSignal(uint64_t dstOffset, uint64_t srcOffset, uint64_t size)

Push a TriggerData and a TriggerFlag at the same time to the FIFO.

Parameters:
  • dstOffset – The offset into the destination memory region.

  • srcOffset – The offset into the source memory region.

  • size – The size of the transfer.

inline void putWithSignal(uint64_t offset, uint64_t size)

Push a TriggerData and a TriggerFlag at the same time to the FIFO.

Parameters:
  • offset – The common offset into the destination and source memory regions.

  • size – The size of the transfer.

inline void putWithSignalAndFlush(uint64_t dstOffset, uint64_t srcOffset, uint64_t size, int64_t maxSpinCount = 1000000)

Push a TriggerData, a TriggerFlag, and a TriggerSync at the same time to the FIFO.

Parameters:
  • dstOffset – The offset into the destination memory region.

  • srcOffset – The offset into the source memory region.

  • size – The size of the transfer.

  • maxSpinCount – The maximum number of spin counts before asserting. Never assert if negative.

inline void putWithSignalAndFlush(uint64_t offset, uint64_t size)

Push a TriggerData, a TriggerFlag, and a TriggerSync at the same time to the FIFO.

Parameters:
  • offset – The common offset into the destination and source memory regions.

  • size – The size of the transfer.

struct SwitchChannelDeviceHandle

Device-side handle for SwitchChannel.

using mscclpp::LLPacket = LL16Packet
using mscclpp::MemoryId = uint32_t

Numeric ID of RegisteredMemory. ProxyService has an internal array indexed by these handles mapping to the actual.

using mscclpp::SemaphoreId = uint32_t

Numeric ID of Semaphore. ProxyService has an internal array indexed by these handles mapping to the actual semaphores.

Semaphore Device Interfaces

struct Host2DeviceSemaphoreDeviceHandle

Device-side handle for Host2DeviceSemaphore.

Public Functions

inline bool poll()

Poll if the host has signaled.

Returns:

true if the host has signaled.

inline void wait(int64_t maxSpinCount = 100000000)

Wait for the host to signal.

inline uint64_t loadExpectedInbound()

Thread-safe read of expected inbound value.

Returns:

The expected inbound value.

inline uint64_t incExpectedInbound()

Thread-safe increment of expected inbound value.

Returns:

The incremented expected inbound value.

inline uint64_t loadInbound()

Thread-safe read of inbound value.

Returns:

The inbound value.

Public Members

uint64_t *inboundToken

A local memory space where a host thread (on behalf of the remote device) will write its semaphore value and the local device will read it.

uint64_t *expectedInboundToken

A local memory space where the local device stores the expected value of the inboundToken to wait for.

struct MemoryDevice2DeviceSemaphoreDeviceHandle

Device-side handle for MemoryDevice2DeviceSemaphore.

Public Functions

inline bool poll()

Poll if remote device has signaled.

Returns:

true if remote device has signaled.

inline void wait(int64_t maxSpinCount = 100000000)

Wait for remote device to signal.

inline void relaxedWait(int64_t maxSpinCount = 100000000)

Relaxed wait; no memory completion guarantee. Use it only for synchronizing execution, not data.

inline void signal()

Signal remote device, ensures prior memory ops complete.

inline void relaxedSignal()

Relaxed signal; no memory completion guarantee. Use it only for synchronizing execution, not data.

inline uint64_t loadExpectedInbound()

Thread-safe read of expected inbound value.

Returns:

The expected inbound value.

inline uint64_t incExpectedInbound()

Thread-safe increment of expected inbound value.

Returns:

The incremented expected inbound value.

inline uint64_t loadInbound()

Thread-safe read of inbound value.

Returns:

The inbound value.

inline uint64_t loadInboundRelaxed()

Thread-safe read of inbound value without memory completion guarantee.

Returns:

The inbound value.

inline uint64_t loadOutbound()

Thread-safe read of outbound value.

Returns:

The outbound value.

inline uint64_t incOutbound()

Thread-safe increment of outbound value.

Returns:

The incremented outbound value.

Public Members

uint64_t *inboundToken

A local memory space where the remote device will write its semaphore value and the local device will read it.

uint64_t *outboundToken

A local memory space where the local device stores the semaphore value to be written to the remote device.

uint64_t *remoteInboundToken

A remote memory space where the local device writes its outboundToken on. This is inboundToken of the remote device.

uint64_t *expectedInboundToken

A local memory space where the local device stores the expected value of the inboundToken to wait for.

FIFO Device Interfaces

struct FifoDeviceHandle

Concurrent FIFO where multiple device threads (the number of threads should not exceed the FIFO size) to push Head pointer is on device, tail pointer is on host (readable by device). The FIFO’s capacity is limited only by MAX_UINT64—effectively infinite for practical use. Exceeding this limit will overflow the counter and lead to undefined behavior.

Public Functions

inline uint64_t push(ProxyTrigger trigger, int64_t maxSpinCount = 1000000)

Push a trigger to the FIFO.

Parameters:
  • trigger – Trigger to push.

  • maxSpinCount – Max spin count before assert. Never assert if negative.

Returns:

Previous head of the FIFO where the trigger was pushed.

inline void sync(uint64_t fifoHead, int64_t maxSpinCount = 1000000)

Wait until a specific trigger is popped from the FIFO.

Parameters:
  • fifoHead – FIFO head where the trigger was pushed.

  • maxSpinCount – Max spin count before assert. Never assert if negative.

Public Members

ProxyTrigger *triggers

FIFO buffer on host.

uint64_t *head

FIFO head on device.

uint64_t *tail

FIFO tail on host.

uint64_t *tailCache

Cached tail value.

int size

FIFO size.

Warning

doxygenstruct: Cannot find class “mscclpp::ProxyTrigger” in doxygen xml output for project “mscclpp” from directory: ./doxygen/xml

constexpr unsigned int mscclpp::TriggerBitsFifoReserved = 1
constexpr unsigned int mscclpp::TriggerBitsMemoryId = 9
constexpr unsigned int mscclpp::TriggerBitsOffset = 32
constexpr unsigned int mscclpp::TriggerBitsSemaphoreId = 10
constexpr unsigned int mscclpp::TriggerBitsSize = 32
constexpr unsigned int mscclpp::TriggerBitsType = 3
using mscclpp::TriggerType = uint64_t
constexpr TriggerType mscclpp::TriggerData = 0x1
constexpr TriggerType mscclpp::TriggerFlag = 0x2
constexpr TriggerType mscclpp::TriggerSync = 0x4

Device Utilities

struct DeviceSemaphore

A device-wide semaphore. This semaphore can be used to control access to a resource across multiple threads or blocks. It uses atomic operations to ensure that the semaphore value is updated correctly across threads. The semaphore value is an integer that can be set, acquired, and released.

Example usage of DeviceSemaphore:

__global__ void myKernel(mscclpp::DeviceSemaphore* semaphore) {
  // Initialize the semaphore (allow up to 3 threads access the resource simultaneously)
  if (blockIdx.x == 0 && threadIdx.x == 0) {
    semaphore->set(3);
  }
  // Each block acquires the semaphore before accessing the shared resource
  if (threadIdx.x == 0) {
    semaphore->acquire();
  }
  __syncthreads();
  // Access the shared resource
  // ...
  __syncthreads();
  // Release the semaphore after accessing the shared resource
  if (threadIdx.x == 0) {
    semaphore->release();
  }
}

Public Functions

DeviceSemaphore() = default

Construct a new DeviceSemaphore object.

inline DeviceSemaphore(int initialValue)
Parameters:

initialValue – The initial value of the semaphore.

~DeviceSemaphore() = default

Destroy the DeviceSemaphore object.

inline void set(int value)

Set the semaphore value. This function is used to initialize or reset the semaphore value. The initial value is typically set to a positive integer to allow acquiring the semaphore.

Parameters:

value – The value to set.

inline void acquire(int maxSpinCount = -1)

Acquire the semaphore (decrease the semaphore value by 1). This function will spin until the semaphore is acquired or the maximum spin count is reached.

Parameters:

maxSpinCount – The maximum number of spin counts before asserting. Never assert if negative.

inline void release()

Release the semaphore (increase the semaphore value by 1).

struct DeviceSyncer

A device-wide barrier. This barrier can be used to synchronize multiple thread blocks within a kernel. It uses atomic operations to ensure that all threads in the same kernel reach the barrier before proceeding and they can safely read data written by other threads in different blocks.

Example usage of DeviceSyncer:

__global__ void myKernel(mscclpp::DeviceSyncer* syncer, int numBlocks) {
  // Do some work here
  // ...
  // Synchronize all blocks
  syncer->sync(numBlocks);
  // All blocks have reached this point
  // ...
}

Public Functions

DeviceSyncer() = default

Construct a new DeviceSyncer object.

~DeviceSyncer() = default

Destroy the DeviceSyncer object.

inline void sync(int blockNum, int64_t maxSpinCount = 100000000)

Synchronize all threads inside a kernel. Guarantee that all previous work of all threads in cooperating blocks is finished.

Parameters:
  • blockNum – The number of blocks that will synchronize.

  • maxSpinCount – The maximum number of spin counts before asserting. Never assert if negative.

Public Static Attributes

static const unsigned int NumCounters = 3U

The number of sync counters.

union VectorType
#include <gpu_data_types.hpp>

Vector type.

Public Types

using ElementType = T

Public Functions

inline VectorType()
inline operator T*()
inline operator const T*() const
inline T &operator[](int i)
inline const T &operator[](int i) const

Public Members

T data[N]
Words<sizeof(T) * N> words

Public Static Attributes

static constexpr int Size = N
template<int Bytes>
struct Words

Word array.

template<int Alignment = 16, bool CopyRemainder = true>
void mscclpp::copy(void *dst, void *src, uint64_t bytes, uint32_t threadId, uint32_t numThreads)

Copy data from the source memory to the destination memory.

This function is intended to be collectively called by multiple threads. Each thread copies a part of elements.

Note

The source and destination addresses do not have to be aligned to the Alignment value, but the misalignment to Alignment should be multiple of 4 bytes and should be the same for both source and destination addresses. The behavior of this function is undefined otherwise.

Note

The number of bytes to be copied should be a multiple of 4 bytes. If the number of bytes is not a multiple of 4 bytes, the remainder bytes will not be copied.

Template Parameters:
  • Alignment – The alignment of the data to be copied. A larger alignment value is more likely to achieve higher copying throughput. Should be one of 4, 8, or 16.

  • CopyRemainder – If false, the function will not copy data that is unaligned to the Alignment value. If true, the function will try to copy the unaligned data with conditions (see the notes).

Parameters:
  • dst – The destination address.

  • src – The source address.

  • bytes – Bytes of the data to be copied. Should be a multiple of 4 bytes.

  • threadId – The index of the current thread among all threads running this function. Should be less than numThreads.

  • numThreads – The total number of threads that run this function.

template<typename PacketType = LL16Packet>
void mscclpp::copyFromPackets(void *originPtr, const void *targetPtr, uint64_t originBytes, uint32_t threadId, uint32_t numThreads, uint32_t flag, int64_t maxSpinCount = -1)

Read packets from the target buffer and write retrieved data to the origin.

This function is intended to be collectively called by multiple threads. Each thread reads a part of packets.

Template Parameters:

PacketType – The packet type. It should be either LL16Packet or LL8Packet.

Parameters:
  • originPtr – The origin buffer.

  • targetPtr – The target buffer.

  • originBytes – The number of bytes to read from the origin buffer.

  • threadId – The index of the current thread among all threads running this function. Should be less than numThreads.

  • numThreads – The total number of threads that run this function.

  • flag – The flag to write in the packets.

  • maxSpinCount – The maximum number of spin counts before asserting. Never assert if negative.

template<typename PacketType = LL16Packet>
void mscclpp::copyToPackets(void *targetPtr, const void *originPtr, uint64_t originBytes, uint32_t threadId, uint32_t numThreads, uint32_t flag)

Read data from the origin and write packets to the target buffer.

This function is intended to be collectively called by multiple threads. Each thread copies a part of packets.

Template Parameters:

PacketType – The packet type. It should be either LL16Packet or LL8Packet.

Parameters:
  • targetPtr – The target buffer.

  • originPtr – The origin buffer.

  • originBytes – The number of bytes to write to the target buffer.

  • threadId – The index of the current thread among all threads running this function. Should be less than numThreads.

  • numThreads – The total number of threads that run this function.

  • flag – The flag to write in the packets.

Atomics

constexpr cuda::memory_order mscclpp::memoryOrderAcqRel = cuda::memory_order_acq_rel
constexpr cuda::memory_order mscclpp::memoryOrderAcquire = cuda::memory_order_acquire
constexpr cuda::memory_order mscclpp::memoryOrderRelaxed = cuda::memory_order_relaxed
constexpr cuda::memory_order mscclpp::memoryOrderRelease = cuda::memory_order_release
constexpr cuda::memory_order mscclpp::memoryOrderSeqCst = cuda::memory_order_seq_cst
constexpr cuda::thread_scope mscclpp::scopeDevice = cuda::thread_scope_device
constexpr cuda::thread_scope mscclpp::scopeSystem = cuda::thread_scope_system
template<typename T, cuda::thread_scope Scope = cuda::thread_scope_system>
T mscclpp::atomicFetchAdd(T *ptr, const T &val, cuda::memory_order memoryOrder)
template<typename T, cuda::thread_scope Scope = cuda::thread_scope_system>
T mscclpp::atomicLoad(T *ptr, cuda::memory_order memoryOrder)
template<typename T, cuda::thread_scope Scope = cuda::thread_scope_system>
void mscclpp::atomicStore(T *ptr, const T &val, cuda::memory_order memoryOrder)

Vector Data Types

using mscclpp::bf16x2 = VectorType<__bfloat16, 2>
using mscclpp::bf16x4 = VectorType<__bfloat16, 4>
using mscclpp::bf16x8 = VectorType<__bfloat16, 8>
using mscclpp::f16x2 = VectorType<__half, 2>
using mscclpp::f16x4 = VectorType<__half, 4>
using mscclpp::f16x8 = VectorType<__half, 8>
using mscclpp::f32x1 = VectorType<float, 1>
using mscclpp::f32x2 = VectorType<float, 2>
using mscclpp::f32x4 = VectorType<float, 4>
using mscclpp::f64x1 = VectorType<double, 1>
using mscclpp::i32x1 = VectorType<int32_t, 1>
using mscclpp::i32x2 = VectorType<int32_t, 2>
using mscclpp::i32x4 = VectorType<int32_t, 4>
using mscclpp::u32x1 = VectorType<uint32_t, 1>
using mscclpp::u32x2 = VectorType<uint32_t, 2>
using mscclpp::u32x4 = VectorType<uint32_t, 4>

Macro Functions

MSCCLPP_ASSERT_DEVICE(__cond, __msg)

Assert a condition on the device and print a message if the condition is false. This macro does nothing in a release mode build (when DEBUG_BUILD is undefined).

OR_POLL_MAYBE_JAILBREAK(__cond1, __cond2, __max_spin_cnt)
POLL_MAYBE_JAILBREAK(__cond, __max_spin_cnt)