C++ API Reference
This reference organizes the MSCCL++ C++ API into two main categories: Host-Side Interfaces for CPU code and Device-Side Interfaces for GPU kernels. Components that are used in both host and device code are documented in the Device-Side Interfaces section.
Host-Side Interfaces
These are the interfaces used in CPU code to set up connections, manage memory, and coordinate operations.
Bootstrap and Process Coordination
-
class Bootstrap
Base class for bootstraps.
Subclassed by mscclpp::TcpBootstrap
Public Functions
-
inline Bootstrap()
Constructor.
-
virtual ~Bootstrap() = default
Destructor.
-
virtual int getRank() const = 0
Return the rank of the process.
- Returns:
The rank of the process.
-
virtual int getNranks() const = 0
Return the total number of ranks.
- Returns:
The total number of ranks.
-
virtual int getNranksPerNode() const = 0
Return the total number of ranks per node.
- Returns:
The total number of ranks per node.
-
virtual void send(void *data, int size, int peer, int tag) = 0
Send arbitrary data to another process.
Data sent via
send(senderBuff, size, receiverRank, tag)
can be received viarecv(receiverBuff, size,
senderRank, tag)
peer
andtag
will be ordered by the order of calls, corresponding to the order of recv() calls on the receiving side. In cases where the execution order of multiple send()s or recv()s between two ranks is unknown, they should be differentiated by using differenttag
values to prevent unexpected behavior.- Parameters:
data – The data to send.
size – The size of the data to send.
peer – The rank of the process to send the data to.
tag – The tag to send the data with.
-
virtual void recv(void *data, int size, int peer, int tag) = 0
Receive data sent from another process by send().
Data sent via
send(senderBuff, size, receiverRank, tag)
can be received viarecv(receiverBuff, size,
senderRank, tag)
peer
andtag
will be ordered by the order of calls, corresponding to the order of recv() calls on the receiving side. In cases where the execution order of multiple send()s or recv()s between two ranks is unknown, they should be differentiated by using differenttag
values to prevent unexpected behavior.- Parameters:
data – The buffer to write the received data to.
size – The size of the data to receive.
peer – The rank of the process to receive the data from.
tag – The tag to receive the data with.
-
virtual void allGather(void *allData, int size) = 0
Gather data from all processes.
When called by rank
r
, this sends data fromallData[r * size]
toallData[(r + 1) * size - 1]
to all other ranks. The data sent by rankr
is received intoallData[r * size]
of other ranks.- Parameters:
allData – The buffer to write the received data to.
size – The size of the data each rank sends.
-
virtual void barrier() = 0
Synchronize all processes.
-
void groupBarrier(const std::vector<int> &ranks)
A partial barrier that synchronizes a group of ranks.
- Parameters:
ranks – The ranks to synchronize.
-
void send(const std::vector<char> &data, int peer, int tag)
Wrapper of send() that sends a vector of characters.
- Parameters:
data – The data to send.
peer – The rank of the process to send the data to.
tag – The tag to send the data with.
-
void recv(std::vector<char> &data, int peer, int tag)
Wrapper of recv() that receives a vector of characters.
Note
The data vector will be resized to the size of the received data.
- Parameters:
data – The buffer to write the received data to.
peer – The rank of the process to receive the data from.
tag – The tag to receive the data with.
-
inline Bootstrap()
-
class TcpBootstrap : public mscclpp::Bootstrap
A native implementation of the bootstrap using TCP sockets.
Public Functions
-
TcpBootstrap(int rank, int nRanks)
Constructor.
- Parameters:
rank – The rank of the process.
nRanks – The total number of ranks.
-
~TcpBootstrap()
Destructor.
-
UniqueId getUniqueId() const
Return the unique ID stored in the TcpBootstrap.
- Returns:
The unique ID stored in the TcpBootstrap.
-
void initialize(UniqueId uniqueId, int64_t timeoutSec = 30)
Initialize the TcpBootstrap with a given unique ID. The unique ID can be generated by any method; it can be created by createUniqueId() or can be any arbitrary bit array provided by the user.
- Parameters:
uniqueId – The unique ID to initialize the TcpBootstrap with.
timeoutSec – The connection timeout in seconds.
-
void initialize(const std::string &ifIpPortTrio, int64_t timeoutSec = 30)
Initialize the TcpBootstrap with a string formatted as “ip:port” or “interface:ip:port”.
- Parameters:
ifIpPortTrio – The string formatted as “ip:port” or “interface:ip:port”.
timeoutSec – The connection timeout in seconds.
-
virtual int getRank() const override
Return the rank of the process.
-
virtual int getNranks() const override
Return the total number of ranks.
-
virtual int getNranksPerNode() const override
Return the total number of ranks per node.
-
virtual void send(void *data, int size, int peer, int tag) override
Send arbitrary data to another process.
Data sent via
send(senderBuff, size, receiverRank, tag)
can be received viarecv(receiverBuff, size,
senderRank, tag)
peer
andtag
will be ordered by the order of calls, corresponding to the order of recv() calls on the receiving side. In cases where the execution order of multiple send()s or recv()s between two ranks is unknown, they should be differentiated by using differenttag
values to prevent unexpected behavior.- Parameters:
data – The data to send.
size – The size of the data to send.
peer – The rank of the process to send the data to.
tag – The tag to send the data with.
-
virtual void recv(void *data, int size, int peer, int tag) override
Receive data sent from another process by send().
Data sent via
send(senderBuff, size, receiverRank, tag)
can be received viarecv(receiverBuff, size,
senderRank, tag)
peer
andtag
will be ordered by the order of calls, corresponding to the order of recv() calls on the receiving side. In cases where the execution order of multiple send()s or recv()s between two ranks is unknown, they should be differentiated by using differenttag
values to prevent unexpected behavior.- Parameters:
data – The buffer to write the received data to.
size – The size of the data to receive.
peer – The rank of the process to receive the data from.
tag – The tag to receive the data with.
-
virtual void allGather(void *allData, int size) override
Gather data from all processes.
When called by rank
r
, this sends data fromallData[r * size]
toallData[(r + 1) * size - 1]
to all other ranks. The data sent by rankr
is received intoallData[r * size]
of other ranks.- Parameters:
allData – The buffer to write the received data to.
size – The size of the data each rank sends.
-
void broadcast(void *data, int size, int root)
Broadcast data from the root process to all processes using a ring-based algorithm.
When called by the root rank, this sends the
size
bytes starting at memory locationdata
to all other ranks. Non-root ranks receive these bytes into their owndata
buffer, overwriting its previous contents. The data propagates sequentially through a logical ring of processes until all ranks have received it.- Parameters:
data – Pointer to the send buffer (root) or receive buffer (non-root)
size – Number of bytes to broadcast
root – Rank initiating the broadcast
-
virtual void barrier() override
Synchronize all processes.
-
TcpBootstrap(int rank, int nRanks)
-
using mscclpp::UniqueId = std::array<uint8_t, UniqueIdBytes>
Unique ID for initializing the TcpBootstrap.
-
constexpr unsigned int mscclpp::UniqueIdBytes = 128
Connection Setup and Memory Management
-
class Connection
Connection between two processes.
Public Functions
Constructor.
- Parameters:
context – The context associated with the connection.
localEndpoint – The local endpoint of the connection.
-
virtual ~Connection() = default
Destructor.
-
virtual void write(RegisteredMemory dst, uint64_t dstOffset, RegisteredMemory src, uint64_t srcOffset, uint64_t size) = 0
Write data from a source RegisteredMemory to a destination RegisteredMemory.
- Parameters:
dst – The destination RegisteredMemory.
dstOffset – The offset in bytes from the start of the destination RegisteredMemory.
src – The source RegisteredMemory.
srcOffset – The offset in bytes from the start of the source RegisteredMemory.
size – The number of bytes to write.
-
virtual void updateAndSync(RegisteredMemory dst, uint64_t dstOffset, uint64_t *src, uint64_t newValue) = 0
Update an 8-byte value in a destination RegisteredMemory and synchronize the change with the remote process.
- Parameters:
dst – The destination RegisteredMemory.
dstOffset – The offset in bytes from the start of the destination RegisteredMemory.
src – A pointer to the value to update.
newValue – The new value to write.
-
virtual void flush(int64_t timeoutUsec = -1) = 0
Flush any pending writes to the remote process.
- Parameters:
timeoutUsec – Timeout in microseconds. Default: -1 (no timeout)
-
virtual Transport transport() const = 0
Get the transport used by the local process.
- Returns:
The transport used by the local process.
-
virtual Transport remoteTransport() const = 0
Get the transport used by the remote process.
- Returns:
The transport used by the remote process.
-
std::shared_ptr<Context> context() const
Get the context associated with this connection.
- Returns:
A shared pointer to the context associated with this connection.
-
const Device &localDevice() const
Get the device used by the local endpoint.
- Returns:
The device used by the local endpoint.
-
int getMaxWriteQueueSize() const
Get the maximum write queue size.
- Returns:
The maximum number of write requests that can be queued.
-
class Context : public std::enable_shared_from_this<Context>
Context for communication. This provides a low-level interface for forming connections in use-cases where the process group abstraction offered by Communicator is not suitable, e.g., ephemeral client-server connections. Correct use of this class requires external synchronization when finalizing connections with the connect() method.
As an example, a client-server scenario where the server will write to the client might proceed as follows:
The client creates an endpoint with createEndpoint() and sends it to the server.
The server receives the client endpoint, creates its own endpoint with createEndpoint(), sends it to the client, and creates a connection with connect().
The client receives the server endpoint, creates a connection with connect() and sends a RegisteredMemory to the server.
The server receives the RegisteredMemory and writes to it using the previously created connection. The client waiting to create a connection before sending the RegisteredMemory ensures that the server cannot write to the RegisteredMemory before the connection is established.
While some transports may have more relaxed implementation behavior, this should not be relied upon.
Public Functions
-
~Context()
Destructor.
-
RegisteredMemory registerMemory(void *ptr, size_t size, TransportFlags transports)
Register a region of GPU memory for use in this context.
- Parameters:
ptr – Base pointer to the memory.
size – Size of the memory region in bytes.
transports – Transport flags.
- Returns:
A RegisteredMemory object representing the registered memory region.
-
Endpoint createEndpoint(EndpointConfig config)
Create an endpoint for establishing connections.
- Parameters:
config – The configuration for the endpoint.
- Returns:
The newly created endpoint.
-
std::shared_ptr<Connection> connect(const Endpoint &localEndpoint, const Endpoint &remoteEndpoint)
Establish a connection between two endpoints. While this method immediately returns a connection object, the connection is only safe to use after the corresponding connection on the remote endpoint has been established. This method must be called on both endpoints to establish a connection.
- Parameters:
localEndpoint – The local endpoint.
remoteEndpoint – The remote endpoint.
- Returns:
A shared pointer to the connection.
-
class Communicator
A class that sets up all registered memories and connections between processes.
A typical way to use this class:
Call connect() to declare connections between the calling process and other processes.
Call registerMemory() to register memory regions that will be used for communication.
Call sendMemory() or recvMemory() to send/receive registered memory regions to/from other processes.
Call get() on futures returned by connect(). Use the returned connections to create flags.
Call buildSemaphore() to create a Semaphore out of the flags.
Call get() on all futures returned by buildSemaphore() and recvMemory().
All done; use semaphores and registered memories to build channels.
CAUTION: The order of method calls matters when the same remote rank and tags are used. That is, the i-th “sending” method call (sendMemory(), connect(), and buildSemaphore()) on the local rank must be matched by the i-th “receiving” method call (recvMemory(), connect(), and buildSemaphore()) on the remote rank.
Correct Example 1:
// Rank 0 communicator.sendMemory(memory1, 1, tag); communicator.sendMemory(memory2, 1, tag); auto connection = communicator.connect(Transport::CudaIpc, 1, tag); connection.get(); // This will return the connection. // Rank 1 auto mem1 = communicator.recvMemory(0, tag); auto mem2 = communicator.recvMemory(0, tag); auto connection = communicator.connect(Transport::CudaIpc, 0, tag); mem2.get(); // This will return memory2. connection.get(); // This will return the connection. mem1.get(); // This will return memory1.
Correct Example 2:
// Rank 0 communicator.sendMemory(memory0, 1, tag); auto mem1 = communicator.recvMemory(1, tag); auto connection = communicator.connect(Transport::CudaIpc, 1, tag); connection.get(); // This will return the connection. mem1.get(); // This will return memory1. // Rank 1 auto mem0 = communicator.recvMemory(0, tag); communicator.sendMemory(memory1, 0, tag); auto connection = communicator.connect(Transport::CudaIpc, 0, tag); mem0.get(); // This will return memory0. connection.get(); // This will return the connection.
Wrong Example:
In the wrong example, the connection information from rank 1 will be sent to the// Rank 0 communicator.sendMemory(memory0, 1, tag); auto mem1 = communicator.recvMemory(1, tag); auto connection = communicator.connect(Transport::CudaIpc, 1, tag); // Rank 1 auto mem0 = communicator.recvMemory(0, tag); auto connection = communicator.connect(Transport::CudaIpc, 0, tag); // undefined behavior communicator.sendMemory(memory1, 0, tag);
mem1
object on rank 0, where the object type is RegisteredMemory, not Connection.Public Functions
Initializes the communicator with a given bootstrap implementation.
- Parameters:
bootstrap – An implementation of the Bootstrap that the communicator will use.
context – An optional context to use for the communicator. If not provided, a new context will be created.
-
~Communicator()
Destroy the communicator.
-
std::shared_ptr<Bootstrap> bootstrap()
Returns the bootstrap held by this communicator.
- Returns:
The bootstrap held by this communicator.
-
std::shared_ptr<Context> context()
Returns the context held by this communicator.
- Returns:
The context held by this communicator.
-
RegisteredMemory registerMemory(void *ptr, size_t size, TransportFlags transports)
Register a region of GPU memory for use in this communicator’s context.
- Parameters:
ptr – Base pointer to the memory.
size – Size of the memory region in bytes.
transports – Transport flags.
- Returns:
A RegisteredMemory object representing the registered memory region.
-
void sendMemory(RegisteredMemory memory, int remoteRank, int tag = 0)
Send information of a registered memory to the remote side.
The send will be started upon calling this function, but this function returns immediately without waiting for the completion of the send. RegisteredMemory sent via
sendMemory(memory, remoteRank, tag)
can be received viarecvMemory(remoteRank, tag)
.Multiple calls to either sendMemory() or connect() with the same
remoteRank
andtag
will be ordered by the order of calls, corresponding to the order of recvMemory() or connect() calls on the receiving side. In cases where the execution order is unknown between two ranks, they should be differentiated by using differenttag
values to prevent unexpected behavior.- Parameters:
memory – The registered memory buffer to send information about.
remoteRank – The rank of the remote process.
tag – The tag to use for identifying the send.
-
std::shared_future<RegisteredMemory> recvMemory(int remoteRank, int tag = 0)
Receive memory information from a corresponding sendMemory call on the remote side.
This function returns a future immediately. The actual receive will be performed upon calling the first get() on the future. RegisteredMemory sent via
sendMemory(memory, remoteRank, tag)
can be received viarecvMemory(remoteRank, tag)
.Multiple calls to either sendMemory() or connect() with the same
remoteRank
andtag
will be ordered by the order of calls, corresponding to the order of recvMemory() or connect() calls on the receiving side. In cases where the execution order is unknown between two ranks, they should be differentiated by using differenttag
values to prevent unexpected behavior.Note
To guarantee the receiving order, calling get() on a future returned by recvMemory() or connect() may start receiving other RegisteredMemory or Connection objects of which futures were returned by an earlier call to recvMemory() or connect() with the same
remoteRank
andtag
. For example, if we call recvMemory() or connect() five times with the sameremoteRank
andtag
and then call get() on the last future, it will start receiving the five RegisteredMemory or Connection objects in order, back to back.- Parameters:
remoteRank – The rank of the remote process.
tag – The tag to use for identifying the receive.
- Returns:
A future of registered memory.
-
std::shared_future<std::shared_ptr<Connection>> connect(EndpointConfig localConfig, int remoteRank, int tag = 0)
Connect to a remote rank.
This function will start (but not wait for) sending metadata about the local endpoint to the remote rank, and return a future connection without waiting for the remote rank to respond. The connection will be established when the remote rank responds with its own endpoint and the local rank calls the first get() on the future. Note that this function is two-way and a connection from rank
i
to remote rankj
needs to have a counterpart from rankj
to ranki
. Note that with IB, buffers are registered at a page level and if a buffer is spread through multiple pages and does not fully utilize all of them, IB’s QP has to register for all involved pages. This potentially has security risks if the connection’s accesses are given to a malicious process.Multiple calls to either sendMemory() or connect() with the same
remoteRank
andtag
will be ordered by the order of calls, corresponding to the order of recvMemory() or connect() calls on the receiving side. In cases where the execution order is unknown between two ranks, they should be differentiated by using differenttag
values to prevent unexpected behavior.Note
To guarantee the receiving order, calling get() on a future returned by recvMemory() or connect() may start receiving other RegisteredMemory or Connection objects of which futures were returned by an earlier call to recvMemory() or connect() with the same
remoteRank
andtag
. For example, if we call recvMemory() or connect() five times with the sameremoteRank
andtag
and then call get() on the last future, it will start receiving the five RegisteredMemory or Connection objects in order, back to back.- Parameters:
localConfig – The configuration for the local endpoint.
remoteRank – The rank of the remote process.
tag – The tag to use for identifying the send and receive.
- Returns:
A future of shared pointer to the connection.
Build a semaphore for cross-process synchronization.
- Parameters:
connection – The connection associated with this semaphore.
remoteRank – The rank of the remote process.
tag – The tag to use for identifying the operation.
- Returns:
A future of the built semaphore.
-
int remoteRankOf(const Connection &connection)
Get the remote rank a connection is connected to.
- Parameters:
connection – The connection to get the remote rank for.
- Returns:
The remote rank the connection is connected to.
-
int tagOf(const Connection &connection)
Get the tag a connection was made with.
- Parameters:
connection – The connection to get the tag for.
- Returns:
The tag the connection was made with.
-
struct Device
Declaration of a device.
Public Functions
-
Device() = default
Constructor.
-
inline Device(DeviceType type, int id = -1)
Constructor.
-
Device() = default
-
class Endpoint
One end of a connection.
Public Functions
-
Endpoint() = default
Constructor.
-
uint64_t hostHash() const
Get the host hash.
- Returns:
The host hash.
-
uint64_t pidHash() const
Get the process ID hash.
- Returns:
The process ID hash.
-
int maxWriteQueueSize() const
Get the maximum write queue size.
- Returns:
The maximum number of write requests that can be queued.
-
Endpoint() = default
-
struct EndpointConfig
Used to configure an endpoint.
Public Functions
-
inline EndpointConfig(Transport transport = Transport::Unknown, Device device = DeviceType::GPU, int ibMaxCqSize = DefaultMaxCqSize, int ibMaxCqPollNum = DefaultMaxCqPollNum, int ibMaxSendWr = DefaultMaxSendWr, int ibMaxWrPerSend = DefaultMaxWrPerSend, int maxWriteQueueSize = -1)
Constructor that takes a transport and sets the other fields to their default values.
- Parameters:
transport – The transport to use.
device – The device to use.
ibMaxCqSize – The maximum completion queue size.
ibMaxCqPollNum – The maximum completion queue poll number.
ibMaxSendWr – The maximum send work requests.
ibMaxWrPerSend – The maximum work requests per send.
maxWriteQueueSize – The maximum write queue size.
-
inline EndpointConfig(Transport transport = Transport::Unknown, Device device = DeviceType::GPU, int ibMaxCqSize = DefaultMaxCqSize, int ibMaxCqPollNum = DefaultMaxCqPollNum, int ibMaxSendWr = DefaultMaxSendWr, int ibMaxWrPerSend = DefaultMaxWrPerSend, int maxWriteQueueSize = -1)
-
class NvlsConnection
Public Functions
-
SwitchChannel bindAllocatedMemory(CUdeviceptr devicePtr, size_t size)
Bind the memory allocated via mscclpp::GpuBuffer to the multicast handle. The behavior is undefined if the devicePtr is not allocated by mscclpp::GpuBuffer.
- Parameters:
devicePtr – The device pointer returned by
mscclpp::GpuBuffer::data()
.size – The bytes of the memory to bind to the multicast handle.
- Returns:
SwitchChannel with devicePtr, mcPtr and bufferSize
-
SwitchChannel bindAllocatedMemory(CUdeviceptr devicePtr, size_t size)
-
class RegisteredMemory
Block of memory that has been registered to a Context. RegisteredMemory does not own the memory it points to, but it provides a way to transfer metadata about the memory to other processes, hence allowing their access to the memory block.
Public Functions
-
RegisteredMemory() = default
Constructor.
-
~RegisteredMemory()
Destructor.
-
void *data() const
Get a pointer to the memory block.
- Returns:
A pointer to the memory block.
-
void *originalDataPtr() const
Get a pointer to the original memory block.
- Returns:
A pointer to the original memory block.
-
size_t size() const
Get the size of the memory block.
- Returns:
The size of the memory block.
-
TransportFlags transports() const
Get the transport flags associated with the memory block.
- Returns:
The transport flags associated with the memory block.
-
std::vector<char> serialize() const
Serialize the RegisteredMemory object to a vector of characters.
- Returns:
A vector of characters representing the serialized RegisteredMemory object.
Public Static Functions
-
static RegisteredMemory deserialize(const std::vector<char> &data)
Deserialize a RegisteredMemory object from a vector of characters.
- Parameters:
data – A vector of characters representing a serialized RegisteredMemory object.
- Returns:
A deserialized RegisteredMemory object.
-
RegisteredMemory() = default
-
class TransportFlags : private detail::TransportFlagsBase
Stores transport flags.
Public Functions
-
TransportFlags() = default
Constructor.
-
TransportFlags(Transport transport)
Constructor.
- Parameters:
transport – The transport to set the flag for.
-
bool has(Transport transport) const
Check if a specific transport flag is set.
- Parameters:
transport – The transport to check the flag for.
- Returns:
True if the flag is set, false otherwise.
-
bool none() const
Check if no transport flags are set.
- Returns:
True if no flags are set, false otherwise.
-
bool any() const
Check if any transport flags are set.
- Returns:
True if any flags are set, false otherwise.
-
bool all() const
Check if all transport flags are set.
- Returns:
True if all flags are set, false otherwise.
-
size_t count() const
Get the number of transport flags that are set.
- Returns:
The number of flags that are set.
-
TransportFlags &operator|=(TransportFlags other)
Bitwise OR assignment operator for TransportFlags.
- Parameters:
other – The other TransportFlags to perform the OR operation with.
- Returns:
A reference to the modified TransportFlags.
-
TransportFlags operator|(TransportFlags other) const
Bitwise OR operator for TransportFlags.
- Parameters:
other – The other TransportFlags to perform the OR operation with.
- Returns:
A new TransportFlags object with the result of the OR operation.
-
TransportFlags operator|(Transport transport) const
Bitwise OR operator for TransportFlags and Transport.
- Parameters:
transport – The Transport to perform the OR operation with.
- Returns:
A new TransportFlags object with the result of the OR operation.
-
TransportFlags &operator&=(TransportFlags other)
Bitwise AND assignment operator for TransportFlags.
- Parameters:
other – The other TransportFlags to perform the AND operation with.
- Returns:
A reference to the modified TransportFlags.
-
TransportFlags operator&(TransportFlags other) const
Bitwise AND operator for TransportFlags.
- Parameters:
other – The other TransportFlags to perform the AND operation with.
- Returns:
A new TransportFlags object with the result of the AND operation.
-
TransportFlags operator&(Transport transport) const
Bitwise AND operator for TransportFlags and Transport.
- Parameters:
transport – The Transport to perform the AND operation with.
- Returns:
A new TransportFlags object with the result of the AND operation.
-
TransportFlags &operator^=(TransportFlags other)
Bitwise XOR assignment operator for TransportFlags.
- Parameters:
other – The other TransportFlags to perform the XOR operation with.
- Returns:
A reference to the modified TransportFlags.
-
TransportFlags operator^(TransportFlags other) const
Bitwise XOR operator for TransportFlags.
- Parameters:
other – The other TransportFlags to perform the XOR operation with.
- Returns:
A new TransportFlags object with the result of the XOR operation.
-
TransportFlags operator^(Transport transport) const
Bitwise XOR operator for TransportFlags and Transport.
- Parameters:
transport – The Transport to perform the XOR operation with.
- Returns:
A new TransportFlags object with the result of the XOR operation.
-
TransportFlags operator~() const
Bitwise NOT operator for TransportFlags.
- Returns:
A new TransportFlags object with the result of the NOT operation.
-
bool operator==(TransportFlags other) const
Equality comparison operator for TransportFlags.
- Parameters:
other – The other TransportFlags to compare with.
- Returns:
True if the two TransportFlags objects are equal, false otherwise.
-
bool operator!=(TransportFlags other) const
Inequality comparison operator for TransportFlags.
- Parameters:
other – The other TransportFlags to compare with.
- Returns:
True if the two TransportFlags objects are not equal, false otherwise.
-
detail::TransportFlagsBase toBitset() const
Convert the TransportFlags object to a bitset representation.
- Returns:
A detail::TransportFlagsBase object representing the TransportFlags object.
-
TransportFlags() = default
-
enum class mscclpp::DeviceType
Available device types.
Values:
-
enumerator Unknown
-
enumerator CPU
-
enumerator GPU
-
enumerator Unknown
-
enum class mscclpp::Transport
Enumerates the available transport types.
Values:
-
enumerator Unknown
-
enumerator CudaIpc
-
enumerator IB0
-
enumerator IB1
-
enumerator IB2
-
enumerator IB3
-
enumerator IB4
-
enumerator IB5
-
enumerator IB6
-
enumerator IB7
-
enumerator Ethernet
-
enumerator NumTransports
-
enumerator Unknown
Connect to NVLS on setup.
This function used to connect to NVLS on setup. NVLS collective using multicast operations to send/recv data. Here we need to put all involved ranks into the collective group.
- Parameters:
comm – The communicator.
allRanks – The ranks of all processes involved in the collective.
config – The configuration for the local endpoint.
- Returns:
std::shared_ptr<NvlsConnection> A shared pointer to the NVLS connection.
Semaphores
-
class Host2DeviceSemaphore
A semaphore for sending signals from the host to the device.
Public Types
-
using DeviceHandle = Host2DeviceSemaphoreDeviceHandle
Device-side handle for Host2DeviceSemaphore.
Public Functions
Constructor.
- Parameters:
communicator – The communicator.
connection – The connection associated with this semaphore.
-
std::shared_ptr<Connection> connection() const
Returns the connection.
- Returns:
The connection associated with this semaphore.
-
void signal()
Signal the device.
-
DeviceHandle deviceHandle() const
Returns the device-side handle.
-
using DeviceHandle = Host2DeviceSemaphoreDeviceHandle
-
class Host2HostSemaphore
A semaphore for sending signals from the local host to a remote host.
Public Functions
Constructor.
- Parameters:
communicator – The communicator.
connection – The connection associated with this semaphore. Transport::CudaIpc is not allowed for Host2HostSemaphore.
-
std::shared_ptr<Connection> connection() const
Returns the connection.
- Returns:
The connection associated with this semaphore.
-
void signal()
Signal the remote host.
-
bool poll()
Check if the remote host has signaled.
- Returns:
true if the remote host has signaled.
-
void wait(int64_t maxSpinCount = 10000000)
Wait for the remote host to signal.
- Parameters:
maxSpinCount – The maximum number of spin counts before throwing an exception. Never throws if negative.
-
class MemoryDevice2DeviceSemaphore
A semaphore for sending signals from the local device to a peer device via a GPU thread.
Public Types
-
using DeviceHandle = MemoryDevice2DeviceSemaphoreDeviceHandle
Device-side handle for MemoryDevice2DeviceSemaphore.
Public Functions
Constructor.
- Parameters:
communicator – The communicator.
connection – The connection associated with this semaphore.
-
std::shared_ptr<Connection> connection() const
Returns the connection.
- Returns:
The connection associated with this semaphore.
-
DeviceHandle deviceHandle() const
Returns the device-side handle.
-
using DeviceHandle = MemoryDevice2DeviceSemaphoreDeviceHandle
-
class Semaphore
Semaphore used by channels for synchronization.
Public Functions
-
Semaphore() = default
Constructor.
-
Semaphore(const SemaphoreStub &localStub, const SemaphoreStub &remoteStub)
Constructor.
- Parameters:
localStub – SemaphoreStub allocated on the local process.
remoteStub – SemaphoreStub allocated on the remote process.
-
std::shared_ptr<Connection> connection() const
Get the connection associated with this semaphore.
- Returns:
A shared pointer to the connection.
-
const RegisteredMemory &localMemory() const
Get the local memory associated with this semaphore.
- Returns:
A reference to the local registered memory.
-
const RegisteredMemory &remoteMemory() const
Get the remote memory associated with this semaphore.
- Returns:
A reference to the remote registered memory.
-
Semaphore() = default
-
class SemaphoreStub
SemaphoreStub object only used for constructing Semaphore, not for direct use by the user.
Public Functions
Constructor.
- Parameters:
connection – A shared pointer to the connection associated with this semaphore.
-
const RegisteredMemory &memory() const
Get the memory associated with this semaphore.
- Returns:
A reference to the registered memory for this semaphore.
-
std::vector<char> serialize() const
Serialize into a vector of characters.
- Returns:
A vector of characters representing the serialized SemaphoreStub object.
Public Static Functions
-
static SemaphoreStub deserialize(const std::vector<char> &data)
Deserialize a SemaphoreStub object from a vector of characters.
- Parameters:
data – A vector of characters representing a serialized SemaphoreStub object.
- Returns:
A deserialized SemaphoreStub object.
Channels
-
struct BaseMemoryChannel
Memory channel without specifying source/destination memory regions.
Subclassed by mscclpp::MemoryChannel
Public Types
-
using DeviceHandle = BaseMemoryChannelDeviceHandle
Device-side handle for BaseMemoryChannel.
Public Functions
-
BaseMemoryChannel() = default
Constructor.
Constructor.
- Parameters:
semaphore – Semaphore used to synchronize the communication.
-
BaseMemoryChannel(const Semaphore &semaphore)
Constructor.
- Parameters:
semaphore – Semaphore used to synchronize the communication.
-
BaseMemoryChannel(const BaseMemoryChannel &other) = default
Constructor.
- Parameters:
other – Other BaseMemoryChannel to copy from.
-
DeviceHandle deviceHandle() const
Returns the device-side handle. User should make sure the BaseMemoryChannel is not released when using the returned handle.
- Returns:
The device-side handle.
-
using DeviceHandle = BaseMemoryChannelDeviceHandle
-
struct BasePortChannel
Port channel without specifying source/destination memory regions.
Subclassed by mscclpp::PortChannel
Public Types
-
using DeviceHandle = BasePortChannelDeviceHandle
Device-side handle for BasePortChannel.
Public Functions
-
BasePortChannel() = default
Constructor.
Constructor.
- Parameters:
semaphoreId – The ID of the semaphore.
semaphore – The semaphore used to synchronize the communication.
proxy – The proxy used for communication.
Constructor.
- Parameters:
semaphoreId – The ID of the semaphore.
semaphore – The semaphore used to synchronize the communication.
proxy – The proxy used for communication.
-
BasePortChannel(const BasePortChannel &other) = default
Copy constructor.
- Parameters:
other – The other BasePortChannel to copy from.
-
BasePortChannel &operator=(BasePortChannel &other) = default
Assignment operator.
- Parameters:
other – The other BasePortChannel to assign from.
-
DeviceHandle deviceHandle() const
Returns the device-side handle. User should make sure the BasePortChannel is not released when using the returned handle.
- Returns:
The device-side handle.
-
using DeviceHandle = BasePortChannelDeviceHandle
-
struct MemoryChannel : public mscclpp::BaseMemoryChannel
Channel for accessing peer memory directly from GPU threads.
Public Types
-
using DeviceHandle = MemoryChannelDeviceHandle
Device-side handle for MemoryChannel.
Public Functions
-
MemoryChannel() = default
Constructor.
Constructor.
- Parameters:
semaphore – The semaphore used to synchronize the communication.
dst – Registered memory of the destination.
src – Registered memory of the source.
packetBuffer – A buffer used to store packets.
packetBuffer
is optional and if it is nullptr, unpackPacket() and unpackPackets() methods are not available.
-
MemoryChannel(const Semaphore &semaphore, RegisteredMemory dst, RegisteredMemory src, void *packetBuffer = nullptr)
Constructor.
- Parameters:
semaphore – The semaphore used to synchronize the communication.
dst – Registered memory of the destination.
src – Registered memory of the source.
packetBuffer – A buffer used to store packets.
packetBuffer
is optional and if it is nullptr, unpackPacket() and unpackPackets() methods are not available.
-
DeviceHandle deviceHandle() const
Returns the device-side handle. User should make sure the MemoryChannel is not released when using the returned handle.
- Returns:
The device-side handle.
-
using DeviceHandle = MemoryChannelDeviceHandle
-
struct PortChannel : public mscclpp::BasePortChannel
Port channel.
Public Types
-
using DeviceHandle = PortChannelDeviceHandle
Device-side handle for PortChannel.
Public Functions
-
PortChannel() = default
Constructor.
Constructor.
- Parameters:
semaphoreId – The ID of the semaphore.
semaphore – The semaphore.
proxy – The proxy.
dst – The destination memory region.
src – The source memory region.
Constructor.
- Parameters:
semaphoreId – The ID of the semaphore.
semaphore – The semaphore.
proxy – The proxy.
dst – The destination memory region.
src – The source memory region.
-
PortChannel(const PortChannel &other) = default
Copy constructor.
- Parameters:
other – The other PortChannel to copy from.
-
PortChannel &operator=(PortChannel &other) = default
Assignment operator.
- Parameters:
other – The other PortChannel to assign from.
-
DeviceHandle deviceHandle() const
Returns the device-side handle. User should make sure the PortChannel is not released when using the returned handle.
- Returns:
The device-side handle.
-
using DeviceHandle = PortChannelDeviceHandle
-
struct SwitchChannel
Proxy Service and FIFO Management
-
class BaseProxyService
Base class for proxy services. Proxy services are used to proxy data between devices.
Subclassed by mscclpp::ProxyService
-
class Fifo
Host-side proxy FIFO for device-produced work elements.
Public Functions
-
Fifo(int size = DEFAULT_FIFO_SIZE)
Constructor.
- Parameters:
size – Number of entries (default: DEFAULT_FIFO_SIZE).
-
~Fifo()
Destructor.
-
ProxyTrigger poll()
Poll and get the trigger at the head.
- Returns:
ProxyTrigger at the head of the FIFO.
-
void pop()
Remove the head trigger.
-
int size() const
Get FIFO size.
- Returns:
Number of entries in the FIFO.
-
FifoDeviceHandle deviceHandle() const
Get device-side FIFO handle.
- Returns:
FifoDeviceHandle for device access.
-
Fifo(int size = DEFAULT_FIFO_SIZE)
-
class Proxy
Host-side proxy for PortChannels.
Public Functions
-
Proxy(ProxyHandler handler, std::function<void()> threadInit, int fifoSize = DEFAULT_FIFO_SIZE)
Constructor.
- Parameters:
handler – Handler for each FIFO trigger.
threadInit – Optional function run in proxy thread before FIFO consumption.
fifoSize – FIFO size (default: DEFAULT_FIFO_SIZE).
-
Proxy(ProxyHandler handler, int fifoSize = DEFAULT_FIFO_SIZE)
Constructor.
- Parameters:
handler – Handler for each FIFO trigger.
fifoSize – FIFO size (default: DEFAULT_FIFO_SIZE).
-
~Proxy()
Destructor. Stops proxy if running.
-
void start()
Start proxy.
-
void stop()
Stop proxy.
-
Proxy(ProxyHandler handler, std::function<void()> threadInit, int fifoSize = DEFAULT_FIFO_SIZE)
-
class ProxyService : public mscclpp::BaseProxyService
Proxy service implementation.
Public Functions
-
ProxyService(int fifoSize = DEFAULT_FIFO_SIZE)
Constructor.
- Parameters:
fifoSize – Size of the FIFO used by the proxy service (default: DEFAULT_FIFO_SIZE).
Build and add a semaphore to the proxy service.
- Parameters:
connection – The connection associated with the semaphore.
- Returns:
The ID of the semaphore.
-
SemaphoreId addSemaphore(const Semaphore &semaphore)
Add a semaphore to the proxy service.
- Parameters:
semaphore – The semaphore to be added
- Returns:
The ID of the semaphore.
Add a semaphore to the proxy service.
- Parameters:
semaphore – The semaphore to be added
- Returns:
The ID of the semaphore.
-
MemoryId addMemory(RegisteredMemory memory)
Register a memory region with the proxy service.
- Parameters:
memory – The memory region to register.
- Returns:
The ID of the memory region.
-
std::shared_ptr<Host2DeviceSemaphore> semaphore(SemaphoreId id) const
Get a semaphore by ID.
- Parameters:
id – The ID of the semaphore.
- Returns:
The semaphore.
-
BasePortChannel basePortChannel(SemaphoreId id)
Get a base port channel by semaphore ID.
- Parameters:
id – The ID of the semaphore.
- Returns:
The base port channel.
-
PortChannel portChannel(SemaphoreId id, MemoryId dst, MemoryId src)
Get a port channel by semaphore ID and memory regions.
- Parameters:
id – The ID of the semaphore.
dst – The destination memory region.
src – The source memory region.
- Returns:
The port channel.
-
virtual void startProxy()
Start the proxy service.
-
virtual void stopProxy()
Stop the proxy service.
-
ProxyService(int fifoSize = DEFAULT_FIFO_SIZE)
-
using mscclpp::ProxyHandler = std::function<ProxyHandlerResult(ProxyTrigger)>
Handler function type for proxy.
-
enum class mscclpp::ProxyHandlerResult
Return values for ProxyHandler.
Values:
-
enumerator Continue
Move to next trigger in FIFO.
-
enumerator Stop
Stop and exit proxy.
-
enumerator Continue
-
constexpr size_t mscclpp::DEFAULT_FIFO_SIZE = 512
Utilities
-
struct AvoidCudaGraphCaptureGuard
A RAII guard that will cudaThreadExchangeStreamCaptureMode to cudaStreamCaptureModeRelaxed on construction and restore the previous mode on destruction. This is helpful when we want to avoid CUDA/HIP graph capture.
-
struct CudaStreamWithFlags
A RAII wrapper around cudaStream_t that will call cudaStreamDestroy on destruction.
Public Functions
-
CudaStreamWithFlags()
Constructor without flags. This will not create any stream. set() can be called later to create a stream with specified flags.
-
CudaStreamWithFlags(unsigned int flags)
Constructor with flags. This will create a stream with the specified flags on the current device.
- Parameters:
flags – The flags to create the stream with.
-
~CudaStreamWithFlags()
Destructor. This will destroy the stream if it was created.
-
void set(unsigned int flags)
Set the stream with the specified flags. The current device at the time of the construction will be used. If the stream was already created, it will raise an error with ErrorCode::InvalidUsage.
- Parameters:
flags – The flags to create the stream with.
- Throws:
Error – if the stream was already created.
-
bool empty() const
Check if the stream is empty (not created).
- Returns:
true if the stream is empty, false otherwise.
-
CudaStreamWithFlags()
-
template<class T = char>
class GpuBuffer Allocates a GPU memory space specialized for communication. The memory is zeroed out. Get the device pointer by
GpuBuffer::data()
.Use this function for communication buffers, i.e., only when other devices (CPU, GPU, NIC, etc.) may access this memory space at the same time with the local device (GPU). Running heavy computation over this memory space may perform bad and is not recommended in general.
The allocated memory space is managed by the
memory_
object, not by the class instance. Which means, the class destructor will NOT free the allocated memory ifmemory_
is shared with and alive in other contexts.- Template Parameters:
T – Type of each element in the allocated memory. Default is
char
.
Public Functions
-
inline GpuBuffer(size_t nelems)
Constructs a GpuBuffer with the specified number of elements.
- Parameters:
nelems – Number of elements to allocate. If it is zero,
data()
will return a null pointer.
-
inline size_t nelems() const
Returns the number of elements in the allocated memory.
- Returns:
The number of elements.
-
inline size_t bytes() const
Returns the number of bytes that is actually allocated. This may be larger than
nelems() * sizeof(T)
.- Returns:
The number of bytes.
-
inline std::shared_ptr<T> memory()
Returns the shared pointer to the allocated memory. If
nelems()
is zero, this function will return an empty shared pointer.- Returns:
A
std::shared_ptr
to the allocated memory.
-
inline T *data()
Returns the device pointer to the allocated memory. Equivalent to
memory().get()
. Ifnelems()
is zero, this function will return a null pointer.- Returns:
A device pointer to the allocated memory.
-
inline int deviceId() const
Returns the device id of the allocated memory.
- Returns:
The device id.
-
class GpuStream
A managed non-blocking GPU stream object created by GpuStreamPool. This object does not own the stream. When this object is destroyed, it will return the underlying CudaStreamWithFlags to the pool.
Public Functions
-
~GpuStream()
Destructor. This will return the underlying CudaStreamWithFlags to the pool, not destroy it.
-
~GpuStream()
-
class GpuStreamPool
A pool of managed GPU streams. Only provides non-blocking streams. This is intended to be used for reusing temporal streams.
-
int mscclpp::getDeviceNumaNode(int deviceId)
Return the NUMA node ID of the given GPU device ID.
- Parameters:
deviceId – The GPU device ID.
- Throws:
Error – if the device ID is invalid or if the NUMA node cannot be determined.
- Returns:
The NUMA node ID of the device.
-
std::string mscclpp::getHostName(int maxlen, const char delim)
Get the host name of the system.
- Parameters:
maxlen – The maximum length of the returned string.
delim – The delimiter to use for the host name; if the delimiter is found before maxlen characters, the host name will be truncated at that point.
- Throws:
Error – if it fails to retrieve the host name (error code: SystemError).
- Returns:
The host name of the system, truncated to maxlen characters if necessary, and split by the specified delimiter.
-
int mscclpp::getIBDeviceCount()
Get the number of available InfiniBand devices.
- Returns:
The number of available InfiniBand devices.
-
std::string mscclpp::getIBDeviceName(Transport ibTransport)
Get the name of the InfiniBand device associated with the specified transport.
- Parameters:
ibTransport – The InfiniBand transport to get the device name for.
- Returns:
The name of the InfiniBand device associated with the specified transport.
-
Transport mscclpp::getIBTransportByDeviceName(const std::string &ibDeviceName)
Get the InfiniBand transport associated with the specified device name.
- Parameters:
ibDeviceName – The name of the InfiniBand device to get the transport for.
- Returns:
The InfiniBand transport associated with the specified device name.
-
template<class T = char>
void mscclpp::gpuMemcpy(T *dst, const T *src, size_t nelems, cudaMemcpyKind kind = cudaMemcpyDefault) Copies memory from src to dst synchronously.
- Template Parameters:
T – Type of each element in the memory.
- Parameters:
dst – Destination address.
src – Source address.
nelems – Number of elements to copy.
kind – The kind of copy operation. Default is cudaMemcpyDefault.
-
template<class T = char>
void mscclpp::gpuMemcpyAsync(T *dst, const T *src, size_t nelems, cudaStream_t stream, cudaMemcpyKind kind = cudaMemcpyDefault) Copies memory from src to dst asynchronously.
- Template Parameters:
T – Type of each element in the memory.
- Parameters:
dst – Destination address.
src – Source address.
nelems – Number of elements to copy.
stream – The stream to use for the copy operation.
kind – The kind of copy operation. Default is cudaMemcpyDefault.
-
std::shared_ptr<GpuStreamPool> mscclpp::gpuStreamPool()
Get the singleton instance of GpuStreamPool.
- Returns:
A shared pointer to the GpuStreamPool instance.
-
bool mscclpp::isCuMemMapAllocated(void *ptr)
Check if ptr is allocaed by cuMemMap.
- Parameters:
ptr – The pointer to check.
- Returns:
True if the pointer is allocated by cuMemMap, false otherwise.
-
bool mscclpp::isNvlsSupported()
Check if NVLink SHARP (NVLS) is supported.
- Returns:
True if NVLink SHARP (NVLS) is supported, false otherwise.
Executor Interface
-
class ExecutionPlan
-
class Executor
Environment and Configuration
-
class Env
The MSCCL++ environment. The constructor reads environment variables and sets the corresponding fields. Use the env() function to get the environment object.
Public Members
-
const std::string debug
Env name:
MSCCLPP_DEBUG
. The debug flag, one of VERSION, WARN, INFO, ABORT, or TRACE. Unset by default.
-
const std::string debugSubsys
Env name:
MSCCLPP_DEBUG_SUBSYS
. The debug subsystem, a comma-separated list of subsystems to enable debug logging for. If the first character is ‘^’, it inverts the mask, i.e., enables all subsystems except those specified. Possible values are INIT, COLL, P2P, SHM, NET, GRAPH, TUNING, ENV, ALLOC, CALL, MSCCLPP_EXECUTOR, ALL. Unset by default.
-
const std::string debugFile
Env name:
MSCCLPP_DEBUG_FILE
. A file path to write debug logs to. Unset by default.
-
const std::string hcaDevices
Env name:
MSCCLPP_HCA_DEVICES
. A comma-separated list of HCA devices to use for IB transport. i-th device in the list will be used for the i-th GPU in the system. If unset, it will use ibverbs APIs to find the devices automatically.
-
const std::string hostid
Env name:
MSCCLPP_HOSTID
. A string that uniquely identifies the host. If unset, it will use the hostname. This is used to determine whether the host is the same across different processes.
-
const std::string socketFamily
Env name:
MSCCLPP_SOCKET_FAMILY
. The socket family to use for TCP sockets (used by TcpBootstrap and the Ethernet transport). Possible values areAF_INET
(IPv4) andAF_INET6
(IPv6). If unset, it will not force any family and will use the first one found.
-
const std::string socketIfname
Env name:
MSCCLPP_SOCKET_IFNAME
. The interface name to use for TCP sockets (used by TcpBootstrap and the Ethernet transport). If unset, it will use the first interface found that matches the socket family.
-
const std::string executionPlanDir
Env name:
MSCCLPP_EXECUTION_PLAN_DIR
. The directory to find execution plans from. This should be set to use execution plans for the NCCL API. Unset by default.
-
const std::string npkitDumpDir
Env name:
MSCCLPP_NPKIT_DUMP_DIR
. The directory to dump NPKIT traces to. If this is set, NPKIT will be enabled and will dump traces to this directory. Unset by default.
-
const bool cudaIpcUseDefaultStream
Env name:
MSCCLPP_CUDAIPC_USE_DEFAULT_STREAM
. If set to true, the CUDA IPC transport will use the default stream for all operations. If set to false, it will use a separate stream for each operation. This is an experimental feature and should be false in most cases. Default is false.
Env name:
MSCCLPP_NCCL_LIB_PATH
. The path to the original NCCL/RCCL shared library. If set, it will be used as a fallback for NCCL operations in cases where the MSCCL++ NCCL cannot work.
-
const std::string forceNcclFallbackOperation
Env name:
MSCCLPP_FORCE_NCCL_FALLBACK_OPERATION
. A comma-separated list of NCCL operations that should always use the fallback implementation, even if the MSCCL++ NCCL can handle them. This is useful for debugging purposes. Currently supportsall
,broadcast
,allreduce
,reducescatter
, andallgather
.
-
const bool enableNcclFallback
Env name:
MSCCLPP_ENABLE_NCCL_FALLBACK
. If set to true, it will enable the fallback implementation for NCCL operations. This is useful for debugging purposes. Default is false.
-
const bool disableChannelCache
Env name:
MSCCLPP_DISABLE_CHANNEL_CACHE
. If set to true, it will disable the channel cache for NCCL APIs. Currently, this should be set to true if the application may call NCCL APIs on the same local buffer with different remote buffers, e.g., in the case of a dynamic communicator. If CUDA/HIP graphs are used, disabling the channel cache won’t affect the performance, but otherwise it may lead to performance degradation. Default is false.
-
const std::string debug
Error Handling
-
class BaseError : public std::runtime_error
Base class for all errors thrown by MSCCL++.
Subclassed by mscclpp::CuError, mscclpp::CudaError, mscclpp::Error, mscclpp::IbError, mscclpp::SysError
Public Functions
-
BaseError(const std::string &message, int errorCode)
Constructor of BaseError.
- Parameters:
message – The error message.
errorCode – The error code.
-
explicit BaseError(int errorCode)
Constructor of BaseError.
- Parameters:
errorCode – The error code.
-
int getErrorCode() const
Get the error code.
- Returns:
The error code.
-
const char *what() const noexcept override
Get the error message.
- Returns:
The error message.
-
BaseError(const std::string &message, int errorCode)
Version
-
std::string mscclpp::version()
Return a version string.
- Returns:
The MSCCL++ version string in “major.minor.patch” format.
Macro Functions
-
MSCCLPP_CUDATHROW(cmd)
Throw mscclpp::CudaError if
cmd
does not return cudaSuccess.- Parameters:
cmd – The command to execute.
-
MSCCLPP_CUTHROW(cmd)
Throw mscclpp::CuError if
cmd
does not return CUDA_SUCCESS.- Parameters:
cmd – The command to execute.
Device-Side Interfaces
These device-side handle structures provide GPU kernel interfaces for MSCCL++ communication primitives. They are designed to be used directly in CUDA/HIP device code.
Channel Device Interfaces
-
struct BaseMemoryChannelDeviceHandle
Device-side handle of a MemoryChannel without specific source and destination.
Subclassed by mscclpp::MemoryChannelDeviceHandle
Public Functions
-
inline void signal()
Signal the remote semaphore.
This function guarantees that all the memory operation before this function is completed before the remote semaphore is signaled.
-
inline void relaxedSignal()
Signal the remote semaphore.
This function is a relaxed version of signal() and provides no guarantee on the completion of memory operations. User requires to call proper fencing before using this function.
-
inline bool poll()
Check if the remote semaphore has signaled.
- Returns:
true if the remote semaphore has signaled.
-
inline void wait(int64_t maxSpinCount = 10000000)
Wait for the remote semaphore to send a signal.
- Parameters:
maxSpinCount – The maximum number of spins before asserting. Never assert if negative.
-
inline void relaxedWait(int64_t maxSpinCount = 10000000)
Wait for the remote semaphore to send a signal.
This function is a relaxed version of signal() and provides no guarantee on the completion of memory operations. User requires to call proper fencing before using this function.
- Parameters:
maxSpinCount – The maximum number of spins before asserting. Never assert if negative.
-
inline void signal()
-
struct BasePortChannelDeviceHandle
Subclassed by mscclpp::PortChannelDeviceHandle
Public Functions
-
inline void put(MemoryId dstId, uint64_t dstOffset, MemoryId srcId, uint64_t srcOffset, uint64_t size)
Push a TriggerData to the FIFO.
- Parameters:
dstId – The ID of destination memory region.
dstOffset – The offset into the destination memory region.
srcId – The ID of source memory region.
srcOffset – The offset into the source memory region.
size – The size of the transfer.
-
inline void put(MemoryId dstId, MemoryId srcId, uint64_t offset, uint64_t size)
Push a TriggerData to the FIFO.
- Parameters:
dstId – The ID of destination memory region.
srcId – The ID of source memory region.
offset – The common offset into the destination and source memory regions.
size – The size of the transfer.
-
inline void signal()
Push a TriggerFlag to the FIFO.
-
inline void putWithSignal(MemoryId dstId, uint64_t dstOffset, MemoryId srcId, uint64_t srcOffset, uint64_t size)
Push a TriggerData and a TriggerFlag at the same time to the FIFO.
- Parameters:
dstId – The ID of destination memory region.
dstOffset – The offset into the destination memory region.
srcId – The ID of source memory region.
srcOffset – The offset into the source memory region.
size – The size of the transfer.
-
inline void putWithSignal(MemoryId dstId, MemoryId srcId, uint64_t offset, uint64_t size)
Push a TriggerData and a TriggerFlag at the same time to the FIFO.
- Parameters:
dstId – The ID of destination memory region.
srcId – The ID of source memory region.
offset – The common offset into the destination and source memory regions.
size – The size of the transfer.
-
inline void putWithSignalAndFlush(MemoryId dstId, uint64_t dstOffset, MemoryId srcId, uint64_t srcOffset, uint64_t size, int64_t maxSpinCount = 1000000)
Push a TriggerData, a TriggerFlag, and a TriggerSync at the same time to the FIFO.
- Parameters:
dstId – The ID of destination memory region.
dstOffset – The offset into the destination memory region.
srcId – The ID of source memory region.
srcOffset – The offset into the source memory region.
size – The size of the transfer.
maxSpinCount – The maximum number of spin counts before asserting. Never assert if negative.
-
inline void putWithSignalAndFlush(MemoryId dstId, MemoryId srcId, uint64_t offset, uint64_t size, int64_t maxSpinCount = 1000000)
Push a TriggerData, a TriggerFlag, and a TriggerSync at the same time to the FIFO.
- Parameters:
dstId – The ID of destination memory region.
srcId – The ID of source memory region.
offset – The common offset into the destination and source memory regions.
size – The size of the transfer.
maxSpinCount – The maximum number of spin counts before asserting. Never assert if negative.
-
inline void flush(int64_t maxSpinCount = 1000000)
Push a TriggerSync to the FIFO.
- Parameters:
maxSpinCount – The maximum number of spin counts before asserting. Never assert if negative.
-
inline bool poll()
Check if the port channel has been signaled.
- Returns:
true if the port channel has been signaled.
-
inline void wait(int64_t maxSpinCount = 10000000)
Wait for the port channel to be signaled.
- Parameters:
maxSpinCount – The maximum number of spin counts before asserting. Never assert if negative.
-
inline void put(MemoryId dstId, uint64_t dstOffset, MemoryId srcId, uint64_t srcOffset, uint64_t size)
Warning
doxygenunion: Cannot find union “mscclpp::ChannelTrigger” in doxygen xml output for project “mscclpp” from directory: ./doxygen/xml
-
union LL16Packet
- #include <packet_device.hpp>
LL (low latency) protocol packet with 8 bytes of data and 8 bytes of flags.
Public Types
-
using Payload = uint2
Public Functions
-
LL16Packet() = default
-
inline LL16Packet(uint2 val, uint32_t flag)
-
inline void write(uint32_t val1, uint32_t val2, uint32_t flag)
Write 8 bytes of data to the packet.
- Parameters:
val1 – The first 4-byte data to write.
val2 – The second 4-byte data to write.
flag – The flag to write.
-
inline void write(uint64_t val, uint32_t flag)
Write 8 bytes of data to the packet.
- Parameters:
val – The 8-byte data to write.
flag – The flag to write.
-
inline void write(uint2 val, uint32_t flag)
Write 8 bytes of data to the packet.
- Parameters:
val – The 8-byte data to write.
flag – The flag to write.
-
inline bool readOnce(uint32_t flag, uint2 &data) const
Helper of read().
- Parameters:
flag – The flag to read.
data – The 8-byte data read.
- Returns:
True if the flag is not equal to the given flag.
-
inline uint2 read(uint32_t flag, int64_t maxSpinCount = 100000000) const
Read 8 bytes of data from the packet. It will spin until the flag is equal to the given flag.
- Parameters:
flag – The flag to read.
maxSpinCount – The maximum number of spin counts before asserting. Never assert if negative.
- Returns:
The 8-byte data read.
-
inline void clear()
Clear the packet.
-
using Payload = uint2
-
union LL8Packet
- #include <packet_device.hpp>
LL (low latency) protocol packet with 4 bytes of data and 4 bytes of flags.
Public Types
-
using Payload = uint32_t
Public Functions
-
LL8Packet() = default
-
inline LL8Packet(uint32_t val, uint32_t flag)
-
inline void write(uint32_t val, uint32_t flag)
Write 4 bytes of data to the packet.
- Parameters:
val – The 4-byte data to write.
flag – The flag to write.
-
inline bool readOnce(uint32_t flag, uint32_t &data) const
Helper of read().
- Parameters:
flag – The flag to read.
data – The 4-byte data read.
- Returns:
True if the flag is not equal to the given flag.
-
inline uint32_t read(uint32_t flag, int64_t maxSpinCount = 1000000) const
Read 4 bytes of data from the packet. It will spin until the flag is equal to the given flag.
- Parameters:
flag – The flag to read.
maxSpinCount – The maximum number of spin counts before asserting. Never assert if negative.
- Returns:
The 4-byte data read.
-
inline void clear()
Clear the packet.
-
using Payload = uint32_t
-
struct MemoryChannelDeviceHandle : public mscclpp::BaseMemoryChannelDeviceHandle
Device-side handle of a MemoryChannel.
Public Functions
-
template<typename T>
inline T read(uint64_t index) Load a value from the remote memory.
- Template Parameters:
T – The type of the value to be loaded.
- Parameters:
index – The index of the value to be loaded. The offset in bytes is calculated as index * sizeof(T).
- Returns:
The value loaded.
-
template<typename T>
inline void write(uint64_t index, const T &v) Write a value to the remote memory.
- Template Parameters:
T – The type of the value to be written.
- Parameters:
index – The index of the value to be written. The offset in bytes is calculated as index * sizeof(T).
v – The value to be written.
-
template<int Alignment = 16, bool CopyRemainder = true>
inline void put(uint64_t targetOffset, uint64_t originOffset, uint64_t originBytes, uint32_t threadId, uint32_t numThreads) Copy data from the local memory (origin) to the remote memory (target).
This function is intended to be collectively called by multiple threads. Each thread copies a part of data.
- Template Parameters:
Alignment – The alignment of the source and destination addresses. Should be 4, 8, or a multiple of 16.
CopyRemainder – Whether to copy remainder bytes when the number of bytes is not a multiple of
Alignment
.
- Parameters:
targetOffset – The offset in bytes of the remote address. Should be a multiple of
Alignment
.originOffset – The offset in bytes of the local address. Should be a multiple of
Alignment
.originBytes – Bytes of the origin to be copied. Should be a multiple of
Alignment
.threadId – The index of the current thread among all threads running this function. This is different from the
threadIdx
in CUDA.numThreads – The total number of threads that run this function.
-
template<int Alignment = 16, bool CopyRemainder = true>
inline void put(uint64_t offset, uint64_t originBytes, uint32_t threadId, uint32_t numThreads) Wrapper of put() with the same offset for target and origin.
-
template<int Alignment = 16, bool CopyRemainder = true>
inline void get(uint64_t targetOffset, uint64_t originOffset, uint64_t originBytes, uint32_t threadId, uint32_t numThreads) Copy data from the remote memory (origin) to the local memory (target).
This function is intended to be collectively called by multiple threads. Each thread copies a part of data.
- Template Parameters:
Alignment – The alignment of the source and destination addresses. Should be 4, 8, or a multiple of 16.
CopyRemainder – Whether to copy remainder bytes when the number of bytes is not a multiple of
Alignment
.
- Parameters:
targetOffset – The offset in bytes of the local address. Should be a multiple of
Alignment
.originOffset – The offset in bytes of the remote address. Should be a multiple of
Alignment
.originBytes – Bytes of the origin to be copied. Should be a multiple of
Alignment
.threadId – The index of the current thread among all threads running this function. This is different from the
threadIdx
in CUDA.numThreads – The total number of threads that run this function.
-
template<int Alignment = 16, bool CopyRemainder = true>
inline void get(uint64_t offset, uint64_t originBytes, uint32_t threadId, uint32_t numThreads) Wrapper of get() with the same offset for target and origin.
-
template<typename PacketType = LL16Packet>
inline void putPackets(uint64_t targetOffset, uint64_t originOffset, uint64_t originBytes, uint32_t threadId, uint32_t numThreads, uint32_t flag) Copy data from the local memory (origin) to the remote memory (target) using packets.
This function is intended to be collectively called by multiple threads. Each thread copies a part of data.
- Template Parameters:
PacketType – The packet type. It should be either LL16Packet or LL8Packet.
- Parameters:
targetOffset – The offset in bytes of the remote address.
originOffset – The offset in bytes of the local address.
originBytes – Bytes of the origin to be copied.
threadId – The index of the current thread among all threads running this function. This is different from the
threadIdx
in CUDA.numThreads – The total number of threads that run this function.
flag – The flag to write.
-
template<typename PacketType = LL16Packet>
inline void putPackets(uint64_t offset, uint64_t originBytes, uint32_t threadId, uint32_t numThreads, uint32_t flag) Wrapper of putPackets() with the same offset for target and origin.
-
template<typename PacketType = LL16Packet>
inline auto unpackPacket(uint64_t index, uint32_t flag, int64_t maxSpinCount = -1) Retrieve data from a packet in the local packet buffer.
- Template Parameters:
PacketType – The packet type. It should be either LL16Packet or LL8Packet.
- Parameters:
index – The index of the packet to be read. The offset in bytes is calculated as index * sizeof(PacketType).
flag – The flag to read.
maxSpinCount – The maximum number of spins before asserting. Never assert if negative.
- Returns:
The value read from the packet. The type of the value depends on the packet type.
-
template<typename PacketType = LL16Packet>
inline void unpackPackets(uint64_t targetOffset, uint64_t originOffset, uint64_t originBytes, uint32_t threadId, uint32_t numThreads, uint32_t flag, int64_t maxSpinCount = -1) Retrieve data from packets in the local packet buffer (target) and write to the local memory (origin).
This function is intended to be collectively called by multiple threads. Each thread copies a part of data.
- Template Parameters:
PacketType – The packet type. It should be either LL16Packet or LL8Packet.
- Parameters:
targetOffset – The offset in bytes of the local packet buffer.
originOffset – The offset in bytes of the local address.
originBytes – Bytes of the origin to be copied.
threadId – The index of the current thread among all threads running this function. This is different from the
threadIdx
in CUDA.numThreads – The total number of threads that run this function.
flag – The flag to write.
maxSpinCount – The maximum number of spins before asserting. Never assert if negative.
-
template<typename PacketType = LL16Packet>
inline void unpackPackets(uint64_t offset, uint64_t originBytes, uint32_t threadId, uint32_t numThreads, uint32_t flag, int64_t maxSpinCount = -1) Wrapper of unpackPackets() with the same offset for target and origin.
-
template<typename T>
-
struct PortChannelDeviceHandle : public mscclpp::BasePortChannelDeviceHandle
Public Functions
-
inline void put(uint64_t dstOffset, uint64_t srcOffset, uint64_t size)
Push a TriggerData to the FIFO.
- Parameters:
dstOffset – The offset into the destination memory region.
srcOffset – The offset into the source memory region.
size – The size of the transfer.
-
inline void put(uint64_t offset, uint64_t size)
Push a TriggerData to the FIFO.
- Parameters:
offset – The common offset into the destination and source memory regions.
size – The size of the transfer.
-
inline void putWithSignal(uint64_t dstOffset, uint64_t srcOffset, uint64_t size)
Push a TriggerData and a TriggerFlag at the same time to the FIFO.
- Parameters:
dstOffset – The offset into the destination memory region.
srcOffset – The offset into the source memory region.
size – The size of the transfer.
-
inline void putWithSignal(uint64_t offset, uint64_t size)
Push a TriggerData and a TriggerFlag at the same time to the FIFO.
- Parameters:
offset – The common offset into the destination and source memory regions.
size – The size of the transfer.
-
inline void putWithSignalAndFlush(uint64_t dstOffset, uint64_t srcOffset, uint64_t size, int64_t maxSpinCount = 1000000)
Push a TriggerData, a TriggerFlag, and a TriggerSync at the same time to the FIFO.
- Parameters:
dstOffset – The offset into the destination memory region.
srcOffset – The offset into the source memory region.
size – The size of the transfer.
maxSpinCount – The maximum number of spin counts before asserting. Never assert if negative.
-
inline void putWithSignalAndFlush(uint64_t offset, uint64_t size)
Push a TriggerData, a TriggerFlag, and a TriggerSync at the same time to the FIFO.
- Parameters:
offset – The common offset into the destination and source memory regions.
size – The size of the transfer.
-
inline void put(uint64_t dstOffset, uint64_t srcOffset, uint64_t size)
-
struct SwitchChannelDeviceHandle
Device-side handle for SwitchChannel.
-
using mscclpp::LLPacket = LL16Packet
-
using mscclpp::MemoryId = uint32_t
Numeric ID of RegisteredMemory. ProxyService has an internal array indexed by these handles mapping to the actual.
-
using mscclpp::SemaphoreId = uint32_t
Numeric ID of Semaphore. ProxyService has an internal array indexed by these handles mapping to the actual semaphores.
Semaphore Device Interfaces
-
struct Host2DeviceSemaphoreDeviceHandle
Device-side handle for Host2DeviceSemaphore.
Public Functions
-
inline bool poll()
Poll if the host has signaled.
- Returns:
true if the host has signaled.
-
inline void wait(int64_t maxSpinCount = 100000000)
Wait for the host to signal.
-
inline uint64_t loadExpectedInbound()
Thread-safe read of expected inbound value.
- Returns:
The expected inbound value.
-
inline uint64_t incExpectedInbound()
Thread-safe increment of expected inbound value.
- Returns:
The incremented expected inbound value.
-
inline uint64_t loadInbound()
Thread-safe read of inbound value.
- Returns:
The inbound value.
Public Members
-
uint64_t *inboundToken
A local memory space where a host thread (on behalf of the remote device) will write its semaphore value and the local device will read it.
-
uint64_t *expectedInboundToken
A local memory space where the local device stores the expected value of the inboundToken to wait for.
-
inline bool poll()
-
struct MemoryDevice2DeviceSemaphoreDeviceHandle
Device-side handle for MemoryDevice2DeviceSemaphore.
Public Functions
-
inline bool poll()
Poll if remote device has signaled.
- Returns:
true if remote device has signaled.
-
inline void wait(int64_t maxSpinCount = 100000000)
Wait for remote device to signal.
-
inline void relaxedWait(int64_t maxSpinCount = 100000000)
Relaxed wait; no memory completion guarantee. Use it only for synchronizing execution, not data.
-
inline void signal()
Signal remote device, ensures prior memory ops complete.
-
inline void relaxedSignal()
Relaxed signal; no memory completion guarantee. Use it only for synchronizing execution, not data.
-
inline uint64_t loadExpectedInbound()
Thread-safe read of expected inbound value.
- Returns:
The expected inbound value.
-
inline uint64_t incExpectedInbound()
Thread-safe increment of expected inbound value.
- Returns:
The incremented expected inbound value.
-
inline uint64_t loadInbound()
Thread-safe read of inbound value.
- Returns:
The inbound value.
-
inline uint64_t loadInboundRelaxed()
Thread-safe read of inbound value without memory completion guarantee.
- Returns:
The inbound value.
-
inline uint64_t loadOutbound()
Thread-safe read of outbound value.
- Returns:
The outbound value.
-
inline uint64_t incOutbound()
Thread-safe increment of outbound value.
- Returns:
The incremented outbound value.
Public Members
-
uint64_t *inboundToken
A local memory space where the remote device will write its semaphore value and the local device will read it.
-
uint64_t *outboundToken
A local memory space where the local device stores the semaphore value to be written to the remote device.
-
uint64_t *remoteInboundToken
A remote memory space where the local device writes its outboundToken on. This is inboundToken of the remote device.
-
uint64_t *expectedInboundToken
A local memory space where the local device stores the expected value of the inboundToken to wait for.
-
inline bool poll()
FIFO Device Interfaces
-
struct FifoDeviceHandle
Concurrent FIFO where multiple device threads (the number of threads should not exceed the FIFO size) to push Head pointer is on device, tail pointer is on host (readable by device). The FIFO’s capacity is limited only by MAX_UINT64—effectively infinite for practical use. Exceeding this limit will overflow the counter and lead to undefined behavior.
Public Functions
-
inline uint64_t push(ProxyTrigger trigger, int64_t maxSpinCount = 1000000)
Push a trigger to the FIFO.
- Parameters:
trigger – Trigger to push.
maxSpinCount – Max spin count before assert. Never assert if negative.
- Returns:
Previous head of the FIFO where the trigger was pushed.
-
inline void sync(uint64_t fifoHead, int64_t maxSpinCount = 1000000)
Wait until a specific trigger is popped from the FIFO.
- Parameters:
fifoHead – FIFO head where the trigger was pushed.
maxSpinCount – Max spin count before assert. Never assert if negative.
-
inline uint64_t push(ProxyTrigger trigger, int64_t maxSpinCount = 1000000)
Warning
doxygenstruct: Cannot find class “mscclpp::ProxyTrigger” in doxygen xml output for project “mscclpp” from directory: ./doxygen/xml
-
constexpr unsigned int mscclpp::TriggerBitsFifoReserved = 1
-
constexpr unsigned int mscclpp::TriggerBitsMemoryId = 9
-
constexpr unsigned int mscclpp::TriggerBitsOffset = 32
-
constexpr unsigned int mscclpp::TriggerBitsSemaphoreId = 10
-
constexpr unsigned int mscclpp::TriggerBitsSize = 32
-
constexpr unsigned int mscclpp::TriggerBitsType = 3
-
using mscclpp::TriggerType = uint64_t
-
constexpr TriggerType mscclpp::TriggerData = 0x1
-
constexpr TriggerType mscclpp::TriggerFlag = 0x2
-
constexpr TriggerType mscclpp::TriggerSync = 0x4
Device Utilities
-
struct DeviceSemaphore
A device-wide semaphore. This semaphore can be used to control access to a resource across multiple threads or blocks. It uses atomic operations to ensure that the semaphore value is updated correctly across threads. The semaphore value is an integer that can be set, acquired, and released.
Example usage of DeviceSemaphore:
__global__ void myKernel(mscclpp::DeviceSemaphore* semaphore) { // Initialize the semaphore (allow up to 3 threads access the resource simultaneously) if (blockIdx.x == 0 && threadIdx.x == 0) { semaphore->set(3); } // Each block acquires the semaphore before accessing the shared resource if (threadIdx.x == 0) { semaphore->acquire(); } __syncthreads(); // Access the shared resource // ... __syncthreads(); // Release the semaphore after accessing the shared resource if (threadIdx.x == 0) { semaphore->release(); } }
Public Functions
-
DeviceSemaphore() = default
Construct a new DeviceSemaphore object.
-
inline DeviceSemaphore(int initialValue)
- Parameters:
initialValue – The initial value of the semaphore.
-
~DeviceSemaphore() = default
Destroy the DeviceSemaphore object.
-
inline void set(int value)
Set the semaphore value. This function is used to initialize or reset the semaphore value. The initial value is typically set to a positive integer to allow acquiring the semaphore.
- Parameters:
value – The value to set.
-
inline void acquire(int maxSpinCount = -1)
Acquire the semaphore (decrease the semaphore value by 1). This function will spin until the semaphore is acquired or the maximum spin count is reached.
- Parameters:
maxSpinCount – The maximum number of spin counts before asserting. Never assert if negative.
-
inline void release()
Release the semaphore (increase the semaphore value by 1).
-
DeviceSemaphore() = default
-
struct DeviceSyncer
A device-wide barrier. This barrier can be used to synchronize multiple thread blocks within a kernel. It uses atomic operations to ensure that all threads in the same kernel reach the barrier before proceeding and they can safely read data written by other threads in different blocks.
Example usage of DeviceSyncer:
__global__ void myKernel(mscclpp::DeviceSyncer* syncer, int numBlocks) { // Do some work here // ... // Synchronize all blocks syncer->sync(numBlocks); // All blocks have reached this point // ... }
Public Functions
-
DeviceSyncer() = default
Construct a new DeviceSyncer object.
-
~DeviceSyncer() = default
Destroy the DeviceSyncer object.
-
inline void sync(int blockNum, int64_t maxSpinCount = 100000000)
Synchronize all threads inside a kernel. Guarantee that all previous work of all threads in cooperating blocks is finished.
- Parameters:
blockNum – The number of blocks that will synchronize.
maxSpinCount – The maximum number of spin counts before asserting. Never assert if negative.
Public Static Attributes
-
static const unsigned int NumCounters = 3U
The number of sync counters.
-
DeviceSyncer() = default
-
union VectorType
- #include <gpu_data_types.hpp>
Vector type.
Public Types
-
using ElementType = T
Public Functions
-
inline VectorType()
-
inline operator T*()
-
inline operator const T*() const
-
inline T &operator[](int i)
-
inline const T &operator[](int i) const
Public Static Attributes
-
static constexpr int Size = N
-
using ElementType = T
-
template<int Bytes>
struct Words Word array.
-
template<int Alignment = 16, bool CopyRemainder = true>
void mscclpp::copy(void *dst, void *src, uint64_t bytes, uint32_t threadId, uint32_t numThreads) Copy data from the source memory to the destination memory.
This function is intended to be collectively called by multiple threads. Each thread copies a part of elements.
Note
The source and destination addresses do not have to be aligned to the
Alignment
value, but the misalignment toAlignment
should be multiple of 4 bytes and should be the same for both source and destination addresses. The behavior of this function is undefined otherwise.Note
The number of bytes to be copied should be a multiple of 4 bytes. If the number of bytes is not a multiple of 4 bytes, the remainder bytes will not be copied.
- Template Parameters:
Alignment – The alignment of the data to be copied. A larger alignment value is more likely to achieve higher copying throughput. Should be one of 4, 8, or 16.
CopyRemainder – If false, the function will not copy data that is unaligned to the
Alignment
value. If true, the function will try to copy the unaligned data with conditions (see the notes).
- Parameters:
dst – The destination address.
src – The source address.
bytes – Bytes of the data to be copied. Should be a multiple of 4 bytes.
threadId – The index of the current thread among all threads running this function. Should be less than
numThreads
.numThreads – The total number of threads that run this function.
-
template<typename PacketType = LL16Packet>
void mscclpp::copyFromPackets(void *originPtr, const void *targetPtr, uint64_t originBytes, uint32_t threadId, uint32_t numThreads, uint32_t flag, int64_t maxSpinCount = -1) Read packets from the target buffer and write retrieved data to the origin.
This function is intended to be collectively called by multiple threads. Each thread reads a part of packets.
- Template Parameters:
PacketType – The packet type. It should be either LL16Packet or LL8Packet.
- Parameters:
originPtr – The origin buffer.
targetPtr – The target buffer.
originBytes – The number of bytes to read from the origin buffer.
threadId – The index of the current thread among all threads running this function. Should be less than
numThreads
.numThreads – The total number of threads that run this function.
flag – The flag to write in the packets.
maxSpinCount – The maximum number of spin counts before asserting. Never assert if negative.
-
template<typename PacketType = LL16Packet>
void mscclpp::copyToPackets(void *targetPtr, const void *originPtr, uint64_t originBytes, uint32_t threadId, uint32_t numThreads, uint32_t flag) Read data from the origin and write packets to the target buffer.
This function is intended to be collectively called by multiple threads. Each thread copies a part of packets.
- Template Parameters:
PacketType – The packet type. It should be either LL16Packet or LL8Packet.
- Parameters:
targetPtr – The target buffer.
originPtr – The origin buffer.
originBytes – The number of bytes to write to the target buffer.
threadId – The index of the current thread among all threads running this function. Should be less than
numThreads
.numThreads – The total number of threads that run this function.
flag – The flag to write in the packets.
Atomics
-
constexpr cuda::memory_order mscclpp::memoryOrderAcqRel = cuda::memory_order_acq_rel
-
constexpr cuda::memory_order mscclpp::memoryOrderAcquire = cuda::memory_order_acquire
-
constexpr cuda::memory_order mscclpp::memoryOrderRelaxed = cuda::memory_order_relaxed
-
constexpr cuda::memory_order mscclpp::memoryOrderRelease = cuda::memory_order_release
-
constexpr cuda::memory_order mscclpp::memoryOrderSeqCst = cuda::memory_order_seq_cst
-
constexpr cuda::thread_scope mscclpp::scopeDevice = cuda::thread_scope_device
-
constexpr cuda::thread_scope mscclpp::scopeSystem = cuda::thread_scope_system
-
template<typename T, cuda::thread_scope Scope = cuda::thread_scope_system>
T mscclpp::atomicFetchAdd(T *ptr, const T &val, cuda::memory_order memoryOrder)
Vector Data Types
-
using mscclpp::bf16x2 = VectorType<__bfloat16, 2>
-
using mscclpp::bf16x4 = VectorType<__bfloat16, 4>
-
using mscclpp::bf16x8 = VectorType<__bfloat16, 8>
-
using mscclpp::f16x2 = VectorType<__half, 2>
-
using mscclpp::f16x4 = VectorType<__half, 4>
-
using mscclpp::f16x8 = VectorType<__half, 8>
-
using mscclpp::f32x1 = VectorType<float, 1>
-
using mscclpp::f32x2 = VectorType<float, 2>
-
using mscclpp::f32x4 = VectorType<float, 4>
-
using mscclpp::f64x1 = VectorType<double, 1>
-
using mscclpp::i32x1 = VectorType<int32_t, 1>
-
using mscclpp::i32x2 = VectorType<int32_t, 2>
-
using mscclpp::i32x4 = VectorType<int32_t, 4>
-
using mscclpp::u32x1 = VectorType<uint32_t, 1>
-
using mscclpp::u32x2 = VectorType<uint32_t, 2>
-
using mscclpp::u32x4 = VectorType<uint32_t, 4>
Macro Functions
-
MSCCLPP_ASSERT_DEVICE(__cond, __msg)
Assert a condition on the device and print a message if the condition is false. This macro does nothing in a release mode build (when DEBUG_BUILD is undefined).
-
OR_POLL_MAYBE_JAILBREAK(__cond1, __cond2, __max_spin_cnt)
-
POLL_MAYBE_JAILBREAK(__cond, __max_spin_cnt)