API Reference

namespace mscclpp

Typedefs

using UniqueId = std::array<uint8_t, MSCCLPP_UNIQUE_ID_BYTES>

Unique ID for a process. This is a MSCCLPP_UNIQUE_ID_BYTES byte array that uniquely identifies a process.

template<class T>
using DeviceHandle = typename T::DeviceHandle

A type which could be safely used in device side.

template<class T>
using PacketPayload = typename T::Payload

Packet value type.

template<class T>
using UniqueCudaPtr = std::unique_ptr<T, CudaDeleter<T>>

Unique device pointer that will call cudaFree on destruction.

Template Parameters:

T – Type of each element in the allocated memory.

template<class T>
using UniqueCudaHostPtr = std::unique_ptr<T, CudaHostDeleter<T>>

Unique CUDA host pointer that will call cudaFreeHost on destruction.

Template Parameters:

T – Type of each element in the allocated memory.

using LLPacket = LL16Packet
using ProxyHandler = std::function<ProxyHandlerResult(ProxyTrigger)>
using SemaphoreId = uint32_t
using MemoryId = uint32_t

Numeric ID of RegisteredMemory. ProxyService has an internal array indexed by these handles mapping to the actual.

using TriggerType = uint64_t

Enums

enum class Transport

Enumerates the available transport types.

Values:

enumerator Unknown
enumerator CudaIpc
enumerator Nvls
enumerator IB0
enumerator IB1
enumerator IB2
enumerator IB3
enumerator IB4
enumerator IB5
enumerator IB6
enumerator IB7
enumerator Ethernet
enumerator NumTransports
enum class ErrorCode

Enumeration of error codes used by MSCCL++.

Values:

enumerator SystemError
enumerator InternalError
enumerator RemoteError
enumerator InvalidUsage
enumerator Timeout
enumerator Aborted
enumerator ExecutorError
enum class DataType

Values:

enumerator INT32
enumerator UINT32
enumerator FLOAT16
enumerator FLOAT32
enumerator BFLOAT16
enum class PacketType

Values:

enumerator LL8
enumerator LL16
enum class ProxyHandlerResult

Values:

enumerator Continue
enumerator FlushFifoTailAndContinue
enumerator Stop

Functions

std::string version()

Return a version string.

inline TransportFlags operator|(Transport transport1, Transport transport2)

Bitwise OR operator for two Transport objects.

Parameters:
  • transport1 – The first Transport to perform the OR operation with.

  • transport2 – The second Transport to perform the OR operation with.

Returns:

A new TransportFlags object with the result of the OR operation.

inline TransportFlags operator&(Transport transport1, Transport transport2)

Bitwise AND operator for two Transport objects.

Parameters:
  • transport1 – The first Transport to perform the AND operation with.

  • transport2 – The second Transport to perform the AND operation with.

Returns:

A new TransportFlags object with the result of the AND operation.

inline TransportFlags operator^(Transport transport1, Transport transport2)

Bitwise XOR operator for two Transport objects.

Parameters:
  • transport1 – The first Transport to perform the XOR operation with.

  • transport2 – The second Transport to perform the XOR operation with.

Returns:

A new TransportFlags object with the result of the XOR operation.

int getIBDeviceCount()

Get the number of available InfiniBand devices.

Returns:

The number of available InfiniBand devices.

std::string getIBDeviceName(Transport ibTransport)

Get the name of the InfiniBand device associated with the specified transport.

Parameters:

ibTransport – The InfiniBand transport to get the device name for.

Returns:

The name of the InfiniBand device associated with the specified transport.

Transport getIBTransportByDeviceName(const std::string &ibDeviceName)

Get the InfiniBand transport associated with the specified device name.

Parameters:

ibDeviceName – The name of the InfiniBand device to get the transport for.

Returns:

The InfiniBand transport associated with the specified device name.

template<typename T>
DeviceHandle<std::remove_reference_t<T>> deviceHandle(T &&t)

Retrieve the deviceHandle instance from host object.

std::string errorToString(enum ErrorCode error)

Convert an error code to a string.

Parameters:

error – The error code to convert.

Returns:

The string representation of the error code.

inline void setReadWriteMemoryAccess(void *base, size_t size)

set memory access permission to read-write

Parameters:
  • base – Base memory pointer.

  • size – Size of the memory.

template<class T>
std::shared_ptr<T> allocSharedCuda(size_t count = 1)

Allocates memory on the device and returns a std::shared_ptr to it. The memory is zeroed out.

Template Parameters:

T – Type of each element in the allocated memory.

Parameters:

count – Number of elements to allocate.

Returns:

A std::shared_ptr to the allocated memory.

template<class T>
std::shared_ptr<T> allocSharedPhysicalCuda([[maybe_unused]] size_t count, [[maybe_unused]] size_t gran = 0)

Allocates physical memory on the device and returns a std::shared_ptr to it. The memory is zeroed out.

Template Parameters:

T – Type of each element in the allocated memory.

Parameters:
  • count – Number of elements to allocate.

  • gran – the granularity of the allocation.

Returns:

A std::shared_ptr to the allocated memory.

template<class T>
std::shared_ptr<T> allocExtSharedCuda(size_t count = 1)

Allocates memory on the device and returns a std::shared_ptr to it. The memory is zeroed out.

Template Parameters:

T – Type of each element in the allocated memory.

Parameters:

count – Number of elements to allocate.

Returns:

A std::shared_ptr to the allocated memory.

template<class T>
UniqueCudaPtr<T> allocUniqueCuda(size_t count = 1)

Allocates memory on the device and returns a std::unique_ptr to it. The memory is zeroed out.

Template Parameters:

T – Type of each element in the allocated memory.

Parameters:

count – Number of elements to allocate.

Returns:

A std::unique_ptr to the allocated memory.

template<class T>
UniqueCudaPtr<T> allocExtUniqueCuda(size_t count = 1)

Allocates memory on the device and returns a std::unique_ptr to it. The memory is zeroed out.

Template Parameters:

T – Type of each element in the allocated memory.

Parameters:

count – Number of elements to allocate.

Returns:

A std::unique_ptr to the allocated memory.

template<class T, typename ...Args>
std::shared_ptr<T> makeSharedCudaHost(Args&&... args)

Allocates memory with cudaHostAlloc, constructs an object of type T in it and returns a std::shared_ptr to it.

Template Parameters:
  • T – Type of the object to construct.

  • Args – Types of the arguments to pass to the constructor.

Parameters:

args – Arguments to pass to the constructor.

Returns:

A std::shared_ptr to the allocated memory.

template<class T>
std::shared_ptr<T[]> makeSharedCudaHost(size_t count)

Allocates an array of objects of type T with cudaHostAlloc, default constructs each element and returns a std::shared_ptr to it.

Template Parameters:

T – Type of the object to construct.

Parameters:

count – Number of elements to allocate.

Returns:

A std::shared_ptr to the allocated memory.

template<class T, typename ...Args, std::enable_if_t<false == std::is_array_v<T>, bool> = true>
UniqueCudaHostPtr<T> makeUniqueCudaHost(Args&&... args)

Allocates memory with cudaHostAlloc, constructs an object of type T in it and returns a std::unique_ptr to it.

Template Parameters:
  • T – Type of the object to construct.

  • Args – Types of the arguments to pass to the constructor.

Parameters:

args – Arguments to pass to the constructor.

Returns:

A std::unique_ptr to the allocated memory.

template<class T, std::enable_if_t<true == std::is_array_v<T>, bool> = true>
UniqueCudaHostPtr<T> makeUniqueCudaHost(size_t count)

Allocates an array of objects of type T with cudaHostAlloc, default constructs each element and returns a std::unique_ptr to it.

Template Parameters:

T – Type of the object to construct.

Parameters:

count – Number of elements to allocate.

Returns:

A std::unique_ptr to the allocated memory.

template<class T>
std::unique_ptr<T> allocUniquePhysicalCuda([[maybe_unused]] size_t count, [[maybe_unused]] size_t gran = 0)

Allocated physical memory on the device and returns a memory handle along with a virtual memory handle for it. The memory is zeroed out.

Template Parameters:

T – Type of each element in the allocated memory.

Parameters:
  • count – Number of elements to allocate.

  • gran – the granularity of the allocation.

Returns:

A std::unique_ptr to the allocated memory.

template<class T>
void memcpyCudaAsync(T *dst, const T *src, size_t count, cudaStream_t stream, cudaMemcpyKind kind = cudaMemcpyDefault)

Asynchronous cudaMemcpy without capture into a CUDA graph.

Template Parameters:

T – Type of each element in the allocated memory.

Parameters:
  • dst – Destination pointer.

  • src – Source pointer.

  • count – Number of elements to copy.

  • stream – CUDA stream to use.

  • kind – Type of cudaMemcpy to perform.

template<class T>
void memcpyCuda(T *dst, const T *src, size_t count, cudaMemcpyKind kind = cudaMemcpyDefault)

Synchronous cudaMemcpy without capture into a CUDA graph.

Template Parameters:

T – Type of each element in the allocated memory.

Parameters:
  • dst – Destination pointer.

  • src – Source pointer.

  • count – Number of elements to copy.

  • kind – Type of cudaMemcpy to perform.

int getDeviceNumaNode(int cudaDev)
void numaBind(int node)
std::shared_ptr<NvlsConnection> connectNvlsCollective(std::shared_ptr<Communicator> comm, std::vector<int> allRanks, size_t bufferSize = NvlsConnection::DefaultNvlsBufferSize)

Connect to NVLS on setup.

This function used to connect to NVLS on setup. NVLS collective using multicast operations to send/recv data. Here we need to put all involved ranks into the collective group.

Parameters:
  • comm – The communicator.

  • allRanks – The ranks of all processes involved in the collective.

  • config – The configuration for the local endpoint.

Returns:

std::shared_ptr<NvlsConnection> A shared pointer to the NVLS connection.

std::string getHostName(int maxlen, const char delim)
bool isNvlsSupported()

Variables

const std::string TransportNames[] = {"UNK", "IPC", "NVLS", "IB0", "IB1", "IB2", "IB3", "IB4", "IB5", "IB6", "IB7", "ETH", "NUM"}
const TransportFlags NoTransports

A constant TransportFlags object representing no transports.

const TransportFlags AllIBTransports

A constant TransportFlags object representing all InfiniBand transports.

const TransportFlags AllTransports

A constant TransportFlags object representing all transports.

constexpr size_t DEFAULT_FIFO_SIZE = 128
template<class>
constexpr bool dependentFalse = false
const TriggerType TriggerData = 0x1
const TriggerType TriggerFlag = 0x2
const TriggerType TriggerSync = 0x4
struct DeviceSyncer
#include <concurrency_device.hpp>

A device-wide barrier.

Public Functions

DeviceSyncer() = default

Construct a new DeviceSyncer object.

~DeviceSyncer() = default

Destroy the DeviceSyncer object.

class Bootstrap
#include <core.hpp>

Base class for bootstraps.

Subclassed by mscclpp::TcpBootstrap

class TcpBootstrap : public mscclpp::Bootstrap
#include <core.hpp>

A native implementation of the bootstrap using TCP sockets.

Public Functions

TcpBootstrap(int rank, int nRanks)

Constructor.

Parameters:
  • rank – The rank of the process.

  • nRanks – The total number of ranks.

~TcpBootstrap()

Destructor.

UniqueId getUniqueId() const

Return the unique ID stored in the TcpBootstrap.

Returns:

The unique ID stored in the TcpBootstrap.

void initialize(UniqueId uniqueId, int64_t timeoutSec = 30)

Initialize the TcpBootstrap with a given unique ID.

Parameters:
  • uniqueId – The unique ID to initialize the TcpBootstrap with.

  • timeoutSec – The connection timeout in seconds.

void initialize(const std::string &ifIpPortTrio, int64_t timeoutSec = 30)

Initialize the TcpBootstrap with a string formatted as “ip:port” or “interface:ip:port”.

Parameters:
  • ifIpPortTrio – The string formatted as “ip:port” or “interface:ip:port”.

  • timeoutSec – The connection timeout in seconds.

virtual int getRank() override

Return the rank of the process.

virtual int getNranks() override

Return the total number of ranks.

virtual int getNranksPerNode() override

Return the total number of ranks per node.

virtual void send(void *data, int size, int peer, int tag) override

Send data to another process.

Data sent via send(senderBuff, size, receiverRank, tag) can be received via recv(receiverBuff, size, senderRank, tag).

Parameters:
  • data – The data to send.

  • size – The size of the data to send.

  • peer – The rank of the process to send the data to.

  • tag – The tag to send the data with.

virtual void recv(void *data, int size, int peer, int tag) override

Receive data from another process.

Data sent via send(senderBuff, size, receiverRank, tag) can be received via recv(receiverBuff, size, senderRank, tag).

Parameters:
  • data – The buffer to write the received data to.

  • size – The size of the data to receive.

  • peer – The rank of the process to receive the data from.

  • tag – The tag to receive the data with.

virtual void allGather(void *allData, int size) override

Gather data from all processes.

When called by rank r, this sends data from allData[r * size] to allData[(r + 1) * size - 1] to all other ranks. The data sent by rank r is received into allData[r * size] of other ranks.

Parameters:
  • allData – The buffer to write the received data to.

  • size – The size of the data each rank sends.

virtual void barrier() override

Synchronize all processes.

Public Static Functions

static UniqueId createUniqueId()

Create a random unique ID.

Returns:

The created unique ID.

class TransportFlags : private detail::TransportFlagsBase
#include <core.hpp>

Stores transport flags.

Public Functions

TransportFlags() = default

Default constructor for TransportFlags.

TransportFlags(Transport transport)

Constructor for TransportFlags that takes a Transport enum value.

Parameters:

transport – The transport to set the flag for.

bool has(Transport transport) const

Check if a specific transport flag is set.

Parameters:

transport – The transport to check the flag for.

Returns:

True if the flag is set, false otherwise.

bool none() const

Check if no transport flags are set.

Returns:

True if no flags are set, false otherwise.

bool any() const

Check if any transport flags are set.

Returns:

True if any flags are set, false otherwise.

bool all() const

Check if all transport flags are set.

Returns:

True if all flags are set, false otherwise.

size_t count() const

Get the number of transport flags that are set.

Returns:

The number of flags that are set.

TransportFlags &operator|=(TransportFlags other)

Bitwise OR assignment operator for TransportFlags.

Parameters:

other – The other TransportFlags to perform the OR operation with.

Returns:

A reference to the modified TransportFlags.

TransportFlags operator|(TransportFlags other) const

Bitwise OR operator for TransportFlags.

Parameters:

other – The other TransportFlags to perform the OR operation with.

Returns:

A new TransportFlags object with the result of the OR operation.

TransportFlags operator|(Transport transport) const

Bitwise OR operator for TransportFlags and Transport.

Parameters:

transport – The Transport to perform the OR operation with.

Returns:

A new TransportFlags object with the result of the OR operation.

TransportFlags &operator&=(TransportFlags other)

Bitwise AND assignment operator for TransportFlags.

Parameters:

other – The other TransportFlags to perform the AND operation with.

Returns:

A reference to the modified TransportFlags.

TransportFlags operator&(TransportFlags other) const

Bitwise AND operator for TransportFlags.

Parameters:

other – The other TransportFlags to perform the AND operation with.

Returns:

A new TransportFlags object with the result of the AND operation.

TransportFlags operator&(Transport transport) const

Bitwise AND operator for TransportFlags and Transport.

Parameters:

transport – The Transport to perform the AND operation with.

Returns:

A new TransportFlags object with the result of the AND operation.

TransportFlags &operator^=(TransportFlags other)

Bitwise XOR assignment operator for TransportFlags.

Parameters:

other – The other TransportFlags to perform the XOR operation with.

Returns:

A reference to the modified TransportFlags.

TransportFlags operator^(TransportFlags other) const

Bitwise XOR operator for TransportFlags.

Parameters:

other – The other TransportFlags to perform the XOR operation with.

Returns:

A new TransportFlags object with the result of the XOR operation.

TransportFlags operator^(Transport transport) const

Bitwise XOR operator for TransportFlags and Transport.

Parameters:

transport – The Transport to perform the XOR operation with.

Returns:

A new TransportFlags object with the result of the XOR operation.

TransportFlags operator~() const

Bitwise NOT operator for TransportFlags.

Returns:

A new TransportFlags object with the result of the NOT operation.

bool operator==(TransportFlags other) const

Equality comparison operator for TransportFlags.

Parameters:

other – The other TransportFlags to compare with.

Returns:

True if the two TransportFlags objects are equal, false otherwise.

bool operator!=(TransportFlags other) const

Inequality comparison operator for TransportFlags.

Parameters:

other – The other TransportFlags to compare with.

Returns:

True if the two TransportFlags objects are not equal, false otherwise.

detail::TransportFlagsBase toBitset() const

Convert the TransportFlags object to a bitset representation.

Returns:

A detail::TransportFlagsBase object representing the TransportFlags object.

class RegisteredMemory
#include <core.hpp>

Represents a block of memory that has been registered to a Context.

Public Functions

RegisteredMemory() = default

Default constructor.

~RegisteredMemory()

Destructor.

void *data() const

Get a pointer to the memory block.

Returns:

A pointer to the memory block.

void *originalDataPtr() const

Get a pointer to the original memory block.

Returns:

A pointer to the original memory block.

size_t size()

Get the size of the memory block.

Returns:

The size of the memory block.

TransportFlags transports()

Get the transport flags associated with the memory block.

Returns:

The transport flags associated with the memory block.

std::vector<char> serialize()

Serialize the RegisteredMemory object to a vector of characters.

Returns:

A vector of characters representing the serialized RegisteredMemory object.

Public Static Functions

static RegisteredMemory deserialize(const std::vector<char> &data)

Deserialize a RegisteredMemory object from a vector of characters.

Parameters:

data – A vector of characters representing a serialized RegisteredMemory object.

Returns:

A deserialized RegisteredMemory object.

class Endpoint
#include <core.hpp>

Represents one end of a connection.

Public Functions

Endpoint() = default

Default constructor.

Transport transport()

Get the transport used.

Returns:

The transport used.

std::vector<char> serialize()

Serialize the Endpoint object to a vector of characters.

Returns:

A vector of characters representing the serialized Endpoint object.

Public Static Functions

static Endpoint deserialize(const std::vector<char> &data)

Deserialize a Endpoint object from a vector of characters.

Parameters:

data – A vector of characters representing a serialized Endpoint object.

Returns:

A deserialized Endpoint object.

class Connection
#include <core.hpp>

Represents a connection between two processes.

Public Functions

virtual void write(RegisteredMemory dst, uint64_t dstOffset, RegisteredMemory src, uint64_t srcOffset, uint64_t size) = 0

Write data from a source RegisteredMemory to a destination RegisteredMemory.

Parameters:
virtual void updateAndSync(RegisteredMemory dst, uint64_t dstOffset, uint64_t *src, uint64_t newValue) = 0

Update a 8-byte value in a destination RegisteredMemory and synchronize the change with the remote process.

Parameters:
  • dst – The destination RegisteredMemory.

  • dstOffset – The offset in bytes from the start of the destination RegisteredMemory.

  • src – A pointer to the value to update.

  • newValue – The new value to write.

virtual void flush(int64_t timeoutUsec = 3e7) = 0

Flush any pending writes to the remote process.

virtual Transport transport() = 0

Get the transport used by the local process.

Returns:

The transport used by the local process.

virtual Transport remoteTransport() = 0

Get the transport used by the remote process.

Returns:

The transport used by the remote process.

std::string getTransportName()

Get the name of the transport used for this connection

Returns:

name of transport() -> remoteTransport()

struct EndpointConfig
#include <core.hpp>

Used to configure an endpoint.

Public Functions

inline EndpointConfig()

Default constructor. Sets transport to Transport::Unknown.

inline EndpointConfig(Transport transport)

Constructor that takes a transport and sets the other fields to their default values.

Parameters:

transport – The transport to use.

class Context
#include <core.hpp>

Represents a context for communication. This provides a low-level interface for forming connections in use-cases where the process group abstraction offered by Communicator is not suitable, e.g., ephemeral client-server connections. Correct use of this class requires external synchronization when finalizing connections with the connect() method.

As an example, a client-server scenario where the server will write to the client might proceed as follows:

  1. The client creates an endpoint with createEndpoint() and sends it to the server.

  2. The server receives the client endpoint, creates its own endpoint with createEndpoint(), sends it to the client, and creates a connection with connect().

  3. The client receives the server endpoint, creates a connection with connect() and sends a RegisteredMemory to the server.

  4. The server receives the RegisteredMemory and writes to it using the previously created connection. The client waiting to create a connection before sending the RegisteredMemory ensures that the server can not write to the RegisteredMemory before the connection is established.

While some transports may have more relaxed implementation behavior, this should not be relied upon.

Public Functions

Context()

Create a context.

~Context()

Destroy the context.

RegisteredMemory registerMemory(void *ptr, size_t size, TransportFlags transports)

Register a region of GPU memory for use in this context.

Parameters:
  • ptr – Base pointer to the memory.

  • size – Size of the memory region in bytes.

  • transports – Transport flags.

Returns:

RegisteredMemory A handle to the buffer.

Endpoint createEndpoint(EndpointConfig config)

Create an endpoint for establishing connections.

Parameters:

config – The configuration for the endpoint.

Returns:

The newly created endpoint.

std::shared_ptr<Connection> connect(Endpoint localEndpoint, Endpoint remoteEndpoint)

Establish a connection between two endpoints. While this method immediately returns a connection object, the connection is only safe to use after the corresponding connection on the remote endpoint has been established. This method must be called on both endpoints to establish a connection.

Parameters:
  • localEndpoint – The local endpoint.

  • remoteEndpoint – The remote endpoint.

Returns:

std::shared_ptr<Connection> A shared pointer to the connection.

struct Setuppable
#include <core.hpp>

A base class for objects that can be set up during Communicator::setup().

Public Functions

virtual void beginSetup(std::shared_ptr<Bootstrap> bootstrap)

Called inside Communicator::setup() before any call to endSetup() of any Setuppable object that is being set up within the same Communicator::setup() call.

Parameters:

bootstrap – A shared pointer to the bootstrap implementation.

virtual void endSetup(std::shared_ptr<Bootstrap> bootstrap)

Called inside Communicator::setup() after all calls to beginSetup() of all Setuppable objects that are being set up within the same Communicator::setup() call.

Parameters:

bootstrap – A shared pointer to the bootstrap implementation.

template<typename T>
class NonblockingFuture
#include <core.hpp>

A non-blocking future that can be used to check if a value is ready and retrieve it.

Public Functions

NonblockingFuture() = default

Default constructor.

inline NonblockingFuture(std::shared_future<T> &&future)

Constructor that takes a shared future and moves it into the NonblockingFuture.

Parameters:

future – The shared future to move.

inline bool ready() const

Check if the value is ready to be retrieved.

Returns:

True if the value is ready, false otherwise.

inline T get() const

Get the value.

Throws:

Error – if the value is not ready.

Returns:

The value.

class Communicator
#include <core.hpp>

A class that sets up all registered memories and connections between processes.

A typical way to use this class:

  1. Call connectOnSetup() to declare connections between the calling process with other processes.

  2. Call registerMemory() to register memory regions that will be used for communication.

  3. Call sendMemoryOnSetup() or recvMemoryOnSetup() to send/receive registered memory regions to/from other processes.

  4. Call setup() to set up all registered memories and connections declared in the previous steps.

  5. Call NonblockingFuture<RegisteredMemory>::get() to get the registered memory regions received from other processes.

  6. All done; use connections and registered memories to build channels.

Public Functions

Communicator(std::shared_ptr<Bootstrap> bootstrap, std::shared_ptr<Context> context = nullptr)

Initializes the communicator with a given bootstrap implementation.

Parameters:
  • bootstrap – An implementation of the Bootstrap that the communicator will use.

  • context – An optional context to use for the communicator. If not provided, a new context will be created.

~Communicator()

Destroy the communicator.

std::shared_ptr<Bootstrap> bootstrap()

Returns the bootstrap held by this communicator.

Returns:

std::shared_ptr<Bootstrap> The bootstrap held by this communicator.

std::shared_ptr<Context> context()

Returns the context held by this communicator.

Returns:

std::shared_ptr<Context> The context held by this communicator.

RegisteredMemory registerMemory(void *ptr, size_t size, TransportFlags transports)

Register a region of GPU memory for use in this communicator’s context.

Parameters:
  • ptr – Base pointer to the memory.

  • size – Size of the memory region in bytes.

  • transports – Transport flags.

Returns:

RegisteredMemory A handle to the buffer.

void sendMemoryOnSetup(RegisteredMemory memory, int remoteRank, int tag)

Send information of a registered memory to the remote side on setup.

This function registers a send to a remote process that will happen by a following call of setup(). The send will carry information about a registered memory on the local process.

Parameters:
  • memory – The registered memory buffer to send information about.

  • remoteRank – The rank of the remote process.

  • tag – The tag to use for identifying the send.

NonblockingFuture<RegisteredMemory> recvMemoryOnSetup(int remoteRank, int tag)

Receive memory on setup.

This function registers a receive from a remote process that will happen by a following call of setup(). The receive will carry information about a registered memory on the remote process.

Parameters:
  • remoteRank – The rank of the remote process.

  • tag – The tag to use for identifying the receive.

Returns:

NonblockingFuture<RegisteredMemory> A non-blocking future of registered memory.

NonblockingFuture<std::shared_ptr<Connection>> connectOnSetup(int remoteRank, int tag, EndpointConfig localConfig)

Connect to a remote rank on setup.

This function only prepares metadata for connection. The actual connection is made by a following call of setup(). Note that this function is two-way and a connection from rank i to remote rank j needs to have a counterpart from rank j to rank i. Note that with IB, buffers are registered at a page level and if a buffer is spread through multiple pages and do not fully utilize all of them, IB’s QP has to register for all involved pages. This potentially has security risks if the connection’s accesses are given to a malicious process.

Parameters:
  • remoteRank – The rank of the remote process.

  • tag – The tag of the connection for identifying it.

  • config – The configuration for the local endpoint.

Returns:

NonblockingFuture<NonblockingFuture<std::shared_ptr<Connection>>> A non-blocking future of shared pointer to the connection.

int remoteRankOf(const Connection &connection)

Get the remote rank a connection is connected to.

Parameters:

connection – The connection to get the remote rank for.

Returns:

The remote rank the connection is connected to.

int tagOf(const Connection &connection)

Get the tag a connection was made with.

Parameters:

connection – The connection to get the tag for.

Returns:

The tag the connection was made with.

void onSetup(std::shared_ptr<Setuppable> setuppable)

Add a custom Setuppable object to a list of objects to be setup later, when setup() is called.

Parameters:

setuppable – A shared pointer to the Setuppable object.

void setup()

Setup all objects that have registered for setup.

This includes previous calls of sendMemoryOnSetup(), recvMemoryOnSetup(), connectOnSetup(), and onSetup(). It is allowed to call this function multiple times, where the n-th call will only setup objects that have been registered after the (n-1)-th call.

class BaseError : public std::runtime_error
#include <errors.hpp>

Base class for all errors thrown by MSCCL++.

Subclassed by mscclpp::CuError, mscclpp::CudaError, mscclpp::Error, mscclpp::IbError, mscclpp::SysError

Public Functions

BaseError(const std::string &message, int errorCode)

Constructor for BaseError.

Parameters:
  • message – The error message.

  • errorCode – The error code.

explicit BaseError(int errorCode)

Constructor for BaseError.

Parameters:

errorCode – The error code.

virtual ~BaseError() = default

Virtual destructor for BaseError.

int getErrorCode() const

Get the error code.

Returns:

The error code.

const char *what() const noexcept override

Get the error message.

Returns:

The error message.

class Error : public mscclpp::BaseError
#include <errors.hpp>

A generic error.

class SysError : public mscclpp::BaseError
#include <errors.hpp>

An error from a system call that sets errno.

class CudaError : public mscclpp::BaseError
#include <errors.hpp>

An error from a CUDA runtime library call.

class CuError : public mscclpp::BaseError
#include <errors.hpp>

An error from a CUDA driver library call.

class IbError : public mscclpp::BaseError
#include <errors.hpp>

An error from an ibverbs library call.

class ExecutionPlan
#include <executor.hpp>
class Executor
#include <executor.hpp>
class Fifo
#include <fifo.hpp>

A class representing a host proxy FIFO that can consume work elements pushed by device threads.

Public Functions

Fifo(int size = DEFAULT_FIFO_SIZE)

Constructs a new Fifo object.

Parameters:

size – The number of entires in the FIFO.

~Fifo()

Destroys the Fifo object.

ProxyTrigger poll()

Polls the FIFO for a trigger.

Returns ProxyTrigger which is the trigger at the head of fifo.

void pop()

Pops a trigger from the FIFO.

void flushTail(bool sync = false)

Flushes the tail of the FIFO.

Parameters:

sync – If true, waits for the flush to complete before returning.

int size() const

Return the FIFO size.

Returns:

The FIFO size.

FifoDeviceHandle deviceHandle()

Returns a FifoDeviceHandle object representing the device FIFO.

Returns:

A FifoDeviceHandle object representing the device FIFO.

struct ProxyTrigger
#include <fifo_device.hpp>

A struct representing a pair of 64-bit unsigned integers used as a trigger for the proxy.

This struct is used as a work element in the concurrent FIFO where multiple device threads can push ProxyTrigger elements and a single host proxy thread consumes these work elements.

Do not use the most significant bit of snd as it is reserved for memory consistency purposes

struct FifoDeviceHandle
#include <fifo_device.hpp>

A concurrent FIFO where multiple device threads (the number of threads should not exceed the fifo size) can push work elements and a single host proxy thread consumes them.

The FIFO has a head pointer allocated on the device which starts at 0 and goes up to 2^64-1, which is almost infinity. There are two copies of the tail, one on the device, FifoDeviceHandle::tailReplica, and another on the host, namely, hostTail. The host always has the “true” tail and occasionally pushes it to the copy on the device. Therefore, most of the time, the device has a stale version. The invariants are: tailReplica <= hostTail <= head. The push() function increments head, hostTail is updated in Fifo::pop(), and it occasionally flushes it to tailReplica via Fifo::flushTail().

Duplicating the tail is a good idea because the FIFO is large enough, and we do not need frequent updates for the tail as there is usually enough space for device threads to push their work into.

Public Members

ProxyTrigger *triggers

The FIFO buffer that is allocated on the host via cudaHostAlloc().

uint64_t *tailReplica

Replica of the FIFO tail that is allocated on device.

uint64_t *head

The FIFO head. Allocated on the device and only accessed by the device.

int size

The FIFO size.

struct AvoidCudaGraphCaptureGuard
#include <gpu_utils.hpp>

A RAII guard that will cudaThreadExchangeStreamCaptureMode to cudaStreamCaptureModeRelaxed on construction and restore the previous mode on destruction. This is helpful when we want to avoid CUDA graph capture.

struct CudaStreamWithFlags
#include <gpu_utils.hpp>

A RAII wrapper around cudaStream_t that will call cudaStreamDestroy on destruction.

template<class T>
struct CudaDeleter
#include <gpu_utils.hpp>

A deleter that calls cudaFree for use with std::unique_ptr or std::shared_ptr.

Template Parameters:

T – Type of each element in the allocated memory.

template<class T>
struct CudaPhysicalDeleter
#include <gpu_utils.hpp>
template<class T>
struct CudaHostDeleter
#include <gpu_utils.hpp>

A deleter that calls cudaFreeHost for use with std::unique_ptr or std::shared_ptr.

Template Parameters:

T – Type of each element in the allocated memory.

class NvlsConnection
#include <nvls.hpp>

Public Functions

DeviceMulticastPointer bindAllocatedMemory(CUdeviceptr devicePtr, size_t size)

bind the allocated memory via mscclpp::allocSharedPhysicalCuda to the multicast handle. The behavior is undefined if the devicePtr is not allocated by mscclpp::allocSharedPhysicalCuda.

Parameters:
  • devicePtr

  • size

Returns:

DeviceMulticastPointer with devicePtr, mcPtr and bufferSize

struct DeviceMulticastPointer
#include <nvls.hpp>
struct DeviceMulticastPointerDeviceHandle
#include <nvls_device.hpp>

Device-side handle for Host2DeviceSemaphore.

union LL16Packet
#include <packet_device.hpp>

LL (low latency) protocol packet.

Public Types

using Payload = uint2

Public Members

uint32_t data1
uint32_t flag1
uint32_t data2
uint32_t flag2
struct mscclpp::LL16Packet::[anonymous] [anonymous]
union LL8Packet
#include <packet_device.hpp>

Public Types

using Payload = uint32_t

Public Members

uint32_t data
uint32_t flag
struct mscclpp::LL8Packet::[anonymous] [anonymous]
uint64_t raw_
class Proxy
#include <proxy.hpp>

Public Functions

Fifo &fifo()

This is a concurrent fifo which is multiple threads from the device can produce for and the sole proxy thread consumes it.

Returns:

the fifo

class BaseProxyService
#include <proxy_channel.hpp>

Base class for proxy services. Proxy services are used to proxy data between devices.

Subclassed by mscclpp::ProxyService

class ProxyService : public mscclpp::BaseProxyService
#include <proxy_channel.hpp>

Proxy service implementation.

Public Functions

ProxyService(size_t fifoSize = DEFAULT_FIFO_SIZE)

Constructor.

SemaphoreId buildAndAddSemaphore(Communicator &communicator, std::shared_ptr<Connection> connection)

Build and add a semaphore to the proxy service.

Parameters:

connection – The connection associated with the semaphore.

Returns:

The ID of the semaphore.

SemaphoreId addSemaphore(std::shared_ptr<Host2DeviceSemaphore> semaphore)

Add a semaphore to the proxy service.

Parameters:

semaphore – The semaphore to be added

Returns:

The ID of the semaphore.

MemoryId addMemory(RegisteredMemory memory)

Register a memory region with the proxy service.

Parameters:

memory – The memory region to register.

Returns:

The ID of the memory region.

std::shared_ptr<Host2DeviceSemaphore> semaphore(SemaphoreId id) const

Get a semaphore by ID.

Parameters:

id – The ID of the semaphore.

Returns:

The semaphore.

ProxyChannel proxyChannel(SemaphoreId id)

Get a proxy channel by semaphore ID.

Parameters:

id – The ID of the semaphore.

Returns:

The proxy channel.

virtual void startProxy()

Start the proxy service.

virtual void stopProxy()

Stop the proxy service.

struct ProxyChannel
#include <proxy_channel.hpp>

Proxy channel.

Public Types

using DeviceHandle = ProxyChannelDeviceHandle

Device-side handle for ProxyChannel.

Public Functions

DeviceHandle deviceHandle() const

Returns the device-side handle.

User should make sure the ProxyChannel is not released when using the returned handle.

struct SimpleProxyChannel
#include <proxy_channel.hpp>

Simple proxy channel with a single destination and source memory region.

Public Types

using DeviceHandle = SimpleProxyChannelDeviceHandle

Device-side handle for SimpleProxyChannel.

Public Functions

SimpleProxyChannel() = default

Default constructor.

SimpleProxyChannel(ProxyChannel proxyChan, MemoryId dst, MemoryId src)

Constructor.

Parameters:
  • proxyChan – The proxy channel.

  • dst – The destination memory region.

  • src – The source memory region.

inline SimpleProxyChannel(ProxyChannel proxyChan)

Constructor.

Parameters:

proxyChan – The proxy channel.

SimpleProxyChannel(const SimpleProxyChannel &other) = default

Copy constructor.

SimpleProxyChannel &operator=(SimpleProxyChannel &other) = default

Assignment operator.

DeviceHandle deviceHandle() const

Returns the device-side handle.

User should make sure the SimpleProxyChannel is not released when using the returned handle.

union ChannelTrigger
#include <proxy_channel_device.hpp>

Basic structure of each work element in the FIFO.

Public Members

ProxyTrigger value
uint64_t size
uint64_t srcOffset
uint64_t __pad0__
uint64_t dstOffset
uint64_t srcMemoryId
uint64_t dstMemoryId
uint64_t type
uint64_t chanId
uint64_t __pad1__
uint64_t reserved
struct mscclpp::ChannelTrigger::[anonymous] fields
struct ProxyChannelDeviceHandle
#include <proxy_channel_device.hpp>
struct SimpleProxyChannelDeviceHandle
#include <proxy_channel_device.hpp>
template<template<typename> typename InboundDeleter, template<typename> typename OutboundDeleter>
class BaseSemaphore
#include <semaphore.hpp>

A base class for semaphores.

An semaphore is a synchronization mechanism that allows the local peer to wait for the remote peer to complete a data transfer. The local peer signals the remote peer that it has completed a data transfer by incrementing the outbound semaphore ID. The incremented outbound semaphore ID is copied to the remote peer’s inbound semaphore ID so that the remote peer can wait for the local peer to complete a data transfer. Vice versa, the remote peer signals the local peer that it has completed a data transfer by incrementing the remote peer’s outbound semaphore ID and copying the incremented value to the local peer’s inbound semaphore ID.

Template Parameters:
  • InboundDeleter – The deleter for inbound semaphore IDs. This is either std::default_delete for host memory or CudaDeleter for device memory.

  • OutboundDeleter – The deleter for outbound semaphore IDs. This is either std::default_delete for host memory or CudaDeleter for device memory.

Public Functions

inline BaseSemaphore(std::unique_ptr<uint64_t, InboundDeleter<uint64_t>> localInboundSemaphoreId, std::unique_ptr<uint64_t, InboundDeleter<uint64_t>> expectedInboundSemaphoreId, std::unique_ptr<uint64_t, OutboundDeleter<uint64_t>> outboundSemaphoreId)

Constructs a BaseSemaphore.

Parameters:
  • localInboundSemaphoreId – The inbound semaphore ID

  • expectedInboundSemaphoreId – The expected inbound semaphore ID

  • outboundSemaphoreId – The outbound semaphore ID

class Host2DeviceSemaphore : public mscclpp::BaseSemaphore<CudaDeleter, std::default_delete>
#include <semaphore.hpp>

A semaphore for sending signals from the host to the device.

Public Types

using DeviceHandle = Host2DeviceSemaphoreDeviceHandle

Device-side handle for Host2DeviceSemaphore.

Public Functions

Host2DeviceSemaphore(Communicator &communicator, std::shared_ptr<Connection> connection)

Constructor.

Parameters:
  • communicator – The communicator.

  • connection – The connection associated with this semaphore.

std::shared_ptr<Connection> connection()

Returns the connection.

Returns:

The connection associated with this semaphore.

void signal()

Signal the device.

DeviceHandle deviceHandle()

Returns the device-side handle.

class Host2HostSemaphore : public mscclpp::BaseSemaphore<std::default_delete, std::default_delete>
#include <semaphore.hpp>

A semaphore for sending signals from the local host to a remote host.

Public Functions

Host2HostSemaphore(Communicator &communicator, std::shared_ptr<Connection> connection)

Constructor

Parameters:
std::shared_ptr<Connection> connection()

Returns the connection.

Returns:

The connection associated with this semaphore.

void signal()

Signal the remote host.

bool poll()

Check if the remote host has signaled.

Returns:

true if the remote host has signaled.

void wait(int64_t maxSpinCount = 10000000)

Wait for the remote host to signal.

Parameters:

maxSpinCount – The maximum number of spin counts before throwing an exception. Never throws if negative.

class SmDevice2DeviceSemaphore : public mscclpp::BaseSemaphore<CudaDeleter, CudaDeleter>
#include <semaphore.hpp>

A semaphore for sending signals from the local device to a peer device via SM.

Public Types

using DeviceHandle = SmDevice2DeviceSemaphoreDeviceHandle

Device-side handle for SmDevice2DeviceSemaphore.

Public Functions

SmDevice2DeviceSemaphore(Communicator &communicator, std::shared_ptr<Connection> connection)

Constructor.

Parameters:
  • communicator – The communicator.

  • connection – The connection associated with this semaphore.

SmDevice2DeviceSemaphore() = delete

Constructor.

DeviceHandle deviceHandle() const

Returns the device-side handle.

struct Host2DeviceSemaphoreDeviceHandle
#include <semaphore_device.hpp>

Device-side handle for Host2DeviceSemaphore.

struct SmDevice2DeviceSemaphoreDeviceHandle
#include <semaphore_device.hpp>

Device-side handle for SmDevice2DeviceSemaphore.

struct SmChannel
#include <sm_channel.hpp>

Channel for accessing peer memory directly from SM.

Public Types

using DeviceHandle = SmChannelDeviceHandle

Device-side handle for SmChannel.

Public Functions

SmChannel() = default

Constructor.

SmChannel(std::shared_ptr<SmDevice2DeviceSemaphore> semaphore, RegisteredMemory dst, void *src, void *getPacketBuffer = nullptr)

Constructor.

Parameters:
  • semaphore – The semaphore used to synchronize the communication.

  • dst – Registered memory of the destination.

  • src – The source memory address.

  • getPacketBuffer – The optional buffer used for getPackets().

DeviceHandle deviceHandle() const

Returns the device-side handle.

User should make sure the SmChannel is not released when using the returned handle.

struct SmChannelDeviceHandle
#include <sm_channel_device.hpp>

Channel for accessing peer memory directly from SM.

struct Timer
#include <utils.hpp>

Subclassed by mscclpp::ScopedTimer

Public Functions

int64_t elapsed() const

Returns the elapsed time in microseconds.

struct ScopedTimer : public mscclpp::Timer
#include <utils.hpp>
namespace detail

Typedefs

using TransportFlagsBase = std::bitset<TransportFlagsSize>

Bitset for storing transport flags.

Functions

template<class T>
T *cudaCalloc(size_t nelem)

A wrapper of cudaMalloc that sets the allocated memory to zero.

Template Parameters:

T – Type of each element in the allocated memory.

Parameters:

nelem – Number of elements to allocate.

Returns:

A pointer to the allocated memory.

template<class T>
T *cudaExtCalloc(size_t nelem)
template<class T>
T *cudaHostCalloc(size_t nelem)

A wrapper of cudaHostAlloc that sets the allocated memory to zero.

Template Parameters:

T – Type of each element in the allocated memory.

Parameters:

nelem – Number of elements to allocate.

Returns:

A pointer to the allocated memory.

template<class T, T*, class Deleter, class Memory>
Memory safeAlloc(size_t nelem)

A template function that allocates memory while ensuring that the memory will be freed when the returned object is destroyed.

Template Parameters:
  • T – Type of each element in the allocated memory.

  • alloc – A function that allocates memory.

  • Deleter – A deleter that will be used to free the allocated memory.

  • Memory – The type of the returned object.

Parameters:

nelem – Number of elements to allocate.

Returns:

An object of type Memory that will free the allocated memory when destroyed.

template<class T, T*, class Deleter, class Memory>
Memory safeAlloc(size_t nelem, size_t gran)

Variables

const size_t TransportFlagsSize = 12