API Reference
-
namespace mscclpp
Typedefs
-
using UniqueId = std::array<uint8_t, MSCCLPP_UNIQUE_ID_BYTES>
Unique ID for a process. This is a MSCCLPP_UNIQUE_ID_BYTES byte array that uniquely identifies a process.
-
template<class T>
using DeviceHandle = typename T::DeviceHandle A type which could be safely used in device side.
-
template<class T>
using UniqueCudaPtr = std::unique_ptr<T, CudaDeleter<T>> Unique device pointer that will call cudaFree on destruction.
- Template Parameters:
T – Type of each element in the allocated memory.
-
template<class T>
using UniqueCudaHostPtr = std::unique_ptr<T, CudaHostDeleter<T>> Unique CUDA host pointer that will call cudaFreeHost on destruction.
- Template Parameters:
T – Type of each element in the allocated memory.
-
using LLPacket = LL16Packet
-
using ProxyHandler = std::function<ProxyHandlerResult(ProxyTrigger)>
-
using SemaphoreId = uint32_t
-
using MemoryId = uint32_t
Numeric ID of RegisteredMemory. ProxyService has an internal array indexed by these handles mapping to the actual.
-
using TriggerType = uint64_t
Enums
-
enum class Transport
Enumerates the available transport types.
Values:
-
enumerator Unknown
-
enumerator CudaIpc
-
enumerator Nvls
-
enumerator IB0
-
enumerator IB1
-
enumerator IB2
-
enumerator IB3
-
enumerator IB4
-
enumerator IB5
-
enumerator IB6
-
enumerator IB7
-
enumerator Ethernet
-
enumerator NumTransports
-
enumerator Unknown
-
enum class ErrorCode
Enumeration of error codes used by MSCCL++.
Values:
-
enumerator SystemError
-
enumerator InternalError
-
enumerator RemoteError
-
enumerator InvalidUsage
-
enumerator Timeout
-
enumerator Aborted
-
enumerator ExecutorError
-
enumerator SystemError
Functions
-
std::string version()
Return a version string.
-
inline TransportFlags operator|(Transport transport1, Transport transport2)
Bitwise OR operator for two Transport objects.
- Parameters:
transport1 – The first Transport to perform the OR operation with.
transport2 – The second Transport to perform the OR operation with.
- Returns:
A new TransportFlags object with the result of the OR operation.
-
inline TransportFlags operator&(Transport transport1, Transport transport2)
Bitwise AND operator for two Transport objects.
- Parameters:
transport1 – The first Transport to perform the AND operation with.
transport2 – The second Transport to perform the AND operation with.
- Returns:
A new TransportFlags object with the result of the AND operation.
-
inline TransportFlags operator^(Transport transport1, Transport transport2)
Bitwise XOR operator for two Transport objects.
- Parameters:
transport1 – The first Transport to perform the XOR operation with.
transport2 – The second Transport to perform the XOR operation with.
- Returns:
A new TransportFlags object with the result of the XOR operation.
-
int getIBDeviceCount()
Get the number of available InfiniBand devices.
- Returns:
The number of available InfiniBand devices.
-
std::string getIBDeviceName(Transport ibTransport)
Get the name of the InfiniBand device associated with the specified transport.
- Parameters:
ibTransport – The InfiniBand transport to get the device name for.
- Returns:
The name of the InfiniBand device associated with the specified transport.
-
Transport getIBTransportByDeviceName(const std::string &ibDeviceName)
Get the InfiniBand transport associated with the specified device name.
- Parameters:
ibDeviceName – The name of the InfiniBand device to get the transport for.
- Returns:
The InfiniBand transport associated with the specified device name.
-
template<typename T>
DeviceHandle<std::remove_reference_t<T>> deviceHandle(T &&t) Retrieve the deviceHandle instance from host object.
-
std::string errorToString(enum ErrorCode error)
Convert an error code to a string.
- Parameters:
error – The error code to convert.
- Returns:
The string representation of the error code.
Allocates memory on the device and returns a std::shared_ptr to it. The memory is zeroed out.
- Template Parameters:
T – Type of each element in the allocated memory.
- Parameters:
count – Number of elements to allocate.
- Returns:
A std::shared_ptr to the allocated memory.
Allocated physical memory on the device and returns a memory handle along with a memory handle for it. The deallocation only happens PhysicalCudaMemory goes out of scope.
- Template Parameters:
T – Type of each element in the allocated memory.
- Parameters:
count – Number of elements to allocate.
gran – the granularity of the allocation.
- Returns:
A std::shared_ptr to the memory handle and a device pointer for that memory.
Allocates memory on the device and returns a std::shared_ptr to it. The memory is zeroed out.
- Template Parameters:
T – Type of each element in the allocated memory.
- Parameters:
count – Number of elements to allocate.
- Returns:
A std::shared_ptr to the allocated memory.
-
template<class T>
UniqueCudaPtr<T> allocUniqueCuda(size_t count = 1) Allocates memory on the device and returns a std::unique_ptr to it. The memory is zeroed out.
- Template Parameters:
T – Type of each element in the allocated memory.
- Parameters:
count – Number of elements to allocate.
- Returns:
A std::unique_ptr to the allocated memory.
-
template<class T>
std::unique_ptr<PhysicalCudaMemory<T>> allocUniquePhysicalCuda(size_t count, size_t gran) Allocated physical memory on the device and returns a memory handle along with a virtual memory handle for it. The memory is zeroed out.
- Template Parameters:
T – Type of each element in the allocated memory.
- Parameters:
count – Number of elements to allocate.
gran – the granularity of the allocation.
- Returns:
A std::unique_ptr to the memory handle and a device pointer for that memory.
-
template<class T>
UniqueCudaPtr<T> allocExtUniqueCuda(size_t count = 1) Allocates memory on the device and returns a std::unique_ptr to it. The memory is zeroed out.
- Template Parameters:
T – Type of each element in the allocated memory.
- Parameters:
count – Number of elements to allocate.
- Returns:
A std::unique_ptr to the allocated memory.
Allocates memory with cudaHostAlloc, constructs an object of type T in it and returns a std::shared_ptr to it.
- Template Parameters:
T – Type of the object to construct.
Args – Types of the arguments to pass to the constructor.
- Parameters:
args – Arguments to pass to the constructor.
- Returns:
A std::shared_ptr to the allocated memory.
Allocates an array of objects of type T with cudaHostAlloc, default constructs each element and returns a std::shared_ptr to it.
- Template Parameters:
T – Type of the object to construct.
- Parameters:
count – Number of elements to allocate.
- Returns:
A std::shared_ptr to the allocated memory.
-
template<class T, typename ...Args, std::enable_if_t<false == std::is_array_v<T>, bool> = true>
UniqueCudaHostPtr<T> makeUniqueCudaHost(Args&&... args) Allocates memory with cudaHostAlloc, constructs an object of type T in it and returns a std::unique_ptr to it.
- Template Parameters:
T – Type of the object to construct.
Args – Types of the arguments to pass to the constructor.
- Parameters:
args – Arguments to pass to the constructor.
- Returns:
A std::unique_ptr to the allocated memory.
-
template<class T, std::enable_if_t<true == std::is_array_v<T>, bool> = true>
UniqueCudaHostPtr<T> makeUniqueCudaHost(size_t count) Allocates an array of objects of type T with cudaHostAlloc, default constructs each element and returns a std::unique_ptr to it.
- Template Parameters:
T – Type of the object to construct.
- Parameters:
count – Number of elements to allocate.
- Returns:
A std::unique_ptr to the allocated memory.
-
template<class T>
void memcpyCudaAsync(T *dst, const T *src, size_t count, cudaStream_t stream, cudaMemcpyKind kind = cudaMemcpyDefault) Asynchronous cudaMemcpy without capture into a CUDA graph.
- Template Parameters:
T – Type of each element in the allocated memory.
- Parameters:
dst – Destination pointer.
src – Source pointer.
count – Number of elements to copy.
stream – CUDA stream to use.
kind – Type of cudaMemcpy to perform.
-
template<class T>
void memcpyCuda(T *dst, const T *src, size_t count, cudaMemcpyKind kind = cudaMemcpyDefault) Synchronous cudaMemcpy without capture into a CUDA graph.
- Template Parameters:
T – Type of each element in the allocated memory.
- Parameters:
dst – Destination pointer.
src – Source pointer.
count – Number of elements to copy.
kind – Type of cudaMemcpy to perform.
-
int getDeviceNumaNode(int cudaDev)
-
void numaBind(int node)
Connect to NVLS on setup.
This function used to connect to NVLS on setup. NVLS collective using multicast operations to send/recv data. Here we need to put all involved ranks into the collective group.
- Parameters:
comm – The communicator.
allRanks – The ranks of all processes involved in the collective.
config – The configuration for the local endpoint.
- Returns:
std::shared_ptr<NvlsConnection> A shared pointer to the NVLS connection.
-
std::string getHostName(int maxlen, const char delim)
-
bool isNvlsSupported()
Variables
-
const std::string TransportNames[] = {"UNK", "IPC", "NVLS", "IB0", "IB1", "IB2", "IB3", "IB4", "IB5", "IB6", "IB7", "ETH", "NUM"}
-
const TransportFlags NoTransports
A constant TransportFlags object representing no transports.
-
const TransportFlags AllIBTransports
A constant TransportFlags object representing all InfiniBand transports.
-
const TransportFlags AllTransports
A constant TransportFlags object representing all transports.
-
constexpr size_t DEFAULT_FIFO_SIZE = 128
-
template<class>
constexpr bool dependentFalse = false
-
const TriggerType TriggerData = 0x1
-
const TriggerType TriggerFlag = 0x2
-
const TriggerType TriggerSync = 0x4
-
struct DeviceSyncer
- #include <concurrency_device.hpp>
A device-wide barrier.
Public Functions
-
DeviceSyncer() = default
Construct a new DeviceSyncer object.
-
~DeviceSyncer() = default
Destroy the DeviceSyncer object.
-
DeviceSyncer() = default
-
class Bootstrap
- #include <core.hpp>
Base class for bootstraps.
Subclassed by mscclpp::TcpBootstrap
-
class TcpBootstrap : public mscclpp::Bootstrap
- #include <core.hpp>
A native implementation of the bootstrap using TCP sockets.
Public Functions
-
TcpBootstrap(int rank, int nRanks)
Constructor.
- Parameters:
rank – The rank of the process.
nRanks – The total number of ranks.
-
~TcpBootstrap()
Destructor.
-
UniqueId getUniqueId() const
Return the unique ID stored in the TcpBootstrap.
- Returns:
The unique ID stored in the TcpBootstrap.
-
void initialize(UniqueId uniqueId, int64_t timeoutSec = 30)
Initialize the TcpBootstrap with a given unique ID.
- Parameters:
uniqueId – The unique ID to initialize the TcpBootstrap with.
timeoutSec – The connection timeout in seconds.
-
void initialize(const std::string &ifIpPortTrio, int64_t timeoutSec = 30)
Initialize the TcpBootstrap with a string formatted as “ip:port” or “interface:ip:port”.
- Parameters:
ifIpPortTrio – The string formatted as “ip:port” or “interface:ip:port”.
timeoutSec – The connection timeout in seconds.
-
virtual int getRank() override
Return the rank of the process.
-
virtual int getNranks() override
Return the total number of ranks.
-
virtual int getNranksPerNode() override
Return the total number of ranks per node.
-
virtual void send(void *data, int size, int peer, int tag) override
Send data to another process.
Data sent via
send(senderBuff, size, receiverRank, tag)
can be received viarecv(receiverBuff, size, senderRank, tag)
.- Parameters:
data – The data to send.
size – The size of the data to send.
peer – The rank of the process to send the data to.
tag – The tag to send the data with.
-
virtual void recv(void *data, int size, int peer, int tag) override
Receive data from another process.
Data sent via
send(senderBuff, size, receiverRank, tag)
can be received viarecv(receiverBuff, size, senderRank, tag)
.- Parameters:
data – The buffer to write the received data to.
size – The size of the data to receive.
peer – The rank of the process to receive the data from.
tag – The tag to receive the data with.
-
virtual void allGather(void *allData, int size) override
Gather data from all processes.
When called by rank
r
, this sends data fromallData[r * size]
toallData[(r + 1) * size - 1]
to all other ranks. The data sent by rankr
is received intoallData[r * size]
of other ranks.- Parameters:
allData – The buffer to write the received data to.
size – The size of the data each rank sends.
-
virtual void barrier() override
Synchronize all processes.
-
TcpBootstrap(int rank, int nRanks)
-
class TransportFlags : private detail::TransportFlagsBase
- #include <core.hpp>
Stores transport flags.
Public Functions
-
TransportFlags() = default
Default constructor for TransportFlags.
-
TransportFlags(Transport transport)
Constructor for TransportFlags that takes a Transport enum value.
- Parameters:
transport – The transport to set the flag for.
-
bool has(Transport transport) const
Check if a specific transport flag is set.
- Parameters:
transport – The transport to check the flag for.
- Returns:
True if the flag is set, false otherwise.
-
bool none() const
Check if no transport flags are set.
- Returns:
True if no flags are set, false otherwise.
-
bool any() const
Check if any transport flags are set.
- Returns:
True if any flags are set, false otherwise.
-
bool all() const
Check if all transport flags are set.
- Returns:
True if all flags are set, false otherwise.
-
size_t count() const
Get the number of transport flags that are set.
- Returns:
The number of flags that are set.
-
TransportFlags &operator|=(TransportFlags other)
Bitwise OR assignment operator for TransportFlags.
- Parameters:
other – The other TransportFlags to perform the OR operation with.
- Returns:
A reference to the modified TransportFlags.
-
TransportFlags operator|(TransportFlags other) const
Bitwise OR operator for TransportFlags.
- Parameters:
other – The other TransportFlags to perform the OR operation with.
- Returns:
A new TransportFlags object with the result of the OR operation.
-
TransportFlags operator|(Transport transport) const
Bitwise OR operator for TransportFlags and Transport.
- Parameters:
transport – The Transport to perform the OR operation with.
- Returns:
A new TransportFlags object with the result of the OR operation.
-
TransportFlags &operator&=(TransportFlags other)
Bitwise AND assignment operator for TransportFlags.
- Parameters:
other – The other TransportFlags to perform the AND operation with.
- Returns:
A reference to the modified TransportFlags.
-
TransportFlags operator&(TransportFlags other) const
Bitwise AND operator for TransportFlags.
- Parameters:
other – The other TransportFlags to perform the AND operation with.
- Returns:
A new TransportFlags object with the result of the AND operation.
-
TransportFlags operator&(Transport transport) const
Bitwise AND operator for TransportFlags and Transport.
- Parameters:
transport – The Transport to perform the AND operation with.
- Returns:
A new TransportFlags object with the result of the AND operation.
-
TransportFlags &operator^=(TransportFlags other)
Bitwise XOR assignment operator for TransportFlags.
- Parameters:
other – The other TransportFlags to perform the XOR operation with.
- Returns:
A reference to the modified TransportFlags.
-
TransportFlags operator^(TransportFlags other) const
Bitwise XOR operator for TransportFlags.
- Parameters:
other – The other TransportFlags to perform the XOR operation with.
- Returns:
A new TransportFlags object with the result of the XOR operation.
-
TransportFlags operator^(Transport transport) const
Bitwise XOR operator for TransportFlags and Transport.
- Parameters:
transport – The Transport to perform the XOR operation with.
- Returns:
A new TransportFlags object with the result of the XOR operation.
-
TransportFlags operator~() const
Bitwise NOT operator for TransportFlags.
- Returns:
A new TransportFlags object with the result of the NOT operation.
-
bool operator==(TransportFlags other) const
Equality comparison operator for TransportFlags.
- Parameters:
other – The other TransportFlags to compare with.
- Returns:
True if the two TransportFlags objects are equal, false otherwise.
-
bool operator!=(TransportFlags other) const
Inequality comparison operator for TransportFlags.
- Parameters:
other – The other TransportFlags to compare with.
- Returns:
True if the two TransportFlags objects are not equal, false otherwise.
-
detail::TransportFlagsBase toBitset() const
Convert the TransportFlags object to a bitset representation.
- Returns:
A detail::TransportFlagsBase object representing the TransportFlags object.
-
TransportFlags() = default
-
class RegisteredMemory
- #include <core.hpp>
Represents a block of memory that has been registered to a Context.
Public Functions
-
RegisteredMemory() = default
Default constructor.
-
~RegisteredMemory()
Destructor.
-
void *data() const
Get a pointer to the memory block.
- Returns:
A pointer to the memory block.
-
void *originalDataPtr() const
Get a pointer to the original memory block.
- Returns:
A pointer to the original memory block.
-
size_t size()
Get the size of the memory block.
- Returns:
The size of the memory block.
-
TransportFlags transports()
Get the transport flags associated with the memory block.
- Returns:
The transport flags associated with the memory block.
-
std::vector<char> serialize()
Serialize the RegisteredMemory object to a vector of characters.
- Returns:
A vector of characters representing the serialized RegisteredMemory object.
Public Static Functions
-
static RegisteredMemory deserialize(const std::vector<char> &data)
Deserialize a RegisteredMemory object from a vector of characters.
- Parameters:
data – A vector of characters representing a serialized RegisteredMemory object.
- Returns:
A deserialized RegisteredMemory object.
-
RegisteredMemory() = default
-
class Endpoint
- #include <core.hpp>
Represents one end of a connection.
Public Functions
-
Endpoint() = default
Default constructor.
-
Endpoint() = default
-
class Connection
- #include <core.hpp>
Represents a connection between two processes.
Public Functions
-
virtual void write(RegisteredMemory dst, uint64_t dstOffset, RegisteredMemory src, uint64_t srcOffset, uint64_t size) = 0
Write data from a source RegisteredMemory to a destination RegisteredMemory.
- Parameters:
dst – The destination RegisteredMemory.
dstOffset – The offset in bytes from the start of the destination RegisteredMemory.
src – The source RegisteredMemory.
srcOffset – The offset in bytes from the start of the source RegisteredMemory.
size – The number of bytes to write.
-
virtual void updateAndSync(RegisteredMemory dst, uint64_t dstOffset, uint64_t *src, uint64_t newValue) = 0
Update a 8-byte value in a destination RegisteredMemory and synchronize the change with the remote process.
- Parameters:
dst – The destination RegisteredMemory.
dstOffset – The offset in bytes from the start of the destination RegisteredMemory.
src – A pointer to the value to update.
newValue – The new value to write.
-
virtual void flush(int64_t timeoutUsec = 3e7) = 0
Flush any pending writes to the remote process.
-
virtual Transport transport() = 0
Get the transport used by the local process.
- Returns:
The transport used by the local process.
-
virtual Transport remoteTransport() = 0
Get the transport used by the remote process.
- Returns:
The transport used by the remote process.
-
std::string getTransportName()
Get the name of the transport used for this connection
- Returns:
name of transport() -> remoteTransport()
-
virtual void write(RegisteredMemory dst, uint64_t dstOffset, RegisteredMemory src, uint64_t srcOffset, uint64_t size) = 0
-
struct EndpointConfig
- #include <core.hpp>
Used to configure an endpoint.
Public Functions
-
inline EndpointConfig()
Default constructor. Sets transport to Transport::Unknown.
-
inline EndpointConfig()
-
class Context
- #include <core.hpp>
Represents a context for communication. This provides a low-level interface for forming connections in use-cases where the process group abstraction offered by Communicator is not suitable, e.g., ephemeral client-server connections. Correct use of this class requires external synchronization when finalizing connections with the connect() method.
As an example, a client-server scenario where the server will write to the client might proceed as follows:
The client creates an endpoint with createEndpoint() and sends it to the server.
The server receives the client endpoint, creates its own endpoint with createEndpoint(), sends it to the client, and creates a connection with connect().
The client receives the server endpoint, creates a connection with connect() and sends a RegisteredMemory to the server.
The server receives the RegisteredMemory and writes to it using the previously created connection. The client waiting to create a connection before sending the RegisteredMemory ensures that the server can not write to the RegisteredMemory before the connection is established.
While some transports may have more relaxed implementation behavior, this should not be relied upon.
Public Functions
-
Context()
Create a context.
-
~Context()
Destroy the context.
-
RegisteredMemory registerMemory(void *ptr, size_t size, TransportFlags transports)
Register a region of GPU memory for use in this context.
- Parameters:
ptr – Base pointer to the memory.
size – Size of the memory region in bytes.
transports – Transport flags.
- Returns:
RegisteredMemory A handle to the buffer.
-
Endpoint createEndpoint(EndpointConfig config)
Create an endpoint for establishing connections.
- Parameters:
config – The configuration for the endpoint.
- Returns:
The newly created endpoint.
-
std::shared_ptr<Connection> connect(Endpoint localEndpoint, Endpoint remoteEndpoint)
Establish a connection between two endpoints. While this method immediately returns a connection object, the connection is only safe to use after the corresponding connection on the remote endpoint has been established. This method must be called on both endpoints to establish a connection.
- Parameters:
localEndpoint – The local endpoint.
remoteEndpoint – The remote endpoint.
- Returns:
std::shared_ptr<Connection> A shared pointer to the connection.
-
struct Setuppable
- #include <core.hpp>
A base class for objects that can be set up during Communicator::setup().
Public Functions
Called inside Communicator::setup() before any call to endSetup() of any Setuppable object that is being set up within the same Communicator::setup() call.
- Parameters:
bootstrap – A shared pointer to the bootstrap implementation.
Called inside Communicator::setup() after all calls to beginSetup() of all Setuppable objects that are being set up within the same Communicator::setup() call.
- Parameters:
bootstrap – A shared pointer to the bootstrap implementation.
-
template<typename T>
class NonblockingFuture - #include <core.hpp>
A non-blocking future that can be used to check if a value is ready and retrieve it.
Public Functions
-
NonblockingFuture() = default
Default constructor.
Constructor that takes a shared future and moves it into the NonblockingFuture.
- Parameters:
future – The shared future to move.
-
inline bool ready() const
Check if the value is ready to be retrieved.
- Returns:
True if the value is ready, false otherwise.
-
NonblockingFuture() = default
-
class Communicator
- #include <core.hpp>
A class that sets up all registered memories and connections between processes.
A typical way to use this class:
Call connectOnSetup() to declare connections between the calling process with other processes.
Call registerMemory() to register memory regions that will be used for communication.
Call sendMemoryOnSetup() or recvMemoryOnSetup() to send/receive registered memory regions to/from other processes.
Call setup() to set up all registered memories and connections declared in the previous steps.
Call NonblockingFuture<RegisteredMemory>::get() to get the registered memory regions received from other processes.
All done; use connections and registered memories to build channels.
Public Functions
Initializes the communicator with a given bootstrap implementation.
- Parameters:
bootstrap – An implementation of the Bootstrap that the communicator will use.
context – An optional context to use for the communicator. If not provided, a new context will be created.
-
~Communicator()
Destroy the communicator.
-
std::shared_ptr<Bootstrap> bootstrap()
Returns the bootstrap held by this communicator.
- Returns:
std::shared_ptr<Bootstrap> The bootstrap held by this communicator.
-
std::shared_ptr<Context> context()
Returns the context held by this communicator.
- Returns:
std::shared_ptr<Context> The context held by this communicator.
-
RegisteredMemory registerMemory(void *ptr, size_t size, TransportFlags transports)
Register a region of GPU memory for use in this communicator’s context.
- Parameters:
ptr – Base pointer to the memory.
size – Size of the memory region in bytes.
transports – Transport flags.
- Returns:
RegisteredMemory A handle to the buffer.
-
void sendMemoryOnSetup(RegisteredMemory memory, int remoteRank, int tag)
Send information of a registered memory to the remote side on setup.
This function registers a send to a remote process that will happen by a following call of setup(). The send will carry information about a registered memory on the local process.
- Parameters:
memory – The registered memory buffer to send information about.
remoteRank – The rank of the remote process.
tag – The tag to use for identifying the send.
-
NonblockingFuture<RegisteredMemory> recvMemoryOnSetup(int remoteRank, int tag)
Receive memory on setup.
This function registers a receive from a remote process that will happen by a following call of setup(). The receive will carry information about a registered memory on the remote process.
- Parameters:
remoteRank – The rank of the remote process.
tag – The tag to use for identifying the receive.
- Returns:
NonblockingFuture<RegisteredMemory> A non-blocking future of registered memory.
-
NonblockingFuture<std::shared_ptr<Connection>> connectOnSetup(int remoteRank, int tag, EndpointConfig localConfig)
Connect to a remote rank on setup.
This function only prepares metadata for connection. The actual connection is made by a following call of setup(). Note that this function is two-way and a connection from rank
i
to remote rankj
needs to have a counterpart from rankj
to ranki
. Note that with IB, buffers are registered at a page level and if a buffer is spread through multiple pages and do not fully utilize all of them, IB’s QP has to register for all involved pages. This potentially has security risks if the connection’s accesses are given to a malicious process.- Parameters:
remoteRank – The rank of the remote process.
tag – The tag of the connection for identifying it.
config – The configuration for the local endpoint.
- Returns:
NonblockingFuture<NonblockingFuture<std::shared_ptr<Connection>>> A non-blocking future of shared pointer to the connection.
-
int remoteRankOf(const Connection &connection)
Get the remote rank a connection is connected to.
- Parameters:
connection – The connection to get the remote rank for.
- Returns:
The remote rank the connection is connected to.
-
int tagOf(const Connection &connection)
Get the tag a connection was made with.
- Parameters:
connection – The connection to get the tag for.
- Returns:
The tag the connection was made with.
Add a custom Setuppable object to a list of objects to be setup later, when setup() is called.
- Parameters:
setuppable – A shared pointer to the Setuppable object.
-
void setup()
Setup all objects that have registered for setup.
This includes previous calls of sendMemoryOnSetup(), recvMemoryOnSetup(), connectOnSetup(), and onSetup(). It is allowed to call this function multiple times, where the n-th call will only setup objects that have been registered after the (n-1)-th call.
-
class BaseError : public std::runtime_error
- #include <errors.hpp>
Base class for all errors thrown by MSCCL++.
Subclassed by mscclpp::CuError, mscclpp::CudaError, mscclpp::Error, mscclpp::IbError, mscclpp::SysError
Public Functions
-
BaseError(const std::string &message, int errorCode)
Constructor for BaseError.
- Parameters:
message – The error message.
errorCode – The error code.
-
explicit BaseError(int errorCode)
Constructor for BaseError.
- Parameters:
errorCode – The error code.
-
int getErrorCode() const
Get the error code.
- Returns:
The error code.
-
const char *what() const noexcept override
Get the error message.
- Returns:
The error message.
-
BaseError(const std::string &message, int errorCode)
-
class SysError : public mscclpp::BaseError
- #include <errors.hpp>
An error from a system call that sets
errno
.
-
class CudaError : public mscclpp::BaseError
- #include <errors.hpp>
An error from a CUDA runtime library call.
-
class CuError : public mscclpp::BaseError
- #include <errors.hpp>
An error from a CUDA driver library call.
-
class IbError : public mscclpp::BaseError
- #include <errors.hpp>
An error from an ibverbs library call.
-
class ExecutionPlan
- #include <executor.hpp>
-
class Executor
- #include <executor.hpp>
-
class Fifo
- #include <fifo.hpp>
A class representing a host proxy FIFO that can consume work elements pushed by device threads.
Public Functions
-
Fifo(int size = DEFAULT_FIFO_SIZE)
Constructs a new Fifo object.
- Parameters:
size – The number of entires in the FIFO.
-
ProxyTrigger poll()
Polls the FIFO for a trigger.
Returns ProxyTrigger which is the trigger at the head of fifo.
-
void pop()
Pops a trigger from the FIFO.
-
void flushTail(bool sync = false)
Flushes the tail of the FIFO.
- Parameters:
sync – If true, waits for the flush to complete before returning.
-
int size() const
Return the FIFO size.
- Returns:
The FIFO size.
-
FifoDeviceHandle deviceHandle()
Returns a FifoDeviceHandle object representing the device FIFO.
- Returns:
A FifoDeviceHandle object representing the device FIFO.
-
Fifo(int size = DEFAULT_FIFO_SIZE)
-
struct ProxyTrigger
- #include <fifo_device.hpp>
A struct representing a pair of 64-bit unsigned integers used as a trigger for the proxy.
This struct is used as a work element in the concurrent FIFO where multiple device threads can push ProxyTrigger elements and a single host proxy thread consumes these work elements.
Do not use the most significant bit of snd as it is reserved for memory consistency purposes
-
struct FifoDeviceHandle
- #include <fifo_device.hpp>
A concurrent FIFO where multiple device threads (the number of threads should not exceed the fifo size) can push work elements and a single host proxy thread consumes them.
The FIFO has a head pointer allocated on the device which starts at 0 and goes up to 2^64-1, which is almost infinity. There are two copies of the tail, one on the device, FifoDeviceHandle::tailReplica, and another on the host, namely, hostTail. The host always has the “true” tail and occasionally pushes it to the copy on the device. Therefore, most of the time, the device has a stale version. The invariants are: tailReplica <= hostTail <= head. The push() function increments head, hostTail is updated in Fifo::pop(), and it occasionally flushes it to tailReplica via Fifo::flushTail().
Duplicating the tail is a good idea because the FIFO is large enough, and we do not need frequent updates for the tail as there is usually enough space for device threads to push their work into.
Public Members
-
ProxyTrigger *triggers
The FIFO buffer that is allocated on the host via
cudaHostAlloc()
.
-
uint64_t *tailReplica
Replica of the FIFO tail that is allocated on device.
-
uint64_t *head
The FIFO head. Allocated on the device and only accessed by the device.
-
int size
The FIFO size.
-
ProxyTrigger *triggers
-
struct AvoidCudaGraphCaptureGuard
- #include <gpu_utils.hpp>
A RAII guard that will cudaThreadExchangeStreamCaptureMode to cudaStreamCaptureModeRelaxed on construction and restore the previous mode on destruction. This is helpful when we want to avoid CUDA graph capture.
-
struct CudaStreamWithFlags
- #include <gpu_utils.hpp>
A RAII wrapper around cudaStream_t that will call cudaStreamDestroy on destruction.
-
template<class T>
struct CudaDeleter - #include <gpu_utils.hpp>
A deleter that calls cudaFree for use with std::unique_ptr or std::shared_ptr.
- Template Parameters:
T – Type of each element in the allocated memory.
-
template<class T>
struct PhysicalCudaMemory - #include <gpu_utils.hpp>
-
template<class T>
struct CudaPhysicalDeleter - #include <gpu_utils.hpp>
-
template<class T>
struct CudaHostDeleter - #include <gpu_utils.hpp>
A deleter that calls cudaFreeHost for use with std::unique_ptr or std::shared_ptr.
- Template Parameters:
T – Type of each element in the allocated memory.
-
class NvlsConnection
- #include <nvls.hpp>
Public Functions
-
std::shared_ptr<char> bindAllocatedCuda(CUmemGenericAllocationHandle memHandle, size_t size)
The
handle
to the allocation (its lifetime is managed by the caller) and thesize
of the allocation.
-
struct DeviceMulticastPointer
- #include <nvls.hpp>
-
std::shared_ptr<char> bindAllocatedCuda(CUmemGenericAllocationHandle memHandle, size_t size)
-
struct DeviceMulticastPointerDeviceHandle
- #include <nvls_device.hpp>
Device-side handle for Host2DeviceSemaphore.
-
union LL16Packet
- #include <packet_device.hpp>
LL (low latency) protocol packet.
Public Types
-
using Payload = uint2
-
using Payload = uint2
-
class Proxy
- #include <proxy.hpp>
-
class BaseProxyService
- #include <proxy_channel.hpp>
Base class for proxy services. Proxy services are used to proxy data between devices.
Subclassed by mscclpp::ProxyService
-
class ProxyService : public mscclpp::BaseProxyService
- #include <proxy_channel.hpp>
Proxy service implementation.
Public Functions
-
ProxyService(size_t fifoSize = DEFAULT_FIFO_SIZE)
Constructor.
Build and add a semaphore to the proxy service.
- Parameters:
connection – The connection associated with the semaphore.
- Returns:
The ID of the semaphore.
Add a semaphore to the proxy service.
- Parameters:
semaphore – The semaphore to be added
- Returns:
The ID of the semaphore.
-
MemoryId addMemory(RegisteredMemory memory)
Register a memory region with the proxy service.
- Parameters:
memory – The memory region to register.
- Returns:
The ID of the memory region.
-
std::shared_ptr<Host2DeviceSemaphore> semaphore(SemaphoreId id) const
Get a semaphore by ID.
- Parameters:
id – The ID of the semaphore.
- Returns:
The semaphore.
-
ProxyChannel proxyChannel(SemaphoreId id)
Get a proxy channel by semaphore ID.
- Parameters:
id – The ID of the semaphore.
- Returns:
The proxy channel.
-
virtual void startProxy()
Start the proxy service.
-
virtual void stopProxy()
Stop the proxy service.
-
ProxyService(size_t fifoSize = DEFAULT_FIFO_SIZE)
-
struct ProxyChannel
- #include <proxy_channel.hpp>
Proxy channel.
Public Types
-
using DeviceHandle = ProxyChannelDeviceHandle
Device-side handle for ProxyChannel.
Public Functions
-
DeviceHandle deviceHandle() const
Returns the device-side handle.
User should make sure the ProxyChannel is not released when using the returned handle.
-
using DeviceHandle = ProxyChannelDeviceHandle
-
struct SimpleProxyChannel
- #include <proxy_channel.hpp>
Simple proxy channel with a single destination and source memory region.
Public Types
-
using DeviceHandle = SimpleProxyChannelDeviceHandle
Device-side handle for SimpleProxyChannel.
Public Functions
-
SimpleProxyChannel() = default
Default constructor.
-
SimpleProxyChannel(ProxyChannel proxyChan, MemoryId dst, MemoryId src)
Constructor.
- Parameters:
proxyChan – The proxy channel.
dst – The destination memory region.
src – The source memory region.
-
inline SimpleProxyChannel(ProxyChannel proxyChan)
Constructor.
- Parameters:
proxyChan – The proxy channel.
-
SimpleProxyChannel(const SimpleProxyChannel &other) = default
Copy constructor.
-
SimpleProxyChannel &operator=(SimpleProxyChannel &other) = default
Assignment operator.
-
DeviceHandle deviceHandle() const
Returns the device-side handle.
User should make sure the SimpleProxyChannel is not released when using the returned handle.
-
using DeviceHandle = SimpleProxyChannelDeviceHandle
-
union ChannelTrigger
- #include <proxy_channel_device.hpp>
Basic structure of each work element in the FIFO.
-
struct ProxyChannelDeviceHandle
- #include <proxy_channel_device.hpp>
-
struct SimpleProxyChannelDeviceHandle
- #include <proxy_channel_device.hpp>
-
template<template<typename> typename InboundDeleter, template<typename> typename OutboundDeleter>
class BaseSemaphore - #include <semaphore.hpp>
A base class for semaphores.
An semaphore is a synchronization mechanism that allows the local peer to wait for the remote peer to complete a data transfer. The local peer signals the remote peer that it has completed a data transfer by incrementing the outbound semaphore ID. The incremented outbound semaphore ID is copied to the remote peer’s inbound semaphore ID so that the remote peer can wait for the local peer to complete a data transfer. Vice versa, the remote peer signals the local peer that it has completed a data transfer by incrementing the remote peer’s outbound semaphore ID and copying the incremented value to the local peer’s inbound semaphore ID.
- Template Parameters:
InboundDeleter – The deleter for inbound semaphore IDs. This is either
std::default_delete
for host memory or CudaDeleter for device memory.OutboundDeleter – The deleter for outbound semaphore IDs. This is either
std::default_delete
for host memory or CudaDeleter for device memory.
Public Functions
-
inline BaseSemaphore(std::unique_ptr<uint64_t, InboundDeleter<uint64_t>> localInboundSemaphoreId, std::unique_ptr<uint64_t, InboundDeleter<uint64_t>> expectedInboundSemaphoreId, std::unique_ptr<uint64_t, OutboundDeleter<uint64_t>> outboundSemaphoreId)
Constructs a BaseSemaphore.
- Parameters:
localInboundSemaphoreId – The inbound semaphore ID
expectedInboundSemaphoreId – The expected inbound semaphore ID
outboundSemaphoreId – The outbound semaphore ID
-
class Host2DeviceSemaphore : public mscclpp::BaseSemaphore<CudaDeleter, std::default_delete>
- #include <semaphore.hpp>
A semaphore for sending signals from the host to the device.
Public Types
-
using DeviceHandle = Host2DeviceSemaphoreDeviceHandle
Device-side handle for Host2DeviceSemaphore.
Public Functions
Constructor.
- Parameters:
communicator – The communicator.
connection – The connection associated with this semaphore.
-
std::shared_ptr<Connection> connection()
Returns the connection.
- Returns:
The connection associated with this semaphore.
-
void signal()
Signal the device.
-
DeviceHandle deviceHandle()
Returns the device-side handle.
-
using DeviceHandle = Host2DeviceSemaphoreDeviceHandle
-
class Host2HostSemaphore : public mscclpp::BaseSemaphore<std::default_delete, std::default_delete>
- #include <semaphore.hpp>
A semaphore for sending signals from the local host to a remote host.
Public Functions
Constructor
- Parameters:
communicator – The communicator.
connection – The connection associated with this semaphore. Transport::CudaIpc is not allowed for Host2HostSemaphore.
-
std::shared_ptr<Connection> connection()
Returns the connection.
- Returns:
The connection associated with this semaphore.
-
void signal()
Signal the remote host.
-
bool poll()
Check if the remote host has signaled.
- Returns:
true if the remote host has signaled.
-
void wait(int64_t maxSpinCount = 10000000)
Wait for the remote host to signal.
- Parameters:
maxSpinCount – The maximum number of spin counts before throwing an exception. Never throws if negative.
-
class SmDevice2DeviceSemaphore : public mscclpp::BaseSemaphore<CudaDeleter, CudaDeleter>
- #include <semaphore.hpp>
A semaphore for sending signals from the local device to a peer device via SM.
Public Types
-
using DeviceHandle = SmDevice2DeviceSemaphoreDeviceHandle
Device-side handle for SmDevice2DeviceSemaphore.
Public Functions
Constructor.
- Parameters:
communicator – The communicator.
connection – The connection associated with this semaphore.
-
SmDevice2DeviceSemaphore() = delete
Constructor.
-
DeviceHandle deviceHandle() const
Returns the device-side handle.
-
using DeviceHandle = SmDevice2DeviceSemaphoreDeviceHandle
-
struct Host2DeviceSemaphoreDeviceHandle
- #include <semaphore_device.hpp>
Device-side handle for Host2DeviceSemaphore.
-
struct SmDevice2DeviceSemaphoreDeviceHandle
- #include <semaphore_device.hpp>
Device-side handle for SmDevice2DeviceSemaphore.
-
struct SmChannel
- #include <sm_channel.hpp>
Channel for accessing peer memory directly from SM.
Public Types
-
using DeviceHandle = SmChannelDeviceHandle
Device-side handle for SmChannel.
Public Functions
-
SmChannel() = default
Constructor.
Constructor.
- Parameters:
semaphore – The semaphore used to synchronize the communication.
dst – Registered memory of the destination.
src – The source memory address.
getPacketBuffer – The optional buffer used for getPackets().
-
DeviceHandle deviceHandle() const
Returns the device-side handle.
User should make sure the SmChannel is not released when using the returned handle.
-
using DeviceHandle = SmChannelDeviceHandle
-
struct SmChannelDeviceHandle
- #include <sm_channel_device.hpp>
Channel for accessing peer memory directly from SM.
-
struct Timer
- #include <utils.hpp>
Subclassed by mscclpp::ScopedTimer
Public Functions
-
int64_t elapsed() const
Returns the elapsed time in microseconds.
-
int64_t elapsed() const
-
namespace detail
Typedefs
-
using TransportFlagsBase = std::bitset<TransportFlagsSize>
Bitset for storing transport flags.
Functions
-
template<class T>
T *cudaCalloc(size_t nelem) A wrapper of cudaMalloc that sets the allocated memory to zero.
- Template Parameters:
T – Type of each element in the allocated memory.
- Parameters:
nelem – Number of elements to allocate.
- Returns:
A pointer to the allocated memory.
-
template<class T>
PhysicalCudaMemory<T> *cudaPhysicalCalloc(size_t nelem, size_t gran)
-
template<class T>
T *cudaHostCalloc(size_t nelem) A wrapper of cudaHostAlloc that sets the allocated memory to zero.
- Template Parameters:
T – Type of each element in the allocated memory.
- Parameters:
nelem – Number of elements to allocate.
- Returns:
A pointer to the allocated memory.
-
template<class T, T*, class Deleter, class Memory>
Memory safeAlloc(size_t nelem) A template function that allocates memory while ensuring that the memory will be freed when the returned object is destroyed.
- Template Parameters:
T – Type of each element in the allocated memory.
alloc – A function that allocates memory.
Deleter – A deleter that will be used to free the allocated memory.
Memory – The type of the returned object.
- Parameters:
nelem – Number of elements to allocate.
- Returns:
An object of type
Memory
that will free the allocated memory when destroyed.
Variables
-
const size_t TransportFlagsSize = 12
-
using TransportFlagsBase = std::bitset<TransportFlagsSize>
-
using UniqueId = std::array<uint8_t, MSCCLPP_UNIQUE_ID_BYTES>