mscclpp.language.channel

Classes

MemoryChannel(dst_rank, src_rank)

A memory channel for direct memory access communication between GPUs.

PortChannel(dst_rank, src_rank)

A port channel for communication using port-based mechanisms between GPUs.

SwitchChannel(rank_list, buffer_type)

A switch channel for collective communication operations among multiple GPUs.

class mscclpp.language.channel.MemoryChannel(dst_rank, src_rank)

Bases: object

A memory channel for direct memory access communication between GPUs.

MemoryChannel enables direct memory access between GPUs through memory mapping, providing high-performance communication for operations like put, get, and reduce. Each channel connects a source rank to a destination rank.

Parameters:
  • dst_rank (int)

  • src_rank (int)

channel_id

Unique identifier for this channel within the source rank.

Type:

int

dst_rank

The destination rank for communication operations.

Type:

int

src_rank

The source rank for communication operations.

Type:

int

channel_type

The type of channel (memory).

Type:

ChannelType

get(dst_chunk, src_chunk, tb=None, tb_group=None)

Retrieve data from remote memory to local memory.

Performs a get operation to copy data from the destination rank’s memory to the source rank’s local memory through the memory channel.

Parameters:
  • dst_chunk (Chunk) – The destination chunk on the source rank where data will be stored.

  • src_chunk (Chunk) – The source chunk on the destination rank to retrieve data from.

  • tb (int, optional) – The thread block ID that will execute this get operation. Defaults to None.

  • tb_group (ThreadBlockGroup, optional) – The ThreadBlockGroup that will execute this get operation. Defaults to None.

Raises:

RuntimeError – If chunk ranks don’t match the channel configuration.

Example

>>> channel.get(dst_chunk, src_chunk, tb=0)
put(dst_chunk, src_chunk, tb=None, tb_group=None)

Send data from local memory to remote memory.

Performs a put operation to copy data from the source rank’s local memory to the destination rank’s memory through the memory channel.

Parameters:
  • dst_chunk (Chunk) – The destination chunk on the destination rank where data will be stored.

  • src_chunk (Chunk) – The source chunk on the source rank to send data from.

  • tb (int, optional) – The thread block ID that will execute this put operation. Defaults to None.

  • tb_group (ThreadBlockGroup, optional) – The ThreadBlockGroup that will execute this put operation. Defaults to None.

Raises:

RuntimeError – If chunk ranks don’t match the channel configuration or if chunk sizes don’t match.

Example

>>> channel.put(dst_chunk, src_chunk, tb=0)
put_packets(dst_chunk, src_chunk, tb=None, tb_group=None)

Transfer data from local buffer to remote scratch buffer in packet format.

Performs a put operation that transfers data from the source rank’s buffer to the destination rank’s scratch buffer in packet format. The destination chunk must be a scratch buffer.

Parameters:
  • dst_chunk (Chunk) – The destination scratch chunk on the destination rank.

  • src_chunk (Chunk) – The source chunk on the source rank.

  • tb (int, optional) – The thread block ID that will execute this operation. Defaults to None.

  • tb_group (ThreadBlockGroup, optional) – The ThreadBlockGroup that will execute this operation. Defaults to None.

Raises:

RuntimeError – If chunk ranks don’t match channel configuration, if destination chunk is not a scratch buffer, or if chunk sizes don’t match.

Example

>>> channel.put_packet(dst_chunk, src_chunk, tb=0)
read_put_packets(dst_chunk, src_chunk, tb=None, tb_group=None)

Transfer data in packet format from local to remote scratch buffer.

Performs a specialized put operation that transfers data in packet format from the source rank’s scratch buffer to the destination rank’s scratch buffer. Both source and destination chunks must be scratch buffers.

Parameters:
  • dst_chunk (Chunk) – The destination scratch chunk on the destination rank.

  • src_chunk (Chunk) – The source scratch chunk on the source rank.

  • tb (int, optional) – The thread block ID that will execute this operation. Defaults to None.

  • tb_group (ThreadBlockGroup, optional) – The ThreadBlockGroup that will execute this operation. Defaults to None.

Raises:

RuntimeError – If chunk ranks don’t match channel configuration, if chunks are not scratch buffers, or if chunk sizes don’t match.

Example

>>> channel.read_put_packet(dst_chunk, src_chunk, tb=0)
reduce(local_src_chunk, remote_src_chunks, tb=None, tb_group=None, local_dst_chunk=None, reduce_op=ReduceOperationType.sum)

Perform a reduction operation combining local and remote data.

Reduces data from multiple remote source chunks with a local source chunk, storing the result in a local destination chunk. If no destination chunk is specified, the result is stored in the local source chunk.

Parameters:
  • local_src_chunk (Chunk) – The local source chunk on the source rank.

  • remote_src_chunks (List[Chunk]) – List of remote source chunks to reduce with.

  • tb (int, optional) – The thread block ID that will execute this operation. Defaults to None.

  • tb_group (ThreadBlockGroup, optional) – The ThreadBlockGroup that will execute this operation. Defaults to None.

  • local_dst_chunk (Chunk, optional) – The local destination chunk. If None, uses local_src_chunk as destination. Defaults to None.

  • reduce_op (ReduceOperation, optional) – The reduction operation to perform. Defaults to ReduceOperationType.sum.

Raises:

RuntimeError – If chunk ranks don’t match channel configuration or if chunk sizes are inconsistent.

Example

>>> channel.reduce(local_chunk, remote_chunks, tb=0)
signal(tb, data_sync=SyncType.both, relaxed=False)

Send a signal through the memory channel.

Signals notify the destination that data is ready or an operation has completed. This is used for synchronization between ranks.

Parameters:
  • tb (int) – The thread block ID that will execute this signal operation.

  • data_sync (SyncType, optional) – Defines the order where threads inside the thread block will be synchronized (equivalent to __syncthreads()) relative to the signal operation. Defaults to SyncType.both.

  • relaxed (bool, optional) – Whether to use relaxed memory ordering. Defaults to False.

Example

>>> channel.signal(tb=0, data_sync=SyncType.before)
wait(tb, data_sync=SyncType.both, relaxed=False)

Wait for a signal through the memory channel.

Waits for a signal from the destination rank, typically used for synchronization to ensure operations are completed before proceeding.

Parameters:
  • tb (int) – The thread block ID that will execute this wait operation.

  • data_sync (SyncType, optional) – Defines the order where threads inside the thread block will be synchronized (equivalent to __syncthreads()) relative to the wait operation. Defaults to SyncType.both.

  • relaxed (bool, optional) – Whether to use relaxed memory ordering. Defaults to False.

Example

>>> channel.wait(tb=0, data_sync=SyncType.after)
class mscclpp.language.channel.PortChannel(dst_rank, src_rank)

Bases: object

A port channel for communication using port-based mechanisms between GPUs.

PortChannel enables communication between GPUs using interconnection ports, supporting operations such as put, signal, wait, and flush. Each channel connects a source rank to a destination rank and is suitable for scenarios requiring port-mapping-based data copy and synchronization methods.

Parameters:
  • dst_rank (int)

  • src_rank (int)

channel_id

Unique identifier for this channel within the source rank.

Type:

int

dst_rank

The destination rank for communication operations.

Type:

int

src_rank

The source rank for communication operations.

Type:

int

channel_type

The type of channel (port).

Type:

ChannelType

flush(tb, data_sync=SyncType.both)

Flush pending operations through the port channel.

Forces completion of all pending operations on the port channel, ensuring data consistency. This operation is only supported for port channels.

Parameters:
  • tb (int) – The thread block ID that will execute this flush operation.

  • data_sync (SyncType, optional) – Defines the order where threads inside the thread block will be synchronized (equivalent to __syncthreads()) relative to the flush operation. Defaults to SyncType.both.

Example

>>> channel.flush(tb=0, data_sync=SyncType.after)
put(dst_chunk, src_chunk, tb)

Send data from local memory to remote memory through the port channel.

Performs a put operation to copy data from the source rank’s local memory to the destination rank’s memory through the port channel.

Parameters:
  • dst_chunk (Chunk) – The destination chunk on the destination rank where data will be stored.

  • src_chunk (Chunk) – The source chunk on the source rank to send data from.

  • tb (int) – The thread block ID that will execute this put operation.

Raises:

RuntimeError – If chunk ranks don’t match the channel configuration or if chunk sizes don’t match.

Example

>>> channel.put(dst_chunk, src_chunk, tb=0)
put_with_signal(dst_chunk, src_chunk, tb)

Send data from local memory to remote memory with automatic signaling.

Performs a put operation that transfers data and automatically sends a signal to notify the destination that the data transfer is complete. This combines data transfer and synchronization in a single operation.

Parameters:
  • dst_chunk (Chunk) – The destination chunk on the destination rank where data will be stored.

  • src_chunk (Chunk) – The source chunk on the source rank to send data from.

  • tb (int) – The thread block ID that will execute this put operation.

Raises:

RuntimeError – If chunk ranks don’t match the channel configuration or if chunk sizes don’t match.

Example

>>> channel.put_with_signal(dst_chunk, src_chunk, tb=0)
put_with_signal_and_flush(dst_chunk, src_chunk, tb)

Send data from local memory to remote memory with signal and flush.

Performs a put operation that transfers data, automatically sends a signal, and flushes the channel. This provides the guarantee of data transfer completion.

Parameters:
  • dst_chunk (Chunk) – The destination chunk on the destination rank where data will be stored.

  • src_chunk (Chunk) – The source chunk on the source rank to send data from.

  • tb (int) – The thread block ID that will execute this put operation.

Raises:

RuntimeError – If chunk ranks don’t match the channel configuration or if chunk sizes don’t match.

Example

>>> channel.put_with_signal_and_flush(dst_chunk, src_chunk, tb=0)
read_put_packets(dst_chunk, src_chunk, tb)

Transfer data in packet format from local to remote scratch buffer.

Performs a specialized put operation that transfers data in packet format from the source rank’s scratch buffer to the destination rank’s scratch buffer through the port channel. Both source and destination chunks must be scratch buffers.

Parameters:
  • dst_chunk (Chunk) – The destination scratch chunk on the destination rank.

  • src_chunk (Chunk) – The source scratch chunk on the source rank.

  • tb (int) – The thread block ID that will execute this operation.

Raises:

RuntimeError – If chunk ranks don’t match channel configuration, if chunks are not scratch buffers, or if chunk sizes don’t match.

Example

>>> channel.read_put_packet(dst_chunk, src_chunk, tb=0)
signal(tb, data_sync=SyncType.both)

Send a signal through the port channel.

Signals notify the destination that data is ready or an operation has completed. This is used for synchronization between ranks through port-based mechanisms.

Parameters:
  • tb (int) – The thread block ID that will execute this signal operation.

  • data_sync (SyncType, optional) – Defines the order where threads inside the thread block will be synchronized (equivalent to __syncthreads()) relative to the signal operation. Defaults to SyncType.both.

Example

>>> channel.signal(tb=0, data_sync=SyncType.before)
wait(tb, data_sync=SyncType.both)

Wait for a signal through the port channel.

Waits for a signal from the destination rank, typically used for synchronization to ensure operations are completed before proceeding through port-based mechanisms.

Parameters:
  • tb (int) – The thread block ID that will execute this wait operation.

  • data_sync (SyncType, optional) – Defines the order where threads inside the thread block will be synchronized (equivalent to __syncthreads()) relative to the wait operation. Defaults to SyncType.both.

Example

>>> channel.wait(tb=0, data_sync=SyncType.after)
class mscclpp.language.channel.SwitchChannel(rank_list, buffer_type)

Bases: object

A switch channel for collective communication operations among multiple GPUs.

SwitchChannel enables collective operations like reduce and broadcast among a group of ranks through a switch-based communication mechanism. It supports operations on shared buffers across multiple ranks in the group.

Parameters:
  • rank_list (List[int])

  • buffer_type (BufferType)

channel_ids

Dictionary mapping ranks to their channel IDs.

Type:

dict

channel_type

The type of channel (switch).

Type:

ChannelType

buffer_type

The type of buffer used for operations.

Type:

BufferType

rank_group

The group of ranks participating in this channel.

Type:

RankGroup

class SwitchChannelRankView(channel, rank)

Bases: object

A rank-specific view of a SwitchChannel for performing operations.

This class provides a convenient interface for performing switch channel operations from the perspective of a specific rank, automatically passing the rank parameter to the underlying channel methods.

_channel

The underlying switch channel.

Type:

SwitchChannel

_rank

The rank this view represents.

Type:

int

broadcast(src_chunk, buffer_offset, size, tb)

Perform a broadcast operation from this rank’s perspective.

Convenience method that calls the underlying channel’s broadcast method with this view’s rank automatically provided.

Parameters:
  • src_chunk (Chunk) – The source chunk containing data to broadcast.

  • buffer_offset (int) – The offset in the destination buffer where data will be stored.

  • size (int) – The size of data to broadcast.

  • tb (int) – The thread block ID that will execute this operation.

Returns:

The result of the underlying channel’s broadcast operation.

Example

>>> rank_view.broadcast(src_chunk=chunk, buffer_offset=0, size=1, tb=0)
reduce(buffer_offset, size, dst_chunk, tb, reduce_op=ReduceOperationType.sum)

Perform a reduction operation from this rank’s perspective.

Convenience method that calls the underlying channel’s reduce method with this view’s rank automatically provided.

Parameters:
  • buffer_offset (int) – The offset in the buffer where reduction data starts.

  • size (int) – The size of data to reduce.

  • dst_chunk (Chunk) – The destination chunk where the result will be stored.

  • tb (int) – The thread block ID that will execute this operation.

  • reduce_op (ReduceOperationType, optional) – The reduction operation to perform. Defaults to ReduceOperationType.sum.

Returns:

The result of the underlying channel’s reduce operation.

Example

>>> rank_view.reduce(buffer_offset=0, size=1, dst_chunk=chunk, tb=0)
at_rank(rank)

Get a rank-specific view of this switch channel.

Returns a SwitchChannelRankView that provides rank-specific operations for reduce and broadcast on this switch channel.

Parameters:

rank (int) – The rank to create a view for.

Returns:

A rank-specific view of this channel.

Return type:

SwitchChannelRankView

Raises:

RuntimeError – If rank is not part of this channel’s rank group.

Example

>>> channel.at_rank(0)
broadcast(rank, src_chunk, buffer_offset, size, tb)

Broadcast data from source chunk to all ranks in the switch channel.

Broadcasts data from the source chunk to the specified buffer region across all ranks in the rank group.

Parameters:
  • rank (int) – The rank that will execute this broadcast operation.

  • src_chunk (Chunk) – The source chunk containing data to broadcast.

  • buffer_offset (int) – The offset in the destination buffer where data will be stored.

  • size (int) – The size of data to broadcast.

  • tb (int) – The thread block ID that will execute this operation.

Raises:

RuntimeError – If src_chunk rank is not in the rank group, if chunk size doesn’t match the required size, or if buffer size is insufficient.

Example

>>> channel.broadcast(rank=0, src_chunk=chunk, buffer_offset=0, size=1, tb=0)
reduce(rank, buffer_offset, size, dst_chunk, tb, reduce_op=ReduceOperationType.sum)

Perform a reduction operation across all ranks in the switch channel.

Reduces data from the specified buffer region across all ranks in the rank group, storing the result in the destination chunk.

Parameters:
  • rank (int) – The rank that will execute this reduction operation.

  • buffer_offset (int) – The offset in the buffer where reduction data starts.

  • size (int) – The size of data to reduce.

  • dst_chunk (Chunk) – The destination chunk where the result will be stored.

  • tb (int) – The thread block ID that will execute this operation.

  • reduce_op (ReduceOperationType, optional) – The reduction operation to perform. Defaults to ReduceOperationType.sum.

Raises:

RuntimeError – If dst_chunk rank is not in the rank group, if chunk size doesn’t match the required size, or if buffer size is insufficient.

Example

>>> channel.reduce(rank=0, buffer_offset=0, size=1, dst_chunk=chunk, tb=0)