mscclpp.language.rank
Classes
|
Base class for buffer objects in MSCCL++ programs. |
|
A scratch buffer for temporary data storage during communication operations. |
|
Represents a single rank (GPU) in an MSCCL++ program. |
|
A semaphore for asynchronus synchronization between thread blocks. |
- class mscclpp.language.rank.BaseBuffer(rank, buffer_type, offset, size)
Bases:
object
Base class for buffer objects in MSCCL++ programs.
BaseBuffer represents a memory buffer associated with a specific rank, providing indexed access to create chunks for communication operations. It supports slice-based indexing to create Chunk objects.
- buffer_type
The type of buffer (input, output, scratch).
- Type:
BufferType
- class mscclpp.language.rank.Buffer(rank, size)
Bases:
BaseBuffer
A scratch buffer for temporary data storage during communication operations.
Buffer extends BaseBuffer to provide dynamically allocated scratch space for a specific rank. It automatically manages scratch buffer allocation within the GPU’s scratch memory space.
- buffer_type
Always BufferType.scratch for Buffer instances.
- Type:
BufferType
- class mscclpp.language.rank.Rank(rank)
Bases:
object
Represents a single rank (GPU) in an MSCCL++ program.
Rank provides operations that can be performed locally on a single GPU, including copy operations, reduce operations, and synchronization barriers. It manages local buffer operations and coordinates with other ranks through the program context.
- Parameters:
rank (int)
- barrier(tb_list)
Create a synchronization barrier between thread blocks.
Synchronizes execution between multiple thread blocks on this rank. For a single thread block, creates a sync operation. For multiple thread blocks, creates a barrier operation.
- Parameters:
tb_list (List[int]) – List of thread block IDs to synchronize.
- Raises:
RuntimeError – If tb_list is empty.
Example
>>> rank0.barrier(tb_list=[0, 1, 2])
- copy(dst_chunk, src_chunk, tb=None, tb_group=None)
Copy data from source chunk to destination chunk.
Performs a simple local copy operation between two chunks on this rank without any packet format conversion.
- Parameters:
dst_chunk (Chunk) – The destination chunk to copy data to.
src_chunk (Chunk) – The source chunk to copy data from.
tb (int, optional) – The thread block ID that will execute this operation. Defaults to None.
tb_group (ThreadBlockGroup, optional) – The ThreadBlockGroup that will execute this operation. Defaults to None.
Example
>>> rank.copy(dst_chunk, src_chunk, tb=0)
- copy_packets(dst_chunk, src_chunk, tb=None, tb_group=None)
Copy data from regular format to packet format.
Packs data from the source chunk and copies it to the destination scratch buffer in packet format.
- Parameters:
dst_chunk (Chunk) – The destination scratch chunk to store packed data.
src_chunk (Chunk) – The source chunk containing data to pack.
tb (int, optional) – The thread block ID that will execute this operation. Defaults to None.
tb_group (ThreadBlockGroup, optional) – The ThreadBlockGroup that will execute this operation. Defaults to None.
Example
>>> rank.copy_packet(dst_chunk, src_chunk, tb=0)
- get_input_buffer()
Get the input buffer for this rank.
- Returns:
The input buffer associated with this rank.
- Return type:
Example
>>> input_buf = rank.get_input_buffer()
- get_output_buffer()
Get the output buffer for this rank.
- Returns:
The output buffer associated with this rank.
- Return type:
Example
>>> output_buf = rank.get_output_buffer()
- reduce(src_chunk, other_chunks, tb=None, tb_group=None, dst_chunk=None, reduce_op=ReduceOperationType.sum, packet=False)
Perform a local reduction operation on this rank.
Reduces data from multiple chunks locally on this rank, combining the source chunk with other chunks using the specified reduction operation.
- Parameters:
src_chunk (Chunk) – The primary source chunk to reduce.
other_chunks (List[Chunk]) – Additional chunks to include in the reduction.
tb (int, optional) – The thread block ID that will execute this operation. Defaults to None.
tb_group (ThreadBlockGroup, optional) – The ThreadBlockGroup that will execute this operation. Defaults to None.
dst_chunk (Chunk, optional) – The destination chunk for the result. If None, uses src_chunk. Defaults to None.
reduce_op (ReduceOperationType, optional) – The reduction operation to perform. Defaults to ReduceOperationType.sum.
packet (bool, optional) – Whether to operate in packet format. Defaults to False.
- Raises:
RuntimeError – If chunk ranks don’t match this rank, if chunk sizes are inconsistent, or if other_chunks is empty.
Example
>>> rank.reduce(src_chunk, other_chunks, tb=0, dst_chunk)
- unpack_packets(dst_chunk, src_chunk, tb=None, tb_group=None)
Copy data from packet format to regular format.
Unpacks data from packet format in the source scratch buffer and copies it to the destination chunk in regular format.
- Parameters:
dst_chunk (Chunk) – The destination chunk to copy unpacked data to.
src_chunk (Chunk) – The source scratch chunk containing packed data.
tb (int, optional) – The thread block ID that will execute this operation. Defaults to None.
tb_group (ThreadBlockGroup, optional) – The ThreadBlockGroup that will execute this operation. Defaults to None.
Example
>>> rank.unpack_packet(dst_chunk, src_chunk, tb=0)
- class mscclpp.language.rank.Semaphore(rank, initial_value)
Bases:
object
A semaphore for asynchronus synchronization between thread blocks.
Semaphore provides acquire and release operations for synchronization between thread blocks within a rank. Each semaphore has an initial value and supports typical semaphore semantics.
- acquire(tb, data_sync=SyncType.both)
Acquire the semaphore from a thread block.
Blocks the thread block until the semaphore can be acquired (value > 0), then decrements the semaphore value.
- Parameters:
tb (int) – The thread block ID that will acquire the semaphore.
data_sync (SyncType, optional) – Defines the order where threads inside the thread block will be synchronized (equivalent to __syncthreads()) relative to the acquire operation. Defaults to SyncType.both.
Example
>>> sem.acquire(tb=0, data_sync=SyncType.before)
- release(tb, data_sync=SyncType.both)
Release the semaphore from a thread block.
Increments the semaphore value, potentially unblocking other thread blocks waiting to acquire the semaphore.
- Parameters:
tb (int) – The thread block ID that will release the semaphore.
data_sync (SyncType, optional) – Defines the order where threads inside the thread block will be synchronized (equivalent to __syncthreads()) relative to the release operation. Defaults to SyncType.both.
Example
>>> sem.release(tb=0, data_sync=SyncType.after)