Port Channel

Note

This tutorial follows the Memory Channel tutorial.

Build and Run the Example

The code of this tutorial is under examples/tutorials/04-port-channel.

Build the example with make:

$ cd examples/tutorials/04-port-channel
$ make

Run the example with ./bidir_port_channel. If you are in a container, you may need to run with root privileges. You should see output similar to the following:

# ./bidir_port_channel
GPU 0: Preparing for tests ...
GPU 1: Preparing for tests ...
GPU 0: [Bidir PutWithSignal] bytes 1024, elapsed 0.0204875 ms/iter, BW 0.0499818 GB/s
GPU 0: [Bidir PutWithSignal] bytes 1048576, elapsed 0.0250319 ms/iter, BW 41.8896 GB/s
GPU 0: [Bidir PutWithSignal] bytes 134217728, elapsed 0.365497 ms/iter, BW 367.219 GB/s
Succeed!

The example code uses localhost port 50505 by default. If the port is already in use, you can change it by modifying the PORT_NUMBER macro in the code.

Caution

Note that this example is NOT a performance benchmark. The performance numbers are provided to give you an idea of the performance characteristics of PortChannel. For optimal performance, synchronization can be further optimized depending on the application scenario and implementation.

Code Overview

The example code implements a bidirectional data transfer using a PortChannel between two GPUs on the same machine. The code is similar to the Memory Channel tutorial, with the main difference being that the construction of a PortChannel is done by a ProxyService instance. We need to “add” the pre-built Semaphore and RegisteredMemory objects to the ProxyService, which return SemaphoreId and MemoryIds, respectively:

mscclpp::ProxyService proxyService;
mscclpp::SemaphoreId semaId = proxyService.addSemaphore(sema);
mscclpp::MemoryId localMemId = proxyService.addMemory(localRegMem);
mscclpp::MemoryId remoteMemId = proxyService.addMemory(remoteRegMem);

Using the IDs, we can create a PortChannel associated with the ProxyService:

mscclpp::PortChannel portChan = proxyService.portChannel(semaId, remoteMemId, localMemId);

The procedures for building Semaphore and RegisteredMemory are explained in the Basic Concepts and the Memory Channel tutorials, respectively.

We need to call proxyService.startProxy() before running GPU kernels that use the PortChannel. The ProxyService runs a background host thread that listens for incoming requests from the PortChannel and handles them accordingly. We can call proxyService.stopProxy() to stop the background thread after all GPU operations are done.

PortChannel

PortChannel is a communication channel that enables data transfer between GPUs using I/O ports, such as the Copy Engine (CE) of a GPU (e.g., cudaMemcpyAsync), InfiniBand queue pairs, or TCP sockets. Compared to MemoryChannel, which copies data using GPU threads, PortChannel offloads data transfer to dedicated hardware or software components. This reduces interference with other parallel GPU operations, and potentially allows for higher throughput. However, PortChannel may introduce additional latency due to the overhead of initiating data transfers.

The device handle of a PortChannel provides the following methods. Since the data transfer is offloaded, each method is supposed to be called by a single GPU thread.

  • put(): Initiates an asynchronous one-way data transfer from the local memory to the remote memory.

  • signal(): Asynchronously signals the completion of all previous put()s to the remote side.

  • wait(): Blocks the calling GPU thread until the corresponding signal() is received from the remote side.

  • poll(): Non-blocking version of wait(). Returns immediately with a boolean indicating whether the signal has been received.

  • flush(): Synchronizes the local GPU with the PortChannel, ensuring that all previous operations are completed.

  • Fused methods (e.g., putWithSignal()): combines multiple sequential operations into a single call for efficiency.

The following diagram illustrates how the bidirPutKernel() function in the example code would work when GPU0 is faster than GPU1. The execution order may vary depending on the relative speeds of the GPUs.

        sequenceDiagram
    participant GPU0
    participant GPU1

    GPU0->>GPU1: signal()
    GPU1->>GPU0: signal()

    Note over GPU0: wait() returns by signal()

    GPU0->>GPU1: putWithSignal(): copy local data range<br>[0:copyBytes) to remote range [0:copyBytes)

    Note over GPU1: wait() returns by signal()

    GPU1->>GPU0: putWithSignal(): copy local data range<br>[copyBytes:2*copyBytes) to remote range [copyBytes:2*copyBytes)

    Note over GPU0: wait() returns by putWithSignal()
    Note over GPU1: wait() returns by putWithSignal()
    

ProxyService

ProxyService is a host-side service that assists operation of one or more PortChannels. When a PortChannel calls put(), signal(), or flush() methods (or their fused versions) on GPU, it constructs a corresponding request and pushes it into a FIFO queue managed by the ProxyService on the host side. The ProxyService runs a background thread that processes these requests and performs the actual data transfers or signaling operations using the appropriate implementation, which depends on the transport type of the Connection associated with the channel.

In most cases, users only need to use a ProxyService instance to create PortChannels and start/stop the proxy thread.

Caution

The device handle methods of PortChannel are thread-safe except when the number of concurrent threads exceeds the FIFO queue size of the ProxyService. The default FIFO queue size is 512, which can be changed by passing a different value to the ProxyService constructor.

Note

Advanced users may want to customize the behavior of ProxyService to support custom request types or transport mechanisms, which can be done by subclassing BaseProxyService. See an example in class AllGatherProxyService.

Summary and Next Steps

In this tutorial, we learned how to use PortChannel for bidirectional data transfer between two GPUs using a ProxyService.