Maps a multi-dimensional local index and thread layout into a reshaped global index.
The idea is that the user need to specify the dimensions of logical index (local + thread ids), and the target tensor dimensions, then we map the logical index into the corresponding target multi-dimension tensor.
We have:
Used to distinguish different memory spaces in GPU programming.
GpuGlobal represents global memory space.
See shared::GpuShared for shared memory space.
When chunking or atomic operations are needed, GpuGlobal is owned by
chunk or atomic struct.
This ensures that the user cannot access the data without using chunk or
atomic operations.
This mapping strategy is useful when we want to reshape a 1D array into a 2D
array and then distribute one element to a thread one by one until consuming
all. It creates a new non-continuous partition for each thread.
Linear mapping for 1D array.
N is the number of thread dimensions.
width is the chunking window.
The array is divided into chunks along threads until all elements are covered.
Expose the convert function to users.
This trait is sealed to prevent arbitrary implementations.
Only types that implement HostToDevPrivateSeal can implement this trait.
This ensures that only safe conversions are allowed,
ensuring safe host-to-device interface.
This attribute generates a host wrapper around a kernel function, allowing it to be launched from the host.
The kernel function itself is original function with Config.
The generated host function is in mod #kname {pub fn launch(…)}