control.loop

Looping control flow constructs.


template <typename T>
inline void range_for(T begin, T end, auto step, (T) -> void body) §

Execute body for a range of values starting from begin and ending before end, incrementing by step on each iteration.

inline void while_do(() -> bool condition, () -> void body) §

Keep calling body while the condition returns true.

template <typename I>
inline void pipelined_for(auto count, (I) -> void body) §

Spawn count threads executing body, which can be a lambda or a function taking one argument specifying the thread index from the range [0, count).

Examples
pipelined_for(x, [](uint4 i)
{
    // ...
});

void foo(uint8 i)
{
}

pipelined_for(256, foo);

Hardware

Each call to pipelined_for inserts a record into a FIFO. This FIFO holds the thread count and the values of captured variables. A finite state machine translates each FIFO record into thread_count calls to the inner function. A separate finite state machine unblocks the calling thread after thread_count threads have completed. Unblocking is achieved with a zero-width fifo.

void F(uint32 x, uint32 y)
{
    uint32 thread_count = x + 1;

    pipelined_for(thread_count, [y](uint32 tid)
    {
    });
}
template <typename T, auto N>
inline void pipelined_for_each(T[N] arr, (index_t<N>, T) -> void body) §

Spawn a thread for each element of array arr calling body which can be a lambda or a function taking two arguments, the thread index and the element value.

Example
uint32[10] a;
pipelined_for_each(a,  [](uint4 i, uint32 x)
{
    ...
});
template <typename I>
inline void pipelined_do((I) -> bool body) §

Spawn threads that will execute the body closure until it returns false. If body has an argument, then the number of threads used is 2^bitsizeof(I). This should be larger than the depth of the body pipeline. If body does not have an argument, then the number of threads used is .options::max_threads_limit. The function returns after body returns false and all in-flight threads “drain” from the loop. When an instance of pipelined_do is called multiple times, the threads for subsequent calls will not start until all threads for earlier calls have started (although not necessarily exited).

Example
pipelined_do([](index_t<32> i)
{
    // ...
    return !done;
});

pipelined_do([]()
{
    // ...
    return !done;
});
template <auto N, typename I, typename T>
inline T[N] pipelined_map(auto count, (I) -> T body) §

Spawn count threads executing body, which can be a lambda or a function taking one argument specifying the thread index from the range [0, count]. The function returns an array T[N] of values produced by body. N must not be greater than count.

Example
auto a = pipelined_map<10>(x, [](uint4 i) -> uint32
{
    // ...
});

Hardware

The hardware generated for pipelined_map is similar to the hardware generated for pipelined_for. The thread collection finite state machine concatenates the return values from each inner thread into an array and returns that array to the caller.

uint32[4] F(uint32 x, uint32 y)
{
    uint32 thread_count = x + 1;

    uint32[4] result = pipelined_map
        ( thread_count,
              [y](uint32 tid) -> uint32
              {
                  return y + tid;
              }
        );

    return result;
}
template <typename I, typename T, T default_value = {}>
inline T pipelined_last(auto count, (I) -> T body) §

Spawn count threads executing body, which can be a lambda or a function taking one argument specifying the thread index from the range [0, count]. The function returns result produced by last call to body.

Example
auto a = pipelined_last(x, [](uint4 i)
{
    // ...
});

Hardware

The hardware generated for pipelined_last is similar to the hardware generated for pipelined_for. The thread collection finite state machine ignores the return value from all inner threads except the last one, where tid == (thread_count - 1). This value is returned to the caller. If the thread count is equal to 0, then default_value is returned.

uint32 F(uint32 x, uint32 y)
{
    uint32 thread_count = x + 1;

    uint32 result = pipelined_last
        (thread_count,
            [y](uint32 tid)
            {
                return y + tid;
            }
        );

    return result;
}
template <auto N, auto MaxCallerThreads = max_threads_limit>
inline void parallel_for(count_t<N> count, (index_t<N>) -> void body) §

Spawn count threads executing body, which can be a lambda or a function taking one argument specifying the thread index from the range [0, count).

Threads execute across N instances of the function body. There are no ordering guarantees among threads spawned by one call to parallel_for. If there are two calls (A and B) to a single parallel_for call site, resulting in threads A0, A1, B0, B1 executing body, then thread A0 will begin executing body ahead of B0 and A1 will begin executing body ahead of B1.

count must be less than N.

Examples
parallel_for(x, [](uint4 i)
{
    // ...
});

void foo(uint8 i)
{
}

parallel_for(256, foo);

Hardware

Each call to parallel_for broadcasts captured variables to N FIFOs. Each of these FIFOs is associated with a pipeline which is an instance of body. After executing body, a counter is incremented, which is used to block the calling thread until all calls to body have completed.

void F(uint32 x, uint32 y)
{
    parallel_for(2, [x, y](index_t<2> tid)
    {
    });
}

Parameters

  • auto N
    

    Number of replicas to body to instantiate.

  • auto MaxCallerThreads = max_threads_limit
    

    Maximum number of threads concurrently executing inside of parallel_for. Caller must ensure this limit is not exceeded.

Arguments

  • count_t<N> count
    

    Number of times that body will be invoked. Must be no greater than N.

  • (index_t<N>) -> void body
    

    Function to invoke.

template <typename T, auto N, auto MaxCallerThreads = max_threads_limit>
inline void parallel_for_each(T[N] arr, (index_t<N>, T) -> void body) §

Spawn a thread for each element of array arr calling body which can be a lambda or a function taking two arguments, the thread index and the element value.

Threads execute across N instances of the function body. There are no ordering guarantees among threads spawned by one call to parallel_for_each. If there are two calls (A and B) to a single parallel_for_each call site, resulting in threads A0, A1, B0, B1 executing body, then thread A0 will begin executing body ahead of B0 and A1 will begin executing body ahead of B1.

Example
uint32[10] a;
parallel_for_each(a,  [](uint4 i, uint32 x)
{
    ...
});

Parameters

  • typename T
    

    Type of each input array element.

  • auto N
    

    Number of replicas to body to instantiate.

  • auto MaxCallerThreads = max_threads_limit
    

    Maximum number of threads concurrently executing inside of parallel_for_each. Caller must ensure this limit is not exceeded.

Arguments

  • T[N] arr
    

    Input array to be processed (each element is processed by a separate call to body).

  • (index_t<N>, T) -> void body
    

    Function which processes one input array element on each call.

template <auto N, typename T, auto MaxCallerThreads = max_threads_limit>
inline T[N] parallel_map(auto count, (index_t<N>) -> T body) §

Spawn count threads executing body, which can be a lambda or a function taking one argument specifying the thread index from the range [0, count]. The function returns an array T[N] of values produced by body. N must not be greater than count.

Threads execute across N instances of the function body. There are no ordering guarantees among threads spawned by one call to parallel_map. If there are two calls (A and B) to a single parallel_map call site, resulting in threads A0, A1, B0, B1 executing body, then thread A0 will begin executing body ahead of B0 and A1 will begin executing body ahead of B1.

Example
auto a = parallel_map<4>(x, [](uint2 i) -> uint32
{
    // ...
});

Hardware

Each call to parallel_map broadcasts captured variables to N FIFOs. Each of these FIFOs is associated with a pipeline which is an instance of body. After executing body, the return result is placed into a FIFO. Results from all FIFOs are dequeued and returned.

void F(uint32 x, uint32 y)
{
    parallel_map<2>(2, [x, y](index_t<2> tid) -> uint32
    {
         return x + y;
    });
}

Parameters

  • auto N
    

    Number of replicas to body to instantiate.

  • typename T
    

    Type of each output array element.

  • auto MaxCallerThreads = max_threads_limit
    

    Maximum number of threads concurrently executing inside of parallel_map. Caller must ensure this limit is not exceeded.

Arguments

  • auto count
    

    Number of times that body will be invoked. Must be no greater than N.

  • (index_t<N>) -> T body
    

    Function which returns one array element on each call.