control.loop ≡
Looping control flow constructs.
template <typename T> inline void range_for(T begin, T end, auto step, (T) -> void body) §
Execute body for a range of values starting from begin
and ending before end, incrementing by step on
each iteration.
inline void while_do(() -> bool condition, () -> void body) §
Keep calling body while the condition
returns true.
template <typename I> inline void pipelined_for(auto count, (I) -> void body) §
Spawn count threads executing body, which
can be a lambda or a function taking one argument specifying the thread
index from the range [0, count).
Examples
pipelined_for(x, [](uint4 i)
{
// ...
});
void foo(uint8 i)
{
}
pipelined_for(256, foo);
Hardware
Each call to pipelined_for inserts a record into a FIFO.
This FIFO holds the thread count and the values of captured variables. A
finite state machine translates each FIFO record into
thread_count calls to the inner function. A separate finite
state machine unblocks the calling thread after
thread_count threads have completed. Unblocking is achieved
with a zero-width fifo.
void F(uint32 x, uint32 y)
{
uint32 thread_count = x + 1;
pipelined_for(thread_count, [y](uint32 tid)
{
});
}
template <typename T, auto N> inline void pipelined_for_each(T[N] arr, (index_t<N>, T) -> void body) §
Spawn a thread for each element of array arr calling
body which can be a lambda or a function taking two
arguments, the thread index and the element value.
Example
uint32[10] a;
pipelined_for_each(a, [](uint4 i, uint32 x)
{
...
});
template <typename I> inline void pipelined_do((I) -> bool body) §
Spawn threads that will execute the body closure until
it returns false. If body has an argument,
then the number of threads used is 2^bitsizeof(I). This
should be larger than the depth of the body pipeline. If
body does not have an argument, then the number of threads
used is .options::max_threads_limit. The function returns after
body returns false and all in-flight threads
“drain” from the loop. When an instance of pipelined_do is
called multiple times, the threads for subsequent calls will not start
until all threads for earlier calls have started (although not
necessarily exited).
Example
pipelined_do([](index_t<32> i)
{
// ...
return !done;
});
pipelined_do([]()
{
// ...
return !done;
});
template <auto N, typename I, typename T> inline T[N] pipelined_map(auto count, (I) -> T body) §
Spawn count threads executing body, which
can be a lambda or a function taking one argument specifying the thread
index from the range [0, count]. The function returns an
array T[N] of values produced by body.
N must not be greater than count.
Example
auto a = pipelined_map<10>(x, [](uint4 i) -> uint32
{
// ...
});
Hardware
The hardware generated for pipelined_map is similar to
the hardware generated for pipelined_for. The thread
collection finite state machine concatenates the return values from each
inner thread into an array and returns that array to the caller.
uint32[4] F(uint32 x, uint32 y)
{
uint32 thread_count = x + 1;
uint32[4] result = pipelined_map
( thread_count,
[y](uint32 tid) -> uint32
{
return y + tid;
}
);
return result;
}
template <typename I, typename T, T default_value = {}> inline T pipelined_last(auto count, (I) -> T body) §
Spawn count threads executing body, which
can be a lambda or a function taking one argument specifying the thread
index from the range [0, count]. The function returns
result produced by last call to body.
Example
auto a = pipelined_last(x, [](uint4 i)
{
// ...
});
Hardware
The hardware generated for pipelined_last is similar to
the hardware generated for pipelined_for. The thread
collection finite state machine ignores the return value from all inner
threads except the last one, where
tid == (thread_count - 1). This value is returned to the
caller. If the thread count is equal to 0, then
default_value is returned.
uint32 F(uint32 x, uint32 y)
{
uint32 thread_count = x + 1;
uint32 result = pipelined_last
(thread_count,
[y](uint32 tid)
{
return y + tid;
}
);
return result;
}
template <auto N, auto MaxCallerThreads = max_threads_limit> inline void parallel_for(count_t<N> count, (index_t<N>) -> void body) §
Spawn count threads executing body, which
can be a lambda or a function taking one argument specifying the thread
index from the range [0, count).
Threads execute across N instances of the function
body. There are no ordering guarantees among threads
spawned by one call to parallel_for. If there are two calls
(A and B) to a single
parallel_for call site, resulting in threads
A0, A1, B0, B1
executing body, then thread A0 will begin
executing body ahead of B0 and A1
will begin executing body ahead of B1.
count must be less than N.
Examples
parallel_for(x, [](uint4 i)
{
// ...
});
void foo(uint8 i)
{
}
parallel_for(256, foo);
Hardware
Each call to parallel_for broadcasts captured variables
to N FIFOs. Each of these FIFOs is associated with a
pipeline which is an instance of body. After executing
body, a counter is incremented, which is used to block the
calling thread until all calls to body have completed.
void F(uint32 x, uint32 y)
{
parallel_for(2, [x, y](index_t<2> tid)
{
});
}
Parameters
-
auto NNumber of replicas to
bodyto instantiate. -
auto MaxCallerThreads = max_threads_limit
Maximum number of threads concurrently executing inside of
parallel_for. Caller must ensure this limit is not exceeded.
template <typename T, auto N, auto MaxCallerThreads = max_threads_limit> inline void parallel_for_each(T[N] arr, (index_t<N>, T) -> void body) §
Spawn a thread for each element of array arr calling
body which can be a lambda or a function taking two
arguments, the thread index and the element value.
Threads execute across N instances of the function
body. There are no ordering guarantees among threads
spawned by one call to parallel_for_each. If there are two
calls (A and B) to a single
parallel_for_each call site, resulting in threads
A0, A1, B0, B1
executing body, then thread A0 will begin
executing body ahead of B0 and A1
will begin executing body ahead of B1.
Example
uint32[10] a;
parallel_for_each(a, [](uint4 i, uint32 x)
{
...
});
Parameters
-
typename TType of each input array element.
-
auto NNumber of replicas to
bodyto instantiate. -
auto MaxCallerThreads = max_threads_limit
Maximum number of threads concurrently executing inside of
parallel_for_each. Caller must ensure this limit is not exceeded.
Arguments
-
T[N] arrInput array to be processed (each element is processed by a separate call to
body). -
(index_t<N>, T) -> void body
Function which processes one input array element on each call.
template <auto N, typename T, auto MaxCallerThreads = max_threads_limit> inline T[N] parallel_map(auto count, (index_t<N>) -> T body) §
Spawn count threads executing body, which
can be a lambda or a function taking one argument specifying the thread
index from the range [0, count]. The function returns an
array T[N] of values produced by body.
N must not be greater than count.
Threads execute across N instances of the function
body. There are no ordering guarantees among threads
spawned by one call to parallel_map. If there are two calls
(A and B) to a single
parallel_map call site, resulting in threads
A0, A1, B0, B1
executing body, then thread A0 will begin
executing body ahead of B0 and A1
will begin executing body ahead of B1.
Example
auto a = parallel_map<4>(x, [](uint2 i) -> uint32
{
// ...
});
Hardware
Each call to parallel_map broadcasts captured variables
to N FIFOs. Each of these FIFOs is associated with a
pipeline which is an instance of body. After executing
body, the return result is placed into a FIFO. Results from
all FIFOs are dequeued and returned.
void F(uint32 x, uint32 y)
{
parallel_map<2>(2, [x, y](index_t<2> tid) -> uint32
{
return x + y;
});
}
Parameters
-
auto NNumber of replicas to
bodyto instantiate. -
typename TType of each output array element.
-
auto MaxCallerThreads = max_threads_limit
Maximum number of threads concurrently executing inside of
parallel_map. Caller must ensure this limit is not exceeded.
Arguments
-
auto count
Number of times that
bodywill be invoked. Must be no greater thanN. -
(index_t<N>) -> T body
Function which returns one array element on each call.