Performance Guidelines

Optimize for Throughput, Avoid Empty Cycles (M-THROUGHPUT)

To ensure COGS savings at scale. 0.1

You should optimize your library for throughput, and one of your key metrics should be items per CPU cycle.

This does not mean to neglect latency—after all you can scale for throughput, but not for latency. However, in most cases you should not pay for latency with empty cycles that come with single-item processing, contended locks and frequent task switching.

Ideally, you should

partition reasonable chunks of work ahead of time,
let individual threads and tasks deal with their slice of work independently,
sleep or yield when no work is present,
design your own APIs for batched operations,
perform work via batched APIs where available,
yield within long individual items, or between chunks of batches (see M-YIELD-POINTS),
exploit CPU caches, temporal and spatial locality.

You should not:

hot spin to receive individual items faster,
perform work on individual items if batching is possible,
do work stealing or similar to balance individual items.

Shared state should only be used if the cost of sharing is less than the cost of re-computation.

Identify, Profile, Optimize the Hot Path Early (M-HOTPATH)

To end up with high performance code. 0.1

You should, early in the development process, identify if your crate is performance or COGS relevant. If it is:

identify hot paths and create benchmarks around them,
regularly run a profiler collecting CPU and allocation insights,
document or communicate the most performance sensitive areas.

For benchmarks we recommend criterion or divan. If possible, benchmarks should not only measure elapsed wall time, but also used CPU time over all threads (this unfortunately requires manual work and is not supported out of the box by the common benchmark utils).

Profiling Rust on Windows works out of the box with Intel VTune and Superluminal. However, to gain meaningful CPU insights you should enable debug symbols for benchmarks in your Cargo.toml:

[profile.bench]
debug = 1

Documenting the most performance sensitive areas helps other contributors take better decision. This can be as simple as sharing screenshots of your latest profiling hot spots.

Long-Running Tasks Should Have Yield Points. (M-YIELD-POINTS)

To ensure you don't starve other tasks of CPU time. 0.2

If you perform long running computations, they should contain yield_now().await points.

Your future might be executed in a runtime that cannot work around blocking or long-running tasks. Even then, such tasks are considered bad design and cause runtime overhead. If your complex task performs I/O regularly it will simply utilize these await points to preempt itself:

async fn process_items(items: &[items]) {
    // Keep processing items, the runtime will preempt you automatically.
    for i in items {
        read_item(i).await;
    }
}

If your task performs long-running CPU operations without intermixed I/O, it should instead cooperatively yield at regular intervals, to not starve concurrent operations:

async fn process_items(zip_file: File) {
    let items = zip_file.read().async;
    for i in items {
        decompress(i);
        yield_now().await;
    }
}

If the number and duration of your individual operations are unpredictable you should use APIs such as has_budget_remaining() and related APIs to query your hosting runtime.

Yield how often?

In a thread-per-core model the overhead of task switching must be balanced against the systemic effects of starving unrelated tasks.

Under the assumption that runtime task switching takes 100's of ns, in addition to the overhead of lost CPU caches, continuous execution in between should be long enough that the switching cost becomes negligible (<1%).

Thus, performing 10 - 100μs of CPU-bound work between yield points would be a good starting point.

Pragmatic Rust Guidelines

Performance Guidelines

Optimize for Throughput, Avoid Empty Cycles (M-THROUGHPUT)

Identify, Profile, Optimize the Hot Path Early (M-HOTPATH)

Further Reading

How much faster?

Long-Running Tasks Should Have Yield Points. (M-YIELD-POINTS)

Yield how often?