Skip to content

Conversation

james7132
Copy link
Contributor

async_task::Runnable::schedule has an "extra" Waker clone and drop whenever the schedule function for the task captures variables to avoid deallocation during scheduling. We can avoid this by just... directly scheduling the Runnable with the references that already exist. This PR just splits out the schedule functions into their own static functions and invokes that directly than go through Runnable::schedule. Benchmarking results:

group                                               direct-schedule                         master
-----                                               ---------------                         ------
executor::create                                    1.00    879.8±9.33ns        ? ?/sec     1.01   884.3±10.50ns        ? ?/sec
multi_thread/executor::channels                     1.00     42.6±1.13ms        ? ?/sec     1.02     43.3±1.69ms        ? ?/sec
multi_thread/executor::spawn_batch                  1.00     28.8±7.70µs        ? ?/sec     1.14     32.7±4.87µs        ? ?/sec
multi_thread/executor::spawn_many_local             1.00     13.9±0.92ms        ? ?/sec     1.10     15.2±1.45ms        ? ?/sec
multi_thread/executor::spawn_one                    1.00   938.5±37.02ns        ? ?/sec     1.12  1046.5±115.88ns        ? ?/sec
multi_thread/executor::spawn_recursively            1.00    137.0±1.26ms        ? ?/sec     1.03    140.5±1.37ms        ? ?/sec
multi_thread/executor::web_server                   1.00     59.7±2.47ms        ? ?/sec     1.05     62.4±3.92ms        ? ?/sec
multi_thread/executor::yield_now                    1.00     20.5±0.19ms        ? ?/sec     1.00     20.5±0.74ms        ? ?/sec
multi_thread/static_executor::channels              1.00     42.5±1.42ms        ? ?/sec     1.05     44.5±4.32ms        ? ?/sec
multi_thread/static_executor::spawn_many_local      1.00      2.9±0.29ms        ? ?/sec     1.08      3.2±0.20ms        ? ?/sec
multi_thread/static_executor::spawn_one             1.00  1079.3±81.93ns        ? ?/sec     1.05  1138.7±154.83ns        ? ?/sec
multi_thread/static_executor::spawn_recursively     1.01     38.4±0.34ms        ? ?/sec     1.00     38.0±1.08ms        ? ?/sec
multi_thread/static_executor::web_server            1.00     59.4±1.01ms        ? ?/sec     1.05     62.3±2.73ms        ? ?/sec
multi_thread/static_executor::yield_now             1.00     20.4±0.43ms        ? ?/sec     1.01     20.5±0.56ms        ? ?/sec
single_thread/executor::channels                    1.00     14.7±0.35ms        ? ?/sec     1.23     18.0±0.84ms        ? ?/sec
single_thread/executor::spawn_batch                 1.00     18.3±8.77µs        ? ?/sec     1.05    19.3±11.16µs        ? ?/sec
single_thread/executor::spawn_many_local            1.00      4.6±0.49ms        ? ?/sec     1.08      4.9±0.46ms        ? ?/sec
single_thread/executor::spawn_one                   1.03  1468.8±163.70ns        ? ?/sec    1.00  1426.0±64.37ns        ? ?/sec
single_thread/executor::spawn_recursively           1.00     21.8±1.14ms        ? ?/sec     1.13     24.6±1.61ms        ? ?/sec
single_thread/executor::web_server                  1.00     22.5±0.31ms        ? ?/sec     1.10     24.8±3.46ms        ? ?/sec
single_thread/executor::yield_now                   1.00      4.0±0.04ms        ? ?/sec     1.09      4.4±0.23ms        ? ?/sec
single_thread/static_executor::channels             1.00     17.1±0.96ms        ? ?/sec     1.11     19.0±1.92ms        ? ?/sec
single_thread/static_executor::spawn_many_local     1.00  1887.9±89.21µs        ? ?/sec     1.22      2.3±0.09ms        ? ?/sec
single_thread/static_executor::spawn_one            1.00  962.6±510.98ns        ? ?/sec     1.13  1091.9±653.42ns        ? ?/sec
single_thread/static_executor::spawn_recursively    1.00     17.8±1.12ms        ? ?/sec     1.19     21.1±2.51ms        ? ?/sec
single_thread/static_executor::web_server           1.00     22.8±0.61ms        ? ?/sec     1.06     24.2±1.93ms        ? ?/sec
single_thread/static_executor::yield_now            1.00      4.1±0.26ms        ? ?/sec     1.04      4.3±0.17ms        ? ?/sec

This seems to have a strong impact on the executor types that already have low spawning overhead (i.e. the StaticExecutors). Rerunning the benchmarks does seem to show some noise when testing for improvements, but overall generally seems to bias towards directly scheduling with at least a 5-10% perf gain on most of these benchmarks.


Commentary: I really dislike the fact that this is necessary to optimize the executors here. Runnable::schedule otherwise doesn't really have a reason to exist as an API if it's just going to be less efficient than directly scheduling the task. I suspect this is a combination of a lack of inlining due to dynamic dispatch (not sure why Rust does not devirtualize this call) and the lack of overhead form cloning the waker and dropping it to keep the task alive during scheduling. Not sure if it's advisable to update guidance for async-task based on this.

@james7132
Copy link
Contributor Author

james7132 commented Aug 28, 2025

Having read up the async_task source, this seems to be more widespread since it happens each time the task is re-scheduled too. Some solutions:

  1. Put the pinned state in the metadata, but that changes the public facing API since it's visible as the generic type parameter for Task.
  2. Change async_task to use a more lightweight approach to delaying deallocation.
  3. Create a variant of spawn_unchecked that requires the schedule function to never drop the provided Runnable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

1 participant