Implement a new --failing-and-slow-first command line argument to test runner. #24624

juj · 2025-06-26T16:36:13Z

This keeps track of results of previous test run, and on subsequent runs, failing tests are run first, then skipped tests, and last, successful tests in slowest-first order. This improves parallelism throughput of the suite.

Add support for --failfast in the multithreaded test suite to help stop suite runs at first test failures quickly.

These two flags --failfast and --failing-and-slow-first together can help achieve < 10 second test suite runs on a CI when the suite is failing.

Example core0 runtime with test/runner core0 on a 16-core/32-thread system:

Total core time: 2818.016s. Wallclock time: 118.083s. Parallelization: 23.86x.

Same suite runtime with test/runner --failing-and-slow-first core0:

Total core time: 2940.180s. Wallclock time: 94.027s. Parallelization: 31.27x.

Gaining a better throughput and a -20.37% test suite wall time.

…t runner. This keeps track of results of previous test run, and on subsequent runs, failing tests are run first, then skipped tests, and last, successful tests in slowest-first order. Add support for --failfast in the multithreaded test suite. This improves parallelism throughput of the suite, and helps stop at test failures quickly.

…failfast mode.

…failure result.

sbc100

IIUC this is what I currently use --failfast --continue for. The downside of --failfast --continue of course is that it doesn't work for parallel testing (so I also add -j1).

.gitignore

sbc100

Actually maybe I misunderstood. I use --failfast --continue when implementing new features and wanting to fix each test failure as I run into it.

How does this improve CI times on the bots? It seems like it would not effect the first run, but only subsequent runs, which the bots don't do, do they?

juj · 2025-08-13T14:30:29Z

How does this improve CI times on the bots? It seems like it would not effect the first run, but only subsequent runs, which the bots don't do, do they?

It doesn't work on the current CircleCI bots, which always start from a clean slate and run all suites from a single command invocation, but it does help if a developer runs test suites locally, and on the ad hoc CI I am running in http://clbri.com:8010/ .

For example, here is one such run:

where all the failing suites fail in a matter of a few seconds, rather than taking a random length to fail.

Also passing suites run faster, since shortest tests are run last, meaning that core utilization will be 100% throughout the test suite run. It is like a self-calibrating version to avoid having to name tests test_zzz_ if they are slow. (which is detrimental to test speed)

juj · 2025-08-15T21:22:38Z

It would be great to get this landed, since this would enable my CI to run against the upstream tree more easily.

sbc100 · 2025-08-15T21:29:27Z

It would be great to get this landed, since this would enable my CI to run against the upstream tree more easily.

It seems like there are couple different things intertwined here.

Perhaps we can tease some of it apart and try to simplify.

The first thing here is making --fail-fast work in the parallel running. That seems like a great idea and maybe we can land that separtely.

Regarding running slow tests first, how about just making that the default? For the initial run we could check in a copy of the test time and update it every few months. Then we wouldn't need a new flag.

Perhaps --failing-first could be a new flags, but again it might make sense to just make it the default?

sbc100 · 2025-08-15T21:29:46Z

Why is this change important for your CI?

sbc100 · 2025-08-15T21:31:01Z

test/parallel_testsuite.py

  def addError(self, test, err):
    print(test, '... ERROR', file=sys.stderr)
    self.buffered_result = BufferedTestError(test, err)
+    self.test_result = 'errored'


It this needed? Isn't the existing buffered_result object enough?

Python doesn't have an API to ask the result of a test object, and it would have required some kind of an awkward isinstance() jungle to convert the test to a string, so I opted to writing simple looking code as a more preferable way.

I thought I posted this already, but can you try if failfast_event and result.errors or retult.failures: instead adding this new member? I

The test_result field is not only used on this line, but it is serialized to the JSON file as well. I did not find a cleaner way to get those strings to the JSON file otherwise.

sbc100 · 2025-08-15T21:32:59Z

test/parallel_testsuite.py

+    try:
+      previous_test_run_results = json.load(open('out/__previous_test_run_results.json'))
+    except FileNotFoundError:
+      previous_test_run_results = {}


Do we need to duplicate this code for handling __previous_test_run_results between here and runner.py?

Perhaps a shared sorting function that they can both call?

This code is not creating a sorter, but I see the shared block amounts to a

def load_previous_test_run_results(): try: return json.load(open('out/__previous_test_run_results.json')) except FileNotFoundError: return {}

I could refactor that to e.g. common.py, though that is not a large win necessarily.

I guess i don't understand quite what is going on here then.. I would have though the new flag --failing-and-slow-first flag would apply equally to the non-parallel and parallel test runner and so it would only need to be handled in a single place (where we decide on the test ordering)... I need to take a deeper look at what is really going on here.

Merged the code.

The new flag only applies to the parallel test runner. If we wanted to extend this to the non-parallel runner, that's fine, though that could be worked on as a later PR as well.

juj · 2025-08-15T21:35:09Z

Why is this change important for your CI?

It allows me to run all suites of tests in a more reasonable time to iterate on seeing different failures. Otherwise the failing suites take 30-50x longer to come around with the failure, making it time consuming to see what is still failing.

juj · 2025-08-18T09:17:25Z

The first thing here is making --fail-fast work in the parallel running. That seems like a great idea and maybe we can land that separtely.

Separating --fail-fast to work in parallel runner from this PR would definitely optimize for a prettier git history. Though I don't think there is a benefit to retro-writing the PR in that manner.. it would take time, and it is unlikely that it would improve bisectability.

Reading the code in this PR, I would have to delete maybe 70% of it to imagine what the functionality of only --fail-fast part of this PR is, and landing that would then create a merge conflict with this PR, that I would have to then fix the conflicts against this PR. I can see that I would be at it for at least a few days more, with reviews asynchronously ping-ponging across our time zones (unless I work pushing it to the night to get the more interactive code review feedback on the same timezone). So it does not seem like a particularly productive refactor, and this is "cold" code for people who don't opt into it.

Regarding running slow tests first, how about just making that the default?

Whereas making --failing-and-slow-first a default behavior in this PR is a functional change that does have a chance of impacting other people. So it feels to me that flipping to --failing-and-slow-first being the default behavior is something that would benefit putting into a separate PR, so that when anyone bisects, they can clearly see if/when their workflow broke to when the new feature became the default, vs if the new feature (that they didn't opt in) accidentally broke when running outside that flag. (e.g. say if hypothetically adding this code that's supposed to be cold for them broke --continue or --failfast or something else)

I mean, changing the code in this PR to make --failing-and-slow-first the default behavior, and adding --no-failing-and-slow-first (--alphabetical-order?) flag to get the old behavior is something I can do if you'd like, though it seems like it could disturb either CircleCI or Google's waterfall, or other developers with higher chance?

juj · 2025-08-19T19:36:25Z

Ping - any more thoughts on this PR?

sbc100

I am supportive of landing this but there seems like a lot of new code and complexity.

I've not yet understood what the new create_test_run_sorter is doing, for example.

I'd also like to know a little bit more about your use case, so that once this lands and can refactor without breaking it.

It sounds like you have a bunch of test runners that are failing, and you want them to fail fast? I.e. you don't want to get a full list of failures in the case of a failure? These runnings run continuously and save their previous results, then base the next run order on those previous results?

Presumably the failing first part is only useful for suites that are not passing? This seems odd since then they would be ref the whole time. Once the suite goes green the overall time to run the suite will return to normal?

However, the slow-first part could be universally useful, even for suites that are not failing? Is that right?

What is the advantage of the fail-fast part for CI? Normally we what to get a full list of failures from the CI, don't we?

Also, just to clear, if we both fail-fast and then run-failing-first, the number of failing tests that run "first" will always be exactly one? Since only one test could have failed on the previous run? Also, if we fail-fast we, by definition only have partial information for a run right? So the failing-and-slow-first thing works better then the previous run ran every test?

sbc100 · 2025-08-19T23:04:47Z

test/parallel_testsuite.py

  def addError(self, test, err):
    print(test, '... ERROR', file=sys.stderr)
    self.buffered_result = BufferedTestError(test, err)
+    self.test_result = 'errored'


I thought I posted this already, but can you try if failfast_event and result.errors or retult.failures: instead adding this new member? I

test/parallel_testsuite.py

sbc100 · 2025-08-20T00:14:52Z

test/common.py


 TEST_ROOT = path_from_root('test')
 LAST_TEST = path_from_root('out/last_test.txt')
+PREVIOUS_TEST_RUN_RESULTS_FILE = path_from_root('out/previous_test_run_results.json')


What happens when we just this same file to write results from different test suites? I guess the names of the specific tests won't match so results from a different test suite will be mostly ignored?

The JSON file has multiple entries for different test suites (core1.test_atexit vs core2.test_atexit) but then also a global entry for test_atexit without distinction of the test suite.

The first accumulates results from specific suite, and the general one across all suites.

This way if a test fails in one suite, it gets bumped in fail frequency in other suites runs.

juj · 2025-08-20T09:09:40Z

I've not yet understood what the new create_test_run_sorter is doing, for example.

The function create_test_run_sorter defines a Python sort() comparator, which establishes the best order to run the tests in. Added a comment to that function to explain it better.

I'd also like to know a little bit more about your use case

The motivations are:

to be able to improve parallelization of the test runner. I run the tests on 88-thread and 256-thread systems, and before this PR, CI parallelization is poor since the tests run in random order, and the test_zzz_ mechanism even places slow tests last. Sorting tests to run slowest first greatly improves parallelization.
to be able to quickly locally see what fails and iterate on failures, I want to have a fail-fast mode that aborts on first failure. This way I can see what to work on next.
to be able to run CI on low end systems, like Windows on ARM and Linux on ARM, I run the CI in fail-fast mode as well, since that improves turn-around time on seeing about failures.

It sounds like you have a bunch of test runners that are failing, and you want them to fail fast?

When I am iterating, I want my local test runs to fail fast. When I am running on the CI, I want the fast CIs to cover everything, and the slower CIs to fail-fast.

These runnings run continuously and save their previous results, then base the next run order on those previous results?

That's right. the JSON file accumulates failure and runtime information from the previous runs using the exponential moving average smoothing.

Once the suite goes green the overall time to run the suite will return to normal?

For the fail-fast CIs, yes. Ordering slowest-first greatly improves all parallel suite runtimes, even when fail-fast is not in effect.

if we both fail-fast and then run-failing-first, the number of failing tests that run "first" will always be exactly one?

In general it will be anything between 1 - number of parallel threads, depending on how many failing tests there are on the suite, and timing of the other threads.

if we fail-fast we, by definition only have partial information for a run right? So the failing-and-slow-first thing works better then the previous run ran every test?

The JSON file accumulates information with weighting from all previous runs, not just from the most recent run. So even if the most recent run was only partial, the JSON file will still have all the info from earlier runs in it.

juj · 2025-08-21T20:53:22Z

Ping, any more thoughts? I'd really love to land this kind of feature, since it would improve my iteration on the CI.

@kripken would you have thoughts on this PR?

sbc100 · 2025-08-21T20:56:22Z

CC @brendandahl too who is also working on the parallel test runner right now (trying to get it working in the browser).

kripken · 2025-08-22T19:46:15Z

Personally I wouldn't use this option, but I'm not opposed to it either.

juj · 2025-08-27T18:53:44Z

Any more thoughts here? This feature would really be critical for my fast CI run times.

If there is any more motivation needed here, here is a visualized run of the bigendian suite without --failing-and-slow-first:

https://clbri.com/dump/emscripten/toolchain_profiler.results_20250826_1246.html

more than 700 seconds (almost 12 minutes)

and the same suite run with --failing-and-slow-first:

https://clbri.com/dump/emscripten/toolchain_profiler.results_20250826_1351.html

about 450 seconds. (~9 minutes)

When iterating over the remaining flaky tests in the suite, not having this is a time sink. :(

I have the next feature in the pipeline that is blocked on this PR landing first, would like to be able to get to working on landing that as well.

sbc100 · 2025-08-27T22:51:08Z

I still feel like this change is mix more concepts that I would like, but if you are OK with having me/us interate on this over post-landing then LGTM to this and we can followup in future PRs.

I would love to see the "failfast working in the parallel runner" split out into its own PR, since it seems fairly separate, but I won't force it.

sbc100 · 2025-08-27T22:53:01Z

I would love to get this adding to all our test runnings ASAP too so we can reap the benefit

I would also like to to work on bots that run on a clean checkout each time (like the github CI bots) which means i think we should check in a set of timing for each test suite. That should also allow us to remove the @is_slow_test decorate since that information will then be available via the checked-in json file.

juj · 2025-08-27T22:53:12Z

Thanks, that sounds good. Definitely not holding the code set in stone.

juj added 6 commits June 26, 2025 19:33

ruff

1008e6d

ruff

28e4ab8

Improve test run information to flow across suites when running in --…

a8544b2

…failfast mode.

Refactor num_failures into a fail_frequency to get a more normalized …

78aa3fb

…failure result.

ruff

1fd0b64

sbc100 reviewed Jul 7, 2025

View reviewed changes

.gitignore Outdated Show resolved Hide resolved

sbc100 reviewed Jul 7, 2025

View reviewed changes

Save previous test run file to out/

b3cec56

sbc100 reviewed Aug 15, 2025

View reviewed changes

juj added 6 commits August 16, 2025 00:50

Share code for loading previous test run results JSON file.

06420cf

Merge remote-tracking branch 'origin/main' into failing_and_slow_first

15cb203

ruff

846455f

ruff

ddcde4f

Simplify common code in test result writing

01df1ad

Rename

64b09c7

Document rationale for failure frequency quantization better.

8f63c9d

juj force-pushed the failing_and_slow_first branch from 0028ba7 to 8f63c9d Compare August 18, 2025 09:30

sbc100 reviewed Aug 20, 2025

View reviewed changes

test/parallel_testsuite.py Outdated Show resolved Hide resolved

sbc100 reviewed Aug 20, 2025

View reviewed changes

Review

3d85b01

juj added 2 commits August 20, 2025 13:36

Add note

d297ff1

Merge remote-tracking branch 'origin/main' into failing_and_slow_first

797c5f4

sbc100 approved these changes Aug 27, 2025

View reviewed changes

juj merged commit 4803d1b into emscripten-core:main Aug 27, 2025
30 checks passed

Implement a new --failing-and-slow-first command line argument to test runner. #24624

Implement a new --failing-and-slow-first command line argument to test runner. #24624

Uh oh!

Conversation

juj commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sbc100 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sbc100 left a comment

Choose a reason for hiding this comment

Uh oh!

juj commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

juj commented Aug 15, 2025

Uh oh!

sbc100 commented Aug 15, 2025

Uh oh!

sbc100 commented Aug 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juj Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juj commented Aug 15, 2025

Uh oh!

juj commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

juj commented Aug 19, 2025

Uh oh!

sbc100 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juj Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juj commented Aug 20, 2025

Uh oh!

juj commented Aug 21, 2025

Uh oh!

sbc100 commented Aug 21, 2025

Uh oh!

kripken commented Aug 22, 2025

Uh oh!

juj commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sbc100 commented Aug 27, 2025

Uh oh!

sbc100 commented Aug 27, 2025

Uh oh!

juj commented Aug 27, 2025

Uh oh!

Uh oh!

juj commented Jun 26, 2025 •

edited

Loading

juj commented Aug 13, 2025 •

edited

Loading

juj Aug 20, 2025 •

edited

Loading

juj commented Aug 18, 2025 •

edited

Loading

juj Aug 20, 2025 •

edited

Loading

juj commented Aug 27, 2025 •

edited

Loading