Skip to content

Conversation

@juj
Copy link
Collaborator

@juj juj commented Jun 26, 2025

This keeps track of results of previous test run, and on subsequent runs, failing tests are run first, then skipped tests, and last, successful tests in slowest-first order. This improves parallelism throughput of the suite.

Add support for --failfast in the multithreaded test suite to help stop suite runs at first test failures quickly.

These two flags --failfast and --failing-and-slow-first together can help achieve < 10 second test suite runs on a CI when the suite is failing.

Example core0 runtime with test/runner core0 on a 16-core/32-thread system:

Total core time: 2818.016s. Wallclock time: 118.083s. Parallelization: 23.86x.

Same suite runtime with test/runner --failing-and-slow-first core0:

Total core time: 2940.180s. Wallclock time: 94.027s. Parallelization: 31.27x.

Gaining a better throughput and a -20.37% test suite wall time.

juj added 6 commits June 26, 2025 19:33
…t runner. This keeps track of results of previous test run, and on subsequent runs, failing tests are run first, then skipped tests, and last, successful tests in slowest-first order. Add support for --failfast in the multithreaded test suite. This improves parallelism throughput of the suite, and helps stop at test failures quickly.
Copy link
Collaborator

@sbc100 sbc100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this is what I currently use --failfast --continue for. The downside of --failfast --continue of course is that it doesn't work for parallel testing (so I also add -j1).

Copy link
Collaborator

@sbc100 sbc100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually maybe I misunderstood. I use --failfast --continue when implementing new features and wanting to fix each test failure as I run into it.

How does this improve CI times on the bots? It seems like it would not effect the first run, but only subsequent runs, which the bots don't do, do they?

@juj
Copy link
Collaborator Author

juj commented Aug 13, 2025

How does this improve CI times on the bots? It seems like it would not effect the first run, but only subsequent runs, which the bots don't do, do they?

It doesn't work on the current CircleCI bots, which always start from a clean slate and run all suites from a single command invocation, but it does help if a developer runs test suites locally, and on the ad hoc CI I am running in http://clbri.com:8010/ .

For example, here is one such run:

image

where all the failing suites fail in a matter of a few seconds, rather than taking a random length to fail.

Also passing suites run faster, since shortest tests are run last, meaning that core utilization will be 100% throughout the test suite run. It is like a self-calibrating version to avoid having to name tests test_zzz_ if they are slow. (which is detrimental to test speed)

@juj
Copy link
Collaborator Author

juj commented Aug 15, 2025

It would be great to get this landed, since this would enable my CI to run against the upstream tree more easily.

@sbc100
Copy link
Collaborator

sbc100 commented Aug 15, 2025

It would be great to get this landed, since this would enable my CI to run against the upstream tree more easily.

It seems like there are couple different things intertwined here.

Perhaps we can tease some of it apart and try to simplify.

The first thing here is making --fail-fast work in the parallel running. That seems like a great idea and maybe we can land that separtely.

Regarding running slow tests first, how about just making that the default? For the initial run we could check in a copy of the test time and update it every few months. Then we wouldn't need a new flag.

Perhaps --failing-first could be a new flags, but again it might make sense to just make it the default?

@sbc100
Copy link
Collaborator

sbc100 commented Aug 15, 2025

Why is this change important for your CI?

def addError(self, test, err):
print(test, '... ERROR', file=sys.stderr)
self.buffered_result = BufferedTestError(test, err)
self.test_result = 'errored'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It this needed? Isn't the existing buffered_result object enough?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python doesn't have an API to ask the result of a test object, and it would have required some kind of an awkward isinstance() jungle to convert the test to a string, so I opted to writing simple looking code as a more preferable way.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I posted this already, but can you try if failfast_event and result.errors or retult.failures: instead adding this new member? I

Copy link
Collaborator Author

@juj juj Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test_result field is not only used on this line, but it is serialized to the JSON file as well. I did not find a cleaner way to get those strings to the JSON file otherwise.

try:
previous_test_run_results = json.load(open('out/__previous_test_run_results.json'))
except FileNotFoundError:
previous_test_run_results = {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to duplicate this code for handling __previous_test_run_results between here and runner.py?

Perhaps a shared sorting function that they can both call?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is not creating a sorter, but I see the shared block amounts to a

def load_previous_test_run_results():
    try:
      return json.load(open('out/__previous_test_run_results.json'))
    except FileNotFoundError:
      return {}

I could refactor that to e.g. common.py, though that is not a large win necessarily.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess i don't understand quite what is going on here then.. I would have though the new flag --failing-and-slow-first flag would apply equally to the non-parallel and parallel test runner and so it would only need to be handled in a single place (where we decide on the test ordering)... I need to take a deeper look at what is really going on here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged the code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new flag only applies to the parallel test runner. If we wanted to extend this to the non-parallel runner, that's fine, though that could be worked on as a later PR as well.

@juj
Copy link
Collaborator Author

juj commented Aug 15, 2025

Why is this change important for your CI?

It allows me to run all suites of tests in a more reasonable time to iterate on seeing different failures. Otherwise the failing suites take 30-50x longer to come around with the failure, making it time consuming to see what is still failing.

@juj
Copy link
Collaborator Author

juj commented Aug 18, 2025

The first thing here is making --fail-fast work in the parallel running. That seems like a great idea and maybe we can land that separtely.

Separating --fail-fast to work in parallel runner from this PR would definitely optimize for a prettier git history. Though I don't think there is a benefit to retro-writing the PR in that manner.. it would take time, and it is unlikely that it would improve bisectability.

Reading the code in this PR, I would have to delete maybe 70% of it to imagine what the functionality of only --fail-fast part of this PR is, and landing that would then create a merge conflict with this PR, that I would have to then fix the conflicts against this PR. I can see that I would be at it for at least a few days more, with reviews asynchronously ping-ponging across our time zones (unless I work pushing it to the night to get the more interactive code review feedback on the same timezone). So it does not seem like a particularly productive refactor, and this is "cold" code for people who don't opt into it.

Regarding running slow tests first, how about just making that the default?

Whereas making --failing-and-slow-first a default behavior in this PR is a functional change that does have a chance of impacting other people. So it feels to me that flipping to --failing-and-slow-first being the default behavior is something that would benefit putting into a separate PR, so that when anyone bisects, they can clearly see if/when their workflow broke to when the new feature became the default, vs if the new feature (that they didn't opt in) accidentally broke when running outside that flag. (e.g. say if hypothetically adding this code that's supposed to be cold for them broke --continue or --failfast or something else)

I mean, changing the code in this PR to make --failing-and-slow-first the default behavior, and adding --no-failing-and-slow-first (--alphabetical-order?) flag to get the old behavior is something I can do if you'd like, though it seems like it could disturb either CircleCI or Google's waterfall, or other developers with higher chance?

@juj juj force-pushed the failing_and_slow_first branch from 0028ba7 to 8f63c9d Compare August 18, 2025 09:30
@juj
Copy link
Collaborator Author

juj commented Aug 19, 2025

Ping - any more thoughts on this PR?

Copy link
Collaborator

@sbc100 sbc100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am supportive of landing this but there seems like a lot of new code and complexity.

I've not yet understood what the new create_test_run_sorter is doing, for example.

I'd also like to know a little bit more about your use case, so that once this lands and can refactor without breaking it.

It sounds like you have a bunch of test runners that are failing, and you want them to fail fast? I.e. you don't want to get a full list of failures in the case of a failure? These runnings run continuously and save their previous results, then base the next run order on those previous results?

Presumably the failing first part is only useful for suites that are not passing? This seems odd since then they would be ref the whole time. Once the suite goes green the overall time to run the suite will return to normal?

However, the slow-first part could be universally useful, even for suites that are not failing? Is that right?

What is the advantage of the fail-fast part for CI? Normally we what to get a full list of failures from the CI, don't we?

Also, just to clear, if we both fail-fast and then run-failing-first, the number of failing tests that run "first" will always be exactly one? Since only one test could have failed on the previous run? Also, if we fail-fast we, by definition only have partial information for a run right? So the failing-and-slow-first thing works better then the previous run ran every test?

def addError(self, test, err):
print(test, '... ERROR', file=sys.stderr)
self.buffered_result = BufferedTestError(test, err)
self.test_result = 'errored'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I posted this already, but can you try if failfast_event and result.errors or retult.failures: instead adding this new member? I


TEST_ROOT = path_from_root('test')
LAST_TEST = path_from_root('out/last_test.txt')
PREVIOUS_TEST_RUN_RESULTS_FILE = path_from_root('out/previous_test_run_results.json')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when we just this same file to write results from different test suites? I guess the names of the specific tests won't match so results from a different test suite will be mostly ignored?

Copy link
Collaborator Author

@juj juj Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JSON file has multiple entries for different test suites (core1.test_atexit vs core2.test_atexit) but then also a global entry for test_atexit without distinction of the test suite.

The first accumulates results from specific suite, and the general one across all suites.

This way if a test fails in one suite, it gets bumped in fail frequency in other suites runs.

@juj
Copy link
Collaborator Author

juj commented Aug 20, 2025

I've not yet understood what the new create_test_run_sorter is doing, for example.

The function create_test_run_sorter defines a Python sort() comparator, which establishes the best order to run the tests in. Added a comment to that function to explain it better.

I'd also like to know a little bit more about your use case

The motivations are:

  • to be able to improve parallelization of the test runner. I run the tests on 88-thread and 256-thread systems, and before this PR, CI parallelization is poor since the tests run in random order, and the test_zzz_ mechanism even places slow tests last. Sorting tests to run slowest first greatly improves parallelization.
  • to be able to quickly locally see what fails and iterate on failures, I want to have a fail-fast mode that aborts on first failure. This way I can see what to work on next.
  • to be able to run CI on low end systems, like Windows on ARM and Linux on ARM, I run the CI in fail-fast mode as well, since that improves turn-around time on seeing about failures.

It sounds like you have a bunch of test runners that are failing, and you want them to fail fast?

When I am iterating, I want my local test runs to fail fast. When I am running on the CI, I want the fast CIs to cover everything, and the slower CIs to fail-fast.

These runnings run continuously and save their previous results, then base the next run order on those previous results?

That's right. the JSON file accumulates failure and runtime information from the previous runs using the exponential moving average smoothing.

Once the suite goes green the overall time to run the suite will return to normal?

For the fail-fast CIs, yes. Ordering slowest-first greatly improves all parallel suite runtimes, even when fail-fast is not in effect.

if we both fail-fast and then run-failing-first, the number of failing tests that run "first" will always be exactly one?

In general it will be anything between 1 - number of parallel threads, depending on how many failing tests there are on the suite, and timing of the other threads.

if we fail-fast we, by definition only have partial information for a run right? So the failing-and-slow-first thing works better then the previous run ran every test?

The JSON file accumulates information with weighting from all previous runs, not just from the most recent run. So even if the most recent run was only partial, the JSON file will still have all the info from earlier runs in it.

@juj
Copy link
Collaborator Author

juj commented Aug 21, 2025

Ping, any more thoughts? I'd really love to land this kind of feature, since it would improve my iteration on the CI.

@kripken would you have thoughts on this PR?

@sbc100
Copy link
Collaborator

sbc100 commented Aug 21, 2025

CC @brendandahl too who is also working on the parallel test runner right now (trying to get it working in the browser).

@kripken
Copy link
Member

kripken commented Aug 22, 2025

Personally I wouldn't use this option, but I'm not opposed to it either.

@juj
Copy link
Collaborator Author

juj commented Aug 27, 2025

Any more thoughts here? This feature would really be critical for my fast CI run times.

If there is any more motivation needed here, here is a visualized run of the bigendian suite without --failing-and-slow-first:

https://clbri.com/dump/emscripten/toolchain_profiler.results_20250826_1246.html

more than 700 seconds (almost 12 minutes)

and the same suite run with --failing-and-slow-first:

https://clbri.com/dump/emscripten/toolchain_profiler.results_20250826_1351.html

about 450 seconds. (~9 minutes)

When iterating over the remaining flaky tests in the suite, not having this is a time sink. :(

I have the next feature in the pipeline that is blocked on this PR landing first, would like to be able to get to working on landing that as well.

@sbc100
Copy link
Collaborator

sbc100 commented Aug 27, 2025

I still feel like this change is mix more concepts that I would like, but if you are OK with having me/us interate on this over post-landing then LGTM to this and we can followup in future PRs.

I would love to see the "failfast working in the parallel runner" split out into its own PR, since it seems fairly separate, but I won't force it.

@sbc100
Copy link
Collaborator

sbc100 commented Aug 27, 2025

I would love to get this adding to all our test runnings ASAP too so we can reap the benefit

I would also like to to work on bots that run on a clean checkout each time (like the github CI bots) which means i think we should check in a set of timing for each test suite. That should also allow us to remove the @is_slow_test decorate since that information will then be available via the checked-in json file.

@juj
Copy link
Collaborator Author

juj commented Aug 27, 2025

Thanks, that sounds good. Definitely not holding the code set in stone.

@juj juj merged commit 4803d1b into emscripten-core:main Aug 27, 2025
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants