Skip to content

Conversation

Ronitsabhaya75
Copy link

@Ronitsabhaya75 Ronitsabhaya75 commented Aug 18, 2025

Summary

This PR fixes the root cause of the test-wasi-pthread flakiness by adding retry logic for resource exhaustion scenarios, rather than just increasing timeouts.

Problem

As identified by @joyeecheung, the issue was not timeouts but pthread_create returning errors due to resource constraints in CI environments
Assertion failed: r == 0 (c/pthread.c: main: 17)

Solution

  1. C Code: Added retry logic with exponential backoff for pthread_create failures (EAGAIN/ENOMEM)
  2. JavaScript: Added retry logic for Node.js Worker creation failures

@nodejs-github-bot
Copy link
Collaborator

Review requested:

  • @nodejs/wasi

@nodejs-github-bot nodejs-github-bot added needs-ci PRs that need a full CI run. test Issues and PRs related to the tests. labels Aug 18, 2025
@joyeecheung
Copy link
Member

joyeecheung commented Aug 18, 2025

The test was marked as flaky because it frequently timed out in CI environments

FWIW the failure was unrelated to timeouts, it was caused by pthread_create returning an error:

[process 1130254]: --- stderr ---
  (node:1130254) ExperimentalWarning: WASI is an experimental feature and might change at any time
  (Use `node --trace-warnings ...` to show where the warning was created)
  fd_write(2, 69600, 2, 69596)
  Assertion failed: r == 0 (c/pthread.c: main: 17)
  wasm://wasm/00020abe:1


  RuntimeError: unreachable
      at wasm://wasm/00020abe:wasm-function[26]:0x892
      at wasm://wasm/00020abe:wasm-function[27]:0x8da
      at wasm://wasm/00020abe:wasm-function[13]:0x46a
      at wasm://wasm/00020abe:wasm-function[11]:0x328
      at WASI.start (node:wasi:138:7)
      at /home/iojs/build/workspace/node-test-commit-arm/test/fixtures/wasi-preview-1.js:191:20

@Ronitsabhaya75
Copy link
Author

Ronitsabhaya75 commented Aug 18, 2025

The test was marked as flaky because it frequently timed out in CI environments

FWIW the failure was unrelated to timeouts, it was caused by pthread_create returning an error:

[process 1130254]: --- stderr ---
  (node:1130254) ExperimentalWarning: WASI is an experimental feature and might change at any time
  (Use `node --trace-warnings ...` to show where the warning was created)
  fd_write(2, 69600, 2, 69596)
  Assertion failed: r == 0 (c/pthread.c: main: 17)
  wasm://wasm/00020abe:1


  RuntimeError: unreachable
      at wasm://wasm/00020abe:wasm-function[26]:0x892
      at wasm://wasm/00020abe:wasm-function[27]:0x8da
      at wasm://wasm/00020abe:wasm-function[13]:0x46a
      at wasm://wasm/00020abe:wasm-function[11]:0x328
      at WASI.start (node:wasi:138:7)
      at /home/iojs/build/workspace/node-test-commit-arm/test/fixtures/wasi-preview-1.js:191:20

I thought it was time out my bad.

@Ronitsabhaya75
Copy link
Author

Ronitsabhaya75 commented Aug 18, 2025

@joyeecheung should i change the 5 seconds timeout to 1 second time out again?
the reason is all test are passing

@joyeecheung
Copy link
Member

Did you mean that you want to just remove the flaky status in the PR? I don't think it's already proven that the flake is not flaky, so if you just remove it, it can fail the CI again. One passing CI does not mean that it would never reproduce again, that's the nature of flakes - it occasionally fail unrelated PRs but frequent enough to make the CI unusable because people keep seeing failures unrelated to their changes and have to run the CI many times to get a passing one.

@Ronitsabhaya75
Copy link
Author

Did you mean that you want to just remove the flaky status in the PR? I don't think it's already proven that the flake is not flaky, so if you just remove it, it can fail the CI again. One passing CI does not mean that it would never reproduce again, that's the nature of flakes - it occasionally fail unrelated PRs but frequent enough to make the CI unusable because people keep seeing failures unrelated to their changes and have to run the CI many times to get a passing one.

I understood what you I'll remove that change and just keep the change of 1 second sleep to 5 second if it helps

@Ronitsabhaya75
Copy link
Author

@joyeecheung i think this could fix the wasi flaky error i added the retry logic

],
workerData,
});
let worker;
Copy link
Member

@joyeecheung joyeecheung Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in this file don't seem very relevant to me, ERR_WORKER_INIT_FAILED is emitted when a Node.js worker thread can't be initialized - mostly when the V8 heap cannot be initialized, but the flake happens when a native pthread can't be created, the test is structured in a way to load that WASM module that spawns the pthread and it's the module that's crashing, not the driver worker. Those are two different initializations and we are not seeing failures of worker thread initializations in the CI, only the pthread creation failures, so there's no need to change anything in how the driver worker is spawn.

@@ -106,7 +128,7 @@ assert.strictEqual(wasiPreview1.wasiImport,
throw new Error(e);
});

const r = Atomics.wait(result, 0, 0, 1000);
const r = Atomics.wait(result, 0, 0, 5000);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is needed - again the failures don't seem related to the driver worker.

break; // Success
}

// If it's a resource issue (EAGAIN/ENOMEM), retry with a small delay
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried to reproduce locally to see exactly what error this is?

@Ronitsabhaya75
Copy link
Author

@joyeecheung I was able regenerate the same error and was able to test it again and tried to fix it. I think you can review it now

@Ronitsabhaya75
Copy link
Author

@joyeecheung can you please review this pr now pleasee

@joyeecheung
Copy link
Member

Please refrain from continuous pinging. Looking into flakes is something that I only do at my spare time.

@Ronitsabhaya75
Copy link
Author

I'm sorry I didn't know my bad🙏

@joyeecheung
Copy link
Member

joyeecheung commented Sep 4, 2025

FWIW I don't think I've seen this in the CI (https://ci.nodejs.org/job/node-test-commit-arm-debug/ looks mostly green with only other failures every now and then) so that might've already been fixed by the split.

@Ronitsabhaya75
Copy link
Author

gotcha I can close this PR and try to see if there is anything else I can work.

Thank you @joyeecheung for letting me know

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-ci PRs that need a full CI run. test Issues and PRs related to the tests.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants