Rewrote `parTraverseN` and `parTraverseN_` for better performance #4451

djspiewak · 2025-07-22T21:01:34Z

This shifts to a fully bespoke implementation of parTraverseN and such. There are a few things left to clean up, such as a few more tests and running some comparative benchmarks, but early results are very promising. In particular, the failure case from #4434 appears to be around two to three orders of magnitude faster with this implementation (which makes sense, since it handles early abort correctly). Kudos to @SystemFw for the core idea which makes this possible.

One of the things I'm doing here is giving up entirely on universal fairness and merely focusing on in-batch fairness. A simpler way of saying this is that we are hardened against head of line blocking, both for actions and cancelation.

Fixes #4434

…gical

djspiewak · 2025-07-23T15:13:42Z

Pros and cons on performance, though I think it's possible to do better here. It's a little bit slower than the previous implementation in the happy path, but it's several orders of magnitude faster in the error path so I'll call that a win.

Before

[info] Benchmark                             (cpuTokens)  (size)   Mode  Cnt    Score    Error  Units
[info] ParallelBenchmark.parTraverse               10000    1000  thrpt   10  292.924 ±  1.496  ops/s
[info] ParallelBenchmark.parTraverseN              10000    1000  thrpt   10  277.978 ±  1.280  ops/s
[info] ParallelBenchmark.parTraverseNCancel        10000    1000  thrpt   10    0.006 ±  0.001  ops/s
[info] ParallelBenchmark.traverse                  10000    1000  thrpt   10   48.015 ±  0.016  ops/s

After

[info] Benchmark                             (cpuTokens)  (size)   Mode  Cnt    Score   Error  Units
[info] ParallelBenchmark.parTraverse               10000    1000  thrpt   10  293.834 ± 1.152  ops/s
[info] ParallelBenchmark.parTraverseN              10000    1000  thrpt   10  233.868 ± 0.309  ops/s
[info] ParallelBenchmark.parTraverseNCancel        10000    1000  thrpt   10    7.859 ± 0.014  ops/s
[info] ParallelBenchmark.traverse                  10000    1000  thrpt   10   48.059 ± 0.014  ops/s

djspiewak · 2025-07-23T17:40:29Z

So I haven't golfed the failure down yet, but it really looks like we're hitting a bug in Scala.js, probably stemming from the "null safe" test. @durban you may be amused

I think we could just remove the null safe test now since we're not using an ArrayBuffer internally, but it's kind of a neat surprise.

durban · 2025-07-23T19:25:28Z

Well, "amused" is one word for it :-) So it's not a bug in Scala.js, as in, it behaves as documented: dereferencing null is undefined behavior in Scala.js (LOL, what? Seriously.), so literally any behavior is "behaving as documented". Apparently scalaJSLinkerConfig could be configured to behave properly for nulls. But removing that very specific test is also fine I think.

djspiewak · 2025-07-23T19:35:48Z

Well that's fun. I actually thought we had some special checking for when the cur0 action became null in the runloop, but apparently not.

durban · 2025-07-23T19:43:47Z

That we do check. It's this line: https://github.com/typelevel/cats-effect/blob/series/3.x/core/shared/src/main/scala/cats/effect/IO.scala#L2024 (and fa is null).

djspiewak · 2025-07-23T19:44:56Z

Ahhhhhhh that makes sense. Okay, by that token, I think it's fair to say that a lot of our combinators just aren't null-safe and that's how it's going to be. :P

durban · 2025-08-09T20:32:04Z

It's annoying, because in Scala they are null safe. (The test passed before, it just failed on JS.) We'd have to do something like this (everywhere), to make it work on JS:

def combinator(fa: F[A], ...) = {
  if (fa eq null) throw new NullPointerException
}

Which is (1) annoying, (2) very redundant, except on Scala.js, and (3) apparently has performance problems in Scala.js (or maybe that's only the linker setting?).

I don't propose we do this. There is a Scala.js linker setting which fixes the problem. In Scala and Scala Native it works by default.

durban · 2025-08-09T20:33:15Z