Skip to content

Conversation

@parker89
Copy link

@parker89 parker89 commented Dec 9, 2025

Currently DirectByteBufferDeallocator has one queue per thread (via ThreadLocal). There are two issues with the current approach:

With platform threads: We use JBoss's EnhancedQueueExecutor under the hood to share platform threads. If a platform thread uses the DirectByteBufferDeallocator and enqueues some QueuedByteBuffers, that platform thread might be repurposed for another non-undertow task, and the buffers will remain in the ThreadLocal queue indefinitely until the thread again uses the DirectByteBufferDeallocator.

With virtual threads: If a virtual thread puts some QueuedByteBuffers into the queue, and then does other stuff (eg being blocked on some IO operation), the ByteBuffers in the queue will be held potentially indefinitely until the thread uses the DirectByteBufferDeallocator again. And when the thread completes, any QueuedByteBuffers will potentially have to wait for a GC cycle to mark the ThreadLocal as unreachable and run the Cleaners for any DirectByteBuffers still in the queue.

The virtual thread issue is the most important motivator for this PR.

This code adapts some code from Guava's Striped class.

A potential downside of this change is a task2 that deallocates a small number of buffers using DirectByteBufferDeallocator.free() might share a queue stripe with another task1 that had previously just enqueued a large number of buffers, and thus will have to deallocate the larger number of buffers before it can proceed. In practice, this issue exists already, assuming a cached thread pool model (eg task1 runs/completes on thread1 and then task2 runs on thread1).

It was unclear to me if the benchmarks are still in use since they haven't been updated for 6 years and didn't work for me when I attempted to run them on the main branch (I had to update the JMH version and add the JMH annotation processor), but I ran them regardless.

I ran the benchmarks with:
java -jar benchmarks/target/undertow-benchmarks.jar SimpleBenchmarks

Results are mostly unchanged (note: lower numbers are better). I ran on a fresh 8 core / 16 GB coder instance

| Benchmark                  | Listener Type | Before (us/op)          | After (us/op)           |
|----------------------------|---------------|-------------------------|-------------------------|
| benchmarkBlockingEmptyGet  | HTTP          | 514.685 ± 177.656       | 518.024 ± 780.275       |
| benchmarkBlockingEmptyGet  | HTTPS         | 595.211 ± 183.514       | 625.623 ± 757.807       |
| benchmarkBlockingEmptyPost | HTTP          | 511.489 ± 479.258       | 523.355 ± 272.388       |
| benchmarkBlockingEmptyPost | HTTPS         | 597.913 ± 573.488       | 617.732 ± 557.635       |
| benchmarkBlockingLargeGet  | HTTP          | 26388.968 ± 27471.126   | 27707.832 ± 20734.965   |
| benchmarkBlockingLargeGet  | HTTPS         | 27581.427 ± 34525.577   | 26756.817 ± 34031.476   |
| benchmarkBlockingLargePost | HTTP          | 7210.632 ± 2138.297     | 7141.661 ± 527.309      |
| benchmarkBlockingLargePost | HTTPS         | 10041.778 ± 6778.218    | 10398.981 ± 9007.884    |

@parker89 parker89 changed the title stripe deallocator Stripe the queues in DirectByteBufferDeallocator by processor count instead of by thread Dec 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant