Stripe the queues in DirectByteBufferDeallocator by processor count instead of by thread #1851
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently
DirectByteBufferDeallocatorhas one queue per thread (via ThreadLocal). There are two issues with the current approach:With platform threads: We use JBoss's
EnhancedQueueExecutorunder the hood to share platform threads. If a platform thread uses theDirectByteBufferDeallocatorand enqueues someQueuedByteBuffers, that platform thread might be repurposed for another non-undertow task, and the buffers will remain in theThreadLocalqueue indefinitely until the thread again uses theDirectByteBufferDeallocator.With virtual threads: If a virtual thread puts some
QueuedByteBuffers into the queue, and then does other stuff (eg being blocked on some IO operation), theByteBuffers in the queue will be held potentially indefinitely until the thread uses theDirectByteBufferDeallocatoragain. And when the thread completes, anyQueuedByteBuffers will potentially have to wait for a GC cycle to mark theThreadLocalas unreachable and run the Cleaners for anyDirectByteBuffers still in the queue.The virtual thread issue is the most important motivator for this PR.
This code adapts some code from Guava's Striped class.
A potential downside of this change is a task2 that deallocates a small number of buffers using
DirectByteBufferDeallocator.free()might share a queue stripe with another task1 that had previously just enqueued a large number of buffers, and thus will have to deallocate the larger number of buffers before it can proceed. In practice, this issue exists already, assuming a cached thread pool model (eg task1 runs/completes on thread1 and then task2 runs on thread1).It was unclear to me if the benchmarks are still in use since they haven't been updated for 6 years and didn't work for me when I attempted to run them on the main branch (I had to update the JMH version and add the JMH annotation processor), but I ran them regardless.
I ran the benchmarks with:
java -jar benchmarks/target/undertow-benchmarks.jar SimpleBenchmarksResults are mostly unchanged (note: lower numbers are better). I ran on a fresh 8 core / 16 GB coder instance