Skip to content

Thread filter optim #238

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: main
Choose a base branch
from
Open

Thread filter optim #238

wants to merge 16 commits into from

Conversation

r1viollet
Copy link
Collaborator

@r1viollet r1viollet commented Jul 7, 2025

What does this PR do?:

  • Reserve padded slots
  • Introduce a register / unregister to retrieve slots
  • manage a free list

Motivation:

Improve throughput of applications that run on many threads with many context updates.

Additional Notes:

How to test the change?:

For Datadog employees:

  • If this PR touches code that signs or publishes builds or packages, or handles
    credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
  • This PR doesn't touch any of that.
  • JIRA: [JIRA-XXXX]

Unsure? Have a question? Request a review!

Copy link

github-actions bot commented Jul 7, 2025

🔧 Report generated by pr-comment-cppcheck

CppCheck Report

Errors (2)

Warnings (8)

Style Violations (306)

Copy link

github-actions bot commented Jul 7, 2025

🔧 Report generated by pr-comment-scanbuild

@r1viollet r1viollet force-pushed the r1viollet/thread_filter_squash branch 3 times, most recently from e5bce28 to 0918008 Compare July 7, 2025 11:45
@r1viollet
Copy link
Collaborator Author

I have reasonable performance on most runs:

Benchmark                                                       (command)  (skipResults)  (workload)  Mode  Cnt    Score   Error  Units
ThreadFilterBenchmark.threadFilterStress01  cpu=100us,wall=100us,filter=1           true           0  avgt         0.039          us/op
ThreadFilterBenchmark.threadFilterStress01  cpu=100us,wall=100us,filter=1           true           7  avgt         0.041          us/op
ThreadFilterBenchmark.threadFilterStress01  cpu=100us,wall=100us,filter=1           true       70000  avgt       111.094          us/op
ThreadFilterBenchmark.threadFilterStress02  cpu=100us,wall=100us,filter=1           true           0  avgt         0.132          us/op
ThreadFilterBenchmark.threadFilterStress02  cpu=100us,wall=100us,filter=1           true           7  avgt         0.139          us/op
ThreadFilterBenchmark.threadFilterStress02  cpu=100us,wall=100us,filter=1           true       70000  avgt       108.666          us/op
ThreadFilterBenchmark.threadFilterStress04  cpu=100us,wall=100us,filter=1           true           0  avgt         0.258          us/op
ThreadFilterBenchmark.threadFilterStress04  cpu=100us,wall=100us,filter=1           true           7  avgt         0.278          us/op
ThreadFilterBenchmark.threadFilterStress04  cpu=100us,wall=100us,filter=1           true       70000  avgt       118.940          us/op
ThreadFilterBenchmark.threadFilterStress08  cpu=100us,wall=100us,filter=1           true           0  avgt         0.624          us/op
ThreadFilterBenchmark.threadFilterStress08  cpu=100us,wall=100us,filter=1           true           7  avgt         0.646          us/op
ThreadFilterBenchmark.threadFilterStress08  cpu=100us,wall=100us,filter=1           true       70000  avgt       160.170          us/op
ThreadFilterBenchmark.threadFilterStress16  cpu=100us,wall=100us,filter=1           true           0  avgt         1.780          us/op
ThreadFilterBenchmark.threadFilterStress16  cpu=100us,wall=100us,filter=1           true           7  avgt         2.288          us/op
ThreadFilterBenchmark.threadFilterStress16  cpu=100us,wall=100us,filter=1           true       70000  avgt       221.987          us/op

I'm not sure why some runs still blow up for higher numbers of threads.

@r1viollet r1viollet mentioned this pull request Jul 7, 2025
3 tasks
@jbachorik jbachorik force-pushed the r1viollet/thread_filter_squash branch 2 times, most recently from e0ac246 to 2421ba9 Compare July 10, 2025 12:48
@r1viollet
Copy link
Collaborator Author

r1viollet commented Jul 10, 2025

CppCheck Report

Errors (2)

Warnings (8)

Style Violations (305)

@jbachorik jbachorik force-pushed the r1viollet/thread_filter_squash branch from 2421ba9 to 50a8d5f Compare July 21, 2025 14:57
@jbachorik
Copy link
Collaborator

I did run some comparison of native memory usage with different thread filter implementations - data is in the notebook

TL;DR there is no observable increase in the native memory usage (the UNDEFINED category). Anyway, it would be useful to have an extra counter for the ThreadIDTable utilization.

jbachorik and others added 3 commits July 24, 2025 21:43
If the TLS cleanup fires before the JVMTI hook, we want to
ensure that we don't crash while retrieving the ProfiledThread
- Add a check on validity of ProfiledThread
- Start the profiler to ensure we have valid thread objects
- add asserts around missing thread object
- remove print (replacing with an assert)
@r1viollet
Copy link
Collaborator Author

r1viollet commented Aug 21, 2025

CppCheck Report

Errors (2)

Warnings (8)

Style Violations (305)

@r1viollet r1viollet marked this pull request as ready for review August 21, 2025 07:53
@r1viollet r1viollet force-pushed the r1viollet/thread_filter_squash branch from 6171739 to e30d88f Compare August 21, 2025 14:05
- Fix removal of self in timerloop init
it was not using a slotID but a thread ID

- Add assertion to find other potential issues
@r1viollet r1viollet force-pushed the r1viollet/thread_filter_squash branch from e30d88f to e78a6b2 Compare August 21, 2025 14:08
@@ -16,6 +16,7 @@

#include <assert.h>

#include "arch_dd.h"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having to include this for unlikely is not perfect

Java_com_datadoghq_profiler_JavaProfiler_filterThreadRemove0(JNIEnv *env,
jobject unused) {
ProfiledThread *current = ProfiledThread::current();
if (unlikely(current == nullptr)) {
Copy link
Contributor

@zhengyu123 zhengyu123 Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think assert(current != nllptr) should be sufficient, otherwise, we have a bigger problem.

return;
}
int tid = current->tid();
if (unlikely(tid < 0)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible? or we should just assert

int tid = ProfiledThread::currentTid();
if (tid < 0) {
ProfiledThread *current = ProfiledThread::current();
if (unlikely(current == nullptr)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

return;
}
int tid = current->tid();
if (unlikely(tid < 0)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

@@ -479,6 +469,10 @@ public Map<String, Long> getDebugCounters() {
private static native boolean init0();
private native void stop0() throws IllegalStateException;
private native String execute0(String command) throws IllegalArgumentException, IllegalStateException, IOException;

private native void filterThreadAdd0();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like that filterThreadAdd0 == filterThread0(true) and filterThreadRemove0 == filterThread0(false). Please remove duplications.

void collect(std::vector<int> &v);
private:
// Optimized slot structure with padding to avoid false sharing
struct alignas(64) Slot {
Copy link
Contributor

@zhengyu123 zhengyu123 Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have definition of DEFAULT_CACHE_LINE_SIZE in dd_arch.h. I would suggest following code for readability and portability.

  struct alignas(DEFAULT_CACHE_LINE_SIZE) Slot {
      std::atomic<int> value{-1};
      char padding[DEFAULT_CACHE_LINE_SIZE - sizeof(value)];
  };

std::atomic<SlotID> _next_index{0};
std::unique_ptr<FreeListNode[]> _free_list;

struct alignas(64) ShardHead { std::atomic<int> head{-1}; };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use DEFAULT_CACHE_LINE_SIZE for readability and portability.

}
// Try to install it atomically
ChunkStorage* expected = nullptr;
if (_chunks[chunk_idx].compare_exchange_strong(expected, new_chunk, std::memory_order_acq_rel)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memory_order_release should be sufficient.


ThreadFilter::SlotID ThreadFilter::registerThread() {
// If disabled, block new registrations
if (!_enabled.load(std::memory_order_acquire)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any memory ordering _enabled providing. Could you explain what it releases

return;
}

ChunkStorage* chunk = _chunks[chunk_idx].load(std::memory_order_relaxed);
Copy link
Contributor

@zhengyu123 zhengyu123 Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need memory_order_acquire ordering here to match the release store.

Copy link
Contributor

@zhengyu123 zhengyu123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did partial second round reviewing, I think there are many inconsistencies in memory ordering.

_num_chunks.store(0, std::memory_order_release);
// Detach and delete chunks
for (int i = 0; i < kMaxChunks; ++i) {
ChunkStorage* chunk = _chunks[i].exchange(nullptr, std::memory_order_acq_rel);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memory_order_acquire instead of memory_order_acq_rel

int slot_idx = slot_id & kChunkMask;

// Fast path: assume valid slot_id from registerThread()
ChunkStorage* chunk = _chunks[chunk_idx].load(std::memory_order_relaxed);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need memory_order_acquire ordering

// Fast path: assume valid slot_id from registerThread()
ChunkStorage* chunk = _chunks[chunk_idx].load(std::memory_order_relaxed);
if (likely(chunk != nullptr)) {
return chunk->slots[slot_idx].value.load(std::memory_order_acquire) != -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memory_order_relaxed should be sufficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants