-
Notifications
You must be signed in to change notification settings - Fork 9
Thread filter optim #238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Thread filter optim #238
Conversation
🔧 Report generated by pr-comment-scanbuild |
e5bce28
to
0918008
Compare
I have reasonable performance on most runs:
I'm not sure why some runs still blow up for higher numbers of threads. |
e0ac246
to
2421ba9
Compare
- Reserve padded slots - Introduce a register / unregister to retrieve slots - manage a free list
2421ba9
to
50a8d5f
Compare
I did run some comparison of native memory usage with different thread filter implementations - data is in the notebook TL;DR there is no observable increase in the native memory usage (the |
…t/thread_filter_squash
If the TLS cleanup fires before the JVMTI hook, we want to ensure that we don't crash while retrieving the ProfiledThread - Add a check on validity of ProfiledThread
- Start the profiler to ensure we have valid thread objects - add asserts around missing thread object - remove print (replacing with an assert)
…t/thread_filter_squash
- Adjust ordering or free list init - Honour the enable / disable flag - Remove bound checks
6171739
to
e30d88f
Compare
- Fix removal of self in timerloop init it was not using a slotID but a thread ID - Add assertion to find other potential issues
e30d88f
to
e78a6b2
Compare
@@ -16,6 +16,7 @@ | |||
|
|||
#include <assert.h> | |||
|
|||
#include "arch_dd.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
having to include this for unlikely is not perfect
Java_com_datadoghq_profiler_JavaProfiler_filterThreadRemove0(JNIEnv *env, | ||
jobject unused) { | ||
ProfiledThread *current = ProfiledThread::current(); | ||
if (unlikely(current == nullptr)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think assert(current != nllptr)
should be sufficient, otherwise, we have a bigger problem.
return; | ||
} | ||
int tid = current->tid(); | ||
if (unlikely(tid < 0)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible? or we should just assert
int tid = ProfiledThread::currentTid(); | ||
if (tid < 0) { | ||
ProfiledThread *current = ProfiledThread::current(); | ||
if (unlikely(current == nullptr)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
return; | ||
} | ||
int tid = current->tid(); | ||
if (unlikely(tid < 0)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
@@ -479,6 +469,10 @@ public Map<String, Long> getDebugCounters() { | |||
private static native boolean init0(); | |||
private native void stop0() throws IllegalStateException; | |||
private native String execute0(String command) throws IllegalArgumentException, IllegalStateException, IOException; | |||
|
|||
private native void filterThreadAdd0(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like that filterThreadAdd0
== filterThread0(true)
and filterThreadRemove0
== filterThread0(false)
. Please remove duplications.
void collect(std::vector<int> &v); | ||
private: | ||
// Optimized slot structure with padding to avoid false sharing | ||
struct alignas(64) Slot { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have definition of DEFAULT_CACHE_LINE_SIZE
in dd_arch.h
. I would suggest following code for readability and portability.
struct alignas(DEFAULT_CACHE_LINE_SIZE) Slot {
std::atomic<int> value{-1};
char padding[DEFAULT_CACHE_LINE_SIZE - sizeof(value)];
};
std::atomic<SlotID> _next_index{0}; | ||
std::unique_ptr<FreeListNode[]> _free_list; | ||
|
||
struct alignas(64) ShardHead { std::atomic<int> head{-1}; }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use DEFAULT_CACHE_LINE_SIZE
for readability and portability.
} | ||
// Try to install it atomically | ||
ChunkStorage* expected = nullptr; | ||
if (_chunks[chunk_idx].compare_exchange_strong(expected, new_chunk, std::memory_order_acq_rel)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memory_order_release
should be sufficient.
|
||
ThreadFilter::SlotID ThreadFilter::registerThread() { | ||
// If disabled, block new registrations | ||
if (!_enabled.load(std::memory_order_acquire)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any memory ordering _enabled
providing. Could you explain what it release
s
return; | ||
} | ||
|
||
ChunkStorage* chunk = _chunks[chunk_idx].load(std::memory_order_relaxed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need memory_order_acquire
ordering here to match the release store
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did partial second round reviewing, I think there are many inconsistencies in memory ordering.
_num_chunks.store(0, std::memory_order_release); | ||
// Detach and delete chunks | ||
for (int i = 0; i < kMaxChunks; ++i) { | ||
ChunkStorage* chunk = _chunks[i].exchange(nullptr, std::memory_order_acq_rel); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memory_order_acquire
instead of memory_order_acq_rel
int slot_idx = slot_id & kChunkMask; | ||
|
||
// Fast path: assume valid slot_id from registerThread() | ||
ChunkStorage* chunk = _chunks[chunk_idx].load(std::memory_order_relaxed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need memory_order_acquire
ordering
// Fast path: assume valid slot_id from registerThread() | ||
ChunkStorage* chunk = _chunks[chunk_idx].load(std::memory_order_relaxed); | ||
if (likely(chunk != nullptr)) { | ||
return chunk->slots[slot_idx].value.load(std::memory_order_acquire) != -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memory_order_relaxed
should be sufficient.
What does this PR do?:
Motivation:
Improve throughput of applications that run on many threads with many context updates.
Additional Notes:
How to test the change?:
For Datadog employees:
credentials of any kind, I've requested a review from
@DataDog/security-design-and-guidance
.Unsure? Have a question? Request a review!