-
Notifications
You must be signed in to change notification settings - Fork 1.8k
reload: fix race between watchdog start and setting async cancellation #11012
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
WalkthroughIntroduces a cooperative shutdown for the hot-reload watchdog in src/flb_reload.c by adding a should_stop flag, switching the watchdog thread to a timed loop with 100 ms checks, removing pthread cancellation, and updating startup/cleanup to initialize, signal stop, join the thread, and log accordingly. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Main as Main Process
participant WD as Watchdog Thread
Note over Main: Startup
Main->>WD: create thread (should_stop=0, timeout_ms)
WD->>WD: loop every 100ms<br/>check elapsed_ms and should_stop
alt should_stop set
WD-->>Main: return early (graceful exit)
else timeout reached
WD-->>Main: log error and abort path
end
Note over Main: Cleanup
Main->>WD: set should_stop=1
Main->>WD: pthread_join
WD-->>Main: exit joined (logged)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Signed-off-by: Bradley Laney <[email protected]>
1d50a18
to
bc40ad2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/flb_reload.c (1)
386-386
: Prefer atomic operations for thread-safe flag.The
volatile
qualifier provides visibility but doesn't guarantee memory ordering on weakly-ordered architectures. For proper cross-platform thread synchronization, consider using C11_Atomic
or platform-specific atomic operations (e.g.,InterlockedCompareExchange
on Windows,__sync_*
builtins on GCC/Clang).Example using C11 atomics:
#include <stdatomic.h>Then in the struct:
- volatile int should_stop; + atomic_int should_stop;And update accesses:
// In cleanup: atomic_store(&watchdog_ctx->should_stop, 1); // In thread: if (atomic_load(&watchdog_ctx->should_stop)) {
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/flb_reload.c
(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/flb_reload.c (1)
include/fluent-bit/flb_mem.h (1)
flb_free
(126-128)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (30)
- GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit (Arm64), amd64_arm64, -DCMAKE_SYSTEM_NAME=Windows -DCMA...
- GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit, x64, x64-windows-static, 3.31.6)
- GitHub Check: pr-windows-build / call-build-windows-package (Windows 32bit, x86, x86-windows-static, 3.31.6)
- GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_COVERAGE=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_ARROW=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, clang, clang++)
- GitHub Check: PR - fuzzing test
- GitHub Check: pr-compile-centos-7
- GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, clang, clang++, ubuntu-22.04, clang-12)
- GitHub Check: pr-compile-without-cxx (3.31.6)
- GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, clang, clang++, ubuntu-24.04, clang-14)
- GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, gcc, g++, ubuntu-24.04, clang-14)
- GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, gcc, g++, ubuntu-22.04, clang-12)
🔇 Additional comments (3)
src/flb_reload.c (3)
391-405
: LGTM! Cooperative shutdown eliminates the cancellation race.The new timed-loop approach with periodic checks is much more reliable and easier to reason about than asynchronous pthread cancellation. This properly addresses the race condition described in the PR where
pthread_cancel
might be called beforepthread_setcanceltype
completes.
429-429
: LGTM!Correct initialization of the flag before thread creation.
448-453
: LGTM! Clean cooperative shutdown.The cleanup sequence properly signals the thread to stop and waits for graceful exit. This is significantly more reliable than
pthread_cancel
, especially on Windows where the original issue was observed.
We saw an issue where, intermittently, when invalid configs were reloaded, the pthread_join call would not complete despite cancel being called, on Windows. The theory is this is because
pthread_setcanceltype(PTHREAD_CANCEL_ASYNCHRONOUS, NULL);
andpthread_cancel(watchdog_ctx->tid);
raced. This change moves away from PTHREAD_CANCEL_ASYNCHRONOUS entirely in favor of something easier to reason able on the different platforms targeted.Summary by CodeRabbit
New Features
Bug Fixes
Refactor
Chores