Skip to content

Conversation

stoksc
Copy link
Contributor

@stoksc stoksc commented Oct 11, 2025

We saw an issue where, intermittently, when invalid configs were reloaded, the pthread_join call would not complete despite cancel being called, on Windows. The theory is this is because pthread_setcanceltype(PTHREAD_CANCEL_ASYNCHRONOUS, NULL); and pthread_cancel(watchdog_ctx->tid); raced. This change moves away from PTHREAD_CANCEL_ASYNCHRONOUS entirely in favor of something easier to reason able on the different platforms targeted.

Summary by CodeRabbit

  • New Features

    • None
  • Bug Fixes

    • Improves stability during hot reload by enabling graceful shutdown, reducing chances of hangs or abrupt terminations.
    • Provides clearer shutdown behavior and logging when reload times out.
  • Refactor

    • Transitioned the hot-reload watchdog to a cooperative stop mechanism for smoother exits.
  • Chores

    • Initialized and cleaned up watchdog state to ensure predictable startup and shutdown behavior.

Copy link

coderabbitai bot commented Oct 11, 2025

Walkthrough

Introduces a cooperative shutdown for the hot-reload watchdog in src/flb_reload.c by adding a should_stop flag, switching the watchdog thread to a timed loop with 100 ms checks, removing pthread cancellation, and updating startup/cleanup to initialize, signal stop, join the thread, and log accordingly.

Changes

Cohort / File(s) Summary
Hot-reload watchdog shutdown refactor
src/flb_reload.c
Replace pthread cancellation with cooperative stop via should_stop; implement timed loop with 100 ms polling up to timeout_ms; initialize should_stop on start; on cleanup set should_stop, join thread, and adjust logging on early exit or timeout abort.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Main as Main Process
  participant WD as Watchdog Thread

  Note over Main: Startup
  Main->>WD: create thread (should_stop=0, timeout_ms)
  WD->>WD: loop every 100ms<br/>check elapsed_ms and should_stop

  alt should_stop set
    WD-->>Main: return early (graceful exit)
  else timeout reached
    WD-->>Main: log error and abort path
  end

  Note over Main: Cleanup
  Main->>WD: set should_stop=1
  Main->>WD: pthread_join
  WD-->>Main: exit joined (logged)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

backport to v4.0.x

Suggested reviewers

  • edsiper
  • koleini
  • fujimotos

Poem

A rabbit taps the watchdog’s clock—tick, tick!
No sudden yank, no canceling trick.
“Should we stop?” it sweetly asks,
Then hops away when done with tasks.
Ears up, loop tight, 100ms beat—
Graceful exits make reviews neat. 🐇⏱️

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title succinctly identifies the main change—addressing a race condition between starting the reload watchdog and using asynchronous cancellation—and references the relevant component (“reload”), so it accurately captures the core intent of the pull request.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/flb_reload.c (1)

386-386: Prefer atomic operations for thread-safe flag.

The volatile qualifier provides visibility but doesn't guarantee memory ordering on weakly-ordered architectures. For proper cross-platform thread synchronization, consider using C11 _Atomic or platform-specific atomic operations (e.g., InterlockedCompareExchange on Windows, __sync_* builtins on GCC/Clang).

Example using C11 atomics:

#include <stdatomic.h>

Then in the struct:

-    volatile int should_stop;
+    atomic_int should_stop;

And update accesses:

// In cleanup:
atomic_store(&watchdog_ctx->should_stop, 1);

// In thread:
if (atomic_load(&watchdog_ctx->should_stop)) {
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0def4e6 and bc40ad2.

📒 Files selected for processing (1)
  • src/flb_reload.c (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/flb_reload.c (1)
include/fluent-bit/flb_mem.h (1)
  • flb_free (126-128)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (30)
  • GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit (Arm64), amd64_arm64, -DCMAKE_SYSTEM_NAME=Windows -DCMA...
  • GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit, x64, x64-windows-static, 3.31.6)
  • GitHub Check: pr-windows-build / call-build-windows-package (Windows 32bit, x86, x86-windows-static, 3.31.6)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_COVERAGE=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_ARROW=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, clang, clang++)
  • GitHub Check: PR - fuzzing test
  • GitHub Check: pr-compile-centos-7
  • GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, clang, clang++, ubuntu-22.04, clang-12)
  • GitHub Check: pr-compile-without-cxx (3.31.6)
  • GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, clang, clang++, ubuntu-24.04, clang-14)
  • GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, gcc, g++, ubuntu-24.04, clang-14)
  • GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, gcc, g++, ubuntu-22.04, clang-12)
🔇 Additional comments (3)
src/flb_reload.c (3)

391-405: LGTM! Cooperative shutdown eliminates the cancellation race.

The new timed-loop approach with periodic checks is much more reliable and easier to reason about than asynchronous pthread cancellation. This properly addresses the race condition described in the PR where pthread_cancel might be called before pthread_setcanceltype completes.


429-429: LGTM!

Correct initialization of the flag before thread creation.


448-453: LGTM! Clean cooperative shutdown.

The cleanup sequence properly signals the thread to stop and waits for graceful exit. This is significantly more reliable than pthread_cancel, especially on Windows where the original issue was observed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant