Skip to content

Conversation

@skyegalaxy
Copy link
Member

@skyegalaxy skyegalaxy commented Sep 11, 2025

Description

This PR fixes a lock order inversion bug detected by Thread Sanitizer between GraphListener::shutdown_mutex_ and Context::on_shutdown_callbacks_mutex_, by storing and instantiating the GraphListener inside of the Context instead of NodeGraph. With this change, we avoid locking the context weak pointer and inverting the order of lock acquisition on startup and shutdown.

Usage of is_shutdown_->load() was also replaced with the existing is_shutdown() method in GraphListener for consistency.

Fixes # (2946)

Is this user-facing behavior change?

no

Did you use Generative AI?

no

Additional Information

This code appears to have remained unchanged for a long time, and I was able to reproduce this tsan error on iron, jazzy and rolling with this example. Tsan identified a lock order inversion as well as one potential data race in other parts of the code. With this change, the lock order inversion warning is gone, no additional tsan warnings were introduced, and all unit tests passed locally when these changes were built against both jazzy and rolling.

@skyegalaxy skyegalaxy changed the title Factor out shutdown hook registration from start_if_not_started in GraphListener Factor out shutdown hook registration from start_if_not_started in GraphListener Sep 11, 2025
@skyegalaxy
Copy link
Member Author

Seems that the one failing test rclcpp.TestExecutor.spin_all_fail_wait_set_clear is known to be flaky, from examining previous PRs.

@skyegalaxy skyegalaxy marked this pull request as ready for review September 11, 2025 22:54
@jmachowinski
Copy link
Collaborator

Why the new interface, would it not be sufficient to move the lock down in start_if_not_started() ?

@skyegalaxy
Copy link
Member Author

Why the new interface, would it not be sufficient to move the lock down in start_if_not_started() ?

My original thinking was that the behavior I factored out seemed like it could be its own functionality and that start_if_not_started felt like it was doing a bit too much compared to its name, but at the end of the day I really just wanna fix the tsan issue and not make this a drastic interface change.

If others agree that the check for is_shutdown_ ought to be okay outside of shutdown_lock_ because it's an atomic_bool, I can just move the lock down

@skyegalaxy skyegalaxy changed the title Factor out shutdown hook registration from start_if_not_started in GraphListener Move shutdown hook registration outside of shutdown_mutex_ in GraphListener::start_if_not_started Sep 12, 2025
@jmachowinski
Copy link
Collaborator

jmachowinski commented Sep 15, 2025

I guess the shutdown handler needs to be registered after the start. If not, it could be called (In the super race from hell) before the listener thread is started and you'll end up in a weird state.
Hm looking at this more, I guess we now have a race going on in this special case after the patch. Moving the registration down creates another race (lost shutdown), so we need a better solution...

@skyegalaxy
Copy link
Member Author

skyegalaxy commented Sep 19, 2025

ok, after speaking with @wjwwood at today's client library WG meeting, I think we've got a solid idea for how to move forward. Rather than having the GraphListener consume a pointer to the context that it then adds a shutdown callback to while holding onto the shutdown mutex, we proposed refactoring the code so that the GraphListener is created within the Context. The Context would lock to prevent multiple GraphListener instantiations, create the GraphListener, register its shutdown hooks, and release the lock. Whereas the GraphListener shutdown callback would acquire its shutdown lock to prevent parallel shutdowns.

@skyegalaxy skyegalaxy force-pushed the graph-listener-lock-order-fixes branch from 06b4b29 to 0440c31 Compare October 2, 2025 23:34
@skyegalaxy skyegalaxy changed the title Move shutdown hook registration outside of shutdown_mutex_ in GraphListener::start_if_not_started Store graph listener inside the context instead of the node graph Oct 2, 2025
@skyegalaxy
Copy link
Member Author

Along with the known flaky test, the only other failing test in CI appears to be another extremely rare flake. from my clean rolling workspace:

smedeiros@1395239c7f9e:~/SourceCode/rolling-ws-2$ for run in {1..5000}; do ./build/rclcpp/test/rclcpp/test_executors_busy_waiting; done | grep FAILED
[  FAILED  ] TestBusyWaiting/EventsExecutor.test_spin, where TypeParam = rclcpp::experimental::executors::EventsExecutor (10017 ms)
[  FAILED  ] 1 test, listed below:
[  FAILED  ] TestBusyWaiting/EventsExecutor.test_spin, where TypeParam = rclcpp::experimental::executors::EventsExecutor
 1 FAILED TEST
smedeiros@1395239c7f9e:~/SourceCode/rolling-ws-2$ 

@skyegalaxy
Copy link
Member Author

Pulls: #2952
Gist: https://gist.githubusercontent.com/skyegalaxy/68fa478b26b26da11996d4a251a95454/raw/f4814cd9de16ce06f1fbfc30d34b63bde90ffa6c/ros2.repos
BUILD args: --packages-above-and-dependencies rclcpp
TEST args: --packages-above rclcpp
ROS Distro: rolling
Job: ci_launcher
ci_launcher ran: https://ci.ros2.org/job/ci_launcher/17194

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Linux-rhel Build Status
  • Windows Build Status

Signed-off-by: Skyler Medeiros <[email protected]>
@skyegalaxy
Copy link
Member Author

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Linux-rhel Build Status
  • Windows Build Status

@ahcorde
Copy link
Contributor

ahcorde commented Oct 6, 2025

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Linux-rhel Build Status
  • Windows Build Status

@ahcorde ahcorde requested a review from fujitatomoya October 8, 2025 10:27
@fujitatomoya
Copy link
Collaborator

@skyegalaxy thanks for fixing this up. lgtm with just a nit, which is just comment block. i do not think we need to re-run the CI after the comment fix.

@skyegalaxy
Copy link
Member Author

skyegalaxy commented Oct 9, 2025

@skyegalaxy thanks for fixing this up. lgtm with just a nit, which is just comment block. i do not think we need to re-run the CI after the comment fix.

Done! 0cbaecf
(looks like the original DCO failure was because I used github's "apply suggestion" feature and it signed off with my personal email instead of my work email!)

@skyegalaxy skyegalaxy force-pushed the graph-listener-lock-order-fixes branch from 7e70733 to 0cbaecf Compare October 9, 2025 01:03
@fujitatomoya fujitatomoya merged commit 715a983 into ros2:rolling Oct 9, 2025
2 of 3 checks passed
@skyegalaxy
Copy link
Member Author

@fujitatomoya thanks for reviewing and merging my PR! Since we caught this in Jazzy, Is this worth backporting to older distros as well?

@fujitatomoya
Copy link
Collaborator

@Mergifyio backport kilted jazzy

@mergify
Copy link
Contributor

mergify bot commented Oct 9, 2025

backport kilted jazzy

✅ Backports have been created

mergify bot pushed a commit that referenced this pull request Oct 9, 2025
)

* factor shutdown hook registration into its own method

Signed-off-by: Skyler Medeiros <[email protected]>

* run ament_uncrustify

Signed-off-by: Skyler Medeiros <[email protected]>

* consistent use of is_shutdown in graph-listener

Signed-off-by: Skyler Medeiros <[email protected]>

* this doesn't actually need to be a separate function

Signed-off-by: Skyler Medeiros <[email protected]>

* whitespace

Signed-off-by: Skyler Medeiros <[email protected]>

* fix typo

Signed-off-by: Skyler Medeiros <[email protected]>

* move graph listener into the context

Signed-off-by: Skyler Medeiros <[email protected]>

* make cpplint happy

Signed-off-by: Skyler Medeiros <[email protected]>

* Update rclcpp/src/rclcpp/context.cpp

Signed-off-by: Skyler Medeiros <[email protected]>

---------

Signed-off-by: Skyler Medeiros <[email protected]>
(cherry picked from commit 715a983)

# Conflicts:
#	rclcpp/src/rclcpp/context.cpp
mergify bot pushed a commit that referenced this pull request Oct 9, 2025
)

* factor shutdown hook registration into its own method

Signed-off-by: Skyler Medeiros <[email protected]>

* run ament_uncrustify

Signed-off-by: Skyler Medeiros <[email protected]>

* consistent use of is_shutdown in graph-listener

Signed-off-by: Skyler Medeiros <[email protected]>

* this doesn't actually need to be a separate function

Signed-off-by: Skyler Medeiros <[email protected]>

* whitespace

Signed-off-by: Skyler Medeiros <[email protected]>

* fix typo

Signed-off-by: Skyler Medeiros <[email protected]>

* move graph listener into the context

Signed-off-by: Skyler Medeiros <[email protected]>

* make cpplint happy

Signed-off-by: Skyler Medeiros <[email protected]>

* Update rclcpp/src/rclcpp/context.cpp

Signed-off-by: Skyler Medeiros <[email protected]>

---------

Signed-off-by: Skyler Medeiros <[email protected]>
(cherry picked from commit 715a983)

# Conflicts:
#	rclcpp/src/rclcpp/context.cpp
skyegalaxy added a commit to irobot-ros/rclcpp that referenced this pull request Oct 9, 2025
…s2#2952)

* factor shutdown hook registration into its own method

Signed-off-by: Skyler Medeiros <[email protected]>

* run ament_uncrustify

Signed-off-by: Skyler Medeiros <[email protected]>

* consistent use of is_shutdown in graph-listener

Signed-off-by: Skyler Medeiros <[email protected]>

* this doesn't actually need to be a separate function

Signed-off-by: Skyler Medeiros <[email protected]>

* whitespace

Signed-off-by: Skyler Medeiros <[email protected]>

* fix typo

Signed-off-by: Skyler Medeiros <[email protected]>

* move graph listener into the context

Signed-off-by: Skyler Medeiros <[email protected]>

* make cpplint happy

Signed-off-by: Skyler Medeiros <[email protected]>

* Update rclcpp/src/rclcpp/context.cpp

Signed-off-by: Skyler Medeiros <[email protected]>

---------

Signed-off-by: Skyler Medeiros <[email protected]>
skyegalaxy added a commit that referenced this pull request Oct 9, 2025
)

* factor shutdown hook registration into its own method

Signed-off-by: Skyler Medeiros <[email protected]>

* run ament_uncrustify

Signed-off-by: Skyler Medeiros <[email protected]>

* consistent use of is_shutdown in graph-listener

Signed-off-by: Skyler Medeiros <[email protected]>

* this doesn't actually need to be a separate function

Signed-off-by: Skyler Medeiros <[email protected]>

* whitespace

Signed-off-by: Skyler Medeiros <[email protected]>

* fix typo

Signed-off-by: Skyler Medeiros <[email protected]>

* move graph listener into the context

Signed-off-by: Skyler Medeiros <[email protected]>

* make cpplint happy

Signed-off-by: Skyler Medeiros <[email protected]>

* Update rclcpp/src/rclcpp/context.cpp

Signed-off-by: Skyler Medeiros <[email protected]>

---------

Signed-off-by: Skyler Medeiros <[email protected]>
skyegalaxy added a commit that referenced this pull request Oct 9, 2025
)

* factor shutdown hook registration into its own method

Signed-off-by: Skyler Medeiros <[email protected]>

* run ament_uncrustify

Signed-off-by: Skyler Medeiros <[email protected]>

* consistent use of is_shutdown in graph-listener

Signed-off-by: Skyler Medeiros <[email protected]>

* this doesn't actually need to be a separate function

Signed-off-by: Skyler Medeiros <[email protected]>

* whitespace

Signed-off-by: Skyler Medeiros <[email protected]>

* fix typo

Signed-off-by: Skyler Medeiros <[email protected]>

* move graph listener into the context

Signed-off-by: Skyler Medeiros <[email protected]>

* make cpplint happy

Signed-off-by: Skyler Medeiros <[email protected]>

* Update rclcpp/src/rclcpp/context.cpp

Signed-off-by: Skyler Medeiros <[email protected]>

---------

Signed-off-by: Skyler Medeiros <[email protected]>
skyegalaxy added a commit that referenced this pull request Oct 9, 2025
)

* factor shutdown hook registration into its own method

Signed-off-by: Skyler Medeiros <[email protected]>

* run ament_uncrustify

Signed-off-by: Skyler Medeiros <[email protected]>

* consistent use of is_shutdown in graph-listener

Signed-off-by: Skyler Medeiros <[email protected]>

* this doesn't actually need to be a separate function

Signed-off-by: Skyler Medeiros <[email protected]>

* whitespace

Signed-off-by: Skyler Medeiros <[email protected]>

* fix typo

Signed-off-by: Skyler Medeiros <[email protected]>

* move graph listener into the context

Signed-off-by: Skyler Medeiros <[email protected]>

* make cpplint happy

Signed-off-by: Skyler Medeiros <[email protected]>

* Update rclcpp/src/rclcpp/context.cpp

Signed-off-by: Skyler Medeiros <[email protected]>

---------

Signed-off-by: Skyler Medeiros <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

lock order inversion between GraphListener and Context in action clients

4 participants