[GPU][AMD] Refactor initialization code and delay connection #41

dmpots · 2025-08-28T20:53:40Z

This PR refactors the amdgpu plugin initialization logic to make it more flexible so that we can choose when to initialize the debug library separately from attaching the process and creating the connection.

The change mostly shuffles around and cleans up existing code, but we now also explicitly track the state of the amd debug library (e.g. initialized, attached, runtime-initialized) so that we can use that to guide decisions around when to generate the debug connection.

We also fixed the connection logic after the changes in f1343a4. That commit modified the timing of the NativeProcessIsStopping callback such that the callback is now triggered on the first stop that occurs when the native process is launched. This caused a problem because the GPUActions returned with that first stop-reply packet are ignored. The gdb-remote client sends a secondary $? packet to get the stop reply packet again, but then we would try to call the initRocm again because m_connected was false.

To avoid these problems we now delay sending the connection until after the initial stop that occurs for the native process when it first launches. I also played with delaying the connection even further to when the rocm runtime is initialized. That works, but it makes it awkward to use the debugger to set gpu breakpoints. The runtime is initialized on demand so there is not always a good place to set a cpu breakpoint where we can halt the process and create the gpu breakpoints. If we change the debugger to propagate the breakpoints from the cpu to gpu then this will not be an issue because we can set the breakpoints before the gpu target is created.

This PR refactors the amdgpu plugin initialization logic to make it more flexible so that we can choose when to initialize the debug library separately from attaching the process and creating the connection. The change mostly shuffles around and cleans up existing code, but we now also explicitly track the state of the amd debug library (e.g. initialized, attached, runtime-initialized) so that we can use that to guide decisions around when to generate the debug connection. We also fixed the connection logic after the changes in f1343a4. It now delays the connection until after the initial stop that occurs for the native process when it first ptrace attaches. The initial stop does not handle the GPUActions so we need to wait for a later stop. I played with delaying the connection even further to when the rocm runtime is initialized. That works, but it makes it awkward to use the debugger to set gpu breakpoints. The runtime is initialized on demand so there is not always a good place to set a cpu breakpoint where we can halt the process and create the gpu breakpoints. If we change the debugger to propagate the breakpoints from the cpu to gpu then this will not be an issue because we can set the breakpoints before the gpu target is created.

dmpots · 2025-08-28T20:59:31Z

All tests are passing by cherry-picking #42 and running

DOTEST_LD_EXTRAS=-Wl,--dynamic-linker=</path/to/fixed/ld.so> ./bin/llvm-lit ../../lldb/test/API/gpu/amd/ -v

jeffreytan81 · 2025-08-30T01:12:17Z

The refactoring part seems fine. However, can you explain in the summary how f1343a4 breaks AMD plugin workflow?
Then it probably will help me to understand how the changes will help fix it.

jeffreytan81 · 2025-08-30T01:14:50Z

Also, is this PR changing any user visible behavior? We used to create AMD GPU target/connection during first native stop. After delaying connection, is it only fixing a race condition without user visible (still creating GPU target during first stop) or we are only creating target later after first native stop?

dmpots · 2025-08-30T06:47:19Z

@jeffreytan81

However, can you explain in the summary how f1343a4 breaks AMD plugin workflow? Then it probably will help me to understand how the changes will help fix it.

Updated the summary to explain how the AMD plugin workflow was broken:

We also fixed the connection logic after the changes in f1343a4. That commit modified the timing of the NativeProcessIsStopping callback such that the callback is now triggered on the first stop that occurs when the native process is launched. This caused a problem because the GPUActions returned with that first stop-reply packet are ignored. The gdb-remote client sends a secondary $? packet to get the stop reply packet again, but then we would try to call the initRocm again because m_connected was false.

I had a different draft commit (#32) that handled the very first stop-reply packet with GPUActions, but after discussing with @clayborg we went with this approach that allows delaying the connection until a later time.

Also, is this PR changing any user visible behavior? We used to create AMD GPU target/connection during first native stop. After delaying connection, is it only fixing a race condition without user visible (still creating GPU target during first stop) or we are only creating target later after first native stop?

There should be no user-visible changes here. We are still creating the connection on the first stop after the initial stop-reply from the launch sequence. The refactor does make it easier to move that connection time around though.

walter-erquinigo

I think I got a couple of good ideas from this patch

walter-erquinigo · 2025-09-02T20:00:18Z

lldb/tools/lldb-server/Plugins/AMDGPU/LLDBServerPluginAMDGPU.cpp

+
+  Status error = InitializeAmdDbgApi();
+  if (error.Fail()) {
+    logAndReportFatalError("{} Failed to initialize debug library: {}",


TIL that you don't need to specify the index of the argument

jeffreytan81 · 2025-09-03T03:23:41Z

@dmpots, thanks for improving the summary.

However, I am convinced that the root cause was caused by following statement:

This caused a problem because the GPUActions returned with that first stop-reply packet are ignored.

For example, I can see that, after first stop (during first CPU connection), ProcessGDBRemote::DoConnectRemote is calling SetThreadStopInfo which internally will handle the GPU actions and start the GPU connection via socket:

lldb_private::process_gdb_remote::ProcessGDBRemote::HandleGPUActions(lldb_private::GPUActions const&) (/home/jeffreytan/llvm/llvm-project/lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.cpp:923)
lldb_private::process_gdb_remote::ProcessGDBRemote::SetThreadStopInfo(StringExtractor&) (/home/jeffreytan/llvm/llvm-project/lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.cpp:2734)
lldb_private::process_gdb_remote::ProcessGDBRemote::DoConnectRemote(llvm::StringRef) (/home/jeffreytan/llvm/llvm-project/lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.cpp:612)
lldb_private::Process::ConnectRemote(llvm::StringRef) (/home/jeffreytan/llvm/llvm-project/lldb/source/Target/Process.cpp:3276)
lldb_private::Platform::DoConnectProcess(llvm::StringRef, llvm::StringRef, lldb_private::Debugger&, lldb_private::Stream*, lldb_private::Target*, lldb_private::Status&) (/home/jeffreytan/llvm/llvm-project/lldb/source/Target/Platform.cpp:1926)
...

What really happened is that, lldb client is double handling the GPU actions later during second SetThreadStopInfo.

lldb_private::TargetList::CreateTarget(lldb_private::Debugger&, llvm::StringRef, llvm::StringRef, lldb_private::LoadDependentFiles, lldb_private::OptionGroupPlatform const*, std::shared_ptr<lldb_private::Target>&) (/home/jeffreytan/llvm/llvm-project/lldb/source/Target/TargetList.cpp:51)
lldb_private::process_gdb_remote::ProcessGDBRemote::HandleConnectionRequest(lldb_private::GPUActions const&) (/home/jeffreytan/llvm/llvm-project/lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.cpp:1041)
lldb_private::process_gdb_remote::ProcessGDBRemote::HandleGPUActions(lldb_private::GPUActions const&) (/home/jeffreytan/llvm/llvm-project/lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.cpp:941)
lldb_private::process_gdb_remote::ProcessGDBRemote::SetThreadStopInfo(StringExtractor&) (/home/jeffreytan/llvm/llvm-project/lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.cpp:2734)
lldb_private::process_gdb_remote::ProcessGDBRemote::RefreshStateAfterStop() (lldb_private::process_gdb_remote::ProcessGDBRemote::RefreshStateAfterStop():40)
lldb_private::Process::ShouldBroadcastEvent(lldb_private::Event*) (/home/jeffreytan/llvm/llvm-project/lldb/source/Target/Process.cpp:3766)
lldb_private::Process::HandlePrivateEvent(std::shared_ptr<lldb_private::Event>&) (/home/jeffreytan/llvm/llvm-project/lldb/source/Target/Process.cpp:4014)
lldb_private::Process::ConnectRemote(llvm::StringRef) (/home/jeffreytan/llvm/llvm-project/lldb/source/Target/Process.cpp:3289)
...

This second socket GPU connection should not happen and will obviously fail which shows as following error in lldb client:

error: HandleGPUActions failed. Error: failed to get reply to handshake packet within timeout of 0.0 seconds

And the second HandleGPUActions() is incorrectly trying to create another dummy target, which explains 3 targets mystery I am observing:

(lldb) target list
Current targets:
  target #0: /data/users/jeffreytan/fbsource/buck-out/v2/gen/fbcode/a0927a84de4fee0f/scripts/xdwang/amd/bit_extract/__shared_memory_test__/shared_memory_test ( arch=x86_64-unknown-linux-gnu, platform=host, pid=2318752, state=stopped )
  target #1: <none> ( arch=amdgcn-amd-amdhsa--gfx942, platform=host, pid=1, state=stopped )
* target #2: <none> ( platform=host, state=unloaded )

Overall, this PR of delaying the connection creation beyond the first connection will prevent this weird workflow and code path. However, I am concerned that this may not be fixing the root cause or scalable. For example, how does a future plugin know that first stop can't be used and have to delay to second stop?

I think we probably should fix the behavior in lldb client side so that it won't call SetThreadStopInfo twice during a single stop. Adding a m_last_stop_packet.reset() inside DoConnectRemote after the first SetThreadStopInfo seems to fix the issue but I am not sure if that's the best fix. We should meet and discuss a better fix with @clayborg.

dmpots · 2025-09-03T17:49:53Z

@jeffreytan81

However, I am convinced that the root cause was caused by following statement:

This caused a problem because the GPUActions returned with that first stop-reply packet are ignored.

Agree that this is a problem. We may also be running into different issues depending on how we are launching (lldb launching the executable vs. attaching to a separately launched lldb server).

I'll setup a meeting to discuss further.

dmpots · 2025-09-09T16:47:29Z

I'll setup a meeting to discuss further.

We discussed offline. We will merge this PR to get back to a working state and separately fix the issue where GPUActions are ignored in the first stop-reply packet.

dmpots requested review from clayborg, jeffreytan81 and walter-erquinigo August 28, 2025 20:53

dmpots mentioned this pull request Aug 28, 2025

Fix initialization logic #32

Closed

walter-erquinigo reviewed Sep 2, 2025

View reviewed changes

clayborg approved these changes Sep 5, 2025

View reviewed changes

dmpots merged commit 4aff835 into clayborg:llvm-server-plugins Sep 9, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[GPU][AMD] Refactor initialization code and delay connection #41

[GPU][AMD] Refactor initialization code and delay connection #41

Uh oh!

dmpots commented Aug 28, 2025 •

edited

Loading

Uh oh!

dmpots commented Aug 28, 2025

Uh oh!

jeffreytan81 commented Aug 30, 2025 •

edited

Loading

Uh oh!

jeffreytan81 commented Aug 30, 2025 •

edited

Loading

Uh oh!

dmpots commented Aug 30, 2025

Uh oh!

walter-erquinigo left a comment

Uh oh!

walter-erquinigo Sep 2, 2025

Uh oh!

jeffreytan81 commented Sep 3, 2025 •

edited

Loading

Uh oh!

dmpots commented Sep 3, 2025

Uh oh!

dmpots commented Sep 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[GPU][AMD] Refactor initialization code and delay connection #41

[GPU][AMD] Refactor initialization code and delay connection #41

Uh oh!

Conversation

dmpots commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dmpots commented Aug 28, 2025

Uh oh!

jeffreytan81 commented Aug 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffreytan81 commented Aug 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dmpots commented Aug 30, 2025

Uh oh!

walter-erquinigo left a comment

Choose a reason for hiding this comment

Uh oh!

walter-erquinigo Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

jeffreytan81 commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dmpots commented Sep 3, 2025

Uh oh!

dmpots commented Sep 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dmpots commented Aug 28, 2025 •

edited

Loading

jeffreytan81 commented Aug 30, 2025 •

edited

Loading

jeffreytan81 commented Aug 30, 2025 •

edited

Loading

jeffreytan81 commented Sep 3, 2025 •

edited

Loading