Skip to content

Conversation

dmpots
Copy link
Collaborator

@dmpots dmpots commented Aug 28, 2025

This PR refactors the amdgpu plugin initialization logic to make it more flexible so that we can choose when to initialize the debug library separately from attaching the process and creating the connection.

The change mostly shuffles around and cleans up existing code, but we now also explicitly track the state of the amd debug library (e.g. initialized, attached, runtime-initialized) so that we can use that to guide decisions around when to generate the debug connection.

We also fixed the connection logic after the changes in f1343a4. That commit modified the timing of the NativeProcessIsStopping callback such that the callback is now triggered on the first stop that occurs when the native process is launched. This caused a problem because the GPUActions returned with that first stop-reply packet are ignored. The gdb-remote client sends a secondary $? packet to get the stop reply packet again, but then we would try to call the initRocm again because m_connected was false.

To avoid these problems we now delay sending the connection until after the initial stop that occurs for the native process when it first launches. I also played with delaying the connection even further to when the rocm runtime is initialized. That works, but it makes it awkward to use the debugger to set gpu breakpoints. The runtime is initialized on demand so there is not always a good place to set a cpu breakpoint where we can halt the process and create the gpu breakpoints. If we change the debugger to propagate the breakpoints from the cpu to gpu then this will not be an issue because we can set the breakpoints before the gpu target is created.

This PR refactors the amdgpu plugin initialization logic to make it more
flexible so that we can choose when to initialize the debug library separately
from attaching the process and creating the connection.

The change mostly shuffles around and cleans up existing code, but we
now also explicitly track the state of the amd debug library (e.g.
initialized, attached, runtime-initialized) so that we can use that to
guide decisions around when to generate the debug connection.

We also fixed the connection logic after the changes in f1343a4.  It now
delays the connection until after the initial stop that occurs for the
native process when it first ptrace attaches. The initial stop does not
handle the GPUActions so we need to wait for a later stop.

I played with delaying the connection even further to when the rocm
runtime is initialized. That works, but it makes it awkward to use the
debugger to set gpu breakpoints. The runtime is initialized on demand so
there is not always a good place to set a cpu breakpoint where we can
halt the process and create the gpu breakpoints. If we change the
debugger to propagate the breakpoints from the cpu to gpu then this will
not be an issue because we can set the breakpoints before the gpu target
is created.
@dmpots
Copy link
Collaborator Author

dmpots commented Aug 28, 2025

All tests are passing by cherry-picking #42 and running

DOTEST_LD_EXTRAS=-Wl,--dynamic-linker=</path/to/fixed/ld.so> ./bin/llvm-lit ../../lldb/test/API/gpu/amd/ -v

@jeffreytan81
Copy link
Collaborator

jeffreytan81 commented Aug 30, 2025

The refactoring part seems fine. However, can you explain in the summary how f1343a4 breaks AMD plugin workflow?
Then it probably will help me to understand how the changes will help fix it.

@jeffreytan81
Copy link
Collaborator

jeffreytan81 commented Aug 30, 2025

Also, is this PR changing any user visible behavior? We used to create AMD GPU target/connection during first native stop. After delaying connection, is it only fixing a race condition without user visible (still creating GPU target during first stop) or we are only creating target later after first native stop?

@dmpots
Copy link
Collaborator Author

dmpots commented Aug 30, 2025

@jeffreytan81

However, can you explain in the summary how f1343a4 breaks AMD plugin workflow? Then it probably will help me to understand how the changes will help fix it.

Updated the summary to explain how the AMD plugin workflow was broken:

We also fixed the connection logic after the changes in f1343a4. That commit modified the timing of the NativeProcessIsStopping callback such that the callback is now triggered on the first stop that occurs when the native process is launched. This caused a problem because the GPUActions returned with that first stop-reply packet are ignored. The gdb-remote client sends a secondary $? packet to get the stop reply packet again, but then we would try to call the initRocm again because m_connected was false.

I had a different draft commit (#32) that handled the very first stop-reply packet with GPUActions, but after discussing with @clayborg we went with this approach that allows delaying the connection until a later time.

Also, is this PR changing any user visible behavior? We used to create AMD GPU target/connection during first native stop. After delaying connection, is it only fixing a race condition without user visible (still creating GPU target during first stop) or we are only creating target later after first native stop?

There should be no user-visible changes here. We are still creating the connection on the first stop after the initial stop-reply from the launch sequence. The refactor does make it easier to move that connection time around though.

Copy link
Collaborator

@walter-erquinigo walter-erquinigo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I got a couple of good ideas from this patch


Status error = InitializeAmdDbgApi();
if (error.Fail()) {
logAndReportFatalError("{} Failed to initialize debug library: {}",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL that you don't need to specify the index of the argument

@jeffreytan81
Copy link
Collaborator

jeffreytan81 commented Sep 3, 2025

@dmpots, thanks for improving the summary.

However, I am convinced that the root cause was caused by following statement:

This caused a problem because the GPUActions returned with that first stop-reply packet are ignored.

For example, I can see that, after first stop (during first CPU connection), ProcessGDBRemote::DoConnectRemote is calling SetThreadStopInfo which internally will handle the GPU actions and start the GPU connection via socket:

lldb_private::process_gdb_remote::ProcessGDBRemote::HandleGPUActions(lldb_private::GPUActions const&) (/home/jeffreytan/llvm/llvm-project/lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.cpp:923)
lldb_private::process_gdb_remote::ProcessGDBRemote::SetThreadStopInfo(StringExtractor&) (/home/jeffreytan/llvm/llvm-project/lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.cpp:2734)
lldb_private::process_gdb_remote::ProcessGDBRemote::DoConnectRemote(llvm::StringRef) (/home/jeffreytan/llvm/llvm-project/lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.cpp:612)
lldb_private::Process::ConnectRemote(llvm::StringRef) (/home/jeffreytan/llvm/llvm-project/lldb/source/Target/Process.cpp:3276)
lldb_private::Platform::DoConnectProcess(llvm::StringRef, llvm::StringRef, lldb_private::Debugger&, lldb_private::Stream*, lldb_private::Target*, lldb_private::Status&) (/home/jeffreytan/llvm/llvm-project/lldb/source/Target/Platform.cpp:1926)
...

What really happened is that, lldb client is double handling the GPU actions later during second SetThreadStopInfo.

lldb_private::TargetList::CreateTarget(lldb_private::Debugger&, llvm::StringRef, llvm::StringRef, lldb_private::LoadDependentFiles, lldb_private::OptionGroupPlatform const*, std::shared_ptr<lldb_private::Target>&) (/home/jeffreytan/llvm/llvm-project/lldb/source/Target/TargetList.cpp:51)
lldb_private::process_gdb_remote::ProcessGDBRemote::HandleConnectionRequest(lldb_private::GPUActions const&) (/home/jeffreytan/llvm/llvm-project/lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.cpp:1041)
lldb_private::process_gdb_remote::ProcessGDBRemote::HandleGPUActions(lldb_private::GPUActions const&) (/home/jeffreytan/llvm/llvm-project/lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.cpp:941)
lldb_private::process_gdb_remote::ProcessGDBRemote::SetThreadStopInfo(StringExtractor&) (/home/jeffreytan/llvm/llvm-project/lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.cpp:2734)
lldb_private::process_gdb_remote::ProcessGDBRemote::RefreshStateAfterStop() (lldb_private::process_gdb_remote::ProcessGDBRemote::RefreshStateAfterStop():40)
lldb_private::Process::ShouldBroadcastEvent(lldb_private::Event*) (/home/jeffreytan/llvm/llvm-project/lldb/source/Target/Process.cpp:3766)
lldb_private::Process::HandlePrivateEvent(std::shared_ptr<lldb_private::Event>&) (/home/jeffreytan/llvm/llvm-project/lldb/source/Target/Process.cpp:4014)
lldb_private::Process::ConnectRemote(llvm::StringRef) (/home/jeffreytan/llvm/llvm-project/lldb/source/Target/Process.cpp:3289)
...

This second socket GPU connection should not happen and will obviously fail which shows as following error in lldb client:

error: HandleGPUActions failed. Error: failed to get reply to handshake packet within timeout of 0.0 seconds

And the second HandleGPUActions() is incorrectly trying to create another dummy target, which explains 3 targets mystery I am observing:

(lldb) target list
Current targets:
  target #0: /data/users/jeffreytan/fbsource/buck-out/v2/gen/fbcode/a0927a84de4fee0f/scripts/xdwang/amd/bit_extract/__shared_memory_test__/shared_memory_test ( arch=x86_64-unknown-linux-gnu, platform=host, pid=2318752, state=stopped )
  target #1: <none> ( arch=amdgcn-amd-amdhsa--gfx942, platform=host, pid=1, state=stopped )
* target #2: <none> ( platform=host, state=unloaded )

Overall, this PR of delaying the connection creation beyond the first connection will prevent this weird workflow and code path. However, I am concerned that this may not be fixing the root cause or scalable. For example, how does a future plugin know that first stop can't be used and have to delay to second stop?

I think we probably should fix the behavior in lldb client side so that it won't call SetThreadStopInfo twice during a single stop. Adding a m_last_stop_packet.reset() inside DoConnectRemote after the first SetThreadStopInfo seems to fix the issue but I am not sure if that's the best fix. We should meet and discuss a better fix with @clayborg.

@dmpots
Copy link
Collaborator Author

dmpots commented Sep 3, 2025

@jeffreytan81

However, I am convinced that the root cause was caused by following statement:

This caused a problem because the GPUActions returned with that first stop-reply packet are ignored.

Agree that this is a problem. We may also be running into different issues depending on how we are launching (lldb launching the executable vs. attaching to a separately launched lldb server).

I'll setup a meeting to discuss further.

@dmpots
Copy link
Collaborator Author

dmpots commented Sep 9, 2025

I'll setup a meeting to discuss further.

We discussed offline. We will merge this PR to get back to a working state and separately fix the issue where GPUActions are ignored in the first stop-reply packet.

@dmpots dmpots merged commit 4aff835 into clayborg:llvm-server-plugins Sep 9, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants