-
Notifications
You must be signed in to change notification settings - Fork 1
[GPU][AMD] Refactor initialization code and delay connection #41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This PR refactors the amdgpu plugin initialization logic to make it more flexible so that we can choose when to initialize the debug library separately from attaching the process and creating the connection. The change mostly shuffles around and cleans up existing code, but we now also explicitly track the state of the amd debug library (e.g. initialized, attached, runtime-initialized) so that we can use that to guide decisions around when to generate the debug connection. We also fixed the connection logic after the changes in f1343a4. It now delays the connection until after the initial stop that occurs for the native process when it first ptrace attaches. The initial stop does not handle the GPUActions so we need to wait for a later stop. I played with delaying the connection even further to when the rocm runtime is initialized. That works, but it makes it awkward to use the debugger to set gpu breakpoints. The runtime is initialized on demand so there is not always a good place to set a cpu breakpoint where we can halt the process and create the gpu breakpoints. If we change the debugger to propagate the breakpoints from the cpu to gpu then this will not be an issue because we can set the breakpoints before the gpu target is created.
All tests are passing by cherry-picking #42 and running
|
The refactoring part seems fine. However, can you explain in the summary how f1343a4 breaks AMD plugin workflow? |
Also, is this PR changing any user visible behavior? We used to create AMD GPU target/connection during first native stop. After delaying connection, is it only fixing a race condition without user visible (still creating GPU target during first stop) or we are only creating target later after first native stop? |
Updated the summary to explain how the AMD plugin workflow was broken:
I had a different draft commit (#32) that handled the very first stop-reply packet with GPUActions, but after discussing with @clayborg we went with this approach that allows delaying the connection until a later time.
There should be no user-visible changes here. We are still creating the connection on the first stop after the initial stop-reply from the launch sequence. The refactor does make it easier to move that connection time around though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I got a couple of good ideas from this patch
|
||
Status error = InitializeAmdDbgApi(); | ||
if (error.Fail()) { | ||
logAndReportFatalError("{} Failed to initialize debug library: {}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL that you don't need to specify the index of the argument
@dmpots, thanks for improving the summary. However, I am convinced that the root cause was caused by following statement:
For example, I can see that, after first stop (during first CPU connection),
What really happened is that, lldb client is double handling the GPU actions later during second
This second socket GPU connection should not happen and will obviously fail which shows as following error in lldb client:
And the second HandleGPUActions() is incorrectly trying to create another dummy target, which explains 3 targets mystery I am observing:
Overall, this PR of delaying the connection creation beyond the first connection will prevent this weird workflow and code path. However, I am concerned that this may not be fixing the root cause or scalable. For example, how does a future plugin know that first stop can't be used and have to delay to second stop? I think we probably should fix the behavior in lldb client side so that it won't call |
Agree that this is a problem. We may also be running into different issues depending on how we are launching (lldb launching the executable vs. attaching to a separately launched lldb server). I'll setup a meeting to discuss further. |
We discussed offline. We will merge this PR to get back to a working state and separately fix the issue where GPUActions are ignored in the first stop-reply packet. |
This PR refactors the amdgpu plugin initialization logic to make it more flexible so that we can choose when to initialize the debug library separately from attaching the process and creating the connection.
The change mostly shuffles around and cleans up existing code, but we now also explicitly track the state of the amd debug library (e.g. initialized, attached, runtime-initialized) so that we can use that to guide decisions around when to generate the debug connection.
We also fixed the connection logic after the changes in f1343a4. That commit modified the timing of the
NativeProcessIsStopping
callback such that the callback is now triggered on the first stop that occurs when the native process is launched. This caused a problem because the GPUActions returned with that first stop-reply packet are ignored. The gdb-remote client sends a secondary$?
packet to get the stop reply packet again, but then we would try to call theinitRocm
again becausem_connected
was false.To avoid these problems we now delay sending the connection until after the initial stop that occurs for the native process when it first launches. I also played with delaying the connection even further to when the rocm runtime is initialized. That works, but it makes it awkward to use the debugger to set gpu breakpoints. The runtime is initialized on demand so there is not always a good place to set a cpu breakpoint where we can halt the process and create the gpu breakpoints. If we change the debugger to propagate the breakpoints from the cpu to gpu then this will not be an issue because we can set the breakpoints before the gpu target is created.