-
Notifications
You must be signed in to change notification settings - Fork 78
mcp/stream: fix prevent potential goroutine leak on blocked incoming channel #123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the CL! Can you add a test for this?
0f708af
to
c71f318
Compare
mcp/streamable_test.go
Outdated
|
||
// Now simulate calling Close() and blocking the goroutine | ||
close(conn.done) | ||
time.Sleep(50 * time.Millisecond) // Allow goroutine to exit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use a sync.WaitGroup instead of a sleep to remove flakiness?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔, I agree that sync.WaitGroup
is a great tool for waiting on goroutines, but in this case, since scanEvents
is launched within handleSSE
and doesn't expose a separate wait condition, sync.WaitGroup
wouldn't directly help in waiting for scanEvents
to exit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack, we could wrap the call to handleSSE:
go func() {
conn.handleSSE(resp)
wg.Done()
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👋, sorry for my delayed response.
I’ve implemented the proposed approach using a sync.WaitGroup
around handleSSE
, and replaced the time.Sleep
with wg.Wait()
.
However, the test fails intermittently — although the leak has been correctly fixed. My understanding is that the scanEvents
goroutine, which is launched inside handleSSE
, might not observe the closed done
channel and exit before handleSSE
returns and wg.Wait()
completes due to scheduling variance. So the test catches it as a false positive leak, even though the fix is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you rebase with upstream and upload the test?
I added the waitgroup and I am not able to reproduce the failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also update the PR to mention this CL is to add a leak test.
It seems Rob beat you to the fix! #130
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick response!
I actually noticed two days ago that the leak issue had been fixed — great to see that resolved.
I've rebased the branch and pushed the test update using sync.WaitGroup
as suggested. On my local runs, the test sometimes passes, but occasionally fails due to a detected goroutine leak, which I believe is caused by scheduling variance.
This test currently relies on timing in two places to simulate the blocked goroutine scenario, and I personally feel this makes it a bit less elegant and deterministic. Given the potential for flakiness in CI, I'm also happy to close this PR if you'd prefer not to include this regression test.

There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why it's failing on your machine, it seems to work on mine after running it 10000 times 🤔 :
:~/w/go-sdk/mcp$ go test -count=10000 -race -run TestStreamableClientConnSSEGoroutineLeak
PASS
ok github.com/modelcontextprotocol/go-sdk/mcp 12.092s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👋, sorry for the delay and thank you so much for your patience and continued feedback!
To help validate the issue further, I added a dedicated step to my GitHub Actions workflow in my personal fork using the following command:
- name: Test
run: go test -v ./...
- name: TestGoroutineLeak
run: go test -count=1000 -v -race -run TestStreamableClientConnSSEGoroutineLeak ./mcp
This directly targets the streamable_test.go
file in the mcp
package, where the leak test is defined. I noticed a failure in this setup using the same environment as upstream CI.
Here's the link to the failing run: link
When you have a moment, you could take a look and confirm whether it aligns with what you’d expect?
mcp/streamable_test.go
Outdated
|
||
// Now simulate calling Close() and blocking the goroutine | ||
close(conn.done) | ||
time.Sleep(50 * time.Millisecond) // Allow goroutine to exit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack, we could wrap the call to handleSSE:
go func() {
conn.handleSSE(resp)
wg.Done()
}
Change:
Adds a regression test to ensure the
scanEvents
goroutine exits properly when thedone
channel is closed and theincoming
channel is full.Previously, there was a potential for goroutine leak if
scanEvents
was blocked on writing toincoming
and thedone
channel was closed before it could exit. This issue has already been fixed in prior changes — this test is intended to capture that edge case and prevent regressions.