Skip to content

mcp/stream: fix prevent potential goroutine leak on blocked incoming channel #123

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

cryo-zd
Copy link
Contributor

@cryo-zd cryo-zd commented Jul 11, 2025

Change:

Adds a regression test to ensure the scanEvents goroutine exits properly when the done channel is closed and the incoming channel is full.

Previously, there was a potential for goroutine leak if scanEvents was blocked on writing to incoming and the done channel was closed before it could exit. This issue has already been fixed in prior changes — this test is intended to capture that edge case and prevent regressions.

Copy link
Contributor

@samthanawalla samthanawalla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the CL! Can you add a test for this?

@cryo-zd cryo-zd force-pushed the fix/streamCliTrans branch from 0f708af to c71f318 Compare July 13, 2025 17:24
jba
jba previously approved these changes Jul 14, 2025

// Now simulate calling Close() and blocking the goroutine
close(conn.done)
time.Sleep(50 * time.Millisecond) // Allow goroutine to exit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a sync.WaitGroup instead of a sleep to remove flakiness?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔, I agree that sync.WaitGroup is a great tool for waiting on goroutines, but in this case, since scanEvents is launched within handleSSE and doesn't expose a separate wait condition, sync.WaitGroup wouldn't directly help in waiting for scanEvents to exit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, we could wrap the call to handleSSE:

go func() {
    conn.handleSSE(resp)
    wg.Done()
}

Copy link
Contributor Author

@cryo-zd cryo-zd Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋, sorry for my delayed response.

I’ve implemented the proposed approach using a sync.WaitGroup around handleSSE, and replaced the time.Sleep with wg.Wait().

However, the test fails intermittently — although the leak has been correctly fixed. My understanding is that the scanEvents goroutine, which is launched inside handleSSE, might not observe the closed done channel and exit before handleSSE returns and wg.Wait() completes due to scheduling variance. So the test catches it as a false positive leak, even though the fix is correct.

Thanks again for the helpful suggestions and patience!
testFail

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rebase with upstream and upload the test?

I added the waitgroup and I am not able to reproduce the failures.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also update the PR to mention this CL is to add a leak test.

It seems Rob beat you to the fix! #130

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick response!

I actually noticed two days ago that the leak issue had been fixed — great to see that resolved.

I've rebased the branch and pushed the test update using sync.WaitGroup as suggested. On my local runs, the test sometimes passes, but occasionally fails due to a detected goroutine leak, which I believe is caused by scheduling variance.

This test currently relies on timing in two places to simulate the blocked goroutine scenario, and I personally feel this makes it a bit less elegant and deterministic. Given the potential for flakiness in CI, I'm also happy to close this PR if you'd prefer not to include this regression test.

test3

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why it's failing on your machine, it seems to work on mine after running it 10000 times 🤔 :

:~/w/go-sdk/mcp$ go test -count=10000 -race -run TestStreamableClientConnSSEGoroutineLeak
PASS
ok      github.com/modelcontextprotocol/go-sdk/mcp      12.092s

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋, sorry for the delay and thank you so much for your patience and continued feedback!

To help validate the issue further, I added a dedicated step to my GitHub Actions workflow in my personal fork using the following command:

- name: Test
      run: go test -v ./...
- name: TestGoroutineLeak
      run: go test -count=1000 -v -race -run TestStreamableClientConnSSEGoroutineLeak ./mcp

This directly targets the streamable_test.go file in the mcp package, where the leak test is defined. I noticed a failure in this setup using the same environment as upstream CI.

Here's the link to the failing run: link

When you have a moment, you could take a look and confirm whether it aligns with what you’d expect?


// Now simulate calling Close() and blocking the goroutine
close(conn.done)
time.Sleep(50 * time.Millisecond) // Allow goroutine to exit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, we could wrap the call to handleSSE:

go func() {
    conn.handleSSE(resp)
    wg.Done()
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants