Skip to content

Conversation

@heavycrystal
Copy link
Contributor

No description provided.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds SSH keepalive functionality to the MySQL connector to detect and handle connection failures. When the SSH tunnel becomes unresponsive, the system can now automatically close the database connection and log the failure.

  • Implements SSH keepalive mechanism with configurable interval (15 seconds)
  • Adds connection monitoring and automatic cleanup on SSH tunnel failure
  • Enhances MySQL connector to respond to SSH keepalive failures

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
flow/connectors/utils/ssh.go Adds keepalive channels, goroutine with periodic SSH requests, and channel management
flow/connectors/mysql/mysql.go Integrates SSH keepalive monitoring into MySQL connector's connection management loop

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

_, _, err := tunnel.Client.SendRequest("[email protected]", true, nil)
if err != nil {
logger.Error("Failed to send keep alive", slog.Any("error", err))
// don't close it, otherwise a select can read from channel forever until we error out
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

selecting a closed channel immediately returns

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just clarified comment, I don't want this behaviour as it'll keep trying to close MySQL connection while the rest of the system catches up

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could close it, then nil out field in tunnel, so later calls to GetKeepaliveChan returns nil, then in code check that if function returns nil connection is down

Copy link
Contributor

@ilidemi ilidemi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How we can autotest this:

  1. ToxiProxy + SSH + MySQL peer

    • an integration test doing a ping (succeeds), starting a SELECT SLEEP(60) query, sleeping for a few sec so it starts, proxy.Disable(), the query should return with an error in less than 60 seconds total. If the keepalive interval is configurable, the test can be shorter than 60s. #3493 adds a toxi for Postgres as an example
    • SELECT SLEEP(10) with a very short keepalive finishes successfully
  2. Unit tests for the channel logic, with a mock SSH client:

    • multiple GetKeepaliveChan calls return the same channel
    • keepalive stops on Close
    • keepalive stops on context cancel
    • nil client returns nil channel
    • no goroutine leak after multiple start/stop cycles (aka many qrep partitions)

Code Opus is great for both docker stuff and mock stuff

for {
select {
case <-ticker.C:
_, _, err := tunnel.Client.SendRequest("[email protected]", true, nil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when the receiver is a black hole? I don't think we're setting a net.Conn deadline (different from Dialer deadline) explicitly anywhere so this might hang indefinitely (our reference extends the deadline on every request). Alternatively, can do time.After concurrently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike the reference, we don't have a handle on the underlying net.Conn and having one is non-trivial because the MySQL connector can create multiple connections with the dialer we provide.

Can revisit this if we see the failure mode in practice

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'd have a time.After goroutine that closes connection if request is blocked too long

@heavycrystal heavycrystal force-pushed the mysql-ssh-keepalive branch 2 times, most recently from 5205997 to b0a3aa3 Compare September 22, 2025 12:34
@heavycrystal heavycrystal merged commit c2b38df into main Sep 25, 2025
13 of 14 checks passed
@heavycrystal heavycrystal deleted the mysql-ssh-keepalive branch September 25, 2025 11:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants