A Lazier Refresh #911

bseto · 2025-10-28T05:12:38Z

bseto
Oct 28, 2025

Hello!

I'd like to add an option to avoid doing a full refresh of the *clusterClient's pslots, rslots and conns when a client receives a MOVE error.

Why:

When a kvrocks slot migration completes, our rueidis clients seem to disconnect, and then attempt to reconnect. But while reconnecting, I can see the client connection metrics spike from 10k clients, to 90k+ sometimes which ends up DDoS'ing the cluster. The cluster stays up, but none of the clients can do any meaningful work until we take down half of them and allow clients to slowly connect.

What I think is the Problem:

Currently writing some modifications to a fork to test to confirm if this is the case, but what I think is happening is that when we receive a isMoved() in cluster.go, shouldRefreshRetry calls lazyRefresh(). I am able to see from my own local testing that a refresh can be triggered multiple times (sometimes 3+ times) per client, and each refresh will getClusterSlots from each node in the cluster. In our biggest regions, we have 42 nodes - and our clients are reading 100,000+/s, so I'm imagining there may be more than just 3 refresh calls - basically up to how many isMoved() calls that can happen before the _refresh() finishes, which with our production clients since they're on huge nodes, could be many times.

But if each _refresh() issues 42 commands to the cluster, and we have around 10k clients connected, this becomes a huge amount of calls.

I'm also seeing that after getClusterSlots, it'll go through and dial to each node in the cluster through the connFn, which points to makeConn in rueidis.go. Which explains why our client connections metrics which normally shows us having 10k clients connected can spike up to 90k+ clients before the inevitable client connections start timing out and have broken pipes etc.

Proposed Solution

In a nutshell, use the MOVED error we get from the cluster to update the slot.
Example message MOVED 217 <ip>:<port>

Given this message, we can update the slot 217 to the appropriate connection. If there was no newly added node, we can check the existing c.conns to see if it exists, and if it does, update the slot

The below is just to give a rough idea of what I mean

func (c *clusterClient) updateFromRedirectMove(err error) error {
	errSplit := strings.Split(err.Error(), " ")
	if len(errSplit) != 3 {
		return ErrExpectedThreePartsToError
	}
	slot, err := strconv.Atoi(errSplit[1])
	if err != nil {
		return err
	}

	addr := errSplit[2]
	c.mu.RLock()
	cc := c.conns[addr]
	c.mu.RUnlock()

	if cc.conn != nil {
		// Connection already exists, just update slot
		c.mu.Lock()
		c.pslots[slot] = cc.conn
		c.mu.Unlock()
		return nil
	}
       // ... handle case where connection does not exist
       // ... lock, and then update pslots, and c.conns (if needed)
}

Before going further though, I wanted to see if you have any insights on whether or not this would be a good idea. Or if this is something you considered previously.

Thank you for your time!

rueian · 2025-10-28T06:07:44Z

rueian
Oct 28, 2025
Maintainer

Hi @bseto, thanks for the discussion.

We did something similar, but partially: #589. I think you might also need to do the optimization in the same place, which is redirectOrNew.

To update the slots, we will need to acquire the full lock, not just Rlock. I am not sure if this is really a good idea, though. But if it works well in your case, I think we are good.

I also wonder how we can improve lazyRefresh. It is currently an unblocking operation, but maybe we can try to make a blocking version of it, with context, to wait for the single flight of the refresh to finish or the deadline of the context. This should be able to relieve the retry storm.

I'm also seeing that after getClusterSlots, it'll go through and dial to each node in the cluster through the connFn, which points to makeConn in rueidis.go. Which explains why our client connections metrics which normally shows us having 10k clients connected can spike up to 90k+ clients before the inevitable client connections start timing out and have broken pipes etc.

I believe connFn is also lazy; it won't do the real connection unless it is used by the upper layer. There must be something else I missed.

1 reply

bseto Oct 28, 2025
Author

Hi @rueian. Sorry you are right about the connFn being lazy (at least from what I can tell on my small local testing). I see makeConn being called, but I don't actually see any dials happening - my mistake.

I'll update you when I get a chance to try out the changes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

A Lazier Refresh #911

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

A Lazier Refresh #911

Uh oh!

bseto Oct 28, 2025

Why:

What I think is the Problem:

Proposed Solution

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

rueian Oct 28, 2025 Maintainer

Uh oh!

bseto Oct 28, 2025 Author

bseto
Oct 28, 2025

Replies: 1 comment 1 reply

rueian
Oct 28, 2025
Maintainer

bseto Oct 28, 2025
Author