Replies: 1 comment 1 reply
-
|
Hi @bseto, thanks for the discussion. We did something similar, but partially: #589. I think you might also need to do the optimization in the same place, which is To update the slots, we will need to acquire the full lock, not just Rlock. I am not sure if this is really a good idea, though. But if it works well in your case, I think we are good. I also wonder how we can improve
I believe |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello!
I'd like to add an option to avoid doing a full refresh of the
*clusterClient'spslots,rslotsandconnswhen a client receives aMOVEerror.Why:
When a kvrocks slot migration completes, our rueidis clients seem to disconnect, and then attempt to reconnect. But while reconnecting, I can see the client connection metrics spike from 10k clients, to 90k+ sometimes which ends up DDoS'ing the cluster. The cluster stays up, but none of the clients can do any meaningful work until we take down half of them and allow clients to slowly connect.
What I think is the Problem:
Currently writing some modifications to a fork to test to confirm if this is the case, but what I think is happening is that when we receive a
isMoved()in cluster.go,shouldRefreshRetrycallslazyRefresh(). I am able to see from my own local testing that a refresh can be triggered multiple times (sometimes 3+ times) per client, and each refresh willgetClusterSlotsfrom each node in the cluster. In our biggest regions, we have 42 nodes - and our clients are reading 100,000+/s, so I'm imagining there may be more than just 3 refresh calls - basically up to how manyisMoved()calls that can happen before the_refresh()finishes, which with our production clients since they're on huge nodes, could be many times.But if each
_refresh()issues 42 commands to the cluster, and we have around 10k clients connected, this becomes a huge amount of calls.I'm also seeing that after
getClusterSlots, it'll go through and dial to each node in the cluster through theconnFn, which points tomakeConnin rueidis.go. Which explains why our client connections metrics which normally shows us having 10k clients connected can spike up to 90k+ clients before the inevitable client connections start timing out and have broken pipes etc.Proposed Solution
In a nutshell, use the
MOVEDerror we get from the cluster to update the slot.Example message
MOVED 217 <ip>:<port>Given this message, we can update the slot
217to the appropriate connection. If there was no newly added node, we can check the existingc.connsto see if it exists, and if it does, update the slotThe below is just to give a rough idea of what I mean
Before going further though, I wanted to see if you have any insights on whether or not this would be a good idea. Or if this is something you considered previously.
Thank you for your time!
Beta Was this translation helpful? Give feedback.
All reactions