Skip to content

Conversation

nkaradzhov
Copy link
Collaborator

1) when RE failover happens, there is a disconnect
2) affected Client reconnects and tries to resubscribe all existing listeners
ISSUE #1: CROSSSLOT Error - client was doing ssubscribe ch1 ch2.. chN which, after the failover could result in CROSSSLOT ( naturally, becasuse now some slots could be owned by other shards )
FIX: send one ssubscribe command per channel instead of one ssubscribe for all channels
ISSUE #2: MOVED Error - some/all of the channels might be moved somewhere else
FIX: 1: propagate the error to the Cluster. 2: Cluster rediscovers topology.
3: Extract all existing subscriptions from all pubsub clients and resubscribe
over the new topology.

fixes: #2902

@nkaradzhov nkaradzhov force-pushed the cluster-failover-issue branch from ff426bc to 2a766b7 Compare October 9, 2025 14:15
@nkaradzhov nkaradzhov marked this pull request as draft October 9, 2025 14:15
@nkaradzhov
Copy link
Collaborator Author

run sharded pub sub tests

@uglide
Copy link
Contributor

uglide commented Oct 9, 2025

Testcase Errors Failures Skipped Total
Root Suite 0 0 0 0
Sharded Pub/Sub E2E 0 0 0 0
Single Subscriber 0 0 0 3
Multiple Subscribers 0 1 0 2

---- Details for maintainers

@nkaradzhov nkaradzhov force-pushed the cluster-failover-issue branch from 2a766b7 to b07e39b Compare October 10, 2025 08:48
@nkaradzhov
Copy link
Collaborator Author

run sharded pub sub tests

… TODO cleanup debug logs

1) when RE failover happens, there is a disconnect
2) affected Client reconnects and tries to resubscribe all existing listeners
ISSUE #1: CROSSSLOT Error - client was doing ssubscribe ch1 ch2.. chN which, after the failover could result in CROSSSLOT ( naturally, becasuse now some slots could be owned by other shards )
FIX: send one ssubscribe command per channel instead of one ssubscribe for all channels
ISSUE #2: MOVED Error - some/all of the channels might be moved somewhere else
FIX: 1: propagate the error to the Cluster. 2: Cluster rediscovers topology.
3: Extract all existing subscriptions from all pubsub clients and resubscribe
over the new topology.

fixes: redis#2902
@uglide
Copy link
Contributor

uglide commented Oct 10, 2025

Testcase Errors Failures Skipped Total
Root Suite 0 0 0 0
Sharded Pub/Sub E2E 0 0 0 0
Single Subscriber 0 0 0 3
Multiple Subscribers 0 0 0 2

---- Details for maintainers

@nkaradzhov nkaradzhov force-pushed the cluster-failover-issue branch from b07e39b to 4423777 Compare October 10, 2025 09:16
@vaibhavkumar-sf
Copy link

PR Reviewer Guide 🔍

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ No key issues found

The PR code appears to be well-structured and follows good practices. The reviewer should still validate the logic and ensure the implementation meets requirements.

Copy link
Member

@bobymicroby bobymicroby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nkaradzhov
Copy link
Collaborator Author

run sharded pub sub tests

@uglide
Copy link
Contributor

uglide commented Oct 10, 2025

Testcase Errors Failures Skipped Total

---- Details for maintainers

@nkaradzhov
Copy link
Collaborator Author

run sharded pub sub tests

@uglide
Copy link
Contributor

uglide commented Oct 10, 2025

Testcase Errors Failures Skipped Total
Root Suite 0 0 0 0
Sharded Pub/Sub E2E 0 0 0 0
Single Subscriber 0 0 0 3
Multiple Subscribers 0 0 0 2

---- Details for maintainers

@nkaradzhov nkaradzhov force-pushed the cluster-failover-issue branch from 113229c to 697402f Compare October 10, 2025 15:22
@nkaradzhov
Copy link
Collaborator Author

run sharded pub sub tests

@uglide
Copy link
Contributor

uglide commented Oct 10, 2025

Testcase Errors Failures Skipped Total
Root Suite 0 0 0 0
Sharded Pub/Sub E2E 0 0 0 0
Single Subscriber 0 0 0 3
Multiple Subscribers 0 0 0 2

---- Details for maintainers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants