-
Notifications
You must be signed in to change notification settings - Fork 28
fix(startup): add retry to CreateTiStore to handle TiKV bootstrap delay #2845
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(startup): add retry to CreateTiStore to handle TiKV bootstrap delay #2845
Conversation
Summary of ChangesHello @wlwilliamx, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical race condition during TiCDC startup where it could crash if the TiKV cluster was not fully bootstrapped yet. By introducing a robust retry mechanism with exponential backoff for establishing the TiKV storage connection ( Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a valuable retry mechanism to CreateTiStore, significantly improving TiCDC's startup resilience by handling delays in TiKV bootstrapping. The context propagation to support cancellation is well-implemented across the affected components. Additionally, the related hardening of the retry logic in UpdateMetaLabel is a good enhancement.
My review focuses on maintainability. I've identified some code duplication in the retry logic and constant definitions across pkg/upstream/upstream.go and pkg/pdutil/api_client.go. Centralizing these would make the retry policies more consistent and easier to manage in the future. Overall, this is a solid improvement to the stability of the system.
|
/retest |
|
/test all |
|
/test all |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hongyunyan, lidezhu The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[LGTM Timeline notifier]Timeline:
|
What problem does this PR solve?
Issue Number: close #2788
What is changed and how it works?
This PR introduces a retry mechanism when creating the TiKV storage client (
CreateTiStore) to make TiCDC startup more resilient, particularly when TiKV is not yet fully bootstrapped.The key changes are:
pkg/upstream/upstream.go:CreateTiStorefunction now accepts acontext.Context.d.OpenWithOptions(the TiKV driver) is wrapped in aretry.Doblock.defaultMaxRetryof 5, a base backoff delay of 200ms, and a max backoff delay of 4000ms.context.Canceled, ensuring it waits for TiKV to become ready.initUpstreamis updated to pass the context toCreateTiStore.Context Propagation:
CreateTiStore, thecontext.Contexthas been propagated down from the initial callers.pkg/keyspace/keyspace_manager.go: TheManagerinterface'sGetStoragemethod signature is updated toGetStorage(ctx context.Context, keyspace string). Themanagerimplementation is updated to accept thectxand pass it toCreateTiStore.api/v2/changefeed.go: Callers ofkeyspaceManager.GetStorageinCreateChangefeed,VerifyTable, andUpdateChangefeedare updated to pass the request context.logservice/txnutil/lock_resolver.go: The caller ofkeyspaceManager.GetStorageinResolveis also updated to pass the context.pkg/pdutil/api_client.go:UpdateMetaLabel(a PD API call) is enhanced.defaultMaxRetryis increased from 3 to 5.WithBackoffBaseDelay(200),WithBackoffMaxDelay(4000)) to make this call more resilient to a busy or starting PD.Why is this change necessary?
This change fixes a race condition during cluster startup, especially when using TiUP. TiUP may start the TiCDC process before the TiKV cluster has finished its bootstrapping procedure.
Previously, when TiCDC tried to initialize its connection to TiKV (via
OpenWithOptions), this call would fail if TiKV was not ready. As there was no retry, the TiCDC process would crash immediately (e.g., ~300ms after starting, while TiKV might finish bootstrapping moments later).By adding a robust retry with exponential backoff to
CreateTiStore, TiCDC will now patiently wait for TiKV to become available, resolving this startup failure and making the deployment process much more stable.Check List
Tests
Questions
Will it cause performance regression or break compatibility?
None
Do you need to update user documentation, design documentation or monitoring documentation?
None
Release note