Skip to content

Conversation

@tabVersion
Copy link
Contributor

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Added a new module for tracking the progress of refresh operations across parallel actors. This includes the introduction of a RefreshProgress structure to monitor the state of each actor during the refresh process. The changes also integrate this tracking into the existing barrier control flow, allowing for better coordination and reporting of refresh completion states. Additionally, updated related files to accommodate the new refresh progress functionality.

Checklist

  • I have written necessary rustdoc comments.
  • I have added necessary unit tests and integration tests.
  • I have added test labels as necessary.
  • I have added fuzzing tests or opened an issue to track them.
  • My PR contains breaking changes.
  • My PR changes performance-critical code, so I will run (micro) benchmarks and present the results.
  • I have checked the Release Timeline and Currently Supported Versions to determine which release branches I need to cherry-pick this PR into.

Documentation

  • My PR needs documentation updates.
Release note

@tabVersion tabVersion changed the title feat(refresh): implement refresh progress tracking for refresh table (for state switch) WIP: feat(refresh): implement refresh progress tracking for refresh table (for state switch) Oct 29, 2025

impl RefreshProgressTracker {
/// Start tracking a new refresh operation
pub fn start_refresh(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we used this method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plan to merge this function with RefreshManager

/// Map from table_id to refresh progress
progress_map: HashMap<TableId, RefreshProgress>,
/// Map from actor_id to table_id for quick lookup
actor_to_table: HashMap<ActorId, TableId>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realize that RefreshProgressTracker needs to handle scaling as well (no matter online scaling or offline scaling), which means we need to know when to update the actor maps to keep it consistent with the latest table's parallelism.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be not? As we will cancel the refresh op across recovery, I think we can assume during the run, the actors and parallelism do not change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. To assume the parallelism do not change during the running time, I think we need to ban online scaling to the batch refreshable table cc @shanicky

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can skip or ban specific tables during online scaling and manual scaling.

During offline scaling, is it confirmed that there are no jobs related to those tables?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During offline scaling, is it confirmed that there are no jobs related to those tables?

Even though there are jobs related to these tables, I think offline scaling is always safe to perform.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The underlying job is always safe to scale. The refresh process will not persist any states, so it is also safe to scale, at the cost of re-running refresh.

@chenzl25 chenzl25 requested review from shanicky and wenym1 November 3, 2025 04:15
Copy link
Contributor

@chenzl25 chenzl25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM

- Updated `refresh_table` method to accept shared actor information for better tracking.
- Introduced `SingleTableRefreshProgressTracker` to manage expected and finished actors during refresh operations.
- Modified barrier reporting to use `TableId` instead of `u32` for associated source IDs, improving type safety and clarity.
- Adjusted various executor implementations to align with the new source ID handling.
@graphite-app
Copy link

graphite-app bot commented Nov 4, 2025

Looks like this PR extends new SQL syntax or updates existing ones. Make sure that:

  • Test cases about the new/updated syntax are added in src/sqlparser/tests/testdata. Especially, double check the formatted_sql is still a valid SQL #20713
  • The meaning of each enum variant is documented in PR description. Additionally, document what it means when each optional clause is omitted.

@tabVersion tabVersion closed this Nov 4, 2025
@hzxa21
Copy link
Collaborator

hzxa21 commented Nov 4, 2025

Do we merge the changes altogether into #23527 instead so this PR is no longer needed?

@tabVersion
Copy link
Contributor Author

Do we merge the changes altogether into #23527 instead so this PR is no longer needed?

I messed up the git history in the pr, made some trouble merging into the original one. Will open a new one instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants