-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Flink: Dynamic Sink: Allow updating table properties #13883
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| try { | ||
| UpdateProperties updateApi = table.updateProperties(); | ||
|
|
||
| // Remove properties that are no longer present |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we consider Flink job as source of truth for properties and remove properties on tables , if added from external pipelines ?
In large scale ingestion pipelines, if we need to add a new property, we will need to then update the TablePropertiesUpdaterImpl and redeploy right ? This may not be a feasible as it may also put too much pressure on catalog on startup.
But that also means, there is no one source of truth :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this feature is enabled, which is optional, the job is the source of truth.
In large scale ingestion pipelines, if we need to add a new property, we will need to then update the TablePropertiesUpdaterImpl and redeploy right ? This may not be a feasible as it may also put too much pressure on catalog on startup.
This is no different from other operations (e.g. table creation, schema changes) that the sink may perform.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was thinking of, updating table properties across all tables (TablePropertiesUpdater code update) is more common than schema updates to all tables at same time on startup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure though how the "needsUpdate()" method will make a difference. If the user returns the same properties, we will already skip the update. Concerning the time of the update, I think properties will have to be updated on first seeing the table (that is, if there are changes to the properties). Not sure if there is a way around it.
| } | ||
| } | ||
|
|
||
| private void updateTablePropertiesIfNeeded(TableIdentifier identifier) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is basically run for every DynamicRecord right ?
Can we avoid this check for every DynamicRecord , where users have TablePropertiesUpdater to not vary/depend on currentProperties of the table ?
Like i am thinking if possible to extend the TablePropertiesUpdater interface to also have needsRefresh() that can default to true. But users can have less expensive checks (ex: false for first record of a table in a task).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check isn't that expensive:
- Table properties are cached
- There aren't any calls to the UpdateProperties API if properties for the table did not change
- Also, if there is no handler specified, this check will be skipped entirely (feature is disabled then)
Can we avoid this check for every DynamicRecord , where users have TablePropertiesUpdater to not vary/depend on currentProperties of the table ?
We can certainly do that.
| import java.util.Map; | ||
|
|
||
| @FunctionalInterface | ||
| public interface TablePropertiesUpdater extends Serializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please elaborate why we need an updater interface for the properties?
Why not just an expected Map<String, String> for the values? The DynamicRecord could contain the expected properties map for the table, and we can act upon that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was contemplating to add the properties to DynamicRecord, but I decided against it because table properties aren't directly connected to the data.
Perhaps there are other table-related settings like the table location which users want to control (this has already been requested by users). The location would only be settable during table creation. Using a separate interface we would be better able to express this intent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per our discussion the next request will be to change the location of the table on table creation.
Maybe we should delegate the table creation to the user and they can manipulate whatever they want.
I'm a little bit more conflicted on the update. Is it really something we have use-case for? Why would we like to update some properties of an already existing table? This seems like a different problem than writing into a table. Maybe if this is a requirement they can do it concurrently in a parallel flow? The properties should not affect the writing, so parallel write could continue even if the properties need to be changed
| // Apply table properties during table creation if updater is provided | ||
| Map<String, String> properties = Maps.newHashMap(); | ||
| if (tablePropertiesUpdater != null) { | ||
| properties = tablePropertiesUpdater.apply(identifier.toString(), properties); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of passing in identifier.toString() as the tableName in TablePropertiesUpdater, could we just modify TablePropertiesUpdater to accept a TableIdentifier? This may be a bit more accurate, since identifier.toString() is composed of both the namespace as well as the tableName. Plus, users may want to update properties such as write.data.path based on both namespace and/or tableName.
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
This feature allows modifying table properties on the fly. Users have reported that they need to update table properties of both new and existing tables. Some catalogues even check for certain properties and reject table changes if those properties are not present.
Below the doc added as part of this PR:
Dynamic Table Properties
The Dynamic Sink supports dynamically updating table properties. This feature allows you to:
TablePropertiesUpdater Interface
The
TablePropertiesUpdateris an interface that receives the fully-qualified table name and current properties, then returns the updated properties:Usage Example