-
Notifications
You must be signed in to change notification settings - Fork 2.9k
[SPEC] Add implementation note about schema evolution #13936
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
bc4825b to
dd108d3
Compare
format/spec.md
Outdated
|
|
||
| ### Schema evolution and writing with old schemas | ||
|
|
||
| Writers should write out all fields with the types specified from the table schema. Inserts or upserts are allowed with an outdated schema (updates must use the latest schema to avoid data loss). Column projection rules are designed so that the table will remain readable even if writers use an outdated schema in these cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inserts or upserts are allowed with an outdated schema (updates must use the latest schema to avoid data loss).
this sentence is less clear to me. Can we just say Writers are allowed with an outdated schema.?
I didn't quite get this part updates must use the latest schema to avoid data loss.
Column projection rules are designed so that the table will remain readable even if writers use an outdated schema in these cases.
Also can we switch this sentence to the one below? This way, the first paragraph is about write and the second paragraph is about read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the feedback.
this sentence is less clear to me. Can we just say Writers are allowed with an outdated schema.?
I don't think this is a universally true statement, I added more details, PTAL.
Also can we switch this sentence to the one below? This way, the first paragraph is about write and the second paragraph is about read.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the new details help. Thanks!
Co-authored-by: Russell Spitzer <[email protected]>
|
Thanks @emkornfield for the PR! Thanks everyone for the review! |
|
I think this counts as errata but I would recommend we do a quick dev list heads up before doing a final merged in the future. |
|
Thanks, @RussellSpitzer, I appreciate the note! I merged because this was framed as a clarification (not a behavior change). I should have sent a quick dev list heads-up first. I’ll follow that practice going forward. If anyone prefers we revert and run a vote, I’m happy to do that. |
|
|
||
| * For all null columns, not writing out the column would cause `initial-default` value would be applied on reading instead of `null`. | ||
| * If `write-default` has been changed then using an out-of-date schema would result in the incorrect value being populated. | ||
| * If a `write` is the result of a partial row update (e.g. `update table set col_y = 'xyz'`) an out-of-date schema would silently drop values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify this? When could this happen? Is this if the column is dropped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If an old schema is used, then you implicitly end up dropping columns because you can't read columns you don't know about. Thinking more about it, this should be unlikely to happen because you probably would have to replay the transaction anyways. But effectively the sequence would be:
- Writer A writes new schema with added columns and new data for the added column.
- Writer B uses uses and old schema (this would have to happen strictly after step 1), and reads the new data, modifying an existing column.
- Writer B's updates would drop the new data from the added column.
Based on mailing list discussion try to capture semantics of type promotion/schema evolution and old writers.