-
Notifications
You must be signed in to change notification settings - Fork 21
Add feature to update table schemas for GCStoBQ load jobs #134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
} | ||
|
||
@Test | ||
public void schemaUpdateSkippedWhenEnabled() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name makes no sense to me. shcemaUpdateSkipped, but the test verifies that it was not skipped, right?
Is there a better name for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, corrected 👍
GcsToBqWriter writer = | ||
new GcsToBqWriter( | ||
storage, bigQuery, schemaManager, retries, retryWaitMs, autoCreate, mockTime); | ||
storage, bigQuery, schemaManager, retries, retryWaitMs, autoCreate,false, mockTime); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be spaces around the "false" in the constructor.
@@ -0,0 +1,242 @@ | |||
package com.wepay.kafka.connect.bigquery.integration; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file needs the standard header for the build to complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added it 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please take a look at #88 and let me know if that change is going to impact your changes. If so how can we make both work?
@Claudenw thanks for the heads up. |
Overview
Closes #96
Github Issue : #96
Schema evolution was not applied on the GCS batch path. Even with "allowNewBigQueryFields" and/or "allowSchemaUnionization" enabled, new columns were not added and required fields were not relaxed, causing LOAD jobs to fail with “No such field” errors and rejected rows.
This change enables the GCS → BigQuery batch path to update destination table schemas before issuing LOAD jobs.
When schema evolution is enabled, the batch writer derives a BigQuery schema from the batch’s records and calls SchemaManager.updateSchema(...) within the existing retry/timeout budget. If all schema-evolution flags(
allowNewBigQueryFields,allowBigQueryRequiredFieldRelaxation,allowSchemaUnionization
) are disabled, the writer bypasses this step, preserving current behavior.Implementation Details
BigQuerySinkTask
Capture and persist all three schema-evolution flags at start-up:
allowNewBigQueryFields
allowBigQueryRequiredFieldRelaxation
allowSchemaUnionization
Supply these flags to the GCS writer so it uses
SchemaManager
only when pre-patching is required (at least one flag enabled).GcsToBqWriter
After ensuring the table exists , convert the batch’s schema to a BigQuery schema and invoke:
inside the existing retry/timeout wrapper, before uploading the batch payload to GCS.
If retries are exhausted, raise a clear error.
No-op when all schema-evolution flags are disabled = strict backward compatibility.
Testing
The result of the integration test:
GcsBatchSchemaEvolutionIT sends two batches through the GCS path. The first uses the original schema; the second relaxes a previously required field and adds a new nullable column. The table schema is updated before the second load, so both fields are NULLABLE and the new column appears as expected.
The load completes as expected and the data aligns with the evolved schema. This behavior confirms that the batch path now respects schema-relaxation and field-addition settings.
For local testing, I added a short pause (about 10 seconds) between the update and the second load to watch the schema change show up in Bigquery before proceeding. You can see the schema of the table before and after the second record inserted into the table.
Unit Tests