-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Add columns support to JSON loader for selective key filtering #7652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
columns
parameter to JSON loader to filter selected columns during loading
columns
parameter to JSON loader to filter selected columns during loading
I need this feature right now. It would be great if it could automatically fill in None for non-existent keys instead of reporting an error. |
Hi @aihao2000, Just to confirm — I have done the changes you asked for! |
Hi! any update on this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool ! I added a few comments :)
# Use block_size equal to the chunk size divided by 32 to leverage multithreading | ||
# Set a default minimum value of 16kB if the chunk size is really small |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert this comment deletion and the 2 others
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert this comment deletion and the 2 others
Wanted clarification on “the 2 others” to ensure no comment restorations were missed. Actually i have restored the two missing comments above - are they at the right place? :)
if self.config.columns is not None: | ||
missing_cols = [col for col in self.config.columns if col not in pa_table.column_names] | ||
for col in missing_cols: | ||
pa_table = pa_table.append_column(col, pa.array([None] * pa_table.num_rows)) | ||
pa_table = pa_table.select(self.config.columns) | ||
yield (file_idx, batch_idx), self._cast_table(pa_table) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would keep this at the end, where you removed the yield
- this way the try/except
is only about the paj.read_json
call
for col in missing_cols: | ||
pa_table = pa_table.append_column(col, pa.array([None] * pa_table.num_rows)) | ||
pa_table = pa_table.select(self.config.columns) | ||
yield (file_idx, batch_idx), self._cast_table(pa_table) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
Co-authored-by: Quentin Lhoest <[email protected]>
Co-authored-by: Quentin Lhoest <[email protected]>
# Pandas fallback in case of ArrowInvalid | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code is not at the right location anymore: it should trigger on ArrowInvalid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ve moved the Pandas fallback into the except pa.ArrowInvalid
block, will you check?
Fixes #7594
This PR adds support for filtering specific columns when loading datasets from .json or .jsonl files — similar to how the columns=... argument works for Parquet.
As suggested, support for the
columns=...
argument (previously available for Parquet) has now been extended to JSON and JSONL loading viaload_dataset(...)
. You can now load only specific keys/columns and skip the rest — which should help in cases where some fields are unclean, inconsistent, or just unnecessary.Example:
Summary of changes:
columns: Optional[List[str]]
toJsonConfig
_generate_tables()
to filter selected columnscolumns
argument fromload_dataset()
to the configLet me know if you'd like the same to be added for CSV or others as a follow-up — happy to help.