Skip to content

Conversation

ArjunJagdale
Copy link
Contributor

@ArjunJagdale ArjunJagdale commented Jun 27, 2025

Fixes #7594
This PR adds support for filtering specific columns when loading datasets from .json or .jsonl files — similar to how the columns=... argument works for Parquet.

As suggested, support for the columns=... argument (previously available for Parquet) has now been extended to JSON and JSONL loading via load_dataset(...). You can now load only specific keys/columns and skip the rest — which should help in cases where some fields are unclean, inconsistent, or just unnecessary.

Example:

from datasets import load_dataset

dataset = load_dataset("json", data_files="your_data.jsonl", columns=["id", "title"])
print(dataset["train"].column_names)
# Output: ['id', 'title']

Summary of changes:

  • Added columns: Optional[List[str]] to JsonConfig
  • Updated _generate_tables() to filter selected columns
  • Forwarded columns argument from load_dataset() to the config
  • Added test for validation(should be fine!)

Let me know if you'd like the same to be added for CSV or others as a follow-up — happy to help.

@ArjunJagdale ArjunJagdale changed the title temp1 Add columns parameter to JSON loader to filter selected columns during loading Jun 27, 2025
@ArjunJagdale ArjunJagdale changed the title Add columns parameter to JSON loader to filter selected columns during loading Add columns support to JSON loader for selective key filtering Jun 27, 2025
@aihao2000
Copy link

I need this feature right now. It would be great if it could automatically fill in None for non-existent keys instead of reporting an error.

@ArjunJagdale
Copy link
Contributor Author

I need this feature right now. It would be great if it could automatically fill in None for non-existent keys instead of reporting an error.

Hi @aihao2000, Just to confirm — I have done the changes you asked for!
If you pass columns=["key1", "key2", "optional_key"] to load_dataset(..., columns=...), and any of those keys are missing from the input JSON objects, the loader will automatically fill those columns with None values, instead of raising an error.

@ArjunJagdale
Copy link
Contributor Author

Hi! any update on this PR?

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool ! I added a few comments :)

Comment on lines -116 to -131
# Use block_size equal to the chunk size divided by 32 to leverage multithreading
# Set a default minimum value of 16kB if the chunk size is really small
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert this comment deletion and the 2 others

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert this comment deletion and the 2 others

Wanted clarification on “the 2 others” to ensure no comment restorations were missed. Actually i have restored the two missing comments above - are they at the right place? :)

Comment on lines 145 to 150
if self.config.columns is not None:
missing_cols = [col for col in self.config.columns if col not in pa_table.column_names]
for col in missing_cols:
pa_table = pa_table.append_column(col, pa.array([None] * pa_table.num_rows))
pa_table = pa_table.select(self.config.columns)
yield (file_idx, batch_idx), self._cast_table(pa_table)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep this at the end, where you removed the yield - this way the try/except is only about the paj.read_json call

for col in missing_cols:
pa_table = pa_table.append_column(col, pa.array([None] * pa_table.num_rows))
pa_table = pa_table.select(self.config.columns)
yield (file_idx, batch_idx), self._cast_table(pa_table)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Comment on lines 183 to 184
# Pandas fallback in case of ArrowInvalid
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code is not at the right location anymore: it should trigger on ArrowInvalid

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve moved the Pandas fallback into the except pa.ArrowInvalid block, will you check?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add option to ignore keys/columns when loading a dataset from jsonl(or any other data format)
3 participants