Add columns support to JSON loader for selective key filtering #7652

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

ArjunJagdale wants to merge 13 commits into huggingface:main from ArjunJagdale:patch-14

Contributor

ArjunJagdale commented Jun 27, 2025 •

edited

Loading

Fixes #7594
This PR adds support for filtering specific columns when loading datasets from .json or .jsonl files — similar to how the columns=... argument works for Parquet.

As suggested, support for the columns=... argument (previously available for Parquet) has now been extended to JSON and JSONL loading via load_dataset(...). You can now load only specific keys/columns and skip the rest — which should help in cases where some fields are unclean, inconsistent, or just unnecessary.

Example:

from datasets import load_dataset

dataset = load_dataset("json", data_files="your_data.jsonl", columns=["id", "title"])
print(dataset["train"].column_names)
# Output: ['id', 'title']

Summary of changes:

Added columns: Optional[List[str]] to JsonConfig
Updated _generate_tables() to filter selected columns
Forwarded columns argument from load_dataset() to the config
Added test for validation(should be fine!)

Let me know if you'd like the same to be added for CSV or others as a follow-up — happy to help.

ArjunJagdale added 3 commits

June 27, 2025 21:48


          temp1

db75657

temp2


          Update load.py

c7872cb


          Update test_json.py

a0fedf5

ArjunJagdale changed the title ~~temp1~~ Add columns parameter to JSON loader to filter selected columns during loading

ArjunJagdale changed the title ~~Add columns parameter to JSON loader to filter selected columns during loading~~ Add columns support to JSON loader for selective key filtering

aihao2000 commented Jul 3, 2025

I need this feature right now. It would be great if it could automatically fill in None for non-existent keys instead of reporting an error.


          Update json.py

d23a48b

Contributor Author

ArjunJagdale commented Jul 3, 2025

I need this feature right now. It would be great if it could automatically fill in None for non-existent keys instead of reporting an error.

Hi @aihao2000, Just to confirm — I have done the changes you asked for!
If you pass columns=["key1", "key2", "optional_key"] to load_dataset(..., columns=...), and any of those keys are missing from the input JSON objects, the loader will automatically fill those columns with None values, instead of raising an error.

Contributor Author

ArjunJagdale commented Jul 14, 2025

Hi! any update on this PR?

lhoestq reviewed

View reviewed changes

Member

lhoestq left a comment

Cool ! I added a few comments :)

src/datasets/load.py Outdated Show resolved Hide resolved

src/datasets/load.py Outdated Show resolved Hide resolved

src/datasets/packaged_modules/json/json.py

Comment on lines -116 to -131

		# Use block_size equal to the chunk size divided by 32 to leverage multithreading
		# Set a default minimum value of 16kB if the chunk size is really small

Member

lhoestq Aug 13, 2025

revert this comment deletion and the 2 others

Contributor Author

ArjunJagdale Aug 14, 2025

revert this comment deletion and the 2 others

Wanted clarification on “the 2 others” to ensure no comment restorations were missed. Actually i have restored the two missing comments above - are they at the right place? :)

lhoestq reviewed

View reviewed changes

src/datasets/packaged_modules/json/json.py Outdated

Comment on lines 145 to 150

+                                                  if self.config.columns is not None:
+                                                      missing_cols = [col for col in self.config.columns if col not in pa_table.column_names]
+                                                      for col in missing_cols:
+                                                          pa_table = pa_table.append_column(col, pa.array([None] * pa_table.num_rows))
+                                                      pa_table = pa_table.select(self.config.columns)
+                                                  yield (file_idx, batch_idx), self._cast_table(pa_table)

Member

lhoestq Aug 13, 2025

I would keep this at the end, where you removed the yield - this way the try/except is only about the paj.read_json call

lhoestq reviewed

View reviewed changes

src/datasets/packaged_modules/json/json.py Outdated

+                                                  for col in missing_cols:
+                                                      pa_table = pa_table.append_column(col, pa.array([None] * pa_table.num_rows))
+                                                  pa_table = pa_table.select(self.config.columns)
+                                              yield (file_idx, batch_idx), self._cast_table(pa_table)

Member

lhoestq Aug 13, 2025

same

ArjunJagdale and others added 5 commits

August 15, 2025 00:18


          Update src/datasets/load.py

5d3cc12

Co-authored-by: Quentin Lhoest <[email protected]>


          Update src/datasets/load.py

eec7df9

Co-authored-by: Quentin Lhoest <[email protected]>


          Update json.py

5e93f70


          Update json.py

608ed21


          Merge branch 'huggingface:main' into patch-14

9fa38b4

lhoestq reviewed

View reviewed changes

src/datasets/packaged_modules/json/json.py Outdated

Comment on lines 183 to 184

		# Pandas fallback in case of ArrowInvalid
		try:

Member

lhoestq Aug 18, 2025

this code is not at the right location anymore: it should trigger on ArrowInvalid

Contributor Author

ArjunJagdale Aug 26, 2025

I’ve moved the Pandas fallback into the except pa.ArrowInvalid block, will you check?

ArjunJagdale added 2 commits

August 26, 2025 23:03


          Merge branch 'huggingface:main' into patch-14

d05759a


          Update json.py

428444d

lhoestq reviewed

View reviewed changes

src/datasets/packaged_modules/json/json.py Outdated

Comment on lines 112 to 114

-                  def _generate_tables(self, files):
-                      for file_idx, file in enumerate(itertools.chain.from_iterable(files)):
-                          # If the file is one json object and if we need to look at the items in one specific field
+                  def _generate_tables(self, files: List[str]) -> Generator:
+                      for file_idx, file in enumerate(files):

Member

lhoestq Sep 4, 2025

why assuming that files is a list of strings ? it isn't

and overall this PR is pretty hard to review since you're doing a lot if small and unnecessary changes, I'd suggest opening a new PR and try to do minimal changes instead

Contributor Author

ArjunJagdale Sep 4, 2025

sure

ArjunJagdale added 2 commits

September 4, 2025 22:51


          Update json.py

ee86c9a


          Update test_json.py

760157c

ArjunJagdale closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet