-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Description
Is your feature request related to a problem? Please describe.
Is it possible to make load_dataset
more universal similar to from_pretrained
in transformers
so that it can handle the hub, and the local path datasets of all supported types?
Currently one has to choose a different loader depending on how the dataset has been created.
e.g. this won't work:
$ git clone https://huggingface.co/datasets/severo/test-parquet
$ python -c 'from datasets import load_dataset; ds=load_dataset("test-parquet"); \
ds.save_to_disk("my_dataset"); load_dataset("my_dataset")'
[...]
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/load.py", line 1746, in load_dataset
builder_instance.download_and_prepare(
File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/builder.py", line 704, in download_and_prepare
self._download_and_prepare(
File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/builder.py", line 793, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/builder.py", line 1277, in _prepare_split
writer.write_table(table)
File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/arrow_writer.py", line 524, in write_table
pa_table = table_cast(pa_table, self._schema)
File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/table.py", line 2005, in table_cast
return cast_table_to_schema(table, schema)
File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/table.py", line 1968, in cast_table_to_schema
raise ValueError(f"Couldn't cast\n{table.schema}\nto\n{features}\nbecause column names don't match")
ValueError: Couldn't cast
_data_files: list<item: struct<filename: string>>
child 0, item: struct<filename: string>
child 0, filename: string
both times the dataset is being loaded from disk. Why does it fail the second time?
Why can't save_to_disk
generate a dataset that can be immediately loaded by load_dataset
?
e.g. the simplest hack would be to have save_to_disk
add some flag to the saved dataset, that tells load_dataset
to internally call load_from_disk
. like having save_to_disk
create a load_me_with_load_from_disk.txt
file ;) and load_dataset
will support that feature from saved datasets from new datasets
versions. The old ones will still need to use load_from_disk
explicitly. Unless the flag is not needed and one can immediately tell by looking at the saved dataset that it was saved via save_to_disk
and thus use load_from_disk
internally.
The use-case is defining a simple API where the user only ever needs to pass a dataset_name_or_path
and it will always just work. Currently one needs to manually add additional switches telling the system whether to use one loading method or the other which works but it's not smooth.
Thank you!