Skip to content

integrate load_from_disk into load_dataset  #5044

@stas00

Description

@stas00

Is your feature request related to a problem? Please describe.

Is it possible to make load_dataset more universal similar to from_pretrained in transformers so that it can handle the hub, and the local path datasets of all supported types?

Currently one has to choose a different loader depending on how the dataset has been created.

e.g. this won't work:

$ git clone https://huggingface.co/datasets/severo/test-parquet
$ python -c 'from datasets import load_dataset; ds=load_dataset("test-parquet"); \
ds.save_to_disk("my_dataset"); load_dataset("my_dataset")'

[...]

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/load.py", line 1746, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/builder.py", line 704, in download_and_prepare
    self._download_and_prepare(
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/builder.py", line 793, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/builder.py", line 1277, in _prepare_split
    writer.write_table(table)
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/arrow_writer.py", line 524, in write_table
    pa_table = table_cast(pa_table, self._schema)
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/table.py", line 2005, in table_cast
    return cast_table_to_schema(table, schema)
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/table.py", line 1968, in cast_table_to_schema
    raise ValueError(f"Couldn't cast\n{table.schema}\nto\n{features}\nbecause column names don't match")
ValueError: Couldn't cast
_data_files: list<item: struct<filename: string>>
  child 0, item: struct<filename: string>
      child 0, filename: string

both times the dataset is being loaded from disk. Why does it fail the second time?

Why can't save_to_disk generate a dataset that can be immediately loaded by load_dataset?

e.g. the simplest hack would be to have save_to_disk add some flag to the saved dataset, that tells load_dataset to internally call load_from_disk. like having save_to_disk create a load_me_with_load_from_disk.txt file ;) and load_dataset will support that feature from saved datasets from new datasets versions. The old ones will still need to use load_from_disk explicitly. Unless the flag is not needed and one can immediately tell by looking at the saved dataset that it was saved via save_to_disk and thus use load_from_disk internally.

The use-case is defining a simple API where the user only ever needs to pass a dataset_name_or_path and it will always just work. Currently one needs to manually add additional switches telling the system whether to use one loading method or the other which works but it's not smooth.

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions