integrate `load_from_disk` into `load_dataset` 

**Is your feature request related to a problem? Please describe.**

Is it possible to make `load_dataset` more universal similar to `from_pretrained` in `transformers` so that it can handle the hub, and the local path datasets of all supported types?

Currently one has to choose a different loader depending on how the dataset has been created.

e.g. this won't work:

```
$ git clone https://huggingface.co/datasets/severo/test-parquet
$ python -c 'from datasets import load_dataset; ds=load_dataset("test-parquet"); \
ds.save_to_disk("my_dataset"); load_dataset("my_dataset")'

[...]

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/load.py", line 1746, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/builder.py", line 704, in download_and_prepare
    self._download_and_prepare(
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/builder.py", line 793, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/builder.py", line 1277, in _prepare_split
    writer.write_table(table)
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/arrow_writer.py", line 524, in write_table
    pa_table = table_cast(pa_table, self._schema)
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/table.py", line 2005, in table_cast
    return cast_table_to_schema(table, schema)
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/table.py", line 1968, in cast_table_to_schema
    raise ValueError(f"Couldn't cast\n{table.schema}\nto\n{features}\nbecause column names don't match")
ValueError: Couldn't cast
_data_files: list<item: struct<filename: string>>
  child 0, item: struct<filename: string>
      child 0, filename: string
```

both times the dataset is being loaded from disk. Why does it fail the second time?

Why can't `save_to_disk` generate a dataset that can be immediately loaded by `load_dataset`?

e.g. the simplest hack would be to have `save_to_disk` add some flag to the saved dataset, that tells `load_dataset` to internally call `load_from_disk`. like having `save_to_disk` create a `load_me_with_load_from_disk.txt` file ;) and `load_dataset` will support that feature from saved datasets from new `datasets` versions. The old ones will still need to use `load_from_disk` explicitly. Unless the flag is not needed and one can immediately tell by looking at the saved dataset that it was saved via `save_to_disk` and thus use `load_from_disk` internally.

The use-case is defining a simple API where the user only ever needs to pass a `dataset_name_or_path` and it will always just work. Currently one needs to manually add additional switches telling the system whether to use one loading method or the other which works but it's not smooth.

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

integrate `load_from_disk` into `load_dataset` #5044

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

integrate load_from_disk into load_dataset #5044

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

integrate `load_from_disk` into `load_dataset` #5044