Skip to content

feat(load): fallback to load_from_disk() when loading a saved dataset directory #7653

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ArjunJagdale
Copy link
Contributor

Related Issue

Fixes #7503
Partially addresses #5044 by allowing load_dataset() to auto-detect and gracefully delegate to load_from_disk() for locally saved datasets.


What does this PR do?

This PR introduces a minimal fallback mechanism in load_dataset() that detects when the provided path points to a dataset saved using save_to_disk(), and automatically redirects to load_from_disk().

🐛 Before (unexpected metadata-only rows):

ds = load_dataset("/path/to/saved_dataset")
# → returns rows with only internal metadata (_data_files, _fingerprint, etc.)

✅ After (graceful fallback):

ds = load_dataset("/path/to/saved_dataset")
# → logs a warning and internally switches to load_from_disk()

Why is this useful?

  • Prevents confusion when reloading local datasets saved via save_to_disk().
  • Enables smoother compatibility with frameworks (e.g., TRL, lighteval) that rely on load_dataset() calls.
  • Fully backward-compatible — hub-based loading, custom builders, and streaming remain untouched.

…`load_dataset`

### Related Issue
Fixes huggingface#7503

### What does this PR do?

This PR introduces a fallback mechanism in `load_dataset()` that detects when the input `path` points to a dataset previously saved using `save_to_disk()`, and automatically redirects to `load_from_disk(path)`.

Previously, calling `load_dataset("/path/to/saved/dataset")` would misinterpret the local structure and return incorrect metadata rows. Now:

```python
# Before: unexpected result
ds = load_dataset("my_saved_dataset")  # Misinterprets metadata

# After: correct behavior
ds = load_dataset("my_saved_dataset")  # Auto-switches to load_from_disk()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inconsistency between load_dataset and load_from_disk functionality
1 participant