Skip to content

fix: raise error when folder-based datasets are loaded without data_dir or data_files #7618

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ArjunJagdale
Copy link
Contributor

@ArjunJagdale ArjunJagdale commented Jun 16, 2025

Related Issues/PRs

#6152

What changes are proposed in this pull request?

This PR adds an early validation step for folder-based datasets (like audiofolder) to prevent silent fallback behavior.

Before this fix:

  • When data_dir or data_files were not provided, the loader defaulted to the current working directory.
  • This caused unexpected behavior like:
    • Long loading times
    • Scanning unintended local files

Now:

  • If both data_dir and data_files are missing, a ValueError is raised early with a helpful message.

How is this PR tested?

  • Manual test via load_dataset("audiofolder") with missing data_dir
  • Existing unit tests (should not break any)
  • New tests (if needed, maintainers can guide)

Does this PR require documentation update?

  • No. You can skip the rest of this section.

Release Notes

Is this a user-facing change?

  • Yes. Give a description of this change to be included in the release notes for users.

Adds early error handling for folder-based datasets when neither data_dir nor data_files is specified, avoiding unintended resolution to the current directory.

What component(s), interfaces, languages, and integrations does this PR affect?

Components:

  • area/datasets
  • area/load

How should the PR be classified in the release notes? Choose one:

  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes

Should this PR be included in the next patch release?

  • Yes (this PR will be cherry-picked and included in the next patch release)

fix: raise error if folder-based dataset missing data_dir and data_files
@lhoestq
Copy link
Member

lhoestq commented Jun 16, 2025

Great ! Since this logic is specific to one builder class maybe this check can be in the class definition ? I think you can put it in FolderBasedBuilder's _info() method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants