Skip to content

Conversation

@Coffeempty
Copy link

@Coffeempty Coffeempty commented Oct 27, 2025

fixes #2312
hey, i tried solving the issue. It was quite easy actually.

StreamingLeRobotDataset currently assumes parquet files follow the pattern data/*/*.parquet.
However, some valid Hub datasets (e.g., yaak-ai/L2D-v3) do not use this directory structure, causing
load_dataset to fail even though the dataset is fully compatible.

This PR removes that structural assumption and makes dataset loading more flexible.


here is what i did:

  • Adds an optional data_files=None parameter to StreamingLeRobotDataset.__init__().
  • If data_files is not provided, load_dataset() will automatically detect parquet files in the repo.
  • Allows users to pass custom glob patterns when desired.
  • Fully preserves backward compatibility.

Looking forward to review + feedback!

### Summary
This PR resolves a streaming issue where StreamingLeRobotDataset
assumed all datasets stored parquet files under "data/*/*.parquet".
This caused load failures for valid HF datasets with different layouts.

### Changes
- Added new parameter `data_files=None` to StreamingLeRobotDataset.__init__()
- When `data_files=None`, the underlying `load_dataset()` automatically
  detects parquet files in the repo.
- Users can still specify custom directory patterns if needed.
- Maintains full backward compatibility.

### Result
Now this works as expected:
from lerobot.datasets.streaming_dataset import StreamingLeRobotDataset
dataset = StreamingLeRobotDataset("yaak-ai/L2D-v3")

### Motivation
This allows streaming from any Hub dataset that contains parquet data,
without requiring a specific folder hierarchy.
@Coffeempty Coffeempty changed the title Update streaming_dataset.py Enable Automatic Parquet File Detection in StreamingLeRobotDataset (Removes Hard-Coded data/*/*.parquet Assumption) issue #2312 Oct 27, 2025
@sattwik-sahu
Copy link

Hi, I tried running the streaming dataset feature with this PR, but got error from #2312 again. I changed to another dataset and successfully ran the StreamingLeRobotDataset(...) command. However, when I tried to read data from the dataset using

repo_id = "lerobot/metaworld_mt50"
dataset = StreamingLeRobotDataset(repo_id)  # streams directly from the Hub

print(dataset[0])

I got the following error

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Cell In[7], [line 1](vscode-notebook-cell:?execution_count=7&line=1)
----> [1](vscode-notebook-cell:?execution_count=7&line=1) dataset[0]

File ~/sattwik/projects/robot-learning/jepavizor/.venv/lib/python3.12/site-packages/torch/utils/data/dataset.py:59, in Dataset.__getitem__(self, index)
     58 def __getitem__(self, index) -> _T_co:
---> [59](https://file+.vscode-resource.vscode-cdn.net/home/moonlab/sattwik/projects/robot-learning/jepavizor/tests/notebooks/~/sattwik/projects/robot-learning/jepavizor/.venv/lib/python3.12/site-packages/torch/utils/data/dataset.py:59)     raise NotImplementedError("Subclasses of Dataset should implement __getitem__.")

NotImplementedError: Subclasses of Dataset should implement __getitem__.

It seems not all datasets on the hub follow some standard naming/directory conventions or API conventions, leading to these inconsistencies?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Streaming Dataset not Working - DataFilesNotFoundError

2 participants