Skip to content

Streaming Dataset not Working - DataFilesNotFoundError #2312

@sattwik-sahu

Description

@sattwik-sahu

System Info

- lerobot version: 0.4.0
- Platform: Linux-6.8.0-85-generic-x86_64-with-glibc2.35
- Python version: 3.10.18
- Huggingface Hub version: 0.35.3
- Datasets version: 4.1.1
- Numpy version: 2.2.6
- PyTorch version: 2.7.1+cu126
- Is PyTorch built with CUDA support?: True
- Cuda version: 12.6
- GPU model: NVIDIA RTX A4500

Information

  • One of the scripts in the examples/ folder of LeRobot
  • My own task or dataset (give details below)

Reproduction

Steps to reproduce the behaviour:

  1. uv init --python 3.10 --package hello-lerobot
  2. cd hello-lerobot && uv sync
  3. source .venv/bin/activate
  4. uv add lerobot
  5. Open a python file/notebook and type in the contents from the StreamingLeRobotDataset example given here
from lerobot.datasets.streaming_dataset import StreamingLeRobotDataset

repo_id = "yaak-ai/L2D-v3"
dataset = StreamingLeRobotDataset(repo_id)  # streams directly from the Hub
  1. Run the file/cell

Expected behavior

The dataset object creation line has an error, as shown below:

---------------------------------------------------------------------------
DataFilesNotFoundError                    Traceback (most recent call last)
Cell In[7], line 4
      1 from lerobot.datasets.streaming_dataset import StreamingLeRobotDataset
      3 repo_id = "yaak-ai/L2D-v3"
----> 4 dataset = StreamingLeRobotDataset(repo_id)  # streams directly from the Hub

File /mnt/toshiba_hdd/simulations/hello-lerobot/.venv/lib/python3.10/site-packages/lerobot/datasets/streaming_dataset.py:152, in StreamingLeRobotDataset.__init__(self, repo_id, root, episodes, image_transforms, delta_timestamps, tolerance_s, revision, force_cache_sync, streaming, buffer_size, max_num_shards, seed, rng, shuffle)
    149     self.delta_timestamps = delta_timestamps
    150     self.delta_indices = get_delta_indices(self.delta_timestamps, self.fps)
--> 152 self.hf_dataset: datasets.IterableDataset = load_dataset(
    153     self.repo_id if not self.streaming_from_local else str(self.root),
    154     split="train",
    155     streaming=self.streaming,
    156     data_files="data/*/*.parquet",
    157     revision=self.revision,
    158 )
    160 self.num_shards = min(self.hf_dataset.num_shards, max_num_shards)

File /mnt/toshiba_hdd/simulations/hello-lerobot/.venv/lib/python3.10/site-packages/datasets/load.py:1392, in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, keep_in_memory, save_infos, revision, token, streaming, num_proc, storage_options, **config_kwargs)
   1387 verification_mode = VerificationMode(
   1388     (verification_mode or VerificationMode.BASIC_CHECKS) if not save_infos else VerificationMode.ALL_CHECKS
   1389 )
   1391 # Create a dataset builder
-> 1392 builder_instance = load_dataset_builder(
   1393     path=path,
   1394     name=name,
   1395     data_dir=data_dir,
   1396     data_files=data_files,
   1397     cache_dir=cache_dir,
   1398     features=features,
   1399     download_config=download_config,
   1400     download_mode=download_mode,
   1401     revision=revision,
   1402     token=token,
   1403     storage_options=storage_options,
   1404     **config_kwargs,
   1405 )
   1407 # Return iterable dataset in case of streaming
   1408 if streaming:

File /mnt/toshiba_hdd/simulations/hello-lerobot/.venv/lib/python3.10/site-packages/datasets/load.py:1132, in load_dataset_builder(path, name, data_dir, data_files, cache_dir, features, download_config, download_mode, revision, token, storage_options, **config_kwargs)
   1130 if features is not None:
   1131     features = _fix_for_backward_compatible_features(features)
-> 1132 dataset_module = dataset_module_factory(
   1133     path,
   1134     revision=revision,
   1135     download_config=download_config,
   1136     download_mode=download_mode,
   1137     data_dir=data_dir,
   1138     data_files=data_files,
   1139     cache_dir=cache_dir,
   1140 )
   1141 # Get dataset builder class
   1142 builder_kwargs = dataset_module.builder_kwargs

File /mnt/toshiba_hdd/simulations/hello-lerobot/.venv/lib/python3.10/site-packages/datasets/load.py:1025, in dataset_module_factory(path, revision, download_config, download_mode, data_dir, data_files, cache_dir, **download_kwargs)
   1023     raise ConnectionError(f"Couldn't reach the Hugging Face Hub for dataset '{path}': {e1}") from None
   1024 if isinstance(e1, (DataFilesNotFoundError, DatasetNotFoundError, EmptyDatasetError)):
-> 1025     raise e1 from None
   1026 if isinstance(e1, FileNotFoundError):
   1027     raise FileNotFoundError(
   1028         f"Couldn't find any data file at {relative_to_absolute_path(path)}. "
   1029         f"Couldn't find '{path}' on the Hugging Face Hub either: {type(e1).__name__}: {e1}"
   1030     ) from None

File /mnt/toshiba_hdd/simulations/hello-lerobot/.venv/lib/python3.10/site-packages/datasets/load.py:1004, in dataset_module_factory(path, revision, download_config, download_mode, data_dir, data_files, cache_dir, **download_kwargs)
    994     else:
    995         use_exported_dataset_infos = True
    996     return HubDatasetModuleFactory(
    997         path,
    998         commit_hash=commit_hash,
    999         data_dir=data_dir,
   1000         data_files=data_files,
   1001         download_config=download_config,
   1002         download_mode=download_mode,
   1003         use_exported_dataset_infos=use_exported_dataset_infos,
-> 1004     ).get_module()
   1005 except GatedRepoError as e:
   1006     message = f"Dataset '{path}' is a gated dataset on the Hub."

File /mnt/toshiba_hdd/simulations/hello-lerobot/.venv/lib/python3.10/site-packages/datasets/load.py:638, in HubDatasetModuleFactory.get_module(self)
    631     patterns = get_data_patterns(base_path, download_config=self.download_config)
    632 data_files = DataFilesDict.from_patterns(
    633     patterns,
    634     base_path=base_path,
    635     allowed_extensions=ALL_ALLOWED_EXTENSIONS,
    636     download_config=self.download_config,
    637 )
--> 638 module_name, default_builder_kwargs = infer_module_for_data_files(
    639     data_files=data_files,
    640     path=self.name,
    641     download_config=self.download_config,
    642 )
    643 data_files = data_files.filter(
    644     extensions=_MODULE_TO_EXTENSIONS[module_name], file_names=_MODULE_TO_METADATA_FILE_NAMES[module_name]
    645 )
    646 module_path, _ = _PACKAGED_DATASETS_MODULES[module_name]

File /mnt/toshiba_hdd/simulations/hello-lerobot/.venv/lib/python3.10/site-packages/datasets/load.py:300, in infer_module_for_data_files(data_files, path, download_config)
    298     raise ValueError(f"Couldn't infer the same data file format for all splits. Got {split_modules}")
    299 if not module_name:
--> 300     raise DataFilesNotFoundError("No (supported) data files found" + (f" in {path}" if path else ""))
    301 return module_name, default_builder_kwargs

DataFilesNotFoundError: No (supported) data files found in yaak-ai/L2D-v3
  • Seems like the program expects to see some files on my local system
  • However, I am using the dataset streaming feature as I cannot download large datasets to my machine

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn’t working correctlydatasetIssues regarding data inputs, processing, or datasetspythonPull requests that update python code

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions