Skip to content

Commit c4c1948

Browse files
jadechogharipre-commit-ci[bot]michel-aractingifracapuano
authored andcommitted
docs(dataset): add dataset v3 documentation (huggingface#1956)
* add v3 doc * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix * update changes * iterate on review * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add changes * create dataset section * Update docs/source/lerobot-dataset-v3.mdx Signed-off-by: Francesco Capuano <[email protected]> * Update docs/source/lerobot-dataset-v3.mdx Signed-off-by: Francesco Capuano <[email protected]> * Update docs/source/lerobot-dataset-v3.mdx Signed-off-by: Francesco Capuano <[email protected]> --------- Signed-off-by: Francesco Capuano <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Michel Aractingi <[email protected]> Co-authored-by: Francesco Capuano <[email protected]>
1 parent 3ff9e6c commit c4c1948

File tree

2 files changed

+174
-1
lines changed

2 files changed

+174
-1
lines changed

docs/source/_toctree.yml

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,9 +19,13 @@
1919
title: Train RL in Simulation
2020
- local: async
2121
title: Use Async Inference
22+
title: "Tutorials"
23+
- sections:
24+
- local: lerobot-dataset-v3
25+
title: Using LeRobotDataset
2226
- local: porting_datasets_v3
2327
title: Porting Large Datasets
24-
title: "Tutorials"
28+
title: "Datasets"
2529
- sections:
2630
- local: smolvla
2731
title: Finetune SmolVLA

docs/source/lerobot-dataset-v3.mdx

Lines changed: 169 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,169 @@
1+
# LeRobotDataset v3.0
2+
3+
`LeRobotDataset v3.0` is a standardized format for robot learning data. It provides unified access to multi-modal time-series data, sensorimotor signals and multi‑camera video, as well as rich metadata for indexing, search, and visualization on the Hugging Face Hub.
4+
5+
This docs will guide you to:
6+
7+
- Understand the v3.0 design and directory layout
8+
- Record a dataset and push it to the Hub
9+
- Load datasets for training with `LeRobotDataset`
10+
- Stream datasets without downloading using `StreamingLeRobotDataset`
11+
- Migrate existing `v2.1` datasets to `v3.0`
12+
13+
## What’s new in `v3`
14+
15+
- **File-based storage**: Many episodes per Parquet/MP4 file (v2 used one file per episode).
16+
- **Relational metadata**: Episode boundaries and lookups are resolved through metadata, not filenames.
17+
- **Hub-native streaming**: Consume datasets directly from the Hub with `StreamingLeRobotDataset`.
18+
- **Lower file-system pressure**: Fewer, larger files ⇒ faster initialization and fewer issues at scale.
19+
- **Unified organization**: Clean directory layout with consistent path templates across data and videos.
20+
21+
## Installation
22+
23+
`LeRobotDataset v3.0` will be included in `lerobot >= 0.4.0`.
24+
25+
Until that stable release, you can use the main branch by following the [build from source instructions](./installation#from-source).
26+
27+
## Record a dataset
28+
29+
Run the command below to record a dataset with the SO-101 and push to the Hub:
30+
31+
```bash
32+
lerobot-record \
33+
--robot.type=so101_follower \
34+
--robot.port=/dev/tty.usbmodem585A0076841 \
35+
--robot.id=my_awesome_follower_arm \
36+
--robot.cameras="{ front: {type: opencv, index_or_path: 0, width: 1920, height: 1080, fps: 30}}" \
37+
--teleop.type=so101_leader \
38+
--teleop.port=/dev/tty.usbmodem58760431551 \
39+
--teleop.id=my_awesome_leader_arm \
40+
--display_data=true \
41+
--dataset.repo_id=${HF_USER}/record-test \
42+
--dataset.num_episodes=5 \
43+
--dataset.single_task="Grab the black cube"
44+
```
45+
46+
See the [recording guide](./il_robots#record-a-dataset) for more details.
47+
48+
## Format design
49+
50+
A core v3 principle is **decoupling storage from the user API**: data is stored efficiently (few large files), while the public API exposes intuitive episode-level access.
51+
52+
`v3` has three pillars:
53+
54+
1. **Tabular data**: Low‑dimensional, high‑frequency signals (states, actions, timestamps) stored in **Apache Parquet**. Access is memory‑mapped or streamed via the `datasets` stack.
55+
2. **Visual data**: Camera frames concatenated and encoded into **MP4**. Frames from the same episode are grouped; videos are sharded per camera for practical sizes.
56+
3. **Metadata**: JSON/Parquet records describing schema (feature names, dtypes, shapes), frame rates, normalization stats, and **episode segmentation** (start/end offsets into shared Parquet/MP4 files).
57+
58+
> To scale to millions of episodes, tabular rows and video frames from multiple episodes are **concatenated** into larger files. Episode‑specific views are reconstructed **via metadata**, not file boundaries.
59+
60+
<div style="display:flex; justify-content:center; gap:12px; flex-wrap:wrap;">
61+
<figure style="margin:0; text-align:center;">
62+
<img
63+
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobotdataset-v3/asset1datasetv3.png"
64+
alt="LeRobotDataset v3 diagram"
65+
width="220"
66+
/>
67+
<figcaption style="font-size:0.9em; color:#666;">
68+
From episode‑based to file‑based datasets
69+
</figcaption>
70+
</figure>
71+
</div>
72+
73+
### Directory layout (simplified)
74+
75+
- **`meta/info.json`**: canonical schema (features, shapes/dtypes), FPS, codebase version, and **path templates** to locate data/video shards.
76+
- **`meta/stats.json`**: global feature statistics (mean/std/min/max) used for normalization; exposed as `dataset.meta.stats`.
77+
- **`meta/tasks.jsonl`**: natural‑language task descriptions mapped to integer IDs for task‑conditioned policies.
78+
- **`meta/episodes/`**: per‑episode records (lengths, tasks, offsets) stored as **chunked Parquet** for scalability.
79+
- **`data/`**: frame‑by‑frame **Parquet** shards; each file typically contains **many episodes**.
80+
- **`videos/`**: **MP4** shards per camera; each file typically contains **many episodes**.
81+
82+
## Load a dataset for training
83+
84+
`LeRobotDataset` returns Python dictionaries of PyTorch tensors and integrates with `torch.utils.data.DataLoader`. Here is a code example showing its use:
85+
86+
```python
87+
import torch
88+
from lerobot.datasets.lerobot_dataset import LeRobotDataset
89+
90+
repo_id = "yaak-ai/L2D-v3"
91+
92+
# 1) Load from the Hub (cached locally)
93+
dataset = LeRobotDataset(repo_id)
94+
95+
# 2) Random access by index
96+
sample = dataset[100]
97+
print(sample)
98+
# {
99+
# 'observation.state': tensor([...]),
100+
# 'action': tensor([...]),
101+
# 'observation.images.front_left': tensor([C, H, W]),
102+
# 'timestamp': tensor(1.234),
103+
# ...
104+
# }
105+
106+
# 3) Temporal windows via delta_timestamps (seconds relative to t)
107+
delta_timestamps = {
108+
"observation.images.front_left": [-0.2, -0.1, 0.0] # 0.2s and 0.1s before current frame
109+
}
110+
111+
dataset = LeRobotDataset(repo_id, delta_timestamps=delta_timestamps)
112+
113+
# Accessing an index now returns a stack for the specified key(s)
114+
sample = dataset[100]
115+
print(sample["observation.images.front_left"].shape) # [T, C, H, W], where T=3
116+
117+
# 4) Wrap with a DataLoader for training
118+
batch_size = 16
119+
data_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size)
120+
121+
device = "cuda" if torch.cuda.is_available() else "cpu"
122+
for batch in data_loader:
123+
observations = batch["observation.state"].to(device)
124+
actions = batch["action"].to(device)
125+
images = batch["observation.images.front_left"].to(device)
126+
# model.forward(batch)
127+
```
128+
129+
## Stream a dataset (no downloads)
130+
131+
Use `StreamingLeRobotDataset` to iterate directly from the Hub without local copies. This allows to stream large datasets without the need to downloading them onto disk or loading them onto memory, and is a key feature of the new dataset format.
132+
133+
```python
134+
from lerobot.datasets.streaming_dataset import StreamingLeRobotDataset
135+
136+
repo_id = "yaak-ai/L2D-v3"
137+
dataset = StreamingLeRobotDataset(repo_id) # streams directly from the Hub
138+
```
139+
140+
<div style="display:flex; justify-content:center; gap:12px; flex-wrap:wrap;">
141+
<figure style="margin:0; text-align:center;">
142+
<img
143+
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobotdataset-v3/streaming-lerobot.png"
144+
alt="StreamingLeRobotDataset"
145+
width="520"
146+
/>
147+
<figcaption style="font-size:0.9em; color:#666;">
148+
Stream directly from the Hub for on‑the‑fly training.
149+
</figcaption>
150+
</figure>
151+
</div>
152+
153+
## Migrate `v2.1``v3.0`
154+
155+
A converter aggregates per‑episode files into larger shards and writes episode offsets/metadata. Convert your dataset using the instructions below.
156+
157+
```bash
158+
# Pre-release build with v3 support:
159+
pip install "https://github.com/huggingface/lerobot/archive/33cad37054c2b594ceba57463e8f11ee374fa93c.zip"
160+
161+
# Convert an existing v2.1 dataset hosted on the Hub:
162+
python -m lerobot.datasets.v30.convert_dataset_v21_to_v30 --repo-id=<HF_USER/DATASET_ID>
163+
```
164+
165+
**What it does**
166+
167+
- Aggregates parquet files: `episode-0000.parquet`, `episode-0001.parquet`, … → **`file-0000.parquet`**, …
168+
- Aggregates mp4 files: `episode-0000.mp4`, `episode-0001.mp4`, … → **`file-0000.mp4`**, …
169+
- Updates `meta/episodes/*` (chunked Parquet) with per‑episode lengths, tasks, and byte/frame offsets.

0 commit comments

Comments
 (0)