|
| 1 | +# LeRobotDataset v3.0 |
| 2 | + |
| 3 | +`LeRobotDataset v3.0` is a standardized format for robot learning data. It provides unified access to multi-modal time-series data, sensorimotor signals and multi‑camera video, as well as rich metadata for indexing, search, and visualization on the Hugging Face Hub. |
| 4 | + |
| 5 | +This docs will guide you to: |
| 6 | + |
| 7 | +- Understand the v3.0 design and directory layout |
| 8 | +- Record a dataset and push it to the Hub |
| 9 | +- Load datasets for training with `LeRobotDataset` |
| 10 | +- Stream datasets without downloading using `StreamingLeRobotDataset` |
| 11 | +- Migrate existing `v2.1` datasets to `v3.0` |
| 12 | + |
| 13 | +## What’s new in `v3` |
| 14 | + |
| 15 | +- **File-based storage**: Many episodes per Parquet/MP4 file (v2 used one file per episode). |
| 16 | +- **Relational metadata**: Episode boundaries and lookups are resolved through metadata, not filenames. |
| 17 | +- **Hub-native streaming**: Consume datasets directly from the Hub with `StreamingLeRobotDataset`. |
| 18 | +- **Lower file-system pressure**: Fewer, larger files ⇒ faster initialization and fewer issues at scale. |
| 19 | +- **Unified organization**: Clean directory layout with consistent path templates across data and videos. |
| 20 | + |
| 21 | +## Installation |
| 22 | + |
| 23 | +`LeRobotDataset v3.0` will be included in `lerobot >= 0.4.0`. |
| 24 | + |
| 25 | +Until that stable release, you can use the main branch by following the [build from source instructions](./installation#from-source). |
| 26 | + |
| 27 | +## Record a dataset |
| 28 | + |
| 29 | +Run the command below to record a dataset with the SO-101 and push to the Hub: |
| 30 | + |
| 31 | +```bash |
| 32 | +lerobot-record \ |
| 33 | + --robot.type=so101_follower \ |
| 34 | + --robot.port=/dev/tty.usbmodem585A0076841 \ |
| 35 | + --robot.id=my_awesome_follower_arm \ |
| 36 | + --robot.cameras="{ front: {type: opencv, index_or_path: 0, width: 1920, height: 1080, fps: 30}}" \ |
| 37 | + --teleop.type=so101_leader \ |
| 38 | + --teleop.port=/dev/tty.usbmodem58760431551 \ |
| 39 | + --teleop.id=my_awesome_leader_arm \ |
| 40 | + --display_data=true \ |
| 41 | + --dataset.repo_id=${HF_USER}/record-test \ |
| 42 | + --dataset.num_episodes=5 \ |
| 43 | + --dataset.single_task="Grab the black cube" |
| 44 | +``` |
| 45 | + |
| 46 | +See the [recording guide](./il_robots#record-a-dataset) for more details. |
| 47 | + |
| 48 | +## Format design |
| 49 | + |
| 50 | +A core v3 principle is **decoupling storage from the user API**: data is stored efficiently (few large files), while the public API exposes intuitive episode-level access. |
| 51 | + |
| 52 | +`v3` has three pillars: |
| 53 | + |
| 54 | +1. **Tabular data**: Low‑dimensional, high‑frequency signals (states, actions, timestamps) stored in **Apache Parquet**. Access is memory‑mapped or streamed via the `datasets` stack. |
| 55 | +2. **Visual data**: Camera frames concatenated and encoded into **MP4**. Frames from the same episode are grouped; videos are sharded per camera for practical sizes. |
| 56 | +3. **Metadata**: JSON/Parquet records describing schema (feature names, dtypes, shapes), frame rates, normalization stats, and **episode segmentation** (start/end offsets into shared Parquet/MP4 files). |
| 57 | + |
| 58 | +> To scale to millions of episodes, tabular rows and video frames from multiple episodes are **concatenated** into larger files. Episode‑specific views are reconstructed **via metadata**, not file boundaries. |
| 59 | +
|
| 60 | +<div style="display:flex; justify-content:center; gap:12px; flex-wrap:wrap;"> |
| 61 | + <figure style="margin:0; text-align:center;"> |
| 62 | + <img |
| 63 | + src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobotdataset-v3/asset1datasetv3.png" |
| 64 | + alt="LeRobotDataset v3 diagram" |
| 65 | + width="220" |
| 66 | + /> |
| 67 | + <figcaption style="font-size:0.9em; color:#666;"> |
| 68 | + From episode‑based to file‑based datasets |
| 69 | + </figcaption> |
| 70 | + </figure> |
| 71 | +</div> |
| 72 | + |
| 73 | +### Directory layout (simplified) |
| 74 | + |
| 75 | +- **`meta/info.json`**: canonical schema (features, shapes/dtypes), FPS, codebase version, and **path templates** to locate data/video shards. |
| 76 | +- **`meta/stats.json`**: global feature statistics (mean/std/min/max) used for normalization; exposed as `dataset.meta.stats`. |
| 77 | +- **`meta/tasks.jsonl`**: natural‑language task descriptions mapped to integer IDs for task‑conditioned policies. |
| 78 | +- **`meta/episodes/`**: per‑episode records (lengths, tasks, offsets) stored as **chunked Parquet** for scalability. |
| 79 | +- **`data/`**: frame‑by‑frame **Parquet** shards; each file typically contains **many episodes**. |
| 80 | +- **`videos/`**: **MP4** shards per camera; each file typically contains **many episodes**. |
| 81 | + |
| 82 | +## Load a dataset for training |
| 83 | + |
| 84 | +`LeRobotDataset` returns Python dictionaries of PyTorch tensors and integrates with `torch.utils.data.DataLoader`. Here is a code example showing its use: |
| 85 | + |
| 86 | +```python |
| 87 | +import torch |
| 88 | +from lerobot.datasets.lerobot_dataset import LeRobotDataset |
| 89 | + |
| 90 | +repo_id = "yaak-ai/L2D-v3" |
| 91 | + |
| 92 | +# 1) Load from the Hub (cached locally) |
| 93 | +dataset = LeRobotDataset(repo_id) |
| 94 | + |
| 95 | +# 2) Random access by index |
| 96 | +sample = dataset[100] |
| 97 | +print(sample) |
| 98 | +# { |
| 99 | +# 'observation.state': tensor([...]), |
| 100 | +# 'action': tensor([...]), |
| 101 | +# 'observation.images.front_left': tensor([C, H, W]), |
| 102 | +# 'timestamp': tensor(1.234), |
| 103 | +# ... |
| 104 | +# } |
| 105 | + |
| 106 | +# 3) Temporal windows via delta_timestamps (seconds relative to t) |
| 107 | +delta_timestamps = { |
| 108 | + "observation.images.front_left": [-0.2, -0.1, 0.0] # 0.2s and 0.1s before current frame |
| 109 | +} |
| 110 | + |
| 111 | +dataset = LeRobotDataset(repo_id, delta_timestamps=delta_timestamps) |
| 112 | + |
| 113 | +# Accessing an index now returns a stack for the specified key(s) |
| 114 | +sample = dataset[100] |
| 115 | +print(sample["observation.images.front_left"].shape) # [T, C, H, W], where T=3 |
| 116 | + |
| 117 | +# 4) Wrap with a DataLoader for training |
| 118 | +batch_size = 16 |
| 119 | +data_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size) |
| 120 | + |
| 121 | +device = "cuda" if torch.cuda.is_available() else "cpu" |
| 122 | +for batch in data_loader: |
| 123 | + observations = batch["observation.state"].to(device) |
| 124 | + actions = batch["action"].to(device) |
| 125 | + images = batch["observation.images.front_left"].to(device) |
| 126 | + # model.forward(batch) |
| 127 | +``` |
| 128 | + |
| 129 | +## Stream a dataset (no downloads) |
| 130 | + |
| 131 | +Use `StreamingLeRobotDataset` to iterate directly from the Hub without local copies. This allows to stream large datasets without the need to downloading them onto disk or loading them onto memory, and is a key feature of the new dataset format. |
| 132 | + |
| 133 | +```python |
| 134 | +from lerobot.datasets.streaming_dataset import StreamingLeRobotDataset |
| 135 | + |
| 136 | +repo_id = "yaak-ai/L2D-v3" |
| 137 | +dataset = StreamingLeRobotDataset(repo_id) # streams directly from the Hub |
| 138 | +``` |
| 139 | + |
| 140 | +<div style="display:flex; justify-content:center; gap:12px; flex-wrap:wrap;"> |
| 141 | + <figure style="margin:0; text-align:center;"> |
| 142 | + <img |
| 143 | + src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/lerobotdataset-v3/streaming-lerobot.png" |
| 144 | + alt="StreamingLeRobotDataset" |
| 145 | + width="520" |
| 146 | + /> |
| 147 | + <figcaption style="font-size:0.9em; color:#666;"> |
| 148 | + Stream directly from the Hub for on‑the‑fly training. |
| 149 | + </figcaption> |
| 150 | + </figure> |
| 151 | +</div> |
| 152 | + |
| 153 | +## Migrate `v2.1` → `v3.0` |
| 154 | + |
| 155 | +A converter aggregates per‑episode files into larger shards and writes episode offsets/metadata. Convert your dataset using the instructions below. |
| 156 | + |
| 157 | +```bash |
| 158 | +# Pre-release build with v3 support: |
| 159 | +pip install "https://github.com/huggingface/lerobot/archive/33cad37054c2b594ceba57463e8f11ee374fa93c.zip" |
| 160 | + |
| 161 | +# Convert an existing v2.1 dataset hosted on the Hub: |
| 162 | +python -m lerobot.datasets.v30.convert_dataset_v21_to_v30 --repo-id=<HF_USER/DATASET_ID> |
| 163 | +``` |
| 164 | + |
| 165 | +**What it does** |
| 166 | + |
| 167 | +- Aggregates parquet files: `episode-0000.parquet`, `episode-0001.parquet`, … → **`file-0000.parquet`**, … |
| 168 | +- Aggregates mp4 files: `episode-0000.mp4`, `episode-0001.mp4`, … → **`file-0000.mp4`**, … |
| 169 | +- Updates `meta/episodes/*` (chunked Parquet) with per‑episode lengths, tasks, and byte/frame offsets. |
0 commit comments