-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Incremental parquet writing #1903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Incremental parquet writing #1903
Conversation
Co-authored-by: Simon Alibert <[email protected]>
Co-authored-by: Remi Cadene <[email protected]> Co-authored-by: Remi <[email protected]>
…cadene/2025_02_19_port_openx
…_v2.1' into user/rcadene/2025_02_19_port_openx
…ataset during recording
…fter creation of a new dataset
783a6b3
to
b0bc4a3
Compare
…michel-aractingi/2025-9-9-incremental_parquet_writing
…michel-aractingi/2025-9-9-incremental_parquet_writing
…michel-aractingi/2025-9-9-incremental_parquet_writing
66801d5
to
38c5c20
Compare
writer = getattr(self, "writer", None) | ||
if writer is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any opposition to walrus operator :=
?
if (writer := getattr(self, "writer", None)):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not at all. I just think we don't use it anywhere else in the code and for readability its a bit better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements incremental parquet writing functionality for LeRobot datasets, improving performance during data collection by avoiding expensive file reloads and memory operations between episodes.
Key changes include:
- Implements PyArrow-based incremental parquet writing with persistent writers
- Adds lazy loading mechanism to defer expensive dataset loading until actually needed for reading
- Introduces finalize() method to properly close parquet writers and complete dataset creation
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
File | Description |
---|---|
src/lerobot/datasets/lerobot_dataset.py | Core implementation of incremental parquet writing, lazy loading, and writer management |
src/lerobot/datasets/video_utils.py | Adds finalize() call to properly close writers in video recording context |
tests/datasets/test_datasets.py | Updates test cases to call finalize() method after dataset creation |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
2c54ed4
to
5df5f0f
Compare
What this does
Branched from #1894, fixed tests and added lazy loading