-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Fix double iteration bug when resumed from a checkpoint. #20775
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix double iteration bug when resumed from a checkpoint. #20775
Conversation
tests/tests_pytorch/loops/test_double_iter_in_iterable_dataset.py
Outdated
Show resolved
Hide resolved
@Borda Thanks for the review! Let me know if there's anything you'd like me to change. Otherwise, can we go ahead and merge this? |
tests/tests_pytorch/loops/test_double_iter_in_iterable_dataset.py
Outdated
Show resolved
Hide resolved
e451080
to
130102c
Compare
Hey, @Borda just wondering if anything is blocking this PR from merging? |
c03f7d0
to
9525a9a
Compare
9525a9a
to
df95a0c
Compare
@Borda, I noticed a few mypy checks are failing, but they don't seem related to the changes in this PR. Should we ignore them for now? Let me know the next steps. |
Signed-off-by: sudipto baral <[email protected]>
Signed-off-by: sudipto baral <[email protected]>
Signed-off-by: sudipto baral <[email protected]>
Signed-off-by: sudipto baral <[email protected]>
a9d56fd
to
2f9cd44
Compare
@bhimrazy, thanks for approving the PR. I can see that a few checks are timed out. Any idea why? |
Just checking in, any hope of moving this forward? |
Hi @sudiptob2 Sorry for the delay and thanks for your patience. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes a double iteration bug that occurs when resuming training from a checkpoint with IterableDatasets. The issue was that iter()
was being called twice - once during setup and again during epoch start - causing data to be skipped when resuming from checkpoints.
- Added a new
is_resuming
property to track when training is being resumed from a checkpoint - Modified the iteration logic in
TrainingEpochLoop
to avoid double iteration when resuming - Added comprehensive test coverage for the checkpoint resumption scenario
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
src/lightning/pytorch/loops/loop.py |
Added is_resuming property and state tracking for checkpoint resumption |
src/lightning/pytorch/loops/training_epoch_loop.py |
Modified iteration logic to check resumption state before calling iter() |
tests/tests_pytorch/loops/test_double_iter_in_iterable_dataset.py |
Added test case to verify checkpoint resumption works correctly with IterableDatasets |
.github/workflows/ci-tests-pytorch.yml |
Increased timeout to accommodate additional test execution time |
Comments suppressed due to low confidence (2)
tests/tests_pytorch/loops/test_double_iter_in_iterable_dataset.py:66
- The function name
test_resume_training_with
is incomplete and unclear. It should be more descriptive, such astest_resume_training_with_iterable_dataset
ortest_resume_training_with_checkpoint
.
def test_resume_training_with(tmp_path):
tests/tests_pytorch/loops/test_double_iter_in_iterable_dataset.py:69
- The variable name
max_epoch
should bemax_epochs
to be consistent with the parameter name used in thetrain_model
function and PyTorch Lightning conventions.
max_epoch = 2
Co-authored-by: Copilot <[email protected]>
Hi @sudiptob2, I’m seeing an odd issue: all tests pass on macOS 14 with Python 3.12 and 2.7 in under 30 minutes, but the job still hangs until the 50‑minute timeout and then fails. I suspect the issue to be in the test file, possibly related to |
Thanks, @sudiptob2, for your contribution and for making it even better! |
@deependujha Thanks to you for the hard work with testing and CI fixes. Really appreciate 🚀 |
What does this PR do?
This PR fixes the double
iter()
bug discussed in #19427Fixes #19427
Before submitting
PR review
Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:
Reviewer checklist
📚 Documentation preview 📚: https://pytorch-lightning--20775.org.readthedocs.build/en/20775/