Skip to content

Fix double iteration bug when resumed from a checkpoint. #20775

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Aug 5, 2025

Conversation

sudiptob2
Copy link
Contributor

@sudiptob2 sudiptob2 commented Apr 29, 2025

What does this PR do?

This PR fixes the double iter() bug discussed in #19427

Fixes #19427

Before submitting
  • Was this discussed/agreed via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Reviewer checklist
  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

📚 Documentation preview 📚: https://pytorch-lightning--20775.org.readthedocs.build/en/20775/

@github-actions github-actions bot added the pl Generic label for PyTorch Lightning package label Apr 29, 2025
@sudiptob2 sudiptob2 marked this pull request as ready for review April 29, 2025 18:56
@sudiptob2
Copy link
Contributor Author

@Borda Thanks for the review! Let me know if there's anything you'd like me to change. Otherwise, can we go ahead and merge this?

@sudiptob2 sudiptob2 force-pushed the fix/19427/double-iter branch 2 times, most recently from e451080 to 130102c Compare May 14, 2025 04:56
@sudiptob2
Copy link
Contributor Author

Hey, @Borda just wondering if anything is blocking this PR from merging?

@sudiptob2 sudiptob2 force-pushed the fix/19427/double-iter branch 2 times, most recently from c03f7d0 to 9525a9a Compare June 27, 2025 06:25
@github-actions github-actions bot added docs Documentation related ci Continuous Integration fabric lightning.fabric.Fabric dependencies Pull requests that update a dependency file dockers package has conflicts labels Jun 27, 2025
@sudiptob2 sudiptob2 force-pushed the fix/19427/double-iter branch from 9525a9a to df95a0c Compare June 27, 2025 06:35
@sudiptob2
Copy link
Contributor Author

@Borda, I noticed a few mypy checks are failing, but they don't seem related to the changes in this PR. Should we ignore them for now? Let me know the next steps.

@sudiptob2 sudiptob2 force-pushed the fix/19427/double-iter branch from a9d56fd to 2f9cd44 Compare July 9, 2025 15:38
@sudiptob2
Copy link
Contributor Author

@bhimrazy, thanks for approving the PR. I can see that a few checks are timed out. Any idea why?

@sudiptob2
Copy link
Contributor Author

Just checking in, any hope of moving this forward?

@deependujha
Copy link
Collaborator

Hi @sudiptob2

Sorry for the delay and thanks for your patience.
I’m checking out the PR now and will update you shortly.

@deependujha deependujha requested a review from Copilot August 4, 2025 20:26
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a double iteration bug that occurs when resuming training from a checkpoint with IterableDatasets. The issue was that iter() was being called twice - once during setup and again during epoch start - causing data to be skipped when resuming from checkpoints.

  • Added a new is_resuming property to track when training is being resumed from a checkpoint
  • Modified the iteration logic in TrainingEpochLoop to avoid double iteration when resuming
  • Added comprehensive test coverage for the checkpoint resumption scenario

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
src/lightning/pytorch/loops/loop.py Added is_resuming property and state tracking for checkpoint resumption
src/lightning/pytorch/loops/training_epoch_loop.py Modified iteration logic to check resumption state before calling iter()
tests/tests_pytorch/loops/test_double_iter_in_iterable_dataset.py Added test case to verify checkpoint resumption works correctly with IterableDatasets
.github/workflows/ci-tests-pytorch.yml Increased timeout to accommodate additional test execution time
Comments suppressed due to low confidence (2)

tests/tests_pytorch/loops/test_double_iter_in_iterable_dataset.py:66

  • The function name test_resume_training_with is incomplete and unclear. It should be more descriptive, such as test_resume_training_with_iterable_dataset or test_resume_training_with_checkpoint.
def test_resume_training_with(tmp_path):

tests/tests_pytorch/loops/test_double_iter_in_iterable_dataset.py:69

  • The variable name max_epoch should be max_epochs to be consistent with the parameter name used in the train_model function and PyTorch Lightning conventions.
    max_epoch = 2

@deependujha
Copy link
Collaborator

Hi @sudiptob2,

I’m seeing an odd issue: all tests pass on macOS 14 with Python 3.12 and 2.7 in under 30 minutes, but the job still hangs until the 50‑minute timeout and then fails.

I suspect the issue to be in the test file, possibly related toqueue handling or multiprocessing (spawn/fork).

@deependujha deependujha merged commit 25b1343 into Lightning-AI:master Aug 5, 2025
84 checks passed
@deependujha
Copy link
Collaborator

Thanks, @sudiptob2, for your contribution and for making it even better!

@sudiptob2
Copy link
Contributor Author

@deependujha Thanks to you for the hard work with testing and CI fixes. Really appreciate 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci Continuous Integration dependencies Pull requests that update a dependency file dockers docs Documentation related fabric lightning.fabric.Fabric package pl Generic label for PyTorch Lightning package
Projects
None yet
Development

Successfully merging this pull request may close these issues.

calling iter twice messes up dataloaders with queues
4 participants