Skip to content

Conversation

stas00
Copy link
Contributor

@stas00 stas00 commented May 18, 2022

For the full discussion please see: https://huggingface.slack.com/archives/C01NHER1JLS/p1652884593515729
The summary is here https://github.com/bigscience-workshop/bigscience/blob/master/train/tr11-176B-ml/chronicles.md#2022-05-18-hanging-at-eval-redux

  1. We again run into a deadlock on the 2nd valid loss entry
  2. This time the srun py-spy tracer worked perfectly so we had all the info needed to analyse the problem
  3. @DanielHesslow discovered that the deadlock happens in DataLoader - i.e. pytorch bug Fix possible deadlock in SharedCache inside a forked child proc pytorch/pytorch#25158 - which was supposed to be fixed, but it's not - the race condition is still there - and other recent comments confirm that.
  4. So based on Daniel's suggestion we set the valid_dataloaders to use num_workers=0 to work around it - it slows things down by a 5 TFLOPs on training, so we kept the training at num_workers=2

We are currently running already on this code as I hacked directly the local branch we were running from #272

@TevenLeScao, @thomasw21

@stas00 stas00 requested review from a team, thomasw21 and TevenLeScao May 18, 2022 18:19
Copy link
Member

@thomasw21 thomasw21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I think ideally we'd want to expose that via an command line argument, but the hotfix makes sense right now. Thanks!

@stas00
Copy link
Contributor Author

stas00 commented May 24, 2022

Indeed, I will work on the cmd line arg before merging.

@stas00
Copy link
Contributor Author

stas00 commented May 25, 2022

Done.

I will rebase the live auto-sync branch and adjust the slurm script to add --valid-num-workers 0 once we merge this.

Thank you!

@stas00 stas00 mentioned this pull request May 25, 2022
@stas00
Copy link
Contributor Author

stas00 commented May 31, 2022

so I'm merging this, right? not sure if we are going to hear from @TevenLeScao

@TevenLeScao
Copy link
Collaborator

Sorry, don't really have the bandwidth to check this, go ahead if it's necessary

@thomasw21
Copy link
Member

Looks alright to me! Feel free to merge! Thanks a lot!

@stas00 stas00 merged commit cb48bd2 into main May 31, 2022
@stas00 stas00 deleted the dl-deadlock-workaround branch May 31, 2022 23:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants