-
Notifications
You must be signed in to change notification settings - Fork 496
refactor FTManager #1397
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor FTManager #1397
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice refactoring! Overall looks good, though we don't have tests in CI right now for torchft (adding #1398) but i dont think that shouldn't block this PR
if self.ft_manager.enabled: | ||
dp_degree, dp_rank = self.ft_manager.get_dp_info(dp_degree, dp_rank) | ||
self.ft_manager = FTManager(job_config.fault_tolerance) | ||
dp_degree, dp_rank = self.ft_manager.get_dp_info(dp_degree, dp_rank) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i looks like dp_degree
and dp_rank
are needed for the dataloader. If we move the dataloader initialization after the model initialization, then I think we can also move maybe_set_all_reduce_hook
here to consolidate all the torchft code together.
Can look into it in a follow up PR
This PR refactors `FTManager` to: 1. simplify construction logic 2. expose simpler interfact to `train.py` 3. make it optional when building optimizer and some other minor improvements.
depending on pytorch#1384 and pytorch#1397
depending on pytorch#1384 and pytorch#1397
depending on pytorch#1384 and pytorch#1397
depending on pytorch#1384 and pytorch#1397
This PR refactors
FTManager
to:train.py
and some other minor improvements.