Skip to content

Conversation

tianyu-l
Copy link
Contributor

This PR refactors FTManager to:

  1. simplify construction logic
  2. expose simpler interfact to train.py
  3. make it optional when building optimizer

and some other minor improvements.

@tianyu-l tianyu-l requested a review from H-Huang July 15, 2025 04:49
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 15, 2025
@tianyu-l tianyu-l mentioned this pull request Jul 15, 2025
Copy link
Member

@H-Huang H-Huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice refactoring! Overall looks good, though we don't have tests in CI right now for torchft (adding #1398) but i dont think that shouldn't block this PR

if self.ft_manager.enabled:
dp_degree, dp_rank = self.ft_manager.get_dp_info(dp_degree, dp_rank)
self.ft_manager = FTManager(job_config.fault_tolerance)
dp_degree, dp_rank = self.ft_manager.get_dp_info(dp_degree, dp_rank)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i looks like dp_degree and dp_rank are needed for the dataloader. If we move the dataloader initialization after the model initialization, then I think we can also move maybe_set_all_reduce_hook here to consolidate all the torchft code together.

Can look into it in a follow up PR

@tianyu-l tianyu-l merged commit 23b8736 into main Jul 15, 2025
7 checks passed
@tianyu-l tianyu-l deleted the refactor branch July 15, 2025 20:52
tianyu-l added a commit that referenced this pull request Jul 23, 2025
idoh pushed a commit to idoh/torchtitan that referenced this pull request Jul 28, 2025
This PR refactors `FTManager` to:
1. simplify construction logic
2. expose simpler interfact to `train.py`
3. make it optional when building optimizer

and some other minor improvements.
idoh pushed a commit to idoh/torchtitan that referenced this pull request Jul 28, 2025
bentherien pushed a commit to bentherien/torchtitan_ that referenced this pull request Aug 5, 2025
joellidin pushed a commit to tplr-ai/torchtitan that referenced this pull request Aug 8, 2025
joellidin pushed a commit to tplr-ai/torchtitan that referenced this pull request Aug 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants