refactor FTManager #1397

tianyu-l · 2025-07-15T04:49:11Z

This PR refactors FTManager to:

simplify construction logic
expose simpler interfact to train.py
make it optional when building optimizer

and some other minor improvements.

H-Huang

Nice refactoring! Overall looks good, though we don't have tests in CI right now for torchft (adding #1398) but i dont think that shouldn't block this PR

H-Huang · 2025-07-15T19:05:33Z

torchtitan/train.py

-        if self.ft_manager.enabled:
-            dp_degree, dp_rank = self.ft_manager.get_dp_info(dp_degree, dp_rank)
+        self.ft_manager = FTManager(job_config.fault_tolerance)
+        dp_degree, dp_rank = self.ft_manager.get_dp_info(dp_degree, dp_rank)


i looks like dp_degree and dp_rank are needed for the dataloader. If we move the dataloader initialization after the model initialization, then I think we can also move maybe_set_all_reduce_hook here to consolidate all the torchft code together.

Can look into it in a follow up PR

depending on #1384 and #1397

This PR refactors `FTManager` to: 1. simplify construction logic 2. expose simpler interfact to `train.py` 3. make it optional when building optimizer and some other minor improvements.

depending on pytorch#1384 and pytorch#1397

tianyu-l requested a review from H-Huang July 15, 2025 04:49

tianyu-l requested review from fegin, wconstab and wwwjn as code owners July 15, 2025 04:49

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 15, 2025

tianyu-l mentioned this pull request Jul 15, 2025

add the forge folder #1387

Merged

refactor FTManager

6c1ce4a

tianyu-l force-pushed the refactor branch from ddc00dc to 6c1ce4a Compare July 15, 2025 06:08

H-Huang approved these changes Jul 15, 2025

View reviewed changes

wwwjn approved these changes Jul 15, 2025

View reviewed changes

tianyu-l merged commit 23b8736 into main Jul 15, 2025
7 checks passed

tianyu-l deleted the refactor branch July 15, 2025 20:52

tianyu-l added a commit that referenced this pull request Jul 23, 2025

add the forge folder (#1387)

2e6ab37

depending on #1384 and #1397

idoh pushed a commit to idoh/torchtitan that referenced this pull request Jul 28, 2025

add the forge folder (pytorch#1387)

249ea8c

depending on pytorch#1384 and pytorch#1397

bentherien pushed a commit to bentherien/torchtitan_ that referenced this pull request Aug 5, 2025

add the forge folder (pytorch#1387)

35b0f6e

depending on pytorch#1384 and pytorch#1397

joellidin pushed a commit to one-covenant/torchtitan that referenced this pull request Aug 8, 2025

add the forge folder (pytorch#1387)

5582a2f

depending on pytorch#1384 and pytorch#1397

joellidin pushed a commit to one-covenant/torchtitan that referenced this pull request Aug 8, 2025

add the forge folder (pytorch#1387)

b4aebbb

depending on pytorch#1384 and pytorch#1397

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor FTManager #1397

refactor FTManager #1397

Uh oh!

tianyu-l commented Jul 15, 2025

Uh oh!

H-Huang left a comment •

edited

Loading

Uh oh!

H-Huang Jul 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

refactor FTManager #1397

refactor FTManager #1397

Uh oh!

Conversation

tianyu-l commented Jul 15, 2025

Uh oh!

H-Huang left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

H-Huang Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

H-Huang left a comment •

edited

Loading