Use new DeviceMesh unflatten to rewrite parallel_dims #1660

fegin · 2025-08-29T05:51:59Z

Summary
This PR utilizes the latest APIs provided by DeviceMesh to simplify the creation of all different meshes.

The design philosophy is as follow:

Create one world mesh with the shape as [world_size,]
Create all 1-D submeshes by using 1) unflattening from the world mesh, or 2) slicing and flatten from other derived meshes.
ParallelDims now provides an API, get_mesh(), which accepts str or list[str]. When the argument is str, the API directly return the corresponding 1-D submesh. If the argument is list[str], the dim names will be used to concatenate to form a n-D device mesh.

torchtitan/distributed/parallel_dims.py

torchtitan/models/llama3/infra/parallelize.py

torchtitan/distributed/parallel_dims.py

tianyu-l · 2025-10-17T05:26:38Z

torchtitan/distributed/parallel_dims.py


-        return mesh
+        if self._meshes[dim].size() == 1:
+            return None


Not sure if this will break user expectation. We got asks that DTensor redistribute running on a mesh of size 1 should perform no op.

But even for current TorchTitan, we won't create any DeviceMesh if the parallelism degree is 1. So it is unclear to me how DeviceMesh with size 1 exists?

not in torchtitan, in internal

PyTorch? Then it is okay, right? DeviceMesh still supports the case but TorchTitan makes a stronger assumption in our use case.

torchtitan/distributed/utils.py

wconstab · 2025-10-28T21:38:42Z

torchtitan/distributed/parallel_dims.py

+        fsdp = self.dp_shard * self.cp
+        efsdp = fsdp * self.tp // (self.etp * self.ep)
+
+        self._world_mesh = init_device_mesh(


does this initialize a world PG?

it may be fine to just ignore this for now in torchtitan, but, i am wondering if users want control over world group creation what would that look like?

cc., @fduwjj are we able to disable the global PG initialization?

I think so right now we don't use split, so we can make it a fake pg. But if split is needed then we need to materialize the world PG anyway.

torchtitan/distributed/parallel_dims.py

torchtitan/distributed/utils.py

tianyu-l

We should modify FLUX train.py as it's in core now.

@ruisizhang123 let's adapt SimpleFSDP after this PR is merged.
oh it seems being fixed in #1959

tianyu-l · 2025-10-29T00:15:53Z

torchtitan/distributed/parallel_dims.py


-        return mesh
+        if self._meshes[dim].size() == 1:
+            return None


not in torchtitan, in internal

torchtitan/distributed/parallel_dims.py

tianyu-l · 2025-10-29T00:25:02Z

torchtitan/distributed/parallel_dims.py

+        )
+
+        self._meshes = {
+            "pp": dataloading_mesh["pp"],


to confirm if things match my expected behavior:

PG will be created for each sub dimension during unflatten, unless backend_override is specified on some dimension with the "fake" backend.

flatten will create a new mesh and a new PG.

slicing will create a new mesh, but reuse the PG created in parent mesh.

Yes

Yes

Yes

torchtitan/distributed/parallel_dims.py

torchtitan/distributed/utils.py

torchtitan/train.py

torchtitan/models/llama4/infra/parallelize.py

torchtitan/distributed/utils.py

wwwjn · 2025-10-30T19:03:54Z

torchtitan/distributed/parallel_dims.py

+        self._world_mesh = init_device_mesh(
+            device_type, (self.world_size,), mesh_dim_names=("world",)
+        )
+        dataloading_mesh = unflatten_mesh(


Curious what will happen if self.pp * batch * self.cp * self.tp != world_size? Will the _unflatten() fail?

Yes, it will fail

This is a demonstration of how parallel_dims will be when using pytorch/pytorch#161224 stack. ghstack-source-id: d29d2e2 Pull-Request: #1885

ghstack-source-id: f7c3fef Pull-Request: #1886

ghstack-source-id: cf7ad2a Pull-Request: #1887

ghstack-source-id: f7c3fef Pull-Request: #1888

ghstack-source-id: 6173cc5 Pull-Request: #1889

ghstack-source-id: 065ffd4 Pull-Request: #1890

ghstack-source-id: 08dd4a6 Pull-Request: #1891

ghstack-source-id: dcf962b Pull-Request: #1892

ghstack-source-id: c9fdc96 Pull-Request: #1893

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 29, 2025

wconstab reviewed Aug 29, 2025

View reviewed changes

torchtitan/distributed/parallel_dims.py Outdated Show resolved Hide resolved

wconstab reviewed Aug 29, 2025

View reviewed changes

torchtitan/distributed/parallel_dims.py Outdated Show resolved Hide resolved

tianyu-l reviewed Aug 29, 2025

View reviewed changes

torchtitan/distributed/parallel_dims.py Outdated Show resolved Hide resolved

ezyang reviewed Aug 30, 2025

View reviewed changes

torchtitan/distributed/parallel_dims.py Outdated Show resolved Hide resolved

ezyang reviewed Aug 30, 2025

View reviewed changes

torchtitan/distributed/parallel_dims.py Outdated Show resolved Hide resolved

fegin force-pushed the chienchin/new_device_mesh branch 7 times, most recently from 12eca61 to 19e4a23 Compare October 15, 2025 20:39

tianyu-l mentioned this pull request Oct 17, 2025

[TorchComms] Support training with EP #1902

Merged

tianyu-l reviewed Oct 17, 2025

View reviewed changes

fegin force-pushed the chienchin/new_device_mesh branch from 19e4a23 to 178bc11 Compare October 28, 2025 20:34

fegin marked this pull request as ready for review October 28, 2025 21:01

fegin requested a review from wwwjn as a code owner October 28, 2025 21:01

wconstab reviewed Oct 28, 2025

View reviewed changes

torchtitan/distributed/parallel_dims.py Outdated Show resolved Hide resolved

wconstab reviewed Oct 28, 2025

View reviewed changes

torchtitan/distributed/utils.py Show resolved Hide resolved

tianyu-l reviewed Oct 29, 2025

View reviewed changes

wwwjn reviewed Oct 30, 2025

View reviewed changes

tianyu-l mentioned this pull request Nov 1, 2025

Why is the ep mesh derived from a factoring of the dp mesh, instead of its own dimension? #1977

Open

fegin added 5 commits November 3, 2025 14:13

Use new DeviceMesh unflatten to rewrite parallel_dims

116d12c

This is a demonstration of how parallel_dims will be when using pytorch/pytorch#161224 stack. ghstack-source-id: d29d2e2 Pull-Request: #1885

misc

f59bd9e

ghstack-source-id: f7c3fef Pull-Request: #1886

Delete legacy code

2064561

ghstack-source-id: cf7ad2a Pull-Request: #1887

misc

b459b67

ghstack-source-id: f7c3fef Pull-Request: #1888

misc

7fd8914

ghstack-source-id: 6173cc5 Pull-Request: #1889

fegin added 9 commits November 3, 2025 14:13

lint

4649ee5

ghstack-source-id: 065ffd4 Pull-Request: #1890

misc

b3b7036

ghstack-source-id: 08dd4a6 Pull-Request: #1891

misc

f69e6a4

ghstack-source-id: dcf962b Pull-Request: #1892

Another round

5b3c741

ghstack-source-id: c9fdc96 Pull-Request: #1893

misc

bb85486

misc

b53d04c

misc

1ca059e

misc

9a71cff

misc

a67e87a

fegin force-pushed the chienchin/new_device_mesh branch from 20910ef to a67e87a Compare November 3, 2025 23:19

fegin added 2 commits November 3, 2025 15:36

misc

9557332

misc

bcbe1f3

fegin requested review from allenwang28, ebsmothers, joecummings and pbontrager as code owners November 4, 2025 00:05

Use new DeviceMesh unflatten to rewrite parallel_dims #1660

Are you sure you want to change the base?

Use new DeviceMesh unflatten to rewrite parallel_dims #1660

Conversation

fegin commented Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tianyu-l left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

fegin commented Aug 29, 2025 •

edited

Loading

tianyu-l left a comment •

edited

Loading