[precompile] add ability to precompile torchtitan models #2092

bobrenjc93 · 2025-11-29T22:48:27Z

Stack from ghstack (oldest at bottom):

-> [precompile] add ability to precompile torchtitan models #2092

For context for folks who don't know, precompile is a new technology which allows
us to serialize a torch.compile'd model as a file on disk that we can load in the future
to avoid recompilations. It doesn't help with cold starts but is quite useful for warm
starts and preemptions where the underlying model doesn't change.

for simplefsdp dsv3 we see the time taken to get through the first
batch go down from 17.99 => 1.73 seconds. For posterity the command
used for testing was

TORCH_LOGS="all" NGPU=2 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/debug_model.toml" cache_tlp ./run_train.sh --model.name simple_fsdp.deepseek_v3 --compile.enable --activation_checkpoint.mode "none"

For this to work you'll need to work on a pytorch checkout later than
pytorch/pytorch#169242

This currently has only been tested with dsv3 and simplefsdp. Notably
the current implementation does not yet support PP. This will be added
at a later time.

for simplefsdp dsv3 we see the time taken to get through the first batch go down from 17.99 => 1.73 seconds. For posterity the command used for testing was ``` TORCH_LOGS="all" NGPU=2 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/debug_model.toml" cache_tlp ./run_train.sh --model.name simple_fsdp.deepseek_v3 --compile.enable --activation_checkpoint.mode "none" ``` For this to work you'll need to work on a pytorch checkout later than pytorch/pytorch#169242 This currently has only been tested with dsv3 and simplefsdp. Notably the current implementation does not yet support PP. This will be added at a later time. [ghstack-poisoned]

for simplefsdp dsv3 we see the time taken to get through the first batch go down from 17.99 => 1.73 seconds. For posterity the command used for testing was ``` TORCH_LOGS="all" NGPU=2 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/debug_model.toml" cache_tlp ./run_train.sh --model.name simple_fsdp.deepseek_v3 --compile.enable --activation_checkpoint.mode "none" ``` For this to work you'll need to work on a pytorch checkout later than pytorch/pytorch#169242 This currently has only been tested with dsv3 and simplefsdp. Notably the current implementation does not yet support PP. This will be added at a later time. ghstack-source-id: 757d8b7 Pull Request resolved: #2092

for simplefsdp dsv3 we see the time taken to get through the first batch go down from 17.99 => 1.73 seconds. For posterity the command used for testing was ``` TORCH_LOGS="all" NGPU=2 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/debug_model.toml" cache_tlp ./run_train.sh --model.name simple_fsdp.deepseek_v3 --compile.enable --activation_checkpoint.mode "none" ``` For this to work you'll need to work on a pytorch checkout later than pytorch/pytorch#169242 This currently has only been tested with dsv3 and simplefsdp. Notably the current implementation does not yet support PP. This will be added at a later time. [ghstack-poisoned]

for simplefsdp dsv3 we see the time taken to get through the first batch go down from 17.99 => 1.73 seconds. For posterity the command used for testing was ``` TORCH_LOGS="all" NGPU=2 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/debug_model.toml" cache_tlp ./run_train.sh --model.name simple_fsdp.deepseek_v3 --compile.enable --activation_checkpoint.mode "none" ``` For this to work you'll need to work on a pytorch checkout later than pytorch/pytorch#169242 This currently has only been tested with dsv3 and simplefsdp. Notably the current implementation does not yet support PP. This will be added at a later time. ghstack-source-id: 76c04e1 Pull Request resolved: #2092

ruisizhang123 · 2025-12-01T20:19:13Z

torchtitan/train.py


        self.job_config = job_config

+        if job_config.compile.enable_precompilation:


qq. Is this for simplefsdp-only or also works for fsdp2+block-level compile?

maybe you want to add this config to apply_compile here for fsdp2:

torchtitan/torchtitan/models/llama3/infra/parallelize.py

Lines 236 to 247 in cbdb311

def apply_compile(model: nn.Module, compile_config: CompileConfig):

"""

Apply torch.compile to each TransformerBlock, which makes compilation efficient due to

repeated structure. Alternatively one can compile the whole model (after applying DP).

"""

for layer_id, transformer_block in model.layers.named_children():

transformer_block = torch.compile(

transformer_block, backend=compile_config.backend, fullgraph=True

)

model.layers.register_module(layer_id, transformer_block)

logger.info("Compiling each TransformerBlock with torch.compile")

; and here for simplefsdp: https://github.com/pytorch/torchtitan/blob/main/torchtitan/experiments/simple_fsdp/llama3/parallelize.py#L151-L152?

Currently only simplefsdp but this should work with fsdp2+block-level compile with some additional work.

ezyang · 2025-12-02T14:55:29Z

torchtitan/train.py

+
+        # Create a unique filename based on model configuration and rank
+        filename = f"compiled_fn_{model_name}_{model_flavor}_rank_{rank}.pt"
+        return os.path.join("/tmp", filename)


This isn't a realistic file path for training on FB infra, as the tmp is cleared if you restart training

Agreed. For FB infra, we would either package the artifact into the conda or fbpkg build, or place it in oilfs and keep a reference to it. For Torchtitan, using /tmp seemed acceptable, though I can make the location configurable through an environment variable. Did you have a different approach in mind?

ezyang · 2025-12-02T14:56:19Z

torchtitan/experiments/simple_fsdp/simple_fsdp.py

    }
    module_cls = type(
-        f"SimpleFSDP{module.__class__.__name__}",
+        f"SimpleFSDP{module.__class__.__name__}_{_wrap_class_counter}",


Please see also https://docs.google.com/document/d/1FqUXYCaoHTQy40anvKVSAv9Ci7yIWCUdZxbWMwHN6is/edit?tab=t.0#heading=h.aq5mvgrni90o

aditvenk · 2025-12-02T18:19:03Z

@bobrenjc93 Will this new precompile option also work with the compiler toolkit experiment?
cc: @yiming0416

tianyu-l

Sorry not sure if this is a draft or ready for review, so putting a hold so that it's not accidentally merged as is.

If it's ready for review: the change seems quite intrusive, please consider simplifying or putting it in compiler_toolkit experiment folder.

bobrenjc93 · 2025-12-02T22:35:38Z

@tianyu-l @aditvenk this PR served as a proof of concept to demonstrate an end-to-end flow where precompilation works with simplefsdp. I'll abandon it for now and shift focus to a more narrowly scoped PR that integrates precompile with the compiler toolkit (which does need some work since aot_compile_joint_with_descriptors currently doesn't compose with precompile).

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 29, 2025

bobrenjc93 requested review from aorenste, jamesjwu and zhxchen17 November 29, 2025 22:49

bobrenjc93 changed the title ~~[draft][precompile] add ability to precompile torchtitan models~~ [precompile] add ability to precompile torchtitan models Nov 29, 2025

bobrenjc93 marked this pull request as ready for review November 29, 2025 22:53

bobrenjc93 requested review from fegin, tianyu-l, wconstab and wwwjn as code owners November 29, 2025 22:53

ruisizhang123 reviewed Dec 1, 2025

View reviewed changes

ezyang reviewed Dec 2, 2025

View reviewed changes

tianyu-l requested changes Dec 2, 2025

View reviewed changes

bobrenjc93 closed this Dec 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[precompile] add ability to precompile torchtitan models #2092

[precompile] add ability to precompile torchtitan models #2092

bobrenjc93 commented Nov 29, 2025 •

edited

Loading

Uh oh!

ruisizhang123 Dec 1, 2025

Uh oh!

bobrenjc93 Dec 2, 2025

Uh oh!

ezyang Dec 2, 2025

Uh oh!

bobrenjc93 Dec 2, 2025

Uh oh!

ezyang Dec 2, 2025

Uh oh!

aditvenk commented Dec 2, 2025 •

edited

Loading

Uh oh!

tianyu-l left a comment

Uh oh!

bobrenjc93 commented Dec 2, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants


		self.job_config = job_config

		if job_config.compile.enable_precompilation:

	def apply_compile(model: nn.Module, compile_config: CompileConfig):
	"""
	Apply torch.compile to each TransformerBlock, which makes compilation efficient due to
	repeated structure. Alternatively one can compile the whole model (after applying DP).
	"""
	for layer_id, transformer_block in model.layers.named_children():
	transformer_block = torch.compile(
	transformer_block, backend=compile_config.backend, fullgraph=True
	)
	model.layers.register_module(layer_id, transformer_block)

	logger.info("Compiling each TransformerBlock with torch.compile")

[precompile] add ability to precompile torchtitan models #2092

[precompile] add ability to precompile torchtitan models #2092

Conversation

bobrenjc93 commented Nov 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ruisizhang123 Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

bobrenjc93 Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

ezyang Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

bobrenjc93 Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

ezyang Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

aditvenk commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

bobrenjc93 commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

bobrenjc93 commented Nov 29, 2025 •

edited

Loading

aditvenk commented Dec 2, 2025 •

edited

Loading

bobrenjc93 commented Dec 2, 2025 •

edited

Loading