From f47e955b0cba458127c583c4aca39f1ae77f4605 Mon Sep 17 00:00:00 2001 From: sayakpaul Date: Fri, 5 Sep 2025 13:23:11 +0530 Subject: [PATCH 01/10] aot comments. --- zerogpu-aoti.md | 76 ++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 75 insertions(+), 1 deletion(-) diff --git a/zerogpu-aoti.md b/zerogpu-aoti.md index b9d6fe8123..cda6fc747f 100644 --- a/zerogpu-aoti.md +++ b/zerogpu-aoti.md @@ -33,6 +33,7 @@ In this post, we’ll show how to wire up Ahead-of-Time (AoT) compilation in Zer - [Dynamic shapes](#dynamic-shapes) - [Multi-compile / shared weights](#multi-compile--shared-weights) - [FlashAttention-3](#flashattention-3) + - [Regional compilation]() - [AoT compiled ZeroGPU Spaces demos](#aot-compiled-zerogpu-spaces-demos) - [Conclusion](#conclusion) - [Resources](#resources) @@ -340,6 +341,75 @@ It tries to load a kernel from the [`kernels-community/vllm-flash-attn3`](https: Here is a [fully working example of an FA3 attention processor](https://gist.github.com/sayakpaul/ff715f979793d4d44beb68e5e08ee067#file-fa3_qwen-py) for the Qwen-Image model. +### Regional compilation + +> [!TIP] +> We suggest using regional compilation as opposed to using full model compilation, especially when the speed benefits are similar. + +So far, we have been compiling the full model. Depending on the model, full model compilation can lead to significant cold start times. Long cold start times make the +development experience unpleasant. + +We can also choose to compile _regions_ within a model, significantly reducing the cold start times, while retaining almost all the benefits of full model compilation. Regional +compilation becomes promising when a model has repeated blocks of computation. A standard +language model, for example, has a number of identically structured Transformer blocks. + +In our example, we can compile the repeated blocks of the Flux transformer ahead of time like so. The [Flux Transformer](https://github.com/huggingface/diffusers/blob/c2e5ece08bf22d249c62e964f91bc326cf9e3759/src/diffusers/models/transformers/transformer_flux.py) has two kinds of repeated blocks: `FluxTransformerBlock` and `FluxSingleTransformerBlock`. We start by capturing inputs for just these blocks: + +```py +# Capturing inputs just for the first block is enough as the input structure remains +# the same for others. + +with spaces.aoti_capture(pipe.transformer.transformer_blocks[0]) as call_double_blocks: + pipe("arbitrary example prompt") + +with spaces.aoti_capture( + pipe.transformer.single_transformer_blocks[0] +) as call_single_blocks: + pipe("arbitrary example prompt") +``` + +We then perform compilation after exporting a `torch.export.ExportedProgram`: + +```py +exported_double = torch.export.export( + mod=pipe.transformer.transformer_blocks[0], + args=call_double_blocks.args, + kwargs=call_double_blocks.kwargs, +) +exported_single = torch.export.export( + mod=pipe.transformer.single_transformer_blocks[0], + args=call_single_blocks.args, + kwargs=call_single_blocks.kwargs, +) + +compiled_double = spaces.aoti_compile(exported_double) +compiled_single = spaces.aoti_compile(exported_single) +``` + +Note that we are only compiling the first blocks within `transformer.transformer_blocks` and +`transformer.single_transformer_blocks` as they each block within these share the same +configuration. So, the compilation-optimized graph can be reused. During loading, we +make use of this optimized graph while reusing the parameters for each block as shown below: + +```py +from spaces.zero.torch.aoti import ZeroGPUCompiledModel, ZeroGPUWeights + +for block in pipe.transformer.transformer_blocks: + weights = ZeroGPUWeights(block.state_dict()) + compiled_block = ZeroGPUCompiledModel(compiled_double.archive_file, weights) + block.forward = compiled_block + + for block in pipe.transformer.single_transformer_blocks: + weights = ZeroGPUWeights(block.state_dict()) + compiled_block = ZeroGPUCompiledModel(compiled_single.archive_file, weights) + block.forward = compiled_block +``` + +And we should be ready to go 🚀 You can check out [this Space](https://huggingface.co/spaces/zerogpu-aoti/Qwen-Image-Edit-AoT-Regional) for a more complete example. + +> [!TIP] +> 💡 Compiling regions ahead of time like this reduces the cold start timing from _X seconds to Y seconds_, while delivering almost identical speedups. + ## AoT compiled ZeroGPU Spaces demos ### Speedup comparison @@ -350,7 +420,11 @@ Here is a [fully working example of an FA3 attention processor](https://gist.git - [FLUX.1 Kontext](https://huggingface.co/spaces/zerogpu-aoti/FLUX.1-Kontext-Dev) - [QwenImage Edit](https://huggingface.co/spaces/multimodalart/Qwen-Image-Edit-Fast) - [Wan 2.2](https://huggingface.co/spaces/zerogpu-aoti/wan2-2-fp8da-aoti-faster) -- [LTX Video](https://huggingface.co/spaces/zerogpu-aoti/ltx-dev-fast) + +### Regional compilation +- [Regional compilation recipe](https://docs.pytorch.org/tutorials/recipes/regional_compilation.html) +- [Native integration in Diffusers](https://huggingface.co/docs/diffusers/main/en/optimization/fp16) +- [More performance numbers](https://pytorch.org/blog/torch-compile-and-diffusers-a-hands-on-guide-to-peak-performance/) ## Conclusion From fe67ff3a65b6e3365e6c190fc32b44cd7f6bf1fb Mon Sep 17 00:00:00 2001 From: sayakpaul Date: Fri, 5 Sep 2025 13:51:27 +0530 Subject: [PATCH 02/10] up --- zerogpu-aoti.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/zerogpu-aoti.md b/zerogpu-aoti.md index cda6fc747f..cf1391ebf5 100644 --- a/zerogpu-aoti.md +++ b/zerogpu-aoti.md @@ -408,7 +408,7 @@ for block in pipe.transformer.transformer_blocks: And we should be ready to go 🚀 You can check out [this Space](https://huggingface.co/spaces/zerogpu-aoti/Qwen-Image-Edit-AoT-Regional) for a more complete example. > [!TIP] -> 💡 Compiling regions ahead of time like this reduces the cold start timing from _X seconds to Y seconds_, while delivering almost identical speedups. +> 💡 For Flux.1-Dev, compiling regions ahead of time like this reduces the cold start timing from _103 seconds to 23 seconds_, while delivering almost identical speedups. ## AoT compiled ZeroGPU Spaces demos From b190ad4542fe960635a038e4b3a8708a4756a0f0 Mon Sep 17 00:00:00 2001 From: sayakpaul Date: Fri, 5 Sep 2025 13:52:40 +0530 Subject: [PATCH 03/10] up --- zerogpu-aoti.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/zerogpu-aoti.md b/zerogpu-aoti.md index cf1391ebf5..e279f109f6 100644 --- a/zerogpu-aoti.md +++ b/zerogpu-aoti.md @@ -33,7 +33,7 @@ In this post, we’ll show how to wire up Ahead-of-Time (AoT) compilation in Zer - [Dynamic shapes](#dynamic-shapes) - [Multi-compile / shared weights](#multi-compile--shared-weights) - [FlashAttention-3](#flashattention-3) - - [Regional compilation]() + - [Regional compilation](#regional-compilation) - [AoT compiled ZeroGPU Spaces demos](#aot-compiled-zerogpu-spaces-demos) - [Conclusion](#conclusion) - [Resources](#resources) From 783752b9b7b1e8894d5ae1073bd7d65d226360cc Mon Sep 17 00:00:00 2001 From: sayakpaul Date: Wed, 10 Sep 2025 10:19:27 +0530 Subject: [PATCH 04/10] add a section on reusing a compiled model. --- zerogpu-aoti.md | 74 ++++++++++++------------------------------------- 1 file changed, 17 insertions(+), 57 deletions(-) diff --git a/zerogpu-aoti.md b/zerogpu-aoti.md index e279f109f6..1d06cee681 100644 --- a/zerogpu-aoti.md +++ b/zerogpu-aoti.md @@ -343,72 +343,31 @@ Here is a [fully working example of an FA3 attention processor](https://gist.git ### Regional compilation -> [!TIP] -> We suggest using regional compilation as opposed to using full model compilation, especially when the speed benefits are similar. - -So far, we have been compiling the full model. Depending on the model, full model compilation can lead to significant cold start times. Long cold start times make the -development experience unpleasant. - -We can also choose to compile _regions_ within a model, significantly reducing the cold start times, while retaining almost all the benefits of full model compilation. Regional -compilation becomes promising when a model has repeated blocks of computation. A standard -language model, for example, has a number of identically structured Transformer blocks. - -In our example, we can compile the repeated blocks of the Flux transformer ahead of time like so. The [Flux Transformer](https://github.com/huggingface/diffusers/blob/c2e5ece08bf22d249c62e964f91bc326cf9e3759/src/diffusers/models/transformers/transformer_flux.py) has two kinds of repeated blocks: `FluxTransformerBlock` and `FluxSingleTransformerBlock`. We start by capturing inputs for just these blocks: - -```py -# Capturing inputs just for the first block is enough as the input structure remains -# the same for others. - -with spaces.aoti_capture(pipe.transformer.transformer_blocks[0]) as call_double_blocks: - pipe("arbitrary example prompt") - -with spaces.aoti_capture( - pipe.transformer.single_transformer_blocks[0] -) as call_single_blocks: - pipe("arbitrary example prompt") -``` +So far, we have been compiling the full model. Depending on the model, full model compilation can lead to significantly long cold start times. Long cold start times make the development experience unpleasant. -We then perform compilation after exporting a `torch.export.ExportedProgram`: +We can also choose to compile _regions_ within a model, significantly reducing the cold start times, while retaining almost all the benefits of full model compilation. Regional compilation becomes promising when +a model has repeated blocks of computation. A standard language model, for example, has a number of +identically structured Transformer blocks. -```py -exported_double = torch.export.export( - mod=pipe.transformer.transformer_blocks[0], - args=call_double_blocks.args, - kwargs=call_double_blocks.kwargs, -) -exported_single = torch.export.export( - mod=pipe.transformer.single_transformer_blocks[0], - args=call_single_blocks.args, - kwargs=call_single_blocks.kwargs, -) +In our example, we can compile the repeated blocks of the Flux transformer ahead of time like so. The [Flux Transformer](https://github.com/huggingface/diffusers/blob/c2e5ece08bf22d249c62e964f91bc326cf9e3759/src/diffusers/models/transformers/transformer_flux.py) has two kinds of repeated blocks: `FluxTransformerBlock` and `FluxSingleTransformerBlock`. -compiled_double = spaces.aoti_compile(exported_double) -compiled_single = spaces.aoti_compile(exported_single) -``` +You can check out [this Space](https://huggingface.co/spaces/zerogpu-aoti/Qwen-Image-Edit-AoT-Regional) for a complete example. -Note that we are only compiling the first blocks within `transformer.transformer_blocks` and -`transformer.single_transformer_blocks` as they each block within these share the same -configuration. So, the compilation-optimized graph can be reused. During loading, we -make use of this optimized graph while reusing the parameters for each block as shown below: +> [!TIP] +> 💡 For Flux.1-Dev, compiling regions ahead of time like this reduces the cold start timing from _103 seconds to 23 seconds_, while delivering almost identical speedups. -```py -from spaces.zero.torch.aoti import ZeroGPUCompiledModel, ZeroGPUWeights +### Use a compiled graph from the Hub -for block in pipe.transformer.transformer_blocks: - weights = ZeroGPUWeights(block.state_dict()) - compiled_block = ZeroGPUCompiledModel(compiled_double.archive_file, weights) - block.forward = compiled_block +Once a model (or even a model block) is compiled ahead of time, we can serialize the compiled graph module +as an artifact and reuse later. In the context of a ZeroGPU-powered demo on Spaces, this will significantly +cut down the demo startup time. - for block in pipe.transformer.single_transformer_blocks: - weights = ZeroGPUWeights(block.state_dict()) - compiled_block = ZeroGPUCompiledModel(compiled_single.archive_file, weights) - block.forward = compiled_block -``` +To keep the storage light, we can just save the compiled model graph without including any model parameters +inside the artifact. -And we should be ready to go 🚀 You can check out [this Space](https://huggingface.co/spaces/zerogpu-aoti/Qwen-Image-Edit-AoT-Regional) for a more complete example. +Check out [this collection](TODO) that shows a full workflow of obtaining compiled model graph, pushing it +to the Hub, and then using it to build a demo. -> [!TIP] -> 💡 For Flux.1-Dev, compiling regions ahead of time like this reduces the cold start timing from _103 seconds to 23 seconds_, while delivering almost identical speedups. ## AoT compiled ZeroGPU Spaces demos @@ -437,6 +396,7 @@ We demonstrate speedups with Flux.1-Dev, but these techniques are not limited to - Visit our [ZeroGPU-AOTI org on the Hub](https://huggingface.co/zerogpu-aoti) to refer to a collection of demos that leverage the techniques discussed in this post. - Browse `spaces.aoti_*` APIs [source code](https://pypi-browser.org/package/spaces/spaces-0.40.1-py3-none-any.whl/spaces/zero/torch/aoti.py) to learn more about the interface - Check out [Kernels Community org on the hub](https://huggingface.co/kernels-community) +- Learn more about regional compilation from [here](pytorch.org/tutorials/recipes/regional_compilation.html) - Upgrade to [Pro](https://huggingface.co/pro) on Hugging Face to create your own ZeroGPU Spaces (and get 25 minutes of H200 usage every day) *Acknowledgements: Thanks to ChunTe Lee for creating an awesome thumbnail for this post. Thanks to Pedro and Vaibhav for providing feedback on the post.* From de28ccd95231993a47ba88f8b6183281eb4a4a39 Mon Sep 17 00:00:00 2001 From: sayakpaul Date: Wed, 10 Sep 2025 10:20:07 +0530 Subject: [PATCH 05/10] toc. --- zerogpu-aoti.md | 1 + 1 file changed, 1 insertion(+) diff --git a/zerogpu-aoti.md b/zerogpu-aoti.md index 1d06cee681..8704367b8d 100644 --- a/zerogpu-aoti.md +++ b/zerogpu-aoti.md @@ -34,6 +34,7 @@ In this post, we’ll show how to wire up Ahead-of-Time (AoT) compilation in Zer - [Multi-compile / shared weights](#multi-compile--shared-weights) - [FlashAttention-3](#flashattention-3) - [Regional compilation](#regional-compilation) + - [Use a compiled graph from the Hub](#use-a-compiled-graph-from-the-hub) - [AoT compiled ZeroGPU Spaces demos](#aot-compiled-zerogpu-spaces-demos) - [Conclusion](#conclusion) - [Resources](#resources) From fac2ac61c21639565785b8de1caa76fed61ef17e Mon Sep 17 00:00:00 2001 From: Sayak Paul Date: Wed, 10 Sep 2025 20:13:13 +0530 Subject: [PATCH 06/10] Update zerogpu-aoti.md Co-authored-by: Charles --- zerogpu-aoti.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/zerogpu-aoti.md b/zerogpu-aoti.md index 8704367b8d..867fa752d1 100644 --- a/zerogpu-aoti.md +++ b/zerogpu-aoti.md @@ -355,7 +355,7 @@ In our example, we can compile the repeated blocks of the Flux transformer ahead You can check out [this Space](https://huggingface.co/spaces/zerogpu-aoti/Qwen-Image-Edit-AoT-Regional) for a complete example. > [!TIP] -> 💡 For Flux.1-Dev, compiling regions ahead of time like this reduces the cold start timing from _103 seconds to 23 seconds_, while delivering almost identical speedups. +> 💡 For Flux.1-Dev, switching to regional compilation reduces the compilation time from _6 minutes_ to just _30 seconds_ while delivering identical speedups. ### Use a compiled graph from the Hub From 57aa7217166785a612774692290e3a28ecd249cf Mon Sep 17 00:00:00 2001 From: sayakpaul Date: Thu, 11 Sep 2025 07:58:25 +0530 Subject: [PATCH 07/10] up --- zerogpu-aoti.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/zerogpu-aoti.md b/zerogpu-aoti.md index 867fa752d1..e3479629a0 100644 --- a/zerogpu-aoti.md +++ b/zerogpu-aoti.md @@ -352,7 +352,7 @@ identically structured Transformer blocks. In our example, we can compile the repeated blocks of the Flux transformer ahead of time like so. The [Flux Transformer](https://github.com/huggingface/diffusers/blob/c2e5ece08bf22d249c62e964f91bc326cf9e3759/src/diffusers/models/transformers/transformer_flux.py) has two kinds of repeated blocks: `FluxTransformerBlock` and `FluxSingleTransformerBlock`. -You can check out [this Space](https://huggingface.co/spaces/zerogpu-aoti/Qwen-Image-Edit-AoT-Regional) for a complete example. +You can check out [this Space](https://huggingface.co/spaces/cbensimon/FLUX.1-dev-fa3-aoti/tree/main) for a complete example. > [!TIP] > 💡 For Flux.1-Dev, switching to regional compilation reduces the compilation time from _6 minutes_ to just _30 seconds_ while delivering identical speedups. @@ -361,7 +361,7 @@ You can check out [this Space](https://huggingface.co/spaces/zerogpu-aoti/Qwen-I Once a model (or even a model block) is compiled ahead of time, we can serialize the compiled graph module as an artifact and reuse later. In the context of a ZeroGPU-powered demo on Spaces, this will significantly -cut down the demo startup time. +cut down the demo startup time by skipping the compilation time. To keep the storage light, we can just save the compiled model graph without including any model parameters inside the artifact. From 5fe5025bce124c21cbbe6588ce57d8b1838d56ed Mon Sep 17 00:00:00 2001 From: sayakpaul Date: Thu, 11 Sep 2025 16:52:48 +0530 Subject: [PATCH 08/10] up --- zerogpu-aoti.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/zerogpu-aoti.md b/zerogpu-aoti.md index e3479629a0..d794bfe130 100644 --- a/zerogpu-aoti.md +++ b/zerogpu-aoti.md @@ -366,7 +366,7 @@ cut down the demo startup time by skipping the compilation time. To keep the storage light, we can just save the compiled model graph without including any model parameters inside the artifact. -Check out [this collection](TODO) that shows a full workflow of obtaining compiled model graph, pushing it +Check out [this collection](https://huggingface.co/collections/zerogpu-aoti/using-compiled-graph-from-the-hub-68c2afcc03de7609f9f91e35) that shows a full workflow of obtaining compiled model graph, pushing it to the Hub, and then using it to build a demo. From b8b6779b61de589e6c6ab521b093d2dd03c3025c Mon Sep 17 00:00:00 2001 From: sayakpaul Date: Thu, 11 Sep 2025 21:00:48 +0530 Subject: [PATCH 09/10] up --- zerogpu-aoti.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/zerogpu-aoti.md b/zerogpu-aoti.md index d794bfe130..762ef868dd 100644 --- a/zerogpu-aoti.md +++ b/zerogpu-aoti.md @@ -400,4 +400,4 @@ We demonstrate speedups with Flux.1-Dev, but these techniques are not limited to - Learn more about regional compilation from [here](pytorch.org/tutorials/recipes/regional_compilation.html) - Upgrade to [Pro](https://huggingface.co/pro) on Hugging Face to create your own ZeroGPU Spaces (and get 25 minutes of H200 usage every day) -*Acknowledgements: Thanks to ChunTe Lee for creating an awesome thumbnail for this post. Thanks to Pedro and Vaibhav for providing feedback on the post.* +*Acknowledgements: Thanks to ChunTe Lee for creating an awesome thumbnail for this post. Thanks to Pedro and Vaibhav for providing feedback on the post. Thanks to Angela Yi from the PyTorch team for helping us with AOT guidance.* From ad2fbc4ca0555dea2b83681acf4529d09646efbc Mon Sep 17 00:00:00 2001 From: Sayak Paul Date: Thu, 11 Sep 2025 21:02:20 +0530 Subject: [PATCH 10/10] Update zerogpu-aoti.md Co-authored-by: Pedro Cuenca --- zerogpu-aoti.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/zerogpu-aoti.md b/zerogpu-aoti.md index 762ef868dd..0d8b4b2d28 100644 --- a/zerogpu-aoti.md +++ b/zerogpu-aoti.md @@ -350,7 +350,7 @@ We can also choose to compile _regions_ within a model, significantly reducing t a model has repeated blocks of computation. A standard language model, for example, has a number of identically structured Transformer blocks. -In our example, we can compile the repeated blocks of the Flux transformer ahead of time like so. The [Flux Transformer](https://github.com/huggingface/diffusers/blob/c2e5ece08bf22d249c62e964f91bc326cf9e3759/src/diffusers/models/transformers/transformer_flux.py) has two kinds of repeated blocks: `FluxTransformerBlock` and `FluxSingleTransformerBlock`. +In our example, we can compile the repeated blocks of the Flux transformer ahead of time, and propagate the compiled graph to the remaining repeated blocks. The [Flux Transformer](https://github.com/huggingface/diffusers/blob/c2e5ece08bf22d249c62e964f91bc326cf9e3759/src/diffusers/models/transformers/transformer_flux.py) has two kinds of repeated blocks: `FluxTransformerBlock` and `FluxSingleTransformerBlock`. You can check out [this Space](https://huggingface.co/spaces/cbensimon/FLUX.1-dev-fa3-aoti/tree/main) for a complete example.