You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _posts/2025-08-20-torch-compile.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -148,9 +148,9 @@ A common pattern in quantized MLPs is SiLU activation followed by a quantized do
148
148
149
149
When using Tensor Parallelism (TP), the linear layer shards the weights and computes incomplete matrix multiplication results, which need to be synchronized across GPUs. When using separate kernels for the compute and communication pieces, we incur communication overhead as the GPUs sit idle while waiting for the network latency of communication results.
150
150
151
-
Instead, we can overlap computation and communication by using fused GEMM+collective kernels. One example of such kernels are the GEMM+reduce\_scatter and all\_gather+GEMM kernels. However, to use those, we have to perform intrusive modifications on the fx graph to transform it into a fusion-friendly representation. This includes parallelizing operations between two GEMMs across GPUs.
151
+
Instead, we can overlap computation and communication by using fused GEMM+collective kernels. One example of such kernels are the GEMM+reduce\_scatter and all\_gather+GEMM kernels. To utilize these kernels, we have to transform the computation graph, including parallelizing operations between two GEMMs across GPUs.
152
152
153
-
If we were to implement this kind of optimization in model definitions, we would have to touch every model vLLM supports (there are hundreds of them\!). It would be intrusive, increase developer friction, and be unlikely to be accepted into vLLM in the first place. Instead, by implementing the optimization in torch.compile, it is contained to just 2 custom passes and can be turned on using CLI flags, providing better performance for all the models supported by vLLM.
153
+
If we were to implement this kind of optimization in model definitions, we would have to touch every model vLLM supports (there are hundreds of them\!). It would be intrusive, break abstractions, increase developer friction, and be unlikely to be accepted into vLLM in the first place. Instead, by implementing the optimization in torch.compile, it is contained to just 2 custom passes and can be turned on using CLI flags, providing better performance for all models supported by vLLM.
154
154
155
155
> [!NOTE]
156
156
> This optimization was implemented in full by a community member [@cascade812](https://github.com/cascade812) who we thank for the incredible contribution. More information on Async TP can be found on the [PyTorch blog](https://discuss.pytorch.org/t/distributed-w-torchtitan-introducing-async-tensor-parallelism-in-pytorch/209487).
@@ -208,4 +208,4 @@ The goal of vLLM’s torch.compile integration is provide good baseline performa
208
208
torch.compile provides a powerful and accessible way to accelerate PyTorch models. In vLLM, it’s a core part of the inference pipeline. Combined with caching, dynamic shape support, CUDA Graphs, and custom passes, it enables efficient, scalable LLM serving across any environment.
209
209
210
210
As the compiler stack matures and support for new hardware expands, torch.compile and vLLM will continue to push the boundaries of inference performance—while keeping model development clean and modular.
211
-
Read more about torch.compile in the [PyTorch documentation](https://docs.pytorch.org/docs/stable/generated/torch.compile.html) and the [vLLM documentation](https://docs.vllm.ai/en/latest/design/v1/torch_compile.html), and join the [#sig-torch-compile channel](https://vllm-dev.slack.com/archives/C08K1FAHFPH) on [vLLM Slack](http://slack.vllm.ai) to ask questions, share feedback, and contribute your own custom passes!
211
+
Read more about torch.compile in the [PyTorch documentation](https://docs.pytorch.org/docs/stable/generated/torch.compile.html) and the [vLLM documentation](https://docs.vllm.ai/en/latest/design/v1/torch_compile.html), and join the [#sig-torch-compile channel](https://vllm-dev.slack.com/archives/C08K1FAHFPH) on [vLLM Slack](http://slack.vllm.ai) to ask questions, share feedback, and contribute your own custom passes!
0 commit comments