update moe

heheda12345 · heheda12345 · commit a2c093da5b06 · 2025-09-11T11:29:33.000-07:00
Signed-off-by: heheda &lt;zhangch99@outlook.com&gt;
diff --git a/_posts/2025-09-11-qwen3-next.md b/_posts/2025-09-11-qwen3-next.md
@@ -52,9 +52,9 @@ In order to manage state for hybrid models like Qwen3-Next, vLLM automatically t
 
 In addition, Flash Linear Attention is based on Triton. Launching Triton kernels can incur significant CPU overheads that disproportionately affect decode-only batches. To overcome this, vLLM enables full CUDA graph mode by default, ensuring good performance in low-latency scenarios
 
-**High-Sparsity MoE: Extreme Efficiency**
+## **High-Sparsity MoE: Extreme Efficiency**
 
-Qwen3-Next pushes sparsity further with **MoE layers at 1:50 activation ratio**. In the flagship **80B-A3B model**, only **3B parameters are active per token**, which can have great throughput and latency.
+Qwen3-Next pushes sparsity further with **MoE layers at 1:50 activation ratio**. In the flagship **80B-A3B model**, only **3B parameters are active per token**. vLLM can have great throughput and latency with the built-in efficient MoE implementation.
 
 
 ## **Multi-Token Prediction (MTP)**