Skip to content

Commit a2c093d

Browse files
committed
update moe
Signed-off-by: heheda <[email protected]>
1 parent 1897600 commit a2c093d

File tree

1 file changed

+2
-2
lines changed

1 file changed

+2
-2
lines changed

_posts/2025-09-11-qwen3-next.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -52,9 +52,9 @@ In order to manage state for hybrid models like Qwen3-Next, vLLM automatically t
5252

5353
In addition, Flash Linear Attention is based on Triton. Launching Triton kernels can incur significant CPU overheads that disproportionately affect decode-only batches. To overcome this, vLLM enables full CUDA graph mode by default, ensuring good performance in low-latency scenarios
5454

55-
**High-Sparsity MoE: Extreme Efficiency**
55+
## **High-Sparsity MoE: Extreme Efficiency**
5656

57-
Qwen3-Next pushes sparsity further with **MoE layers at 1:50 activation ratio**. In the flagship **80B-A3B model**, only **3B parameters are active per token**, which can have great throughput and latency.
57+
Qwen3-Next pushes sparsity further with **MoE layers at 1:50 activation ratio**. In the flagship **80B-A3B model**, only **3B parameters are active per token**. vLLM can have great throughput and latency with the built-in efficient MoE implementation.
5858

5959

6060
## **Multi-Token Prediction (MTP)**

0 commit comments

Comments
 (0)