-
Notifications
You must be signed in to change notification settings - Fork 29
Add vLLM Semantic Router Blog #77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
+115
−0
Merged
Changes from 2 commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
5fd1503
docs: add vllm semantic router blog
Xunzhuo 77d7553
resolve reviews
Xunzhuo 71ca42d
update: resolve feedbacks
Xunzhuo 6b879b5
resolve reviews
Xunzhuo 2f757dd
update: resolve feedbacks
Xunzhuo 11717a3
resolve reviews
Xunzhuo d46785e
Merge branch 'main' into add-vsr-blog
simon-mo 87c5b92
update
Xunzhuo fa21de0
Merge branch 'main' into add-vsr-blog
simon-mo File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,113 @@ | ||
--- | ||
layout: post | ||
title: "Revolution in Large Model Inference: From GPT-5 to vLLM Semantic Router" | ||
Xunzhuo marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
author: "vLLM Semantic Router Team" | ||
image: /assets/logos/vllm-logo-text-light.png | ||
--- | ||
|
||
 | ||
|
||
## Industry Status: Inference ≠ The More, The Better | ||
|
||
Over the past year, **Hybrid inference / automatic routing** has become one of the hottest topics in the large model industry. | ||
|
||
Take **GPT-5** as an example. Its real breakthrough isn't in the number of parameters, but in the **"automatic routing + thinking quota"**: | ||
|
||
* **Light queries → Light models**: For example, "Why is the sky blue?" does not require expensive inference models. | ||
|
||
* **Complex/High-value queries → Strong inference models**: Legal analysis, financial simulations, etc., are routed to models with Chain-of-Thought capabilities. | ||
|
||
The logic behind this mechanism is called **"Per-token Unit Economics"**. | ||
|
||
Every token generated is no longer a meaningless "consumption" but must bring value. | ||
|
||
Free-tier users receive answers from lightweight models, keeping costs under control. | ||
When a query shows commercial intent (e.g., booking flights or finding legal services), it is routed to high-computation models and agent services that plug directly into transaction flows. | ||
|
||
For use cases like this, companies such as OpenAI can participate in the value chain by taking a commission on completed transactions — turning free traffic from a cost center into a monetizable entry point. | ||
|
||
Meanwhile, other companies are rapidly following suit: | ||
|
||
* **Anthropic Claude 3.7/4**: Fast thinking + slow thinking, with user-controlled switches. | ||
|
||
* **Google Gemini 2.5**: Introduces *thinking budget*, enabling enterprises to finely control inference costs. | ||
|
||
* **Alibaba Qwen3**: Attempts to switch between thinking/non-thinking modes using instructions. | ||
|
||
* **DeepSeek v3.1**: Uses a "single-model dual-mode" approach, combining dialogue and reasoning. | ||
|
||
In summary: The industry is entering a new era where **"not a single token should be wasted"**. | ||
|
||
## Recent Research: vLLM Semantic Router | ||
|
||
Amid the industry's push for "Hybrid inference," we focus on the **open-source inference engine vLLM**. | ||
|
||
vLLM has become the de facto standard for deploying large models in the industry. However, it lacks fine-grained semantic-level control - the ability to decide based on meaning rather than just query type. As a result, developers either enable full inference (wasting computation) or disable inference entirely (losing accuracy). | ||
|
||
Thus, we propose the **vLLM Semantic Router**, bringing GPT-5's "smart routing" capabilities to the open-source ecosystem. | ||
|
||
 | ||
|
||
🔹 **Architecture Design** | ||
|
||
1. **Semantic Classification**: Based on a **ModernBERT** fine-tuned intent classifier, determining whether a user query requires inference. | ||
|
||
Xunzhuo marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
2. **Smart Routing**: | ||
|
||
* Simple queries → Directly call the inference mode for fast responses. | ||
|
||
* Complex inference queries → Use Chain-of-Thought for accurate reasoning. | ||
|
||
3. **Rust High-Performance Engine**: Using the HuggingFace Candle framework to achieve high concurrency and zero-copy efficient inference. | ||
|
||
4. **Cloud-Native Integration**: Easily integrated with Kubernetes / API Gateway via Envoy ext_proc plugin, supporting enterprise-level deployments. | ||
|
||
Experimental data shows: | ||
|
||
* **Accuracy**: Improved by **+10.2%** | ||
* **Latency**: Reduced by **47.1%** | ||
* **Token Consumption**: Decreased by **48.5%** | ||
|
||
Especially in knowledge-intensive areas like business and economics, accuracy improvements even exceed **20%**. | ||
|
||
## Background of the vLLM Semantic Router Project | ||
|
||
The Semantic Router is not the isolated outcome of a single paper, but rather the result of collaboration and sustained efforts within the open-source community: | ||
|
||
* The project was initially proposed by **Dr. Chen Huamin**, Distinguished Engineer at **Red Hat**, in early **2025** across multiple open-source communities. | ||
|
||
* The project was iterated and evolved by **Xunzhuo Liu** from **Tencent**, and contributed to the vLLM community, becoming a part of the vLLM ecosystem. | ||
|
||
* **Dr. Wang Chen** from **IBM Research** and **Huamin** will present this project at the **2025 KubeCon North America** summit. | ||
|
||
Its mission is: To become the "inference accelerator" for open-source large models: | ||
|
||
* Ensure accuracy while minimizing unnecessary token consumption. | ||
* Allow developers to seamlessly switch between fast/slow thinking modes without needing to fully enable or disable inference. | ||
* Through native support for Kubernetes / Envoy, bring this capability into enterprise-level production environments. | ||
|
||
Thus, the vLLM Semantic Router is not just a research achievement but an **important bridge for open-source AI infrastructure**. It brings "academic innovation" directly into "industrial application." | ||
|
||
You can start exploring and experience it by visiting the GitHub repository: [https://github.com/vllm-project/semantic-router](https://github.com/vllm-project/semantic-router). | ||
|
||
## Future Trends: Cost-Effective, Just-in-Time Inference | ||
|
||
The large model industry has shifted from "Can we perform inference?" to "**When to perform inference and how to perform it?**" | ||
|
||
* **GPT-5**: Through automatic routing and thinking quotas, it ties computation allocation to commercial value, driving C-end monetization. | ||
|
||
* **vLLM Semantic Router**: Brings semantic routing to the open-source engine vLLM, enabling low-latency, low-energy consumption inference scheduling. | ||
|
||
The future competitive focus will no longer be about "whose model is the largest," but about: | ||
|
||
* **Can you perform inference at the right moment with the lowest cost?** | ||
* **Who can more precisely switch between fast/slow thinking modes?** | ||
* **Who can guarantee user experience without wasting computational resources?** | ||
|
||
Thus, the next frontier will be: **Intelligent self-adjusting inference mechanisms**. No need for explicit user switches or hardcoding; instead, the model/system can autonomously decide when to "think deeply" or provide a quick answer. | ||
|
||
## Summary in One Sentence | ||
|
||
* **GPT-5**: Uses routing for business, driving widespread intelligence. | ||
* **vLLM Semantic Router**: Uses semantic routing for efficiency, driving green AI. | ||
* The next competitive edge: **Performing the most appropriate inference with the lowest computation at the right time.** |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.