-
Notifications
You must be signed in to change notification settings - Fork 27
Add vLLM Semantic Router Blog #77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 3 commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
5fd1503
docs: add vllm semantic router blog
Xunzhuo 77d7553
resolve reviews
Xunzhuo 71ca42d
update: resolve feedbacks
Xunzhuo 6b879b5
resolve reviews
Xunzhuo 2f757dd
update: resolve feedbacks
Xunzhuo 11717a3
resolve reviews
Xunzhuo d46785e
Merge branch 'main' into add-vsr-blog
simon-mo 87c5b92
update
Xunzhuo fa21de0
Merge branch 'main' into add-vsr-blog
simon-mo File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,112 @@ | ||
--- | ||
layout: post | ||
title: "vLLM Semantic Router: Next Phase in LLM inference" | ||
author: "vLLM Semantic Router Team" | ||
image: /assets/logos/vllm-logo-text-light.png | ||
--- | ||
|
||
 | ||
|
||
## Industry Status: Inference ≠ More Is Better | ||
|
||
Over the past year, **hybrid reasoning and automatic routing** have emerged as some of the most discussed topics in the large-model ecosystem. | ||
Xunzhuo marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
Take **GPT-5** as an example. Its most significant breakthrough is not simply the number of parameters, but the introduction of **automatic routing and thinking quotas**: | ||
Xunzhuo marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
* **Light queries → Lightweight models**: For example, "Why is the sky blue?" does not require an expensive inference model. | ||
Xunzhuo marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
* **Complex/High-value queries → Advanced models**: Tasks such as legal analysis or financial simulations are routed to models with Chain-of-Thought capabilities. | ||
Xunzhuo marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
The principle behind this is often described as **per-token unit economics**. | ||
Xunzhuo marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
Every token generated must deliver value rather than being treated as pure computational expense. | ||
Xunzhuo marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
For example: | ||
Xunzhuo marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
* Free-tier users receive answers from lightweight models, keeping costs under control. | ||
Xunzhuo marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
* When a query indicates commercial intent (e.g., booking flights or finding legal services), it is routed to high-compute models or agent services directly integrated into transaction flows. | ||
Xunzhuo marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
In these cases, companies like OpenAI can participate in the value chain by taking a commission on completed transactions — transforming free usage from a cost center into a monetizable entry point. | ||
|
||
Other companies are adopting similar strategies: | ||
Xunzhuo marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
* **Anthropic Claude 3.7/4**: Combines "fast thinking" and "slow thinking" with user-controlled toggles. | ||
Xunzhuo marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
* **Google Gemini 2.5**: Introduces a *thinking budget*, giving enterprises fine-grained control over inference costs. | ||
Xunzhuo marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
* **Alibaba Qwen3**: Explores instruction-based switching between reasoning and non-reasoning modes. | ||
Xunzhuo marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
* **DeepSeek v3.1**: Implements a "single-model dual-mode" design, merging dialogue and reasoning. | ||
Xunzhuo marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
In short: the industry is entering an era where **no token should be wasted**. | ||
Xunzhuo marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
--- | ||
Xunzhuo marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
## Recent Research: vLLM Semantic Router | ||
|
||
Amid this shift toward hybrid reasoning, we focus on the **open-source inference engine vLLM**. | ||
Xunzhuo marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
While vLLM has become the de facto standard for deploying large models, it lacks fine-grained, semantic-level control — the ability to make routing decisions based on meaning rather than query type alone. Developers are often forced to either enable full inference (wasting computation) or disable it entirely (sacrificing accuracy). | ||
Xunzhuo marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
To address this, we propose the **vLLM Semantic Router**, which brings GPT-5-style "smart routing" to the open-source ecosystem. | ||
Xunzhuo marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
 | ||
|
||
### Architecture Design | ||
|
||
Xunzhuo marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
1. **Semantic Classification**: Uses a **ModernBERT** fine-tuned intent classifier to determine whether a query requires inference. | ||
Xunzhuo marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
2. **Smart Routing**: | ||
|
||
* Simple queries → Fast inference mode. | ||
Xunzhuo marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
* Complex queries → Chain-of-Thought for accurate reasoning. | ||
Xunzhuo marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
3. **High-Performance Engine**: Built with Rust and the Hugging Face Candle framework, enabling high concurrency and zero-copy efficiency. | ||
Xunzhuo marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
4. **Cloud-Native Integration**: Seamlessly integrates with Kubernetes and API Gateways via the Envoy `ext_proc` plugin for enterprise deployments. | ||
Xunzhuo marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
Experimental results show: | ||
|
||
* **Accuracy**: +10.2% | ||
* **Latency**: –47.1% | ||
* **Token Consumption**: –48.5% | ||
|
||
In knowledge-intensive areas such as business and economics, accuracy improvements can exceed **20%**. | ||
|
||
--- | ||
|
||
## Project Background | ||
|
||
The Semantic Router is not the isolated result of a single paper but a collaborative outcome of sustained community contributions: | ||
Xunzhuo marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
* Originally proposed by **Dr. Chen Huamin**, Distinguished Engineer at **Red Hat**, in early **2025** across multiple open-source communities. | ||
* Iterated and further developed by **Xunzhuo Liu** at **Tencent**, later contributed to the vLLM community. | ||
* **Dr. Wang Chen** from **IBM Research** and **Dr. Chen Huamin** will present the project at **KubeCon North America 2025**. | ||
Xunzhuo marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
The mission is clear: to serve as an **inference accelerator** for open-source large models: | ||
|
||
* Preserve accuracy while minimizing unnecessary token usage. | ||
* Enable seamless switching between "fast" and "slow" thinking modes without fully enabling or disabling inference. | ||
* Deliver production-ready enterprise integration through native Kubernetes and Envoy support. | ||
|
||
The vLLM Semantic Router is therefore not just a research milestone but an **essential bridge for open-source AI infrastructure**, translating **academic innovation into industrial application**. | ||
|
||
You can start exploring the project here: [https://github.com/vllm-project/semantic-router](https://github.com/vllm-project/semantic-router). | ||
Xunzhuo marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
--- | ||
|
||
## Future Trends: Cost-Effective, Just-in-Time Inference | ||
|
||
The central industry question has shifted from *"Can we perform inference?"* to *"When and how should inference be performed?"* | ||
|
||
* **GPT-5**: Uses automatic routing and thinking quotas to align computation with commercial value, enabling monetization. | ||
Xunzhuo marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
* **vLLM Semantic Router**: Brings semantic routing to the open-source vLLM engine, enabling low-latency, energy-efficient inference scheduling. | ||
Xunzhuo marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
The new competitive focus will be less about model scale and more about: | ||
Xunzhuo marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
* **Performing inference at the right moment with the lowest cost.** | ||
* **Switching between fast and slow reasoning with precision.** | ||
Xunzhuo marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
* **Preserving user experience without wasting compute.** | ||
|
||
The next frontier is **intelligent, self-adjusting inference mechanisms** — systems that autonomously determine when to "think deeply" and when to respond directly, without explicit user toggles or hardcoded rules. | ||
Xunzhuo marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
--- | ||
|
||
## One-Sentence Summary | ||
|
||
* **GPT-5**: Business-driven routing → broad intelligence. | ||
* **vLLM Semantic Router**: Efficiency-driven routing → sustainable AI. | ||
* **Future edge**: Performing the right inference at the right time, with minimal computation. |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.