Skip to content

Commit bc6c1d5

Browse files
tanmayv25nnshah1
andauthored
Add short-term strategy to enable P->D workflow in disagg (#22)
Co-authored-by: nnshah1 <[email protected]>
1 parent 633181f commit bc6c1d5

File tree

3 files changed

+143
-0
lines changed

3 files changed

+143
-0
lines changed
Lines changed: 143 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,143 @@
1+
# Prefill->Decode Disaggregated Workflow for TensorRT-LLM
2+
3+
**Status**: Approved
4+
5+
**Authors**: Tanmay Verma
6+
7+
**Category**: Architecture
8+
9+
**Replaces**: N/A
10+
11+
**Replaced By**: N/A
12+
13+
**Sponsor**: @richardhuo-nv @nnshah1 @nvrohanv
14+
15+
**Required Reviewers**: @richardhuo-nv @nnshah1
16+
17+
**Review Date**: 07/15/2025
18+
19+
**Pull Request**: [#22](https://github.com/ai-dynamo/enhancements/pull/22)
20+
21+
**Implementation PR / Tracking Issue**: [#1884](https://github.com/ai-dynamo/dynamo/pull/1884)
22+
23+
# Summary
24+
25+
Currently, the disaggregated workflow examples require requests to first hit the decode worker and then the prefill worker. This DEP proposes an alternative approach that allows users to perform prefill operations on the first worker followed by decode operations on the second worker, providing more flexibility in workflow orchestration.
26+
27+
# Motivation
28+
29+
Dynamo users have expressed strong interest in gaining control over request flow patterns. The current TensorRT-LLM disaggregated workflow routes requests first to a TensorRT-LLM worker, which then forwards them to a Prefill worker for remote prefill execution. This design constraint stems from the current limitation where only one worker per model can interact with the KV router(call `register_llm`), allowing KV routing to either Prefill workers or Decode workers depending on which receives the request first. In the existing implementation, users are restricted to KV routing on decode workers only. Providing flexibility to choose which worker type handles the initial request routing would address critical user requirements for workflow customization.
30+
31+
In the current implementation, the TensorRT-LLM prefill worker transfers all KV cache blocks to the decode workers, regardless of whether the decode workers already have a complete KV cache match. This means that, even with a 100% KV cache hit on the decode side, the prefill worker still sends all blocks, resulting in redundant data transfer. By routing requests to the prefill worker first when using KV routing, we can optimize efficiency and reduce unnecessary work during the prefill stage.
32+
33+
## Goals
34+
35+
List out any additional goals in bullet points. Goals may be aspirational / difficult to measure but guide the proposal.
36+
37+
### Short-Term Goal
38+
39+
* Goal Allow users to swap the order of prefill/decode workers in TensorRT-LLM disaggregated workflows in the short-term
40+
41+
* Goal Maintain feature parity between both orchestration methods (prefill-first and decode-first workflows)
42+
43+
44+
### Long-Term Goal
45+
46+
* Propose a long-term goal of maintaining a unified disaggregated workflow.
47+
48+
49+
## Requirements
50+
51+
### REQ \<1\> \<Option to control prefill-first or decode-first\>
52+
User **MUST** be able to control whether the request flows from Prefill->Decode or Decode->Prefill.
53+
54+
### REQ \<2\> \<Both the workflows should be properly tested\>
55+
We **SHOULD** properly test both the workflows.
56+
57+
# Proposal
58+
59+
60+
Current request disaggregated workflow looks like:
61+
62+
![Current Disaggregated Workflow](0002_images/current.png)
63+
64+
To unblock users in short-term, we are proposing an additional option (--disaggregation-strategy={prefill_first, decode_first}) to the worker script. The two workers are:
65+
- First stage worker
66+
- Second stage worker
67+
68+
Based on the cli option, the first stage worker can decide whether it wants to run prefill locally and then push the request to second stage for decode. Or forward the request to second stage for remote prefill and run decode locally.
69+
70+
![Proposed Disaggregated Workflow](0002_images/proposed_design.png)
71+
72+
73+
This should address the requirement in short-term.
74+
75+
## Pseudo code
76+
77+
### Prefill Worker
78+
```python
79+
async def generate(self, request: dict):
80+
# Generate the prefill response locally
81+
prefill_request = copy.deepcopy(request)
82+
prefill_response = None
83+
response_count = 0
84+
async for res in self.generate_locally(prefill_request):
85+
prefill_response = res
86+
response_count += 1
87+
if response_count > 1:
88+
raise ValueError("Prefill response should be generated only once.")
89+
90+
if (
91+
self.disaggregation_strategy == DisaggregationStrategy.PREFILL_FIRST
92+
and not self.check_error(prefill_response)
93+
):
94+
# If operating under prefill_first strategy, the prefill handler needs to trigger
95+
# the decode handler.
96+
if prefill_response is not None:
97+
request["disaggregated_params"] = prefill_response[
98+
"disaggregated_params"
99+
]
100+
async for res in self.remote_decode(request):
101+
yield res
102+
else:
103+
# Return response to the decode handler.
104+
yield prefill_response
105+
```
106+
107+
### Decode Worker
108+
109+
```python
110+
async def generate(self, request: dict):
111+
if self.disaggregation_strategy == DisaggregationStrategy.DECODE_FIRST:
112+
prefill_response = None
113+
# If operating under decode_first strategy, the decode handler needs to trigger
114+
# the prefill handler.
115+
response_count = 0
116+
async for res in self.remote_prefill(request):
117+
prefill_response = res
118+
response_count += 1
119+
if response_count > 1:
120+
raise ValueError("Prefill response should be generated only once.")
121+
122+
response_data = (
123+
prefill_response.data() if prefill_response is not None else None
124+
)
125+
if prefill_response is not None and self.check_error(response_data):
126+
yield response_data
127+
return
128+
if prefill_response is not None and response_data is not None:
129+
request["disaggregated_params"] = response_data["disaggregated_params"]
130+
131+
async for res in self.generate_locally(request):
132+
yield res
133+
134+
```
135+
136+
## Long-Term
137+
138+
We strongly believe that it should not be the workers job to determine what and where to perform execution. The KV router should be smart enough to handle prefill and decode worker request routing.
139+
The long-term proposal is a WIP.
140+
141+
# Alternate Solutions
142+
143+
N/A

deps/0002_images/current.png

50.3 KB
Loading
90.7 KB
Loading

0 commit comments

Comments
 (0)