|
| 1 | +# Prefill->Decode Disaggregated Workflow for TensorRT-LLM |
| 2 | + |
| 3 | +**Status**: Approved |
| 4 | + |
| 5 | +**Authors**: Tanmay Verma |
| 6 | + |
| 7 | +**Category**: Architecture |
| 8 | + |
| 9 | +**Replaces**: N/A |
| 10 | + |
| 11 | +**Replaced By**: N/A |
| 12 | + |
| 13 | +**Sponsor**: @richardhuo-nv @nnshah1 @nvrohanv |
| 14 | + |
| 15 | +**Required Reviewers**: @richardhuo-nv @nnshah1 |
| 16 | + |
| 17 | +**Review Date**: 07/15/2025 |
| 18 | + |
| 19 | +**Pull Request**: [#22](https://github.com/ai-dynamo/enhancements/pull/22) |
| 20 | + |
| 21 | +**Implementation PR / Tracking Issue**: [#1884](https://github.com/ai-dynamo/dynamo/pull/1884) |
| 22 | + |
| 23 | +# Summary |
| 24 | + |
| 25 | +Currently, the disaggregated workflow examples require requests to first hit the decode worker and then the prefill worker. This DEP proposes an alternative approach that allows users to perform prefill operations on the first worker followed by decode operations on the second worker, providing more flexibility in workflow orchestration. |
| 26 | + |
| 27 | +# Motivation |
| 28 | + |
| 29 | +Dynamo users have expressed strong interest in gaining control over request flow patterns. The current TensorRT-LLM disaggregated workflow routes requests first to a TensorRT-LLM worker, which then forwards them to a Prefill worker for remote prefill execution. This design constraint stems from the current limitation where only one worker per model can interact with the KV router(call `register_llm`), allowing KV routing to either Prefill workers or Decode workers depending on which receives the request first. In the existing implementation, users are restricted to KV routing on decode workers only. Providing flexibility to choose which worker type handles the initial request routing would address critical user requirements for workflow customization. |
| 30 | + |
| 31 | +In the current implementation, the TensorRT-LLM prefill worker transfers all KV cache blocks to the decode workers, regardless of whether the decode workers already have a complete KV cache match. This means that, even with a 100% KV cache hit on the decode side, the prefill worker still sends all blocks, resulting in redundant data transfer. By routing requests to the prefill worker first when using KV routing, we can optimize efficiency and reduce unnecessary work during the prefill stage. |
| 32 | + |
| 33 | +## Goals |
| 34 | + |
| 35 | +List out any additional goals in bullet points. Goals may be aspirational / difficult to measure but guide the proposal. |
| 36 | + |
| 37 | +### Short-Term Goal |
| 38 | + |
| 39 | +* Goal Allow users to swap the order of prefill/decode workers in TensorRT-LLM disaggregated workflows in the short-term |
| 40 | + |
| 41 | +* Goal Maintain feature parity between both orchestration methods (prefill-first and decode-first workflows) |
| 42 | + |
| 43 | + |
| 44 | +### Long-Term Goal |
| 45 | + |
| 46 | +* Propose a long-term goal of maintaining a unified disaggregated workflow. |
| 47 | + |
| 48 | + |
| 49 | +## Requirements |
| 50 | + |
| 51 | +### REQ \<1\> \<Option to control prefill-first or decode-first\> |
| 52 | +User **MUST** be able to control whether the request flows from Prefill->Decode or Decode->Prefill. |
| 53 | + |
| 54 | +### REQ \<2\> \<Both the workflows should be properly tested\> |
| 55 | +We **SHOULD** properly test both the workflows. |
| 56 | + |
| 57 | +# Proposal |
| 58 | + |
| 59 | + |
| 60 | +Current request disaggregated workflow looks like: |
| 61 | + |
| 62 | + |
| 63 | + |
| 64 | +To unblock users in short-term, we are proposing an additional option (--disaggregation-strategy={prefill_first, decode_first}) to the worker script. The two workers are: |
| 65 | +- First stage worker |
| 66 | +- Second stage worker |
| 67 | + |
| 68 | +Based on the cli option, the first stage worker can decide whether it wants to run prefill locally and then push the request to second stage for decode. Or forward the request to second stage for remote prefill and run decode locally. |
| 69 | + |
| 70 | + |
| 71 | + |
| 72 | + |
| 73 | +This should address the requirement in short-term. |
| 74 | + |
| 75 | +## Pseudo code |
| 76 | + |
| 77 | +### Prefill Worker |
| 78 | +```python |
| 79 | + async def generate(self, request: dict): |
| 80 | + # Generate the prefill response locally |
| 81 | + prefill_request = copy.deepcopy(request) |
| 82 | + prefill_response = None |
| 83 | + response_count = 0 |
| 84 | + async for res in self.generate_locally(prefill_request): |
| 85 | + prefill_response = res |
| 86 | + response_count += 1 |
| 87 | + if response_count > 1: |
| 88 | + raise ValueError("Prefill response should be generated only once.") |
| 89 | + |
| 90 | + if ( |
| 91 | + self.disaggregation_strategy == DisaggregationStrategy.PREFILL_FIRST |
| 92 | + and not self.check_error(prefill_response) |
| 93 | + ): |
| 94 | + # If operating under prefill_first strategy, the prefill handler needs to trigger |
| 95 | + # the decode handler. |
| 96 | + if prefill_response is not None: |
| 97 | + request["disaggregated_params"] = prefill_response[ |
| 98 | + "disaggregated_params" |
| 99 | + ] |
| 100 | + async for res in self.remote_decode(request): |
| 101 | + yield res |
| 102 | + else: |
| 103 | + # Return response to the decode handler. |
| 104 | + yield prefill_response |
| 105 | +``` |
| 106 | + |
| 107 | +### Decode Worker |
| 108 | + |
| 109 | +```python |
| 110 | + async def generate(self, request: dict): |
| 111 | + if self.disaggregation_strategy == DisaggregationStrategy.DECODE_FIRST: |
| 112 | + prefill_response = None |
| 113 | + # If operating under decode_first strategy, the decode handler needs to trigger |
| 114 | + # the prefill handler. |
| 115 | + response_count = 0 |
| 116 | + async for res in self.remote_prefill(request): |
| 117 | + prefill_response = res |
| 118 | + response_count += 1 |
| 119 | + if response_count > 1: |
| 120 | + raise ValueError("Prefill response should be generated only once.") |
| 121 | + |
| 122 | + response_data = ( |
| 123 | + prefill_response.data() if prefill_response is not None else None |
| 124 | + ) |
| 125 | + if prefill_response is not None and self.check_error(response_data): |
| 126 | + yield response_data |
| 127 | + return |
| 128 | + if prefill_response is not None and response_data is not None: |
| 129 | + request["disaggregated_params"] = response_data["disaggregated_params"] |
| 130 | + |
| 131 | + async for res in self.generate_locally(request): |
| 132 | + yield res |
| 133 | + |
| 134 | +``` |
| 135 | + |
| 136 | +## Long-Term |
| 137 | + |
| 138 | +We strongly believe that it should not be the workers job to determine what and where to perform execution. The KV router should be smart enough to handle prefill and decode worker request routing. |
| 139 | +The long-term proposal is a WIP. |
| 140 | + |
| 141 | +# Alternate Solutions |
| 142 | + |
| 143 | +N/A |
0 commit comments