vllm-project · nwangfw · Jul 30, 2025 · Jul 30, 2025 · Jul 30, 2025 · Jul 30, 2025
diff --git a/docs/source/getting_started/quickstart.rst b/docs/source/getting_started/quickstart.rst
@@ -46,6 +46,12 @@ Ensure that:
 1. The `Service` name matches the `model.aibrix.ai/name` label value in the `Deployment`.
 2. The `--served-model-name` argument value in the `Deployment` command is also consistent with the `Service` name and `model.aibrix.ai/name` label.
 
+Deploy PD (Prefill-Decode) disaggregation model
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+
+
+
 
 Invoke the model endpoint using gateway API
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

diff --git a/test/regression/v0.4.0/dynamo/README.md b/test/regression/v0.4.0/dynamo/README.md
@@ -0,0 +1,15 @@
+# Dynamo Installation Instruction
+
+
+We follow the instruction in [dynamo](https://github.com/ai-dynamo/dynamo) to deploy the Dynamo Cloud in Kubernetes. The detailed instrunction can be found in Section 1.  `1. Installing Dynamo Cloud from Published Artifacts` from dynamo's [quickstart guide](https://github.com/ai-dynamo/dynamo/blob/main/docs/guides/dynamo_deploy/quickstart.md). We use the most recent release images (version: 0.3.2) published by Dynamo team.
-We follow the instruction in [dynamo](https://github.com/ai-dynamo/dynamo) to deploy the Dynamo Cloud in Kubernetes. The detailed instrunction can be found in Section 1.  `1. Installing Dynamo Cloud from Published Artifacts` from dynamo's [quickstart guide](https://github.com/ai-dynamo/dynamo/blob/main/docs/guides/dynamo_deploy/quickstart.md). We use the most recent release images (version: 0.3.2) published by Dynamo team.
+We follow the instruction in [dynamo](https://github.com/ai-dynamo/dynamo) to deploy the Dynamo Cloud in Kubernetes. The detailed instruction can be found in Section 1.  `1. Installing Dynamo Cloud from Published Artifacts` from dynamo's [quickstart guide](https://github.com/ai-dynamo/dynamo/blob/main/docs/guides/dynamo_deploy/quickstart.md). We use the most recent release images (version: 0.3.2) published by Dynamo team.
-We follow the instruction in [dynamo](https://github.com/ai-dynamo/dynamo) to deploy the Dynamo Cloud in Kubernetes. The detailed instrunction can be found in Section 1.  `1. Installing Dynamo Cloud from Published Artifacts` from dynamo's [quickstart guide](https://github.com/ai-dynamo/dynamo/blob/main/docs/guides/dynamo_deploy/quickstart.md). We use the most recent release images (version: 0.3.2) published by Dynamo team.
+We follow the instruction in [dynamo](https://github.com/ai-dynamo/dynamo) to deploy the Dynamo Cloud in Kubernetes. The detailed instruction can be found in Section 1.  `1. Installing Dynamo Cloud from Published Artifacts` from dynamo's [quickstart guide](https://github.com/ai-dynamo/dynamo/blob/main/docs/guides/dynamo_deploy/quickstart.md). We use the most recent release images (version: 0.3.2) published by Dynamo team.
+
+
+### Model Deployment
+
+We use sample deployment yamls from the dynamo repo in the v0.3.2 release for PD disaggration testing. https://github.com/ai-dynamo/dynamo/blob/v0.3.2/examples/llm/deploy/agg.yaml and https://github.com/ai-dynamo/dynamo/blob/v0.3.2/examples/llm/deploy/agg-router.yaml.
-We use sample deployment yamls from the dynamo repo in the v0.3.2 release for PD disaggration testing. https://github.com/ai-dynamo/dynamo/blob/v0.3.2/examples/llm/deploy/agg.yaml and https://github.com/ai-dynamo/dynamo/blob/v0.3.2/examples/llm/deploy/agg-router.yaml.
+We use sample deployment yamls from the dynamo repo in the v0.3.2 release for PD disaggregation testing. https://github.com/ai-dynamo/dynamo/blob/v0.3.2/examples/llm/deploy/agg.yaml and https://github.com/ai-dynamo/dynamo/blob/v0.3.2/examples/llm/deploy/agg-router.yaml.
-We use sample deployment yamls from the dynamo repo in the v0.3.2 release for PD disaggration testing. https://github.com/ai-dynamo/dynamo/blob/v0.3.2/examples/llm/deploy/agg.yaml and https://github.com/ai-dynamo/dynamo/blob/v0.3.2/examples/llm/deploy/agg-router.yaml.
+We use sample deployment yamls from the dynamo repo in the v0.3.2 release for PD disaggregation testing. https://github.com/ai-dynamo/dynamo/blob/v0.3.2/examples/llm/deploy/agg.yaml and https://github.com/ai-dynamo/dynamo/blob/v0.3.2/examples/llm/deploy/agg-router.yaml.
+
+
+> Note: There are some configuration changes in terms of image downloading and model downloading due to the testing environment difference.
+
+> 1. We download container image from VKE docker registry aibrix-cn-beijing.cr.volces.com. The images are synced from dockerhub and nvidia ngc.
+> 2. We download model from VKE object storage, which are synced from Huggingface model hub.
diff --git a/test/regression/v0.4.0/dynamo/disagg.yaml b/test/regression/v0.4.0/dynamo/disagg.yaml
@@ -0,0 +1,146 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: llm-disagg
+spec:
+  envs:
+    - name: DYN_DEPLOYMENT_CONFIG
+      value: '{"Common":{"model":"models/Qwen3-8B","block-size":64,"max-model-len":16384,"kv-transfer-config":"{\"kv_connector\":\"DynamoNixlConnector\"}"},"Frontend":{"served_model_name":"Qwen3-8B","endpoint":"dynamo.Processor.chat/completions","port":8000},"Processor":{"router":"round-robin","common-configs":["model","block-size"]},"VllmWorker":{"remote-prefill":true,"conditional-disagg":true,"max-local-prefill-length":10,"max-prefill-queue-size":2,"ServiceArgs":{"workers":1,"resources":{"gpu":"1"}},"common-configs":["model","block-size","max-model-len","kv-transfer-config"]},"PrefillWorker":{"max-num-batched-tokens":16384,"ServiceArgs":{"workers":1,"resources":{"gpu":"1"}},"common-configs":["model","block-size","max-model-len","kv-transfer-config"]},"Planner":{"environment":"kubernetes","no-operation":true}}'
+  services:
+    Frontend:
+      dynamoNamespace: llm-disagg
+      componentType: main
+      replicas: 1
+      resources:
+        requests:
+          cpu: "1"
+          memory: "2Gi"
+        limits:
+          cpu: "1"
+          memory: "2Gi"
+      extraPodSpec:
+        # nodeSelector:
+        #   machine.cluster.vke.volcengine.com/gpu-name: NVIDIA-L20     
+        mainContainer:
+          image: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/ai-dynamo/vllm-runtime:0.3.2
+          workingDir: /workspace/examples/llm
+          args:
+            - dynamo
+            - serve
+            - graphs.disagg:Frontend
+            - --system-app-port
+            - "5000"
+            - --enable-system-app
+            - --use-default-health-checks
+            - --service-name
+            - Frontend
+    Processor:
+      dynamoNamespace: llm-disagg
+      componentType: worker
+      replicas: 1
+      resources:
+        requests:
+          cpu: "1"
+          memory: "2Gi"
+        limits:
+          cpu: "1"
+          memory: "2Gi"
+      extraPodSpec:
+        # nodeSelector:
+        #   machine.cluster.vke.volcengine.com/gpu-name: NVIDIA-L20     
+        mainContainer:
+          image: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/ai-dynamo/vllm-runtime:0.3.2
+          workingDir: /workspace/examples/llm
+          command:
+            - /bin/sh
+            - -c
+            - |
+
+              apt update && apt install wget -y
+              wget https://tos-tools.tos-cn-beijing.volces.com/linux/amd64/tosutil
+              chmod +x tosutil
+              ./tosutil config -i <YOUR_ACCESS_KEY_ID> -k <YOUR_SECRET_ACCESS_KEY> -e tos-cn-beijing.ivolces.com -re cn-beijing
+              ./tosutil cp tos://aibrix-artifact-testing/models/Qwen3-8B ./models -r -p 8 -j 32
+              echo "model downloaded, start serving"
+
+              dynamo serve graphs.disagg:Processor --system-app-port 5000 --enable-system-app --use-default-health-checks --service-name Processor
+    VllmWorker:
+      #envFromSecret: hf-token-secret
+      dynamoNamespace: llm-disagg
+      replicas: 1
+      resources:
+        requests:
+          cpu: "10"
+          memory: "20Gi"
+          gpu: "1"
+        limits:
+          cpu: "10"
+          memory: "20Gi"
+          gpu: "1"
+      extraPodSpec:
+        # nodeSelector:
+        #   machine.cluster.vke.volcengine.com/gpu-name: NVIDIA-L20     
+        mainContainer:
+          image: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/ai-dynamo/vllm-runtime:0.3.2
+          workingDir: /workspace/examples/llm
+          command:
+            - /bin/sh
+            - -c
+            - |
+
+              apt update && apt install wget -y
+              wget https://tos-tools.tos-cn-beijing.volces.com/linux/amd64/tosutil
+              chmod +x tosutil
+              ./tosutil config -i <YOUR_ACCESS_KEY_ID> -k <YOUR_SECRET_ACCESS_KEY> -e tos-cn-beijing.ivolces.com -re cn-beijing
+              ./tosutil cp tos://aibrix-artifact-testing/models/Qwen3-8B ./models -r -p 8 -j 32
+
+              echo "model downloaded, start serving"
+
+              dynamo serve graphs.disagg:VllmWorker --system-app-port 5000 --enable-system-app --use-default-health-checks --service-name VllmWorker
+    PrefillWorker:
+      # envFromSecret: hf-token-secret
+      dynamoNamespace: llm-disagg
+      replicas: 1
+      resources:
+        requests:
+          cpu: "10"
+          memory: "20Gi"
+          gpu: "1"
+        limits:
+          cpu: "10"
+          memory: "20Gi"
+          gpu: "1"
+      extraPodSpec:
+        # nodeSelector:
+        #   machine.cluster.vke.volcengine.com/gpu-name: NVIDIA-L20   
+        mainContainer:
+          image: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/ai-dynamo/vllm-runtime:0.3.2
+          workingDir: /workspace/examples/llm
+          command:
+            - /bin/sh
+            - -c
+            - |
+
+              apt update && apt install wget -y
+              wget https://tos-tools.tos-cn-beijing.volces.com/linux/amd64/tosutil
+              chmod +x tosutil
+              ./tosutil config -i <YOUR_ACCESS_KEY_ID> -k <YOUR_SECRET_ACCESS_KEY> -e tos-cn-beijing.ivolces.com -re cn-beijing
+              ./tosutil cp tos://aibrix-artifact-testing/models/Qwen3-8B ./models -r -p 8 -j 32
+
+              echo "model downloaded, start serving"
+
+              dynamo serve graphs.disagg:PrefillWorker --system-app-port 5000 --enable-system-app --use-default-health-checks --service-name PrefillWorker
diff --git a/test/regression/v0.4.0/dynamo/disagg_router.yaml b/test/regression/v0.4.0/dynamo/disagg_router.yaml
@@ -0,0 +1,179 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: disagg-router
+spec:
+  envs:
+    - name: DYN_DEPLOYMENT_CONFIG
+      value: '{"Common":{"model":"models/Qwen3-8B","block-size":64,"max-model-len":16384,"router":"kv","kv-transfer-config":"{\"kv_connector\":\"DynamoNixlConnector\"}"},"Frontend":{"served_model_name":"Qwen3-8B","endpoint":"dynamo.Processor.chat/completions","port":8000},"Processor":{"common-configs":["model","block-size","max-model-len","router"]},"Router":{"min-workers":1,"common-configs":["model","block-size","router"]},"VllmWorker":{"max-num-batched-tokens":16384,"remote-prefill":true,"conditional-disagg":true,"max-local-prefill-length":10,"max-prefill-queue-size":2,"tensor-parallel-size":1,"enable-prefix-caching":true,"ServiceArgs":{"workers":1,"resources":{"gpu":"1"}},"common-configs":["model","block-size","max-model-len","router","kv-transfer-config"]},"PrefillWorker":{"max-num-batched-tokens":16384,"ServiceArgs":{"workers":1,"resources":{"gpu":"1"}},"common-configs":["model","block-size","max-model-len","kv-transfer-config"]},"Planner":{"environment":"kubernetes","no-operation":true}}'
+  services:
+    Frontend:
+      dynamoNamespace: llm-disagg-router
+      componentType: main
+      replicas: 1
+      resources:
+        requests:
+          cpu: "1"
+          memory: "2Gi"
+        limits:
+          cpu: "1"
+          memory: "2Gi"
+      extraPodSpec:
+        nodeSelector:
+          machine.cluster.vke.volcengine.com/gpu-name: NVIDIA-L20
+        mainContainer:
+          image: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/ai-dynamo/vllm-runtime:0.3.2
+          workingDir: /workspace/examples/llm
+          args:
+            - dynamo
+            - serve
+            - graphs.disagg_router:Frontend
+            - --system-app-port
+            - "5000"
+            - --enable-system-app
+            - --use-default-health-checks
+            - --service-name
+            - Frontend
+    Processor:
+      dynamoNamespace: llm-disagg-router
+      componentType: worker
+      replicas: 1
+      resources:
+        requests:
+          cpu: "1"
+          memory: "2Gi"
+        limits:
+          cpu: "1"
+          memory: "2Gi"
+      extraPodSpec:
+        nodeSelector:
+          node.kubernetes.io/instance-type: ecs.g3il.xlarge
+        mainContainer:
+          image: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/ai-dynamo/vllm-runtime:0.3.2
+          workingDir: /workspace/examples/llm
+          command:
+            - /bin/sh
+            - -c
+            - |
+
+              apt update && apt install wget -y
+              wget https://tos-tools.tos-cn-beijing.volces.com/linux/amd64/tosutil
+              chmod +x tosutil
+              ./tosutil config -i <YOUR_ACCESS_KEY_ID> -k <YOUR_SECRET_ACCESS_KEY> -e tos-cn-beijing.ivolces.com -re cn-beijing
+              ./tosutil cp tos://aibrix-artifact-testing/models/Qwen3-8B ./models -r -p 8 -j 32
+
+              echo "model downloaded, start serving"
+
+              dynamo serve graphs.disagg_router:Processor --system-app-port "5000" --enable-system-app --use-default-health-checks --service-name Processor
+    Router:
+      dynamoNamespace: llm-disagg-router
+      componentType: worker
+      replicas: 1
+      resources:
+        requests:
+          cpu: "1"
+          memory: "2Gi"
+        limits:
+          cpu: "1"
+          memory: "2Gi"
+      extraPodSpec:
+        nodeSelector:
+          machine.cluster.vke.volcengine.com/gpu-name: NVIDIA-L20
+        mainContainer:
+          image: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/ai-dynamo/vllm-runtime:0.3.2
+          workingDir: /workspace/examples/llm
+          command:
+            - /bin/sh
+            - -c
+            - |
+
+              # apt update && apt install wget -y
+              # wget https://tos-tools.tos-cn-beijing.volces.com/linux/amd64/tosutil
+              # chmod +x tosutil
+              # ./tosutil config -i <YOUR_ACCESS_KEY_ID> -k <YOUR_SECRET_ACCESS_KEY> -e tos-cn-beijing.ivolces.com -re cn-beijing
+              # ./tosutil cp tos://aibrix-artifact-testing/models/Qwen3-8B ./models -r -p 8 -j 32
+
+              # echo "model downloaded, start serving"
+
+              dynamo serve graphs.disagg_router:Router --system-app-port "5000" --enable-system-app --use-default-health-checks --service-name Router
+    VllmWorker:
+      #      envFromSecret: hf-token-secret
+      dynamoNamespace: llm-disagg-router
+      replicas: 1
+      resources:
+        requests:
+          cpu: "10"
+          memory: "20Gi"
+          gpu: "1"
+        limits:
+          cpu: "10"
+          memory: "20Gi"
+          gpu: "1"
+      extraPodSpec:
+        nodeSelector:
+          machine.cluster.vke.volcengine.com/gpu-name: NVIDIA-L20
+        mainContainer:
+          image: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/ai-dynamo/vllm-runtime:0.3.2
+          workingDir: /workspace/examples/llm
+          command:
+            - /bin/sh
+            - -c
+            - |
+
+              apt update && apt install wget -y
+              wget https://tos-tools.tos-cn-beijing.volces.com/linux/amd64/tosutil
+              chmod +x tosutil
+              ./tosutil config -i <YOUR_ACCESS_KEY_ID> -k <YOUR_SECRET_ACCESS_KEY> -e tos-cn-beijing.ivolces.com -re cn-beijing
+              ./tosutil cp tos://aibrix-artifact-testing/models/Qwen3-8B ./models -r -p 8 -j 32
+
+              echo "model downloaded, start serving"
+
+              dynamo serve graphs.disagg_router:VllmWorker --system-app-port 5000 --enable-system-app --use-default-health-checks --service-name VllmWorker
+
+    PrefillWorker:
+      #      envFromSecret: hf-token-secret
+      dynamoNamespace: llm-disagg-router
+      replicas: 2
+      resources:
+        requests:
+          cpu: "10"
+          memory: "20Gi"
+          gpu: "1"
+        limits:
+          cpu: "10"
+          memory: "20Gi"
+          gpu: "1"
+      extraPodSpec:
+        nodeSelector:
+          machine.cluster.vke.volcengine.com/gpu-name: NVIDIA-L20
+        mainContainer:
+          image: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/ai-dynamo/vllm-runtime:0.3.2
+          workingDir: /workspace/examples/llm
+          command:
+            - /bin/sh
+            - -c
+            - |
+
+              apt update && apt install wget -y
+              wget https://tos-tools.tos-cn-beijing.volces.com/linux/amd64/tosutil
+              chmod +x tosutil
+              ./tosutil config -i <YOUR_ACCESS_KEY_ID> -k <YOUR_SECRET_ACCESS_KEY> -e tos-cn-beijing.ivolces.com -re cn-beijing
+              ./tosutil cp tos://aibrix-artifact-testing/models/Qwen3-8B ./models -r -p 8 -j 32
+
+              echo "model downloaded, start serving"
+
+              dynamo serve graphs.disagg:PrefillWorker --system-app-port 5000 --enable-system-app --use-default-health-checks --service-name PrefillWorker
diff --git a/test/regression/v0.4.0/sglang/235b-service.yaml b/test/regression/v0.4.0/sglang/235b-service.yaml
@@ -0,0 +1,14 @@
+apiVersion: v1
+kind: Service
+metadata:
+  name: qwen3-235b-service
+  namespace: default
+spec:
+  selector:
+    model.aibrix.ai/name: qwen3-235b
+  ports:
+    - protocol: TCP
+      port: 8000
+      targetPort: 8000
+      nodePort: 30010
+  type: NodePort
diff --git a/test/regression/v0.4.0/sglang/32b-service.yaml b/test/regression/v0.4.0/sglang/32b-service.yaml
@@ -0,0 +1,14 @@
+apiVersion: v1
+kind: Service
+metadata:
+  name: qwen3-32b-service
+  namespace: default
+spec:
+  selector:
+    model.aibrix.ai/name: qwen3-32b
+  ports:
+    - protocol: TCP
+      port: 8000
+      targetPort: 8000
+      nodePort: 30009
+  type: NodePort