Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions docs/source/getting_started/quickstart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,12 @@ Ensure that:
1. The `Service` name matches the `model.aibrix.ai/name` label value in the `Deployment`.
2. The `--served-model-name` argument value in the `Deployment` command is also consistent with the `Service` name and `model.aibrix.ai/name` label.

Deploy PD (Prefill-Decode) disaggregation model
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~





Invoke the model endpoint using gateway API
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down
15 changes: 15 additions & 0 deletions test/regression/v0.4.0/dynamo/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Dynamo Installation Instruction


We follow the instruction in [dynamo](https://github.com/ai-dynamo/dynamo) to deploy the Dynamo Cloud in Kubernetes. The detailed instrunction can be found in Section 1. `1. Installing Dynamo Cloud from Published Artifacts` from dynamo's [quickstart guide](https://github.com/ai-dynamo/dynamo/blob/main/docs/guides/dynamo_deploy/quickstart.md). We use the most recent release images (version: 0.3.2) published by Dynamo team.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a typo in "instrunction". It should be "instruction".

Suggested change
We follow the instruction in [dynamo](https://github.com/ai-dynamo/dynamo) to deploy the Dynamo Cloud in Kubernetes. The detailed instrunction can be found in Section 1. `1. Installing Dynamo Cloud from Published Artifacts` from dynamo's [quickstart guide](https://github.com/ai-dynamo/dynamo/blob/main/docs/guides/dynamo_deploy/quickstart.md). We use the most recent release images (version: 0.3.2) published by Dynamo team.
We follow the instruction in [dynamo](https://github.com/ai-dynamo/dynamo) to deploy the Dynamo Cloud in Kubernetes. The detailed instruction can be found in Section 1. `1. Installing Dynamo Cloud from Published Artifacts` from dynamo's [quickstart guide](https://github.com/ai-dynamo/dynamo/blob/main/docs/guides/dynamo_deploy/quickstart.md). We use the most recent release images (version: 0.3.2) published by Dynamo team.



### Model Deployment

We use sample deployment yamls from the dynamo repo in the v0.3.2 release for PD disaggration testing. https://github.com/ai-dynamo/dynamo/blob/v0.3.2/examples/llm/deploy/agg.yaml and https://github.com/ai-dynamo/dynamo/blob/v0.3.2/examples/llm/deploy/agg-router.yaml.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a typo in "disaggration". It should be "disaggregation".

Suggested change
We use sample deployment yamls from the dynamo repo in the v0.3.2 release for PD disaggration testing. https://github.com/ai-dynamo/dynamo/blob/v0.3.2/examples/llm/deploy/agg.yaml and https://github.com/ai-dynamo/dynamo/blob/v0.3.2/examples/llm/deploy/agg-router.yaml.
We use sample deployment yamls from the dynamo repo in the v0.3.2 release for PD disaggregation testing. https://github.com/ai-dynamo/dynamo/blob/v0.3.2/examples/llm/deploy/agg.yaml and https://github.com/ai-dynamo/dynamo/blob/v0.3.2/examples/llm/deploy/agg-router.yaml.



> Note: There are some configuration changes in terms of image downloading and model downloading due to the testing environment difference.

> 1. We download container image from VKE docker registry aibrix-cn-beijing.cr.volces.com. The images are synced from dockerhub and nvidia ngc.
> 2. We download model from VKE object storage, which are synced from Huggingface model hub.
146 changes: 146 additions & 0 deletions test/regression/v0.4.0/dynamo/disagg.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: llm-disagg
spec:
envs:
- name: DYN_DEPLOYMENT_CONFIG
value: '{"Common":{"model":"models/Qwen3-8B","block-size":64,"max-model-len":16384,"kv-transfer-config":"{\"kv_connector\":\"DynamoNixlConnector\"}"},"Frontend":{"served_model_name":"Qwen3-8B","endpoint":"dynamo.Processor.chat/completions","port":8000},"Processor":{"router":"round-robin","common-configs":["model","block-size"]},"VllmWorker":{"remote-prefill":true,"conditional-disagg":true,"max-local-prefill-length":10,"max-prefill-queue-size":2,"ServiceArgs":{"workers":1,"resources":{"gpu":"1"}},"common-configs":["model","block-size","max-model-len","kv-transfer-config"]},"PrefillWorker":{"max-num-batched-tokens":16384,"ServiceArgs":{"workers":1,"resources":{"gpu":"1"}},"common-configs":["model","block-size","max-model-len","kv-transfer-config"]},"Planner":{"environment":"kubernetes","no-operation":true}}'
services:
Frontend:
dynamoNamespace: llm-disagg
componentType: main
replicas: 1
resources:
requests:
cpu: "1"
memory: "2Gi"
limits:
cpu: "1"
memory: "2Gi"
extraPodSpec:
# nodeSelector:
# machine.cluster.vke.volcengine.com/gpu-name: NVIDIA-L20
mainContainer:
image: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/ai-dynamo/vllm-runtime:0.3.2
workingDir: /workspace/examples/llm
args:
- dynamo
- serve
- graphs.disagg:Frontend
- --system-app-port
- "5000"
- --enable-system-app
- --use-default-health-checks
- --service-name
- Frontend
Processor:
dynamoNamespace: llm-disagg
componentType: worker
replicas: 1
resources:
requests:
cpu: "1"
memory: "2Gi"
limits:
cpu: "1"
memory: "2Gi"
extraPodSpec:
# nodeSelector:
# machine.cluster.vke.volcengine.com/gpu-name: NVIDIA-L20
mainContainer:
image: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/ai-dynamo/vllm-runtime:0.3.2
workingDir: /workspace/examples/llm
command:
- /bin/sh
- -c
- |

apt update && apt install wget -y
wget https://tos-tools.tos-cn-beijing.volces.com/linux/amd64/tosutil
chmod +x tosutil
./tosutil config -i <YOUR_ACCESS_KEY_ID> -k <YOUR_SECRET_ACCESS_KEY> -e tos-cn-beijing.ivolces.com -re cn-beijing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Storing secret placeholders like <YOUR_ACCESS_KEY_ID> and <YOUR_SECRET_ACCESS_KEY> directly in the YAML is a major security risk, as real credentials could be accidentally committed. These should be managed using Kubernetes Secrets and injected into the container as environment variables.

For example, you can define environment variables in your container spec that pull from a secret:

env:
- name: YOUR_ACCESS_KEY_ID
  valueFrom:
    secretKeyRef:
      name: my-tos-secret
      key: accessKeyId
- name: YOUR_SECRET_ACCESS_KEY
  valueFrom:
    secretKeyRef:
      name: my-tos-secret
      key: secretAccessKey

This issue is repeated for the VllmWorker and PrefillWorker services in this file, and in other files in this PR.

              ./tosutil config -i $YOUR_ACCESS_KEY_ID -k $YOUR_SECRET_ACCESS_KEY -e tos-cn-beijing.ivolces.com -re cn-beijing

./tosutil cp tos://aibrix-artifact-testing/models/Qwen3-8B ./models -r -p 8 -j 32
echo "model downloaded, start serving"

dynamo serve graphs.disagg:Processor --system-app-port 5000 --enable-system-app --use-default-health-checks --service-name Processor
VllmWorker:
#envFromSecret: hf-token-secret
dynamoNamespace: llm-disagg
replicas: 1
resources:
requests:
cpu: "10"
memory: "20Gi"
gpu: "1"
limits:
cpu: "10"
memory: "20Gi"
gpu: "1"
extraPodSpec:
# nodeSelector:
# machine.cluster.vke.volcengine.com/gpu-name: NVIDIA-L20
mainContainer:
image: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/ai-dynamo/vllm-runtime:0.3.2
workingDir: /workspace/examples/llm
command:
- /bin/sh
- -c
- |

apt update && apt install wget -y
wget https://tos-tools.tos-cn-beijing.volces.com/linux/amd64/tosutil
chmod +x tosutil
./tosutil config -i <YOUR_ACCESS_KEY_ID> -k <YOUR_SECRET_ACCESS_KEY> -e tos-cn-beijing.ivolces.com -re cn-beijing
./tosutil cp tos://aibrix-artifact-testing/models/Qwen3-8B ./models -r -p 8 -j 32

echo "model downloaded, start serving"

dynamo serve graphs.disagg:VllmWorker --system-app-port 5000 --enable-system-app --use-default-health-checks --service-name VllmWorker
PrefillWorker:
# envFromSecret: hf-token-secret
dynamoNamespace: llm-disagg
replicas: 1
resources:
requests:
cpu: "10"
memory: "20Gi"
gpu: "1"
limits:
cpu: "10"
memory: "20Gi"
gpu: "1"
extraPodSpec:
# nodeSelector:
# machine.cluster.vke.volcengine.com/gpu-name: NVIDIA-L20
mainContainer:
image: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/ai-dynamo/vllm-runtime:0.3.2
workingDir: /workspace/examples/llm
command:
- /bin/sh
- -c
- |

apt update && apt install wget -y
wget https://tos-tools.tos-cn-beijing.volces.com/linux/amd64/tosutil
chmod +x tosutil
./tosutil config -i <YOUR_ACCESS_KEY_ID> -k <YOUR_SECRET_ACCESS_KEY> -e tos-cn-beijing.ivolces.com -re cn-beijing
./tosutil cp tos://aibrix-artifact-testing/models/Qwen3-8B ./models -r -p 8 -j 32

echo "model downloaded, start serving"

dynamo serve graphs.disagg:PrefillWorker --system-app-port 5000 --enable-system-app --use-default-health-checks --service-name PrefillWorker
179 changes: 179 additions & 0 deletions test/regression/v0.4.0/dynamo/disagg_router.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,179 @@
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: disagg-router
spec:
envs:
- name: DYN_DEPLOYMENT_CONFIG
value: '{"Common":{"model":"models/Qwen3-8B","block-size":64,"max-model-len":16384,"router":"kv","kv-transfer-config":"{\"kv_connector\":\"DynamoNixlConnector\"}"},"Frontend":{"served_model_name":"Qwen3-8B","endpoint":"dynamo.Processor.chat/completions","port":8000},"Processor":{"common-configs":["model","block-size","max-model-len","router"]},"Router":{"min-workers":1,"common-configs":["model","block-size","router"]},"VllmWorker":{"max-num-batched-tokens":16384,"remote-prefill":true,"conditional-disagg":true,"max-local-prefill-length":10,"max-prefill-queue-size":2,"tensor-parallel-size":1,"enable-prefix-caching":true,"ServiceArgs":{"workers":1,"resources":{"gpu":"1"}},"common-configs":["model","block-size","max-model-len","router","kv-transfer-config"]},"PrefillWorker":{"max-num-batched-tokens":16384,"ServiceArgs":{"workers":1,"resources":{"gpu":"1"}},"common-configs":["model","block-size","max-model-len","kv-transfer-config"]},"Planner":{"environment":"kubernetes","no-operation":true}}'
services:
Frontend:
dynamoNamespace: llm-disagg-router
componentType: main
replicas: 1
resources:
requests:
cpu: "1"
memory: "2Gi"
limits:
cpu: "1"
memory: "2Gi"
extraPodSpec:
nodeSelector:
machine.cluster.vke.volcengine.com/gpu-name: NVIDIA-L20
mainContainer:
image: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/ai-dynamo/vllm-runtime:0.3.2
workingDir: /workspace/examples/llm
args:
- dynamo
- serve
- graphs.disagg_router:Frontend
- --system-app-port
- "5000"
- --enable-system-app
- --use-default-health-checks
- --service-name
- Frontend
Processor:
dynamoNamespace: llm-disagg-router
componentType: worker
replicas: 1
resources:
requests:
cpu: "1"
memory: "2Gi"
limits:
cpu: "1"
memory: "2Gi"
extraPodSpec:
nodeSelector:
node.kubernetes.io/instance-type: ecs.g3il.xlarge
mainContainer:
image: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/ai-dynamo/vllm-runtime:0.3.2
workingDir: /workspace/examples/llm
command:
- /bin/sh
- -c
- |

apt update && apt install wget -y
wget https://tos-tools.tos-cn-beijing.volces.com/linux/amd64/tosutil
chmod +x tosutil
./tosutil config -i <YOUR_ACCESS_KEY_ID> -k <YOUR_SECRET_ACCESS_KEY> -e tos-cn-beijing.ivolces.com -re cn-beijing
./tosutil cp tos://aibrix-artifact-testing/models/Qwen3-8B ./models -r -p 8 -j 32

echo "model downloaded, start serving"

dynamo serve graphs.disagg_router:Processor --system-app-port "5000" --enable-system-app --use-default-health-checks --service-name Processor
Router:
dynamoNamespace: llm-disagg-router
componentType: worker
replicas: 1
resources:
requests:
cpu: "1"
memory: "2Gi"
limits:
cpu: "1"
memory: "2Gi"
extraPodSpec:
nodeSelector:
machine.cluster.vke.volcengine.com/gpu-name: NVIDIA-L20
mainContainer:
image: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/ai-dynamo/vllm-runtime:0.3.2
workingDir: /workspace/examples/llm
command:
- /bin/sh
- -c
- |

# apt update && apt install wget -y
# wget https://tos-tools.tos-cn-beijing.volces.com/linux/amd64/tosutil
# chmod +x tosutil
# ./tosutil config -i <YOUR_ACCESS_KEY_ID> -k <YOUR_SECRET_ACCESS_KEY> -e tos-cn-beijing.ivolces.com -re cn-beijing
# ./tosutil cp tos://aibrix-artifact-testing/models/Qwen3-8B ./models -r -p 8 -j 32

# echo "model downloaded, start serving"
Comment on lines +104 to +110
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block of code is commented out. If it's not needed, it should be removed to improve readability and reduce clutter.


dynamo serve graphs.disagg_router:Router --system-app-port "5000" --enable-system-app --use-default-health-checks --service-name Router
VllmWorker:
# envFromSecret: hf-token-secret
dynamoNamespace: llm-disagg-router
replicas: 1
resources:
requests:
cpu: "10"
memory: "20Gi"
gpu: "1"
limits:
cpu: "10"
memory: "20Gi"
gpu: "1"
extraPodSpec:
nodeSelector:
machine.cluster.vke.volcengine.com/gpu-name: NVIDIA-L20
mainContainer:
image: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/ai-dynamo/vllm-runtime:0.3.2
workingDir: /workspace/examples/llm
command:
- /bin/sh
- -c
- |

apt update && apt install wget -y
wget https://tos-tools.tos-cn-beijing.volces.com/linux/amd64/tosutil
chmod +x tosutil
./tosutil config -i <YOUR_ACCESS_KEY_ID> -k <YOUR_SECRET_ACCESS_KEY> -e tos-cn-beijing.ivolces.com -re cn-beijing
./tosutil cp tos://aibrix-artifact-testing/models/Qwen3-8B ./models -r -p 8 -j 32

echo "model downloaded, start serving"

dynamo serve graphs.disagg_router:VllmWorker --system-app-port 5000 --enable-system-app --use-default-health-checks --service-name VllmWorker

PrefillWorker:
# envFromSecret: hf-token-secret
dynamoNamespace: llm-disagg-router
replicas: 2
resources:
requests:
cpu: "10"
memory: "20Gi"
gpu: "1"
limits:
cpu: "10"
memory: "20Gi"
gpu: "1"
extraPodSpec:
nodeSelector:
machine.cluster.vke.volcengine.com/gpu-name: NVIDIA-L20
mainContainer:
image: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/ai-dynamo/vllm-runtime:0.3.2
workingDir: /workspace/examples/llm
command:
- /bin/sh
- -c
- |

apt update && apt install wget -y
wget https://tos-tools.tos-cn-beijing.volces.com/linux/amd64/tosutil
chmod +x tosutil
./tosutil config -i <YOUR_ACCESS_KEY_ID> -k <YOUR_SECRET_ACCESS_KEY> -e tos-cn-beijing.ivolces.com -re cn-beijing
./tosutil cp tos://aibrix-artifact-testing/models/Qwen3-8B ./models -r -p 8 -j 32

echo "model downloaded, start serving"

dynamo serve graphs.disagg:PrefillWorker --system-app-port 5000 --enable-system-app --use-default-health-checks --service-name PrefillWorker
14 changes: 14 additions & 0 deletions test/regression/v0.4.0/sglang/235b-service.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
apiVersion: v1
kind: Service
metadata:
name: qwen3-235b-service
namespace: default
spec:
selector:
model.aibrix.ai/name: qwen3-235b
ports:
- protocol: TCP
port: 8000
targetPort: 8000
nodePort: 30010
type: NodePort
14 changes: 14 additions & 0 deletions test/regression/v0.4.0/sglang/32b-service.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
apiVersion: v1
kind: Service
metadata:
name: qwen3-32b-service
namespace: default
spec:
selector:
model.aibrix.ai/name: qwen3-32b
ports:
- protocol: TCP
port: 8000
targetPort: 8000
nodePort: 30009
type: NodePort
Loading