ai-dynamo · biswapanda · Aug 8, 2025 · Aug 8, 2025 · Aug 13, 2025 · Aug 13, 2025
diff --git a/000N-model-deploy-benchmark/000N-model-deploy-benchmark.md b/000N-model-deploy-benchmark/000N-model-deploy-benchmark.md
@@ -0,0 +1,198 @@
+# Simplify model deployment and auxiliary utilities: benchmarking
+
+## Problems: 
+1. Missing UX around model deployment and auxiliary utilities (benchmarking)
+
+- Use case 1: I want a simple quickstart reference to deploy a model and optionally run auxiliary utilities like benchmarking, inference gateway, model express, etc.
+
+- Use case 2: I want to specify many configs for trtllm backend and dont want to list all arguments in k8s CR.
+    Listing all arguments in k8s CR is not manageable for trtllm backend which has many configs.
+
+- Use case 3: I want to deploy and reproduce `perf benchmarks` for a specific model. 
+    This is hard to do now due to tight coupling between dynamo namespace, SLA profiler code, k8s CR and backend config
+
+## Objective:
+- decouple config from framework image: this will simplify model deployment and benchmarking
+- easy quickstart for users: a reference quickstart to deploy a model with benchmarking in minimal steps
+- well-tested recipes: deploy and tune fewer models to generate best configs for benchmarking
+- composable helm charts: use helmfile to deploy all the components in a composable way
+
+## Principles:
+- Use k8s CRD DynamoGraphDeployment as the base
+- Decouple image, helm chart and recipes
+- Provide quick start helm chart/enhance dynamo operator to deploy and benchmark
+
+# Design
+
+## High-level data model for deployment:
+
+For example, the YAML below can be used to deploy a Qwen model with vLLM backend, disagg mode and KV routing in the `qwen-test` dynamo namespace. 
+
+Additionally, vLLM parameters are passed as configmap ref `my-model-config` in the container.
+
+```yaml
+dynamoNamespace:  qwen-test
+
+model:
+    # below parameters are used to generate the k8s DynamoGraphDeployment CR
+    name: Qwen/Qwen-0.6b
+    backend:
+        # name of the backend: vllm, trtllm, sglang
+        type: vllm
+        image: dynamo-vllm:0.1.0
+    deployment:
+        # create the deployment if not exists
+        create: true
+    decode: # this is the name of the component
+        extraConfigs:
+        -  name: model-config      # name of the config 
+           # this path will be passed to the backend component as 
+           # an environment variable (default: `DYNAMO_EXTRA_CONFIG`) in the container
+           env: DYNAMO_EXTRA_CONFIG
+           # this is where the model config is available in the container
+           # (default path: /opt/dynamo/model/config.yaml)
+           path: /opt/dynamo/model/config.yaml 
+           # this is the configmap that contains config
+           configMapRef:
+               name: my-model-decode-config
+    prefill: # this is the name of the component
+        extraConfigs:
+          - name: prefill-config
+            configMapRef:
+                name: my-model-prefill-config
+    ### Deployment mode ###
+    # enables disaggregation
+    disaggregation: true
+    # routing policy: none, kv, random, round-robin
+    routingPolicy: kv
+
+# enables running the benchmark job
+benchmark:
+    enabled: true
+    config:
+        # benchmark configs
+
+observability:
+    enabled: true
+```
+
+This high-level data model can be used as:
+1. a values.yaml for a parent helm chart to deploy a model and auxiliary utilities like benchmark, inference gateway, etc.
+2. in the next phase, this can be absorbed by k8s DynamoGraphDeployment CR to be used by the operator
+
+A high-level (parent helm chart) can take the above data model as input and render the k8s DynamoGraphDeployment CR and auxiliary utilities like benchmark, inference gateway, etc.
+
+### Passing configs to the container
+
+Each component can be configured with a k8s configmap ref which is mounted at a path specified in `DYNAMO_EXTRA_CONFIG` environment variable in the container.
+
+```yaml
+extraConfig:
+    # this is where the extra config is available in the container
+    # this path will be passed to the backend component as 
+    # an environment variable `DYNAMO_EXTRA_CONFIG` in the container
+    # default path is /opt/dynamo/model/config.yaml
+    path: /opt/dynamo/model/config.yaml
+    # this is the configmap that contains the model config
+    # this will be mounted to the container as a volume
+    # at the path specified in the path field above
+    configMapRef:
+        name: my-model-prefill-config
+```
+
+configmap example:
+
+```yaml
+apiVersion: v1  
+kind: ConfigMap
+metadata:
+  name: my-model-prefill-config
+data:
+    # example vllm config
+    is-prefill-worker: true
+    data-parallel-size: 2
+    enable-kv-routing: true
+    max-model-length: 10240
+    gpu-memory-utilization: 0.8
+    enforce-eager: true
+```
+
+This loosely defines how to pass configs to the components.
+
+Backend component will 
+- read the env variable `DYNAMO_EXTRA_CONFIG` 
+- read the config file during initialization
+- update args based on the config file
+
+
+## Alternative 1: Composable helm charts
+We can leverage [Helmfile](https://github.com/helmfile/helmfile?tab=readme-ov-file#getting-started) to compose helm charts for different dynamo functionalities.
+
+A single helmfile can be used to deploy the components below in a composable way. A reference would be the [wide-ep-lws example in llm-d](https://github.com/llm-d-incubation/llm-d-infra/tree/main/quickstart/examples/wide-ep-lws).
+
+Helm charts:
+- [Dynamo cloud platform](https://github.com/ai-dynamo/dynamo/tree/main/deploy/cloud/helm) is the base helm chart and deploys the operator for managing the life cycle of the graph deployment, grove integration, etc.
+  - current state: independent [Dynamo cloud platform](https://github.com/ai-dynamo/dynamo/tree/main/deploy/cloud/helm) helm chart
+- Dynamo Inference Gateway helm chart
+  - current state: independent [Dynamo Inference Gateway helm chart](https://github.com/ai-dynamo/dynamo/blob/f7e468c7e8ff0d1426db987564e60572167e8464/deploy/inference-gateway/helm/dynamo-gaie/values.yaml#L27)
+- Metrics
+  - current state: we don't have a helm chart but we have a few YAML files with env variables [Metrics](https://github.com/ai-dynamo/dynamo/tree/main/deploy/metrics/k8s)
+- Benchmark
+  - current state: a few hard-coded jobs in the `benchmark` [folder](https://github.com/ai-dynamo/dynamo/tree/main/benchmarks/profiler/deploy)
+- Model Express
+  - current state: hard-coded YAML for single config in agg mode [model-express](https://github.com/ai-dynamo/modelexpress/pull/31/files)
+- Fault injection/Test
+  - Doesn't exist
+- Troubleshooting
+  - Doesn't exist
+
+
+### Other alternative considerations
+1. Use environment variables (current approach)
+- We are reinventing template rendering with `envsubst` in a non-sustainable way
+- This is not an ideal extensible quickstart for users
+
+2. Use `kustomize` to render the template
+- This is a sane way for customization and better than env vars + `envsubst`
+- This can be derived from helm charts (use `helm template` to get the k8s YAML)
+- Helmfile supports `kustomize` to render the template. [Reference](https://helmfile.readthedocs.io/en/latest/advanced-features/#deploy-kustomizations-with-helmfile)
+
+
+## Phase 1: Quickly iterate on helm chart and publish for public usage/feedback
+
+Based on the inputs in values.yaml, the helm chart renders appropriate k8s DynamoGraphDeployment CR/Services fragments.
+
+A similar reference structure is already used in the [Dynamo Inference Gateway helm chart](https://github.com/ai-dynamo/dynamo/blob/f7e468c7e8ff0d1426db987564e60572167e8464/deploy/inference-gateway/helm/dynamo-gaie/values.yaml#L27).
+```yaml
+
+
+# This is the Dynamo namespace where the dynamo model is deployed
+dynamoNamespace: "vllm-agg"
+
+# This is the port on which the model is exposed
+model:
+  # This is the model name that will be used to route traffic to the dynamo model
+  # for example, if the model name is Qwen/Qwen3-0.6B, then the modelShortName should be qwen
+  identifier: "Qwen/Qwen3-0.6B"
+  # This is the short name of the model that will be used to generate the resource names
+  shortName: "qwen"
+  # Criticality level for the inference model
+  criticality: "Critical"
+
+inferencePool:
+  ...
+
+# HTTPRoute configuration
+httpRoute:
+  enabled: true
+  ...
+
+extension:
+  # the GAIE extension
+  image: ...
+```
+
+
+## Phase 2: Stabilize and update operator
+
+Incorporate the logic in the operator after the UX above is validated/finalized through quick iteration.