From eb135776548021764f1bd734ead4572935bb0edd Mon Sep 17 00:00:00 2001 From: mohammedabdulwahhab Date: Thu, 4 Sep 2025 12:51:07 -0700 Subject: [PATCH 01/19] Add document on ETCD alternatives for Dynamo in K8s Signed-off-by: mohammedabdulwahhab --- etcd-k8s.md | 310 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 310 insertions(+) create mode 100644 etcd-k8s.md diff --git a/etcd-k8s.md b/etcd-k8s.md new file mode 100644 index 0000000..911a85c --- /dev/null +++ b/etcd-k8s.md @@ -0,0 +1,310 @@ +# Alternative Approach to ETCD Dependency in Kubernetes + +## Problem + +Customers are hesitant to stand up and maintain dedicated etcd clusters to deploy Dynamo. ETCD, however, is a hard dependency of the Dynamo Runtime. + +It enables the following functionalities within DRT: + +- Heartbeats/leases +- Component Registry/Service Discovery +- Cleanup on Shutdown +- General Key-value storage for various purposes (KVCache Metadata, Model Metadata, etc.) + +## ETCD and Kubernetes + +Kubernetes stores its own state in ETCD. With a few tradeoffs, we can use native Kubernetes APIs and Resources to achieve the same functionality. Under the hood, these operations are still backed by etcd. This amounts to an alternative implementation of the Dynamo Runtime interface that can run without needing ETCD in K8s environments. + +This document explores a few approaches to achieve this. + +Note: This document is only to explore alternative approaches to eliminating the ETCD dependency. Decoupling from NATS (the transport layer) is a separate concern. + +## DRT Component Registry Primer + +Here is a primer on the core workflow that ETCD is used for in DRT: + +```python +# server.py - Server side workflow +from dynamo.runtime import DistributedRuntime, dynamo_worker + +@dynamo_worker(static=False) +async def worker(runtime: DistributedRuntime): + component = runtime.namespace(ns).component("backend") + await component.create_service() + endpoint = component.endpoint("generate") + + await endpoint.serve_endpoint(RequestHandler().generate) +``` + +```python +# client.py - Client side workflow +from dynamo.runtime import DistributedRuntime, dynamo_worker + +@dynamo_worker(static=False) +async def worker(runtime: DistributedRuntime): + await init(runtime, "dynamo") + + endpoint = runtime.namespace(ns).component("backend").endpoint("generate") + + # Create client and wait for an endpoint to be ready + client = await endpoint.client() + await client.wait_for_instances() + + stream = await client.generate("hello world") + async for char in stream: + print(char) +``` + +### Server Side: +1. Server registers with DRT, receiving a primary lease (unique instance identifier) from ETCD +2. Server creates an entry in ETCD advertising its endpoint. The entry is tied to the lease of the instance. +3. Server continues to renew the lease. This keeps its associated endpoint entries alive. If a lease fails to renew, associated endpoint entries are automatically removed by ETCD. + +### Client Side: +1. Client asks ETCD for a watch of endpoints. +2. Client maintains a cache of such entries, updating its cache as it receives updates from ETCD. +3. Client uses the transport to target one of the instances in its cache. + +```bash +# Example etcd keys showing endpoints +$ etcdctl get --prefix instances/ +instances/dynamo/worker/generate:5fc98e41ac8ce3b +{ + "endpoint":"generate", + "namespace":"dynamo", + "component":"worker", + "endpoint_id":"generate", + "instance_id":"worker-abc123", + "transport":"nats..." +} +``` + +Summary of entities: +- Leases: Unique identifier for an instance with TTL and auto-renewal +- Endpoints: Service registration entries associated with a lease +- Watches: Real-time subscription to key prefix changes + +Summary of operations: +- Creating/renewing leases: create_lease(), keep_alive background task +- Creating endpoints: kv_create() with lease attachment +- Watching endpoints: kv_get_and_watch_prefix() for real-time updates +- Automatic cleanup: Lease expiration removes associated keys + +Key ETCD APIs used in Dynamo Runtime (from lib/runtime/src/transports/etcd.rs): + +Lease Management: +- create_lease() - Create lease with TTL +- revoke_lease() - Revoke lease explicitly + +Key-Value Operations: +- kv_create() - Create key if it doesn't exist +- kv_create_or_validate() - Create or validate existing key +- kv_get_and_watch_prefix() - Get initial values and watch for changes + + +## Approach 1: Using Custom Resources and kubectl watch API + +We use the kubectl `watch` API and Custom Resources for Lease and DynamoEndpoint to achieve similar functionality. + +### Server side: +1. Server registers with DRT, creating a Lease resource in K8s +2. Server creates a DynamoEndpoint CR in K8s with the owner ref set to its Lease resource +3. Server renews its Lease, keeping the controller/operator from terminating it (and the associated DynamoEndpoint CR) + +### Client side: +1. Client asks K8s for a watch of DynamoEndpoints (using the kubectl `watch` API). +2. Client maintains a cache of such entries, updating its cache as it receives updates from K8s +3. Client uses the transport to target one of the instances in its cache. + +### Controller: +1. Dynamo controller is responsible for deleting leases that have not been renewed. +2. When a lease is deleted, all associated DynamoEndpoint CRs are also deleted. + +```mermaid +sequenceDiagram + participant Server + participant K8s API + participant Controller + participant Client + + Note over Server: Dynamo Instance Startup + Server->>K8s API: Create Lease resource + Server->>K8s API: Create DynamoEndpoint CR (ownerRef=Lease) + Server->>K8s API: Renew Lease periodically + + Note over Controller: Lease Management + Controller->>K8s API: Watch Lease resources + Controller->>Controller: Check lease expiry + Controller->>K8s API: Delete expired Leases + K8s API->>K8s API: Auto-delete DynamoEndpoints (ownerRef) + + Note over Client: Service Discovery + Client->>K8s API: Watch DynamoEndpoint resources + K8s API-->>Client: Stream endpoint changes + Client->>Client: Update instance cache + Client->>Server: Route requests via NATS +``` + +```yaml +# Example Lease resource +apiVersion: coordination.k8s.io/v1 +kind: Lease +metadata: + name: dynamo-lease + namespace: dynamo + ownerReferences: + - apiVersion: pods/v1 + kind: Pod + name: dynamo-pod-abc123 +spec: + holderIdentity: dynamo-pod-abc123 + leaseDurationSeconds: 30 + renewDeadlineSeconds: 20 +``` + +```yaml +# Example DynamoEndpoint CR +apiVersion: dynamo.nvidia.com/v1alpha1 +kind: DynamoEndpoint +metadata: + name: dynamo-endpoint + namespace: dynamo + ownerReferences: + - apiVersion: coordination.k8s.io/v1 + kind: Lease + name: dynamo-lease + labels: + dynamo-namespace: dynamo + dynamo-component: backend + dynamo-endpoint: generate +spec: + ... +``` + +Kubernetes concepts: +- Lease: Existing Kubernetes resource in the coordination.k8s.io API group +- Owner references: Metadata establishing parent-child relationships for automatic cleanup +- kubectl watch API: Real-time subscription to resource changes +- Custom resources: Extension mechanism for arbitrary data storage. We can introduce a new CRD for DynamoEndpoint and DynamoModel and similar to store state associated with a lese. + +Trade-offs: +- Pros: No external etcd cluster, leverages Kubernetes primitives, automatic cleanup +- Cons: API latency overhead, limited prefix watching, atomicity complexity, controller overhead + +Notes: +- Requires a controller to delete leases on expiry. This is not something k8s automatically handles for us. +- prefix-based watching for changes is not supported by the kubectl `watch` api. We can however watch on a set of labels that correspond to the endpoints we are interested in. +- Unavoidable: overhead of going through kube api as opposed to direct etcd calls. +- Need to work out atomicity of operations + + +## Approach 2: Watching EndpointSlices + +Disclaimer: This approach is exploratory and would require significant changes to the current Dynamo architecture. It eliminates explicit lease management by leveraging Kubernetes' built-in pod lifecycle and readiness probes. + +### Server side: +1. Dynamo operator creates server instances. Pods are labeled with `dynamo-namespace` and `dynamo-component`. +2. When a pod wants to serve an endpoint, it performs two actions: + - Creates a Service for that endpoint (if it doesn't exist) with selectors: + - `dynamo-namespace` + - `dynamo-component` + - `dynamo-endpoint-: true` + - Patches its own labels to add `dynamo-endpoint-: true` +3. Readiness probe status reflects the health of this specific endpoint. + +### Client side: +1. Client watches EndpointSlices associated with the target Kubernetes Service. +2. EndpointSlices maintain the current state of pods serving the endpoint and their readiness status. +3. Client maintains a cache of available instances, updating as EndpointSlice changes arrive. +4. Client routes requests to healthy instances via NATS transport. + +```mermaid +graph TD + A[Dynamo Pod] --> B{Kubernetes Service} + B --> C[EndpointSlice] + C --> D[Client Watch] + + A --> E[Readiness Probe] + E --> F{Endpoint Healthy?} + F -->|Yes| G[Pod Ready] + F -->|No| H[Pod Not Ready] + + D --> I[Instance Cache] + I --> J[Route to NATS] + + subgraph "Pod Lifecycle" + K[Pod Starts] --> L[Register Endpoint] + L --> M[Update Labels] + M --> N[Create Service] + end + + subgraph "Service Discovery" + O[Watch EndpointSlices] --> P[Receive Updates] + P --> Q[Update Cache] + Q --> R[Route Requests] + end +``` + +Kubernetes concepts: +- Services and EndpointSlices: Services define pod sets, EndpointSlices track pod addresses and readiness +- Pod labels and selectors: Key-value pairs for pod identification and Service targeting +- Readiness probes: Health checks that determine pod readiness for traffic + +```yaml +# Example Service and EndpointSlice +apiVersion: v1 +kind: Service +metadata: + name: dynamo-generate-service + namespace: dynamo + labels: + dynamo-namespace: dynamo + dynamo-component: backend + dynamo-endpoint: generate +spec: + selector: + dynamo-namespace: dynamo + dynamo-component: backend + dynamo-endpoint-generate: "true" + ports: # dummy port since transport isn't actually taking place through this + - ... + type: ClusterIP + +--- +apiVersion: discovery.k8s.io/v1 +kind: EndpointSlice +metadata: + name: dynamo-generate-service-abc12 + namespace: dynamo + labels: + kubernetes.io/service-name: dynamo-generate-service +addressType: IPv4 +ports: +- ... # dummy port since transport isn't actually taking place through this +endpoints: +- addresses: + - "10.0.1.100" + conditions: + ready: true + targetRef: + apiVersion: pods/v1 + kind: Pod + name: dynamo-pod-abc123 +- addresses: + - "10.0.1.101" + conditions: + ready: false # Pod failed readiness probe + targetRef: + apiVersion: pods/v1 + kind: Pod + name: dynamo-pod-def456 +``` + +Trade-offs: +- Pros: No lease management, automatic cleanup, native health checks +- Cons: Dynamic Service creation, label management complexity, limited granularity + +Notes: +- Service creation could be moved to a controller +- Consider annotations instead of labels for endpoint advertisement +- Multiple readiness probes per pod for per-endpoint health status From 23fc0d5decbe571a642276c24849969e7d2ea8a4 Mon Sep 17 00:00:00 2001 From: mohammedabdulwahhab Date: Thu, 4 Sep 2025 12:54:31 -0700 Subject: [PATCH 02/19] Revise mermaid diagram for Kubernetes service flow Signed-off-by: mohammedabdulwahhab --- etcd-k8s.md => deps/etcd-k8s.md | 44 +++++++++++++++------------------ 1 file changed, 20 insertions(+), 24 deletions(-) rename etcd-k8s.md => deps/etcd-k8s.md (93%) diff --git a/etcd-k8s.md b/deps/etcd-k8s.md similarity index 93% rename from etcd-k8s.md rename to deps/etcd-k8s.md index 911a85c..598490c 100644 --- a/etcd-k8s.md +++ b/deps/etcd-k8s.md @@ -219,30 +219,26 @@ Disclaimer: This approach is exploratory and would require significant changes t 4. Client routes requests to healthy instances via NATS transport. ```mermaid -graph TD - A[Dynamo Pod] --> B{Kubernetes Service} - B --> C[EndpointSlice] - C --> D[Client Watch] - - A --> E[Readiness Probe] - E --> F{Endpoint Healthy?} - F -->|Yes| G[Pod Ready] - F -->|No| H[Pod Not Ready] - - D --> I[Instance Cache] - I --> J[Route to NATS] - - subgraph "Pod Lifecycle" - K[Pod Starts] --> L[Register Endpoint] - L --> M[Update Labels] - M --> N[Create Service] - end - - subgraph "Service Discovery" - O[Watch EndpointSlices] --> P[Receive Updates] - P --> Q[Update Cache] - Q --> R[Route Requests] - end +sequenceDiagram + participant Pod + participant K8s API + participant Client + + Note over Pod: Dynamo Pod Lifecycle + Pod->>K8s API: Create Service (if doesn't exist) + Pod->>K8s API: Update pod labels for endpoint + Pod->>Pod: Readiness probe health check + + Note over K8s API: Service Management + K8s API->>K8s API: Create EndpointSlice for Service + K8s API->>K8s API: Update EndpointSlice with pod readiness + K8s API->>K8s API: Remove failed pods from EndpointSlice + + Note over Client: Service Discovery + Client->>K8s API: Watch EndpointSlice for target Service + K8s API-->>Client: Stream readiness changes + Client->>Client: Update instance cache + Client->>Pod: Route requests via NATS ``` Kubernetes concepts: From 636197a65d968c0cc407daa0b2958bd9a3ade024 Mon Sep 17 00:00:00 2001 From: mohammedabdulwahhab Date: Thu, 4 Sep 2025 12:59:11 -0700 Subject: [PATCH 03/19] Revise trade-offs and notes in etcd-k8s.md Signed-off-by: mohammedabdulwahhab --- deps/etcd-k8s.md | 11 ++++------- 1 file changed, 4 insertions(+), 7 deletions(-) diff --git a/deps/etcd-k8s.md b/deps/etcd-k8s.md index 598490c..f3e5f10 100644 --- a/deps/etcd-k8s.md +++ b/deps/etcd-k8s.md @@ -296,11 +296,8 @@ endpoints: name: dynamo-pod-def456 ``` -Trade-offs: -- Pros: No lease management, automatic cleanup, native health checks -- Cons: Dynamic Service creation, label management complexity, limited granularity - Notes: -- Service creation could be moved to a controller -- Consider annotations instead of labels for endpoint advertisement -- Multiple readiness probes per pod for per-endpoint health status +- Pro: We don't need a dedicated controller to delete leases on expiry. (No leases) +- We need to find a better pattern for a pod to influence the services it is part of than mutating its label set. Potentially a controller could be involved. +- The service is not actually used for transport here. Only to enumerate the endpointslices which are doing book keeping for which pods are backing the endpoint. + From 4d1349cea26aaca13703ae2f695d83148cfd5488 Mon Sep 17 00:00:00 2001 From: mohammedabdulwahhab Date: Thu, 4 Sep 2025 13:06:28 -0700 Subject: [PATCH 04/19] Update etcd-k8s.md Signed-off-by: mohammedabdulwahhab --- deps/etcd-k8s.md | 77 ++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 75 insertions(+), 2 deletions(-) diff --git a/deps/etcd-k8s.md b/deps/etcd-k8s.md index f3e5f10..6937584 100644 --- a/deps/etcd-k8s.md +++ b/deps/etcd-k8s.md @@ -102,7 +102,7 @@ Key-Value Operations: - kv_get_and_watch_prefix() - Get initial values and watch for changes -## Approach 1: Using Custom Resources and kubectl watch API +## Approach 1: Lease-Based Endpoint Registry We use the kubectl `watch` API and Custom Resources for Lease and DynamoEndpoint to achieve similar functionality. @@ -198,7 +198,7 @@ Notes: - Need to work out atomicity of operations -## Approach 2: Watching EndpointSlices +## Approach 2: Pod Lifecycle-Driven Discovery Disclaimer: This approach is exploratory and would require significant changes to the current Dynamo architecture. It eliminates explicit lease management by leveraging Kubernetes' built-in pod lifecycle and readiness probes. @@ -301,3 +301,76 @@ Notes: - We need to find a better pattern for a pod to influence the services it is part of than mutating its label set. Potentially a controller could be involved. - The service is not actually used for transport here. Only to enumerate the endpointslices which are doing book keeping for which pods are backing the endpoint. +## Approach 3: Controller-managed DynamoEndpoint Resources + +This approach combines the best of both previous approaches: pods create DynamoEndpoint resources directly (like Approach 1), but a controller manages their status based on pod readiness (like Approach 2). + +### Server side: +1. Dynamo pods create DynamoEndpoint CRs directly when they start serving an endpoint +2. Pods set ownerReferences to themselves on the DynamoEndpoint resources +3. DynamoEndpoint resources are automatically cleaned up when pods terminate + +### Controller: +1. Dynamo controller watches pod lifecycle events and readiness status +2. Controller updates the status field of DynamoEndpoint resources based on pod readiness +3. Controller maintains instance lists and transport information in DynamoEndpoint status + +### Client side: +1. Client watches DynamoEndpoint resources using kubectl watch API +2. Client maintains cache of available instances from DynamoEndpoint status +3. Client routes requests to healthy instances via NATS transport + +```mermaid +sequenceDiagram + participant Pod + participant K8s API + participant Controller + participant Client + + Note over Pod: Dynamo Pod Startup + Pod->>K8s API: Create DynamoEndpoint CR + Pod->>K8s API: Set ownerReference to pod + + Note over Controller: Status Management + Controller->>K8s API: Watch Pod lifecycle events + Controller->>K8s API: Watch Pod readiness changes + Controller->>K8s API: Update DynamoEndpoint status + K8s API->>K8s API: Auto-delete DynamoEndpoint (ownerRef) + + Note over Client: Service Discovery + Client->>K8s API: Watch DynamoEndpoint resources + K8s API-->>Client: Stream status changes + Client->>Client: Update instance cache + Client->>Pod: Route requests via NATS +``` + +```yaml +# Example DynamoEndpoint resource +apiVersion: dynamo.nvidia.com/v1alpha1 +kind: DynamoEndpoint +metadata: + name: dynamo-generate-endpoint + namespace: dynamo + labels: + dynamo-namespace: dynamo + dynamo-component: backend + dynamo-endpoint: generate + ownerReferences: + - apiVersion: v1 + kind: Pod + name: dynamo-pod-abc123 + uid: abc123-def456 +spec: + endpoint: generate + namespace: dynamo + component: backend +status: + ready: true # controller updates this on readiness change events +``` + +Kubernetes concepts: +- Custom Resource Definitions: Define the schema for DynamoEndpoint resources +- Owner references: Automatic cleanup when pods terminate + +Notes: +- Controller is in charge of updating the status of the DynamoEndpoint as underlying pod readiness changes. From 2f2d76f5dbbd234f742638f5593c23d93793452f Mon Sep 17 00:00:00 2001 From: mohammedabdulwahhab Date: Thu, 4 Sep 2025 13:12:13 -0700 Subject: [PATCH 05/19] Refine etcd-k8s.md with updated notes and trade-offs Clarified trade-offs and Kubernetes concepts in etcd-k8s documentation. Signed-off-by: mohammedabdulwahhab --- deps/etcd-k8s.md | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/deps/etcd-k8s.md b/deps/etcd-k8s.md index 6937584..f0d465e 100644 --- a/deps/etcd-k8s.md +++ b/deps/etcd-k8s.md @@ -187,17 +187,12 @@ Kubernetes concepts: - kubectl watch API: Real-time subscription to resource changes - Custom resources: Extension mechanism for arbitrary data storage. We can introduce a new CRD for DynamoEndpoint and DynamoModel and similar to store state associated with a lese. -Trade-offs: -- Pros: No external etcd cluster, leverages Kubernetes primitives, automatic cleanup -- Cons: API latency overhead, limited prefix watching, atomicity complexity, controller overhead - Notes: - Requires a controller to delete leases on expiry. This is not something k8s automatically handles for us. - prefix-based watching for changes is not supported by the kubectl `watch` api. We can however watch on a set of labels that correspond to the endpoints we are interested in. - Unavoidable: overhead of going through kube api as opposed to direct etcd calls. - Need to work out atomicity of operations - ## Approach 2: Pod Lifecycle-Driven Discovery Disclaimer: This approach is exploratory and would require significant changes to the current Dynamo architecture. It eliminates explicit lease management by leveraging Kubernetes' built-in pod lifecycle and readiness probes. @@ -243,7 +238,6 @@ sequenceDiagram Kubernetes concepts: - Services and EndpointSlices: Services define pod sets, EndpointSlices track pod addresses and readiness -- Pod labels and selectors: Key-value pairs for pod identification and Service targeting - Readiness probes: Health checks that determine pod readiness for traffic ```yaml @@ -299,7 +293,7 @@ endpoints: Notes: - Pro: We don't need a dedicated controller to delete leases on expiry. (No leases) - We need to find a better pattern for a pod to influence the services it is part of than mutating its label set. Potentially a controller could be involved. -- The service is not actually used for transport here. Only to enumerate the endpointslices which are doing book keeping for which pods are backing the endpoint. +- The service is not actually used for transport here. Only to manage the EndpointSlices which are doing book keeping for which pods are backing the endpoint. ## Approach 3: Controller-managed DynamoEndpoint Resources From 8bf5d6c87b3b075d2306430bd20f3171eaa9648a Mon Sep 17 00:00:00 2001 From: mohammedabdulwahhab Date: Thu, 4 Sep 2025 13:15:44 -0700 Subject: [PATCH 06/19] Revise DynamoEndpoint management approach This update introduces a new approach for managing DynamoEndpoint resources, combining pod lifecycle management with controller oversight. It details the server, controller, and client-side interactions for improved service discovery and resource management. Signed-off-by: mohammedabdulwahhab --- deps/etcd-k8s.md | 152 ++++++++++++++++++++++++----------------------- 1 file changed, 77 insertions(+), 75 deletions(-) diff --git a/deps/etcd-k8s.md b/deps/etcd-k8s.md index f0d465e..475092f 100644 --- a/deps/etcd-k8s.md +++ b/deps/etcd-k8s.md @@ -193,7 +193,83 @@ Notes: - Unavoidable: overhead of going through kube api as opposed to direct etcd calls. - Need to work out atomicity of operations -## Approach 2: Pod Lifecycle-Driven Discovery + +## Approach 2: Controller-managed DynamoEndpoint Resources + +This approach combines the best of both previous approaches: pods create DynamoEndpoint resources directly (like Approach 1), but a controller manages their status based on pod readiness (like Approach 2). + +### Server side: +1. Dynamo pods create DynamoEndpoint CRs directly when they start serving an endpoint +2. Pods set ownerReferences to themselves on the DynamoEndpoint resources +3. DynamoEndpoint resources are automatically cleaned up when pods terminate + +### Controller: +1. Dynamo controller watches pod lifecycle events and readiness status +2. Controller updates the status field of DynamoEndpoint resources based on pod readiness +3. Controller maintains instance lists and transport information in DynamoEndpoint status + +### Client side: +1. Client watches DynamoEndpoint resources using kubectl watch API +2. Client maintains cache of available instances from DynamoEndpoint status +3. Client routes requests to healthy instances via NATS transport + +```mermaid +sequenceDiagram + participant Pod + participant K8s API + participant Controller + participant Client + + Note over Pod: Dynamo Pod Startup + Pod->>K8s API: Create DynamoEndpoint CR + Pod->>K8s API: Set ownerReference to pod + + Note over Controller: Status Management + Controller->>K8s API: Watch Pod lifecycle events + Controller->>K8s API: Watch Pod readiness changes + Controller->>K8s API: Update DynamoEndpoint status + K8s API->>K8s API: Auto-delete DynamoEndpoint (ownerRef) + + Note over Client: Service Discovery + Client->>K8s API: Watch DynamoEndpoint resources + K8s API-->>Client: Stream status changes + Client->>Client: Update instance cache + Client->>Pod: Route requests via NATS +``` + +```yaml +# Example DynamoEndpoint resource +apiVersion: dynamo.nvidia.com/v1alpha1 +kind: DynamoEndpoint +metadata: + name: dynamo-generate-endpoint + namespace: dynamo + labels: + dynamo-namespace: dynamo + dynamo-component: backend + dynamo-endpoint: generate + ownerReferences: + - apiVersion: v1 + kind: Pod + name: dynamo-pod-abc123 + uid: abc123-def456 +spec: + endpoint: generate + namespace: dynamo + component: backend +status: + ready: true # controller updates this on readiness change events +``` + +Kubernetes concepts: +- Custom Resource Definitions: Define the schema for DynamoEndpoint resources +- Owner references: Automatic cleanup when pods terminate + +Notes: +- Controller is in charge of updating the status of the DynamoEndpoint as underlying pod readiness changes. + + +## Approach 3: Pod Lifecycle-Driven Discovery Disclaimer: This approach is exploratory and would require significant changes to the current Dynamo architecture. It eliminates explicit lease management by leveraging Kubernetes' built-in pod lifecycle and readiness probes. @@ -294,77 +370,3 @@ Notes: - Pro: We don't need a dedicated controller to delete leases on expiry. (No leases) - We need to find a better pattern for a pod to influence the services it is part of than mutating its label set. Potentially a controller could be involved. - The service is not actually used for transport here. Only to manage the EndpointSlices which are doing book keeping for which pods are backing the endpoint. - -## Approach 3: Controller-managed DynamoEndpoint Resources - -This approach combines the best of both previous approaches: pods create DynamoEndpoint resources directly (like Approach 1), but a controller manages their status based on pod readiness (like Approach 2). - -### Server side: -1. Dynamo pods create DynamoEndpoint CRs directly when they start serving an endpoint -2. Pods set ownerReferences to themselves on the DynamoEndpoint resources -3. DynamoEndpoint resources are automatically cleaned up when pods terminate - -### Controller: -1. Dynamo controller watches pod lifecycle events and readiness status -2. Controller updates the status field of DynamoEndpoint resources based on pod readiness -3. Controller maintains instance lists and transport information in DynamoEndpoint status - -### Client side: -1. Client watches DynamoEndpoint resources using kubectl watch API -2. Client maintains cache of available instances from DynamoEndpoint status -3. Client routes requests to healthy instances via NATS transport - -```mermaid -sequenceDiagram - participant Pod - participant K8s API - participant Controller - participant Client - - Note over Pod: Dynamo Pod Startup - Pod->>K8s API: Create DynamoEndpoint CR - Pod->>K8s API: Set ownerReference to pod - - Note over Controller: Status Management - Controller->>K8s API: Watch Pod lifecycle events - Controller->>K8s API: Watch Pod readiness changes - Controller->>K8s API: Update DynamoEndpoint status - K8s API->>K8s API: Auto-delete DynamoEndpoint (ownerRef) - - Note over Client: Service Discovery - Client->>K8s API: Watch DynamoEndpoint resources - K8s API-->>Client: Stream status changes - Client->>Client: Update instance cache - Client->>Pod: Route requests via NATS -``` - -```yaml -# Example DynamoEndpoint resource -apiVersion: dynamo.nvidia.com/v1alpha1 -kind: DynamoEndpoint -metadata: - name: dynamo-generate-endpoint - namespace: dynamo - labels: - dynamo-namespace: dynamo - dynamo-component: backend - dynamo-endpoint: generate - ownerReferences: - - apiVersion: v1 - kind: Pod - name: dynamo-pod-abc123 - uid: abc123-def456 -spec: - endpoint: generate - namespace: dynamo - component: backend -status: - ready: true # controller updates this on readiness change events -``` - -Kubernetes concepts: -- Custom Resource Definitions: Define the schema for DynamoEndpoint resources -- Owner references: Automatic cleanup when pods terminate - -Notes: -- Controller is in charge of updating the status of the DynamoEndpoint as underlying pod readiness changes. From 73902afef6c1738d2f2785bd78341835b7982444 Mon Sep 17 00:00:00 2001 From: mohammedabdulwahhab Date: Thu, 4 Sep 2025 13:19:54 -0700 Subject: [PATCH 07/19] Update etcd-k8s.md Signed-off-by: mohammedabdulwahhab --- deps/etcd-k8s.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/deps/etcd-k8s.md b/deps/etcd-k8s.md index 475092f..8a13c0f 100644 --- a/deps/etcd-k8s.md +++ b/deps/etcd-k8s.md @@ -196,7 +196,7 @@ Notes: ## Approach 2: Controller-managed DynamoEndpoint Resources -This approach combines the best of both previous approaches: pods create DynamoEndpoint resources directly (like Approach 1), but a controller manages their status based on pod readiness (like Approach 2). +Pods create DynamoEndpoint resources directly, but a controller keeps status in sync with underlying pod readiness status. Instead of using leases, we sync with the readiness status of the pod (as advertised by the probe). ### Server side: 1. Dynamo pods create DynamoEndpoint CRs directly when they start serving an endpoint @@ -269,9 +269,9 @@ Notes: - Controller is in charge of updating the status of the DynamoEndpoint as underlying pod readiness changes. -## Approach 3: Pod Lifecycle-Driven Discovery +## Approach 3: Pod Lifecycle-Driven Discovery using EndpointSlices -Disclaimer: This approach is exploratory and would require significant changes to the current Dynamo architecture. It eliminates explicit lease management by leveraging Kubernetes' built-in pod lifecycle and readiness probes. +Disclaimer: This idea is still WIP. It is similar to Approach 2, but eliminates the need for a controller relying on the Kubernetes Service controller to keep EndpointSlices up to date. ### Server side: 1. Dynamo operator creates server instances. Pods are labeled with `dynamo-namespace` and `dynamo-component`. From 372e600f7b99c0249e2b318fcf508743b7ec3362 Mon Sep 17 00:00:00 2001 From: mohammedabdulwahhab Date: Thu, 4 Sep 2025 13:20:58 -0700 Subject: [PATCH 08/19] Update etcd-k8s.md Signed-off-by: mohammedabdulwahhab --- deps/etcd-k8s.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/deps/etcd-k8s.md b/deps/etcd-k8s.md index 8a13c0f..67e9161 100644 --- a/deps/etcd-k8s.md +++ b/deps/etcd-k8s.md @@ -269,7 +269,7 @@ Notes: - Controller is in charge of updating the status of the DynamoEndpoint as underlying pod readiness changes. -## Approach 3: Pod Lifecycle-Driven Discovery using EndpointSlices +## Approach 3: EndpointSlice based discovery Disclaimer: This idea is still WIP. It is similar to Approach 2, but eliminates the need for a controller relying on the Kubernetes Service controller to keep EndpointSlices up to date. From 074687b624a0c6aba2bc2e1e799faf6e1d52c987 Mon Sep 17 00:00:00 2001 From: mohammedabdulwahhab Date: Thu, 4 Sep 2025 13:21:23 -0700 Subject: [PATCH 09/19] Update title for etcd-k8s documentation Signed-off-by: mohammedabdulwahhab --- deps/etcd-k8s.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/deps/etcd-k8s.md b/deps/etcd-k8s.md index 67e9161..e41af86 100644 --- a/deps/etcd-k8s.md +++ b/deps/etcd-k8s.md @@ -1,4 +1,4 @@ -# Alternative Approach to ETCD Dependency in Kubernetes +# ETCD-less Dynamo setup in Kubernetes ## Problem From 5c79080d9fc975fac49215821252a779f64bd5a8 Mon Sep 17 00:00:00 2001 From: mohammedabdulwahhab Date: Thu, 4 Sep 2025 13:32:01 -0700 Subject: [PATCH 10/19] Refine EndpointSlice discovery documentation Clarified wording and improved structure in the EndpointSlice based discovery section. Signed-off-by: mohammedabdulwahhab --- deps/etcd-k8s.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/deps/etcd-k8s.md b/deps/etcd-k8s.md index e41af86..dc8ebf0 100644 --- a/deps/etcd-k8s.md +++ b/deps/etcd-k8s.md @@ -271,23 +271,22 @@ Notes: ## Approach 3: EndpointSlice based discovery -Disclaimer: This idea is still WIP. It is similar to Approach 2, but eliminates the need for a controller relying on the Kubernetes Service controller to keep EndpointSlices up to date. +Disclaimer: This idea is a WIP. It is similar to Approach 2, but eliminates the need for a custom controller relying instead on the Kubernetes Service controller to keep EndpointSlices up to date. ### Server side: -1. Dynamo operator creates server instances. Pods are labeled with `dynamo-namespace` and `dynamo-component`. +1. Pods for dynamo workers have labels for `dynamo-namespace` and `dynamo-component`. 2. When a pod wants to serve an endpoint, it performs two actions: - Creates a Service for that endpoint (if it doesn't exist) with selectors: - - `dynamo-namespace` - - `dynamo-component` + - `dynamo-namespace: NS_NAME` + - `dynamo-component: COMPONENT_NAME` - `dynamo-endpoint-: true` - Patches its own labels to add `dynamo-endpoint-: true` -3. Readiness probe status reflects the health of this specific endpoint. +3. Concurrently, the readiness probe is reporting the status of the worker and its endpoints. ### Client side: 1. Client watches EndpointSlices associated with the target Kubernetes Service. -2. EndpointSlices maintain the current state of pods serving the endpoint and their readiness status. -3. Client maintains a cache of available instances, updating as EndpointSlice changes arrive. -4. Client routes requests to healthy instances via NATS transport. +3. Client maintains a cache of available instances, updating as EndpointSlice changes arrive (in response to pods readiness status/addition/deletion). +4. Client routes requests to ready instances. ```mermaid sequenceDiagram @@ -313,7 +312,7 @@ sequenceDiagram ``` Kubernetes concepts: -- Services and EndpointSlices: Services define pod sets, EndpointSlices track pod addresses and readiness +- Services and EndpointSlices - Readiness probes: Health checks that determine pod readiness for traffic ```yaml @@ -370,3 +369,4 @@ Notes: - Pro: We don't need a dedicated controller to delete leases on expiry. (No leases) - We need to find a better pattern for a pod to influence the services it is part of than mutating its label set. Potentially a controller could be involved. - The service is not actually used for transport here. Only to manage the EndpointSlices which are doing book keeping for which pods are backing the endpoint. + From f6ad4f3686569dfebcfbbdeb039b6d73b3239507 Mon Sep 17 00:00:00 2001 From: mohammedabdulwahhab Date: Fri, 5 Sep 2025 14:35:15 -0700 Subject: [PATCH 11/19] Update etcd-k8s.md Signed-off-by: mohammedabdulwahhab --- deps/etcd-k8s.md | 454 +++++++++++++++++------------------------------ 1 file changed, 161 insertions(+), 293 deletions(-) diff --git a/deps/etcd-k8s.md b/deps/etcd-k8s.md index dc8ebf0..1163193 100644 --- a/deps/etcd-k8s.md +++ b/deps/etcd-k8s.md @@ -1,72 +1,51 @@ -# ETCD-less Dynamo setup in Kubernetes +# ETCD-less Dynamo in Kubernetes: Endpoint Instance Discovery -## Problem +## Background -Customers are hesitant to stand up and maintain dedicated etcd clusters to deploy Dynamo. ETCD, however, is a hard dependency of the Dynamo Runtime. +This document is part of a series of deps attempting to eliminate the ETCD dependency in Dynamo. -It enables the following functionalities within DRT: +Source: https://github.com/ai-dynamo/enhancements/blob/neelays/runtime/XXXX-runtime-infrastructure.md -- Heartbeats/leases -- Component Registry/Service Discovery -- Cleanup on Shutdown -- General Key-value storage for various purposes (KVCache Metadata, Model Metadata, etc.) +ETCD usage in Dynamo can be categorized into the following: +- Endpoint Instance Discovery (Current focus) +- Model / Worker Discovery +- Worker Port Conflict Resolution +- Router State Sharing Synchronization -## ETCD and Kubernetes +Here we resolve how we can achieve endpoint instance discovery without ETCD in Kubernetes environments. -Kubernetes stores its own state in ETCD. With a few tradeoffs, we can use native Kubernetes APIs and Resources to achieve the same functionality. Under the hood, these operations are still backed by etcd. This amounts to an alternative implementation of the Dynamo Runtime interface that can run without needing ETCD in K8s environments. -This document explores a few approaches to achieve this. +## Current Implementation of Endpoint Instance Discovery -Note: This document is only to explore alternative approaches to eliminating the ETCD dependency. Decoupling from NATS (the transport layer) is a separate concern. - -## DRT Component Registry Primer - -Here is a primer on the core workflow that ETCD is used for in DRT: - -```python -# server.py - Server side workflow -from dynamo.runtime import DistributedRuntime, dynamo_worker - -@dynamo_worker(static=False) -async def worker(runtime: DistributedRuntime): - component = runtime.namespace(ns).component("backend") - await component.create_service() - endpoint = component.endpoint("generate") - - await endpoint.serve_endpoint(RequestHandler().generate) -``` +Here is a primer on the current implementation of endpoint instance discovery between a server and a client. +### APIs ```python -# client.py - Client side workflow -from dynamo.runtime import DistributedRuntime, dynamo_worker - -@dynamo_worker(static=False) -async def worker(runtime: DistributedRuntime): - await init(runtime, "dynamo") - - endpoint = runtime.namespace(ns).component("backend").endpoint("generate") - - # Create client and wait for an endpoint to be ready - client = await endpoint.client() - await client.wait_for_instances() - - stream = await client.generate("hello world") - async for char in stream: - print(char) +# server side +endpoint = runtime.namespace("dynamo").component("backend") +service = await component.create_service() +endpoint = service.endpoint("generate") +await endpoint.serve_endpoint(RequestHandler().generate) + +# client side +endpoint = runtime.namespace("dynamo").component("backend").endpoint("generate") +client = await endpoint.client() +await client.wait_for_instances() +stream = await client.generate("request") ``` ### Server Side: 1. Server registers with DRT, receiving a primary lease (unique instance identifier) from ETCD -2. Server creates an entry in ETCD advertising its endpoint. The entry is tied to the lease of the instance. +2. Server creates one or more entries in ETCD advertising its endpoint. The entry is tied to the lease of the instance. 3. Server continues to renew the lease. This keeps its associated endpoint entries alive. If a lease fails to renew, associated endpoint entries are automatically removed by ETCD. ### Client Side: -1. Client asks ETCD for a watch of endpoints. +1. Client asks ETCD for a watch of endpoints 2. Client maintains a cache of such entries, updating its cache as it receives updates from ETCD. -3. Client uses the transport to target one of the instances in its cache. +3. Client selects one of the endpoint instances in its cache and routes requests to it. ```bash -# Example etcd keys showing endpoints +# Example of an etcd key showing an endpoint entry associated with an instance $ etcdctl get --prefix instances/ instances/dynamo/worker/generate:5fc98e41ac8ce3b { @@ -79,294 +58,183 @@ instances/dynamo/worker/generate:5fc98e41ac8ce3b } ``` -Summary of entities: -- Leases: Unique identifier for an instance with TTL and auto-renewal -- Endpoints: Service registration entries associated with a lease -- Watches: Real-time subscription to key prefix changes +```mermaid +sequenceDiagram + participant Server + participant ETCD + participant Client -Summary of operations: -- Creating/renewing leases: create_lease(), keep_alive background task -- Creating endpoints: kv_create() with lease attachment -- Watching endpoints: kv_get_and_watch_prefix() for real-time updates -- Automatic cleanup: Lease expiration removes associated keys + Server->>ETCD: Register instance, get lease + Server->>ETCD: Create endpoint entries tied to lease + Server->>ETCD: Renew lease periodically -Key ETCD APIs used in Dynamo Runtime (from lib/runtime/src/transports/etcd.rs): + Client->>ETCD: Watch for endpoints + ETCD-->>Client: Send endpoint updates + Client->>Client: Update local cache + Client->>Server: Route requests to selected instance -Lease Management: -- create_lease() - Create lease with TTL -- revoke_lease() - Revoke lease explicitly + Note over ETCD: Lease expires → entries auto-removed +``` -Key-Value Operations: -- kv_create() - Create key if it doesn't exist -- kv_create_or_validate() - Create or validate existing key -- kv_get_and_watch_prefix() - Get initial values and watch for changes +**How do we achieve these (or roughly similar) APIs in Kubernetes?** +## Approach 1: Ready pod <=> Ready component (No explicit endpoint entity) -## Approach 1: Lease-Based Endpoint Registry +High level ideas: +- Each pod is a unique instance of a component. The operator attaches labels for namespace, component to the pod. +- The readiness probe of the pod reflects the health of the instance. If it is ready, clients can route traffic to it. +- The readiness probe is only 200 when all endpoints are ready. +- The pod name is the unique instance identifier for internal use. When using NATS based transport, the topic is a function of the pod name. +- Clients simply set a watch for ready pods matching the labels for namespace, component. They update their cache of instances as events arrive (pod creation, pod deletion, pod readiness change). +- ***NOTE***: There is no explicit entity for endpoint. Clients know apriori which endpoints are available for a component. If the client made a mistake this will be known at request time. -We use the kubectl `watch` API and Custom Resources for Lease and DynamoEndpoint to achieve similar functionality. -### Server side: -1. Server registers with DRT, creating a Lease resource in K8s -2. Server creates a DynamoEndpoint CR in K8s with the owner ref set to its Lease resource -3. Server renews its Lease, keeping the controller/operator from terminating it (and the associated DynamoEndpoint CR) +### APIs -### Client side: -1. Client asks K8s for a watch of DynamoEndpoints (using the kubectl `watch` API). -2. Client maintains a cache of such entries, updating its cache as it receives updates from K8s -3. Client uses the transport to target one of the instances in its cache. +APIs stay the same -### Controller: -1. Dynamo controller is responsible for deleting leases that have not been renewed. -2. When a lease is deleted, all associated DynamoEndpoint CRs are also deleted. +```python +# SERVER SIDE + +# Operator creates a pod with labels for namespace, component. The namespace, component, and pod name are injected into the pod. +# get_component() returns a struct with info on the current instance's namespace, component, and pod name. +endpoint = get_component().endpoint("generate") +# subscribes to a NATS topic that is a function of (namespace, component, endpoint, pod name) +await endpoint.serve_endpoint(RequestHandler().generate) + +# CLIENT SIDE + +# setup a kube watch for pods matching the labels for namespace, component +# block until a ready pod is found that matches the labels for namespace, component +# NOTE: No validation is done that the pod actually serves the endpoint that we are interested in +endpoint = runtime.namespace("dynamo").component("backend").endpoint("generate") +client = await endpoint.client() +await client.wait_for_instances() + +# write to the NATS topic that is a function of (namespace, component, endpoint, pod name) +# errors out if the component doesn't actually have such an endpoint +stream = await client.generate("request") +``` ```mermaid sequenceDiagram - participant Server - participant K8s API - participant Controller + participant Pod + participant KubeAPI participant Client - Note over Server: Dynamo Instance Startup - Server->>K8s API: Create Lease resource - Server->>K8s API: Create DynamoEndpoint CR (ownerRef=Lease) - Server->>K8s API: Renew Lease periodically - - Note over Controller: Lease Management - Controller->>K8s API: Watch Lease resources - Controller->>Controller: Check lease expiry - Controller->>K8s API: Delete expired Leases - K8s API->>K8s API: Auto-delete DynamoEndpoints (ownerRef) - - Note over Client: Service Discovery - Client->>K8s API: Watch DynamoEndpoint resources - K8s API-->>Client: Stream endpoint changes - Client->>Client: Update instance cache - Client->>Server: Route requests via NATS -``` + Pod->>KubeAPI: Pod created with namespace/component labels + Pod->>Pod: Readiness probe becomes healthy + Pod->>KubeAPI: Pod status updated to Ready -```yaml -# Example Lease resource -apiVersion: coordination.k8s.io/v1 -kind: Lease -metadata: - name: dynamo-lease - namespace: dynamo - ownerReferences: - - apiVersion: pods/v1 - kind: Pod - name: dynamo-pod-abc123 -spec: - holderIdentity: dynamo-pod-abc123 - leaseDurationSeconds: 30 - renewDeadlineSeconds: 20 -``` + Client->>KubeAPI: Watch pods with matching labels + KubeAPI-->>Client: Send pod events (create/update/delete) + Client->>Client: Update local cache of ready pods + Client->>Pod: Route requests to selected pod instance -```yaml -# Example DynamoEndpoint CR -apiVersion: dynamo.nvidia.com/v1alpha1 -kind: DynamoEndpoint -metadata: - name: dynamo-endpoint - namespace: dynamo - ownerReferences: - - apiVersion: coordination.k8s.io/v1 - kind: Lease - name: dynamo-lease - labels: - dynamo-namespace: dynamo - dynamo-component: backend - dynamo-endpoint: generate -spec: - ... + Note over KubeAPI: No explicit endpoint validation ``` -Kubernetes concepts: -- Lease: Existing Kubernetes resource in the coordination.k8s.io API group -- Owner references: Metadata establishing parent-child relationships for automatic cleanup -- kubectl watch API: Real-time subscription to resource changes -- Custom resources: Extension mechanism for arbitrary data storage. We can introduce a new CRD for DynamoEndpoint and DynamoModel and similar to store state associated with a lese. - Notes: -- Requires a controller to delete leases on expiry. This is not something k8s automatically handles for us. -- prefix-based watching for changes is not supported by the kubectl `watch` api. We can however watch on a set of labels that correspond to the endpoints we are interested in. -- Unavoidable: overhead of going through kube api as opposed to direct etcd calls. -- Need to work out atomicity of operations +- Assumes that all components serve the same set of endpoints. +- If a client makes a mistake and requests a non-existent endpoint, it will error out at request time. +- Addressing relies on convention. I know the names of the endpoints that are available on a given component. +- Assumes each pod holds exactly 1 instance of a component (TODO: what setups is this not true for?) +## Approach 2: Endpoints are dynamically exposed and discovered on components -## Approach 2: Controller-managed DynamoEndpoint Resources +This approach is very similar to the first approach, but we components dynamically expose endpoints. Idea from Julien M. -Pods create DynamoEndpoint resources directly, but a controller keeps status in sync with underlying pod readiness status. Instead of using leases, we sync with the readiness status of the pod (as advertised by the probe). - -### Server side: -1. Dynamo pods create DynamoEndpoint CRs directly when they start serving an endpoint -2. Pods set ownerReferences to themselves on the DynamoEndpoint resources -3. DynamoEndpoint resources are automatically cleaned up when pods terminate - -### Controller: -1. Dynamo controller watches pod lifecycle events and readiness status -2. Controller updates the status field of DynamoEndpoint resources based on pod readiness -3. Controller maintains instance lists and transport information in DynamoEndpoint status - -### Client side: -1. Client watches DynamoEndpoint resources using kubectl watch API -2. Client maintains cache of available instances from DynamoEndpoint status -3. Client routes requests to healthy instances via NATS transport - -```mermaid -sequenceDiagram - participant Pod - participant K8s API - participant Controller - participant Client - - Note over Pod: Dynamo Pod Startup - Pod->>K8s API: Create DynamoEndpoint CR - Pod->>K8s API: Set ownerReference to pod - - Note over Controller: Status Management - Controller->>K8s API: Watch Pod lifecycle events - Controller->>K8s API: Watch Pod readiness changes - Controller->>K8s API: Update DynamoEndpoint status - K8s API->>K8s API: Auto-delete DynamoEndpoint (ownerRef) - - Note over Client: Service Discovery - Client->>K8s API: Watch DynamoEndpoint resources - K8s API-->>Client: Stream status changes - Client->>Client: Update instance cache - Client->>Pod: Route requests via NATS -``` +High level ideas: +- Pod adds an annotation on itself to list out the endpoints it exposes. The labels for namespace and component are injected as before. +- Clients watch for pod events on the tuple (namespace, component) and do a client side filter to check if there is an annotation on the pod matching the endpoint they are interested in. ```yaml -# Example DynamoEndpoint resource -apiVersion: dynamo.nvidia.com/v1alpha1 -kind: DynamoEndpoint +# Example pod annotation +apiVersion: v1 +kind: Pod metadata: - name: dynamo-generate-endpoint - namespace: dynamo + name: dynamo-pod-abc123 labels: - dynamo-namespace: dynamo - dynamo-component: backend - dynamo-endpoint: generate - ownerReferences: - - apiVersion: v1 - kind: Pod - name: dynamo-pod-abc123 - uid: abc123-def456 -spec: - endpoint: generate - namespace: dynamo - component: backend -status: - ready: true # controller updates this on readiness change events -``` - -Kubernetes concepts: -- Custom Resource Definitions: Define the schema for DynamoEndpoint resources -- Owner references: Automatic cleanup when pods terminate - -Notes: -- Controller is in charge of updating the status of the DynamoEndpoint as underlying pod readiness changes. - + nvidia.com/dynamo-namespace: dynamo # set by pod + nvidia.com/dynamo-component: backend + annotations: + dynamo.nvidia.com/endpoint/generate="{...}" # pod dynamically adds and updates this annotatino + dynamo.nvidia.com/endpoint/generate_tokens="{...}" -## Approach 3: EndpointSlice based discovery +``` -Disclaimer: This idea is a WIP. It is similar to Approach 2, but eliminates the need for a custom controller relying instead on the Kubernetes Service controller to keep EndpointSlices up to date. +### APIs -### Server side: -1. Pods for dynamo workers have labels for `dynamo-namespace` and `dynamo-component`. -2. When a pod wants to serve an endpoint, it performs two actions: - - Creates a Service for that endpoint (if it doesn't exist) with selectors: - - `dynamo-namespace: NS_NAME` - - `dynamo-component: COMPONENT_NAME` - - `dynamo-endpoint-: true` - - Patches its own labels to add `dynamo-endpoint-: true` -3. Concurrently, the readiness probe is reporting the status of the worker and its endpoints. +APIs stay the same -### Client side: -1. Client watches EndpointSlices associated with the target Kubernetes Service. -3. Client maintains a cache of available instances, updating as EndpointSlice changes arrive (in response to pods readiness status/addition/deletion). -4. Client routes requests to ready instances. +```python +# SERVER SIDE + +# Operator creates a pod with labels for namespace, component. The namespace, component, and pod name are injected into the pod. +# get_component() returns a struct with info on the current instance's namespace, component, and pod name. +endpoint = get_component().endpoint("generate") +# NEW: Modifies its own annotation to add the endpoint to the list of endpoints it serves +# subscribes to a NATS topic that is a function of (namespace, component, endpoint, pod name) +await endpoint.serve_endpoint(RequestHandler().generate) + +# CLIENT SIDE + +# setup a kube watch for pods matching the labels for namespace, component +# filter the pods to only include ones that have the endpoint in their annotation +# block until a READY pod is found that matches the labels for namespace, component and has the endpoint in their annotation +endpoint = runtime.namespace("dynamo").component("backend").endpoint("generate") +client = await endpoint.client() +await client.wait_for_instances() + +# write to the NATS topic that is a function of (namespace, component, endpoint, pod name) +stream = await client.generate("request") +``` ```mermaid sequenceDiagram participant Pod - participant K8s API + participant KubeAPI participant Client - Note over Pod: Dynamo Pod Lifecycle - Pod->>K8s API: Create Service (if doesn't exist) - Pod->>K8s API: Update pod labels for endpoint - Pod->>Pod: Readiness probe health check - - Note over K8s API: Service Management - K8s API->>K8s API: Create EndpointSlice for Service - K8s API->>K8s API: Update EndpointSlice with pod readiness - K8s API->>K8s API: Remove failed pods from EndpointSlice - - Note over Client: Service Discovery - Client->>K8s API: Watch EndpointSlice for target Service - K8s API-->>Client: Stream readiness changes - Client->>Client: Update instance cache - Client->>Pod: Route requests via NATS + Pod->>KubeAPI: Pod created with namespace/component labels + Pod->>KubeAPI: Add endpoint annotation dynamically + Pod->>Pod: Readiness probe becomes healthy + Pod->>KubeAPI: Pod status updated to Ready + + Client->>KubeAPI: Watch pods with matching labels + KubeAPI-->>Client: Send pod events with annotations + Client->>Client: Filter pods by endpoint annotation + Client->>Client: Update local cache of matching ready pods + Client->>Pod: Route requests to selected pod instance + + Note over Client: Client-side filtering by endpoint annotation ``` -Kubernetes concepts: -- Services and EndpointSlices -- Readiness probes: Health checks that determine pod readiness for traffic +Notes: +- Allows for storage of metadata associated with the endpoint in the annotation +- Allows for dynamic updates of endpoints +- Client side parsing and filtering of the annotation may add overhead +- Honors the API, blocks until we know for sure an endpoint is available on the given component + +## Approach 3: Endpoints associated with a component are statically defined on the DGD + +This approach is similar to the first approach, but the endpoints and any metadata about them are statically defined on the DGD. + +Instead of a pod dynamically adding and updating its own annotations, the operator can define a set of labels on pod start that enable it to be watched by clients. ```yaml -# Example Service and EndpointSlice +# Example DGD apiVersion: v1 -kind: Service +kind: DGD metadata: - name: dynamo-generate-service - namespace: dynamo + name: dynamo-dgd-abc123 labels: - dynamo-namespace: dynamo - dynamo-component: backend - dynamo-endpoint: generate -spec: - selector: - dynamo-namespace: dynamo - dynamo-component: backend - dynamo-endpoint-generate: "true" - ports: # dummy port since transport isn't actually taking place through this - - ... - type: ClusterIP - ---- -apiVersion: discovery.k8s.io/v1 -kind: EndpointSlice -metadata: - name: dynamo-generate-service-abc12 - namespace: dynamo - labels: - kubernetes.io/service-name: dynamo-generate-service -addressType: IPv4 -ports: -- ... # dummy port since transport isn't actually taking place through this -endpoints: -- addresses: - - "10.0.1.100" - conditions: - ready: true - targetRef: - apiVersion: pods/v1 - kind: Pod - name: dynamo-pod-abc123 -- addresses: - - "10.0.1.101" - conditions: - ready: false # Pod failed readiness probe - targetRef: - apiVersion: pods/v1 - kind: Pod - name: dynamo-pod-def456 + nvidia.com/dynamo-namespace: dynamo + nvidia.com/dynamo-component: backend + nvidia.com/expose-dynamo-endpoint/generate: true + nvidia.com/expose-dynamo-endpoint/generate_tokens: true ``` Notes: -- Pro: We don't need a dedicated controller to delete leases on expiry. (No leases) -- We need to find a better pattern for a pod to influence the services it is part of than mutating its label set. Potentially a controller could be involved. -- The service is not actually used for transport here. Only to manage the EndpointSlices which are doing book keeping for which pods are backing the endpoint. - +- Endpoints of the component are codified in the spec itself +- TODO: enforce that the application logic exposes the endpoints that it is supposed to according to what is defined in the spec From 4a93fc7d638c763cbcd67a5fe95266caf0e5d591 Mon Sep 17 00:00:00 2001 From: mohammedabdulwahhab Date: Mon, 8 Sep 2025 08:30:36 -0700 Subject: [PATCH 12/19] Update etcd-k8s.md Signed-off-by: mohammedabdulwahhab --- deps/etcd-k8s.md | 105 ++++++++++++++++++++++++++++++++++++----------- 1 file changed, 80 insertions(+), 25 deletions(-) diff --git a/deps/etcd-k8s.md b/deps/etcd-k8s.md index 1163193..fa3e6f6 100644 --- a/deps/etcd-k8s.md +++ b/deps/etcd-k8s.md @@ -84,7 +84,7 @@ High level ideas: - Each pod is a unique instance of a component. The operator attaches labels for namespace, component to the pod. - The readiness probe of the pod reflects the health of the instance. If it is ready, clients can route traffic to it. - The readiness probe is only 200 when all endpoints are ready. -- The pod name is the unique instance identifier for internal use. When using NATS based transport, the topic is a function of the pod name. +- The pod name is the unique instance identifier for internal use. When using NATS based transport, the topic is a function of the pod name. (Similar to the primary lease in etcd) - Clients simply set a watch for ready pods matching the labels for namespace, component. They update their cache of instances as events arrive (pod creation, pod deletion, pod readiness change). - ***NOTE***: There is no explicit entity for endpoint. Clients know apriori which endpoints are available for a component. If the client made a mistake this will be known at request time. @@ -99,7 +99,7 @@ APIs stay the same # Operator creates a pod with labels for namespace, component. The namespace, component, and pod name are injected into the pod. # get_component() returns a struct with info on the current instance's namespace, component, and pod name. endpoint = get_component().endpoint("generate") -# subscribes to a NATS topic that is a function of (namespace, component, endpoint, pod name) +# exposes the endpoint on the transport await endpoint.serve_endpoint(RequestHandler().generate) # CLIENT SIDE @@ -111,11 +111,21 @@ endpoint = runtime.namespace("dynamo").component("backend").endpoint("generate") client = await endpoint.client() await client.wait_for_instances() -# write to the NATS topic that is a function of (namespace, component, endpoint, pod name) +# Send the message via the transport # errors out if the component doesn't actually have such an endpoint stream = await client.generate("request") ``` +``` +apiVersion: v1 +kind: Pod +metadata: + name: dynamo-pod-abc123 + labels: + nvidia.com/dynamo-namespace: dynamo + nvidia.com/dynamo-component: backend +``` + ```mermaid sequenceDiagram participant Pod @@ -140,7 +150,68 @@ Notes: - Addressing relies on convention. I know the names of the endpoints that are available on a given component. - Assumes each pod holds exactly 1 instance of a component (TODO: what setups is this not true for?) -## Approach 2: Endpoints are dynamically exposed and discovered on components + +## Approach 2: Endpoints associated with a component are statically defined on the DGD + +This approach is similar to the first approach, but the endpoints and any metadata about them are statically defined on the DGD. + +Instead of a pod dynamically adding and updating its own annotations, the operator can define a set of labels on pod start that enable it to be watched by clients. + +```yaml +# Example DGD +apiVersion: v1 +kind: DGD +metadata: + name: dynamo-dgd-abc123 + services: + decode: # the decode component defines the endpoints it is associated with + endpoints: + - generate + - generate_tokens + +# Pods are created with labels for namespace, component, and endpoints that can then be used to filter by kube watch +apiVersion: v1 +kind: Pod +metadata: + name: dynamo-pod-abc123 + labels: + nvidia.com/dynamo-namespace: dynamo + nvidia.com/dynamo-component: backend + nvidia.com/expose-dynamo-endpoint/generate: true + nvidia.com/expose-dynamo-endpoint/generate_tokens: true +``` + +### APIs + +APIs stay the same + +```python +# SERVER SIDE + +# Operator creates a pod with labels for namespace, component. The namespace, component, endpoints, and pod name are injected into the pod. +endpoint = get_component().endpoint("generate") +# exposes the endpoint on the transport +# NOTE: This endpoint will be discoverable as soon as the pod marks itself ready. (No explicit registration necessary) +await endpoint.serve_endpoint(RequestHandler().generate) + +# CLIENT SIDE + +# setup a kube watch for pods matching the labels for namespace, component, endpoint +# block until a ready pod is found that matches the labels for namespace, component +endpoint = runtime.namespace("dynamo").component("backend").endpoint("generate") +client = await endpoint.client() +await client.wait_for_instances() + +# Send the message via the transport. (e.g if transport is NATS, the message is sent to the NATS topic that is a function of (namespace, component, endpoint, pod name)) +# errors out if the component doesn't actually have such an endpoint +stream = await client.generate("request") +``` + +Notes: +- Endpoints of the component are codified in the spec itself +- TODO: enforce that the application logic exposes the endpoints that it is supposed to according to what is defined in the spec + +## Approach 3: Endpoints are dynamically exposed and discovered on components This approach is very similar to the first approach, but we components dynamically expose endpoints. Idea from Julien M. @@ -216,25 +287,9 @@ Notes: - Client side parsing and filtering of the annotation may add overhead - Honors the API, blocks until we know for sure an endpoint is available on the given component -## Approach 3: Endpoints associated with a component are statically defined on the DGD -This approach is similar to the first approach, but the endpoints and any metadata about them are statically defined on the DGD. - -Instead of a pod dynamically adding and updating its own annotations, the operator can define a set of labels on pod start that enable it to be watched by clients. - -```yaml -# Example DGD -apiVersion: v1 -kind: DGD -metadata: - name: dynamo-dgd-abc123 - labels: - nvidia.com/dynamo-namespace: dynamo - nvidia.com/dynamo-component: backend - nvidia.com/expose-dynamo-endpoint/generate: true - nvidia.com/expose-dynamo-endpoint/generate_tokens: true -``` - -Notes: -- Endpoints of the component are codified in the spec itself -- TODO: enforce that the application logic exposes the endpoints that it is supposed to according to what is defined in the spec +feedback: +- add examples of labels +- remove mentions to nats transport +- reorder 3 to 2 in the doc +- workers have associated configmap that the worker can read/write to From 9e1004f3b7cb45af7fc4a69557f4ce40b1bd41a4 Mon Sep 17 00:00:00 2001 From: mohammedabdulwahhab Date: Mon, 8 Sep 2025 23:58:24 -0700 Subject: [PATCH 13/19] Update etcd-k8s.md Signed-off-by: mohammedabdulwahhab --- deps/etcd-k8s.md | 459 ++++++++++++++++++++++++----------------------- 1 file changed, 239 insertions(+), 220 deletions(-) diff --git a/deps/etcd-k8s.md b/deps/etcd-k8s.md index fa3e6f6..9ebd9aa 100644 --- a/deps/etcd-k8s.md +++ b/deps/etcd-k8s.md @@ -2,294 +2,313 @@ ## Background -This document is part of a series of deps attempting to eliminate the ETCD dependency in Dynamo. +This document is part of a series of docs attempting to replace the ETCD dependency in Dynamo using Kubernetes Source: https://github.com/ai-dynamo/enhancements/blob/neelays/runtime/XXXX-runtime-infrastructure.md ETCD usage in Dynamo can be categorized into the following: - Endpoint Instance Discovery (Current focus) -- Model / Worker Discovery +- Model / Worker Discovery (Current focus) - Worker Port Conflict Resolution - Router State Sharing Synchronization -Here we resolve how we can achieve endpoint instance discovery without ETCD in Kubernetes environments. +## ServiceDiscovery Interface +To de-couple Dynamo from ETCD, we define a minimal `ServiceDiscovery` interface that can be satisfied by different backends (etcd, kubernetes, etc). In Kubernetes environments, we will use the Kubernetes APIs to implement this interface. -## Current Implementation of Endpoint Instance Discovery +### Methods +- `create_instance`(namespace: str, component: str, metadata: dict) -> Instance + - Creates an instance and persists associated immutable metadata. Does not mark the instance as ready. +- `list_instances`(namespace: str, component: str) -> List[Instance] + - Lists all instances that match the given namespace and component. +- `get_metadata`(namespace: str, component: str, instance_id: str) -> Instance + - Returns the metadata for an instance. +- `subscribe`(namespace: str, component: str) -> EventStream + - Subscribes to events for the set of instances that match (namespace, component). Returns a stream of events giving signals for when instances that match the subscription are created, deleted, or readiness status changes. +- `set_instance_status`(instance: Instance, status: str) + - Marks the instance as ready or not ready for traffic. +- `get_instance_status`(instance: Instance) -> str + - Returns the status of the instance. + +- `write_kv`(key: str, value: str) + - Writes data at a given key. Data not associated with any instance. +- `read_kv`(key: str) + - Reads data at a given key. Data not associated with any instance. -Here is a primer on the current implementation of endpoint instance discovery between a server and a client. +### Overall Flow + +At a high level, these are the service discovery requirements in a given disaggregated inference setup: + +Credits: @itay + +1. The frontend needs to know how to reach the decode workers +2. The frontend needs to know what model key to route to the above workers +3. The frontend needs to know some model specifics (like tokenizer) for the specified model +4. The decode worker needs to know how to reach the prefill workers (for the disagg transfer) + +#### Frontend Discovers Decode Workers + +- Decode calls `create_instance("dynamo", "decode", metadata)` and `set_instance_status(decode_worker, "ready")` to register itself with the service discovery backend. Its metadata includes the model it serves. +- Frontend calls `list_instances("dynamo", "decode")` and `get_metadata` to get all decode workers and the associated model information to bootstrap its cache. +- Frontend sets up a watch on the decode workers using `subscribe("dynamo", "decode")` to keep its cache up to date. -### APIs ```python -# server side -endpoint = runtime.namespace("dynamo").component("backend") -service = await component.create_service() -endpoint = service.endpoint("generate") -await endpoint.serve_endpoint(RequestHandler().generate) - -# client side -endpoint = runtime.namespace("dynamo").component("backend").endpoint("generate") -client = await endpoint.client() -await client.wait_for_instances() -stream = await client.generate("request") +# Frontend start: +decode_workers = service_discovery.list_instances("dynamo", "decode") +for decode_worker in decode_workers: + # Fetch the associated metadata for the instance to register the model in-memory model registry + metadata = service_discovery.get_metadata("dynamo", "decode", decode_worker.instance_id) + model = metadata["Model"] + # map the instance to the model in the in-memory model registry ... ``` -### Server Side: -1. Server registers with DRT, receiving a primary lease (unique instance identifier) from ETCD -2. Server creates one or more entries in ETCD advertising its endpoint. The entry is tied to the lease of the instance. -3. Server continues to renew the lease. This keeps its associated endpoint entries alive. If a lease fails to renew, associated endpoint entries are automatically removed by ETCD. - -### Client Side: -1. Client asks ETCD for a watch of endpoints -2. Client maintains a cache of such entries, updating its cache as it receives updates from ETCD. -3. Client selects one of the endpoint instances in its cache and routes requests to it. - -```bash -# Example of an etcd key showing an endpoint entry associated with an instance -$ etcdctl get --prefix instances/ -instances/dynamo/worker/generate:5fc98e41ac8ce3b -{ - "endpoint":"generate", - "namespace":"dynamo", - "component":"worker", - "endpoint_id":"generate", - "instance_id":"worker-abc123", - "transport":"nats..." -} -``` +#### Frontend Needs to Know what Model Key -> Instance Mapping -```mermaid -sequenceDiagram - participant Server - participant ETCD - participant Client +Addressed above. - Server->>ETCD: Register instance, get lease - Server->>ETCD: Create endpoint entries tied to lease - Server->>ETCD: Renew lease periodically +#### Frontend Needs to Know Some Model Specifics (like Tokenizer) - Client->>ETCD: Watch for endpoints - ETCD-->>Client: Send endpoint updates - Client->>Client: Update local cache - Client->>Server: Route requests to selected instance +- Decode writes a Model Card using the `write_kv` method. Frontend can simply read this entry. - Note over ETCD: Lease expires → entries auto-removed -``` +#### Decode Worker Needs to Know how to Reach Prefill Workers + +- Decode worker does the exact same thing as the frontend, instead listing and subscribing to the "prefill" component. -**How do we achieve these (or roughly similar) APIs in Kubernetes?** -## Approach 1: Ready pod <=> Ready component (No explicit endpoint entity) + -High level ideas: -- Each pod is a unique instance of a component. The operator attaches labels for namespace, component to the pod. -- The readiness probe of the pod reflects the health of the instance. If it is ready, clients can route traffic to it. -- The readiness probe is only 200 when all endpoints are ready. -- The pod name is the unique instance identifier for internal use. When using NATS based transport, the topic is a function of the pod name. (Similar to the primary lease in etcd) -- Clients simply set a watch for ready pods matching the labels for namespace, component. They update their cache of instances as events arrive (pod creation, pod deletion, pod readiness change). -- ***NOTE***: There is no explicit entity for endpoint. Clients know apriori which endpoints are available for a component. If the client made a mistake this will be known at request time. +## API Reference +This table relates when each method is used in the context of endpoint instance discovery and model / worker management. We will also compare reference impls for each method in etcd and kubernetes. -### APIs +### create_instance -APIs stay the same +This method is used to create an instance and persist associated metadata tied to the life of the instance. Notably, this metadata could capture immutable information associated with the instance such as model information (e.g what is currently stored in ETCD as `models/{uuid}`). ```python -# SERVER SIDE - -# Operator creates a pod with labels for namespace, component. The namespace, component, and pod name are injected into the pod. -# get_component() returns a struct with info on the current instance's namespace, component, and pod name. -endpoint = get_component().endpoint("generate") -# exposes the endpoint on the transport -await endpoint.serve_endpoint(RequestHandler().generate) - -# CLIENT SIDE - -# setup a kube watch for pods matching the labels for namespace, component -# block until a ready pod is found that matches the labels for namespace, component -# NOTE: No validation is done that the pod actually serves the endpoint that we are interested in -endpoint = runtime.namespace("dynamo").component("backend").endpoint("generate") -client = await endpoint.client() -await client.wait_for_instances() - -# Send the message via the transport -# errors out if the component doesn't actually have such an endpoint -stream = await client.generate("request") +# Example: Registration of a decode worker with associated model information +# Currently, this is being stored as part of the register_llm function in ETCD as `models/{uuid}` +metadata = { + "Model": { + "name": "Qwen3-32B", + "type": "Completions", + "runtime_config": { + "total_kv_blocks": 24064, + "max_num_seqs": 256, + "max_num_batched_tokens": 2048 + } + } +} +decode_worker = service_discovery.create_instance(namespace, component, metadata) ``` -``` -apiVersion: v1 -kind: Pod -metadata: - name: dynamo-pod-abc123 - labels: - nvidia.com/dynamo-namespace: dynamo - nvidia.com/dynamo-component: backend -``` +#### Kubernetes Reference Impl +- Asserts the pod has namespace and component labels that match up with method args. Fails otherwise. +- Asserts there is a Kubernetes Service that will select on namespace and component labels. (More on this later) +- Writes a ConfigMap with the attached metadata. ConfigMap name is a function of the pod name. Returns an Instance object. -```mermaid -sequenceDiagram - participant Pod - participant KubeAPI - participant Client +Note: There are many ways to persist and access this immutable metadata. We can use ConfigMaps, a PVC with ReadWriteMany access, etc. - Pod->>KubeAPI: Pod created with namespace/component labels - Pod->>Pod: Readiness probe becomes healthy - Pod->>KubeAPI: Pod status updated to Ready +### get_metadata - Client->>KubeAPI: Watch pods with matching labels - KubeAPI-->>Client: Send pod events (create/update/delete) - Client->>Client: Update local cache of ready pods - Client->>Pod: Route requests to selected pod instance +This method is to fetch the metadata associated with an instance. For instance, the frontend can use this method to fetch the model information associated with an instance when it receives an event that a new worker is ready. - Note over KubeAPI: No explicit endpoint validation +```python +# Context: frontend has watch setup on decode workers +stream = service_discovery.subscribe("dynamo", "decode") +for event in stream: + switch event: + case NewReadyInstanceEvent: + # Fetch the associated metadata for the instance to register the model in-memory model registry + metadata = service_discovery.get_metadata("dynamo", "decode", instance_id) + model = metadata["Model"] ``` -Notes: -- Assumes that all components serve the same set of endpoints. -- If a client makes a mistake and requests a non-existent endpoint, it will error out at request time. -- Addressing relies on convention. I know the names of the endpoints that are available on a given component. -- Assumes each pod holds exactly 1 instance of a component (TODO: what setups is this not true for?) +#### Kubernetes Reference Impl +- Fetches the associated ConfigMap with the attached metadata. +Note: There are many ways to persist and access this immutable metadata. We can use ConfigMaps, a PVC with ReadWriteMany access, etc. -## Approach 2: Endpoints associated with a component are statically defined on the DGD +### subscribe -This approach is similar to the first approach, but the endpoints and any metadata about them are statically defined on the DGD. +This method is to subscribe to events for the set of instances that match (namespace, component). Returns a stream of events giving signals for when instances that match the subscription are created, deleted, or readiness status changes. This can be used by the frontend to maintain its inventory of ready decode workers. It can be used by decode workers to maintain its inventory of ready prefill workers. -Instead of a pod dynamically adding and updating its own annotations, the operator can define a set of labels on pod start that enable it to be watched by clients. - -```yaml -# Example DGD -apiVersion: v1 -kind: DGD -metadata: - name: dynamo-dgd-abc123 - services: - decode: # the decode component defines the endpoints it is associated with - endpoints: - - generate - - generate_tokens - -# Pods are created with labels for namespace, component, and endpoints that can then be used to filter by kube watch -apiVersion: v1 -kind: Pod -metadata: - name: dynamo-pod-abc123 - labels: - nvidia.com/dynamo-namespace: dynamo - nvidia.com/dynamo-component: backend - nvidia.com/expose-dynamo-endpoint/generate: true - nvidia.com/expose-dynamo-endpoint/generate_tokens: true +```python +# Context: frontend has watch setup on decode workers +stream = service_discovery.subscribe("dynamo", "decode") +for event in stream: + switch event: + case NewReadyInstanceEvent: + metadata = service_discovery.get_metadata("dynamo", "decode", instance_id) + model = metadata["Model"] + # register the model in-memory model registry + # map the instance to the model ``` -### APIs +#### Kubernetes Reference Impl +- Sets up a kubectl watch for EndpointSlices corresponding to the Service that selects on the namespace and component labels. +- Returns a stream of events that inform us when a READY pod matching the namespace and component is up. + +### set_instance_status / get_instance_status -APIs stay the same +This method is to mark the instance as ready or not ready for traffic. ```python -# SERVER SIDE +# Context: decode worker has finished loading the model +metadata = { + "Model": { + "name": "Qwen3-32B", + "type": "Completions", + "runtime_config": { + "total_kv_blocks": 24064, + "max_num_seqs": 256, + "max_num_batched_tokens": 2048 + } + } +} +decode_worker = service_discovery.create_instance(namespace, component, metadata) +# Set up endpoints/transport. This can be http/grpc/nats. When the underlying transport is ready to serve, mark the instance as ready for discovery. +await start_http_server() +# When the server is ready to serve, mark the instance as ready for discovery. +service_discovery.set_instance_status(decode_worker, "ready") +``` -# Operator creates a pod with labels for namespace, component. The namespace, component, endpoints, and pod name are injected into the pod. -endpoint = get_component().endpoint("generate") -# exposes the endpoint on the transport -# NOTE: This endpoint will be discoverable as soon as the pod marks itself ready. (No explicit registration necessary) -await endpoint.serve_endpoint(RequestHandler().generate) +#### Kubernetes Reference Impl +- The readiness probe of the pod will proxy the result of `get_instance_status`. When the instance will be ready for traffic, the readiness probe will return 200 and EndpointSlices will be updated to include the instance. +- Components that have called `subscribe` on this component will be notified when the instance is ready for traffic. They can use this instance to route traffic to using their transport. -# CLIENT SIDE +### write_kv / read_kv -# setup a kube watch for pods matching the labels for namespace, component, endpoint -# block until a ready pod is found that matches the labels for namespace, component -endpoint = runtime.namespace("dynamo").component("backend").endpoint("generate") -client = await endpoint.client() -await client.wait_for_instances() +This method is to write and read data at a given key. This is a useful place to store the model card (e.g `mdc/{model-slug}`) and other model specific information. -# Send the message via the transport. (e.g if transport is NATS, the message is sent to the NATS topic that is a function of (namespace, component, endpoint, pod name)) -# errors out if the component doesn't actually have such an endpoint -stream = await client.generate("request") +```python +# Context: decode worker is registering the MDC before it marks itself as ready +service_discovery.write_kv("mdc/Qwen3-32B", mdc) ``` -Notes: -- Endpoints of the component are codified in the spec itself -- TODO: enforce that the application logic exposes the endpoints that it is supposed to according to what is defined in the spec +#### Kubernetes Reference Impl +- Writes a ConfigMap with the attached data. ConfigMap name is a function of the key. + +## Kubernetes EndpointSlice Discovery Mechanism + +This section provides a detailed explanation of how the ServiceDiscovery interface is implemented using Kubernetes EndpointSlices for service discovery. -## Approach 3: Endpoints are dynamically exposed and discovered on components +### Core Components -This approach is very similar to the first approach, but we components dynamically expose endpoints. Idea from Julien M. +The Kubernetes implementation relies on three main Kubernetes resources: -High level ideas: -- Pod adds an annotation on itself to list out the endpoints it exposes. The labels for namespace and component are injected as before. -- Clients watch for pod events on the tuple (namespace, component) and do a client side filter to check if there is an annotation on the pod matching the endpoint they are interested in. +1. **Pods**: Each worker instance runs as a pod with specific labels +2. **Service**: Selects pods based on namespace and component labels +3. **EndpointSlices**: Automatically managed by Kubernetes, tracks ready pod endpoints +### Resource Structure + +#### Pod Labels ```yaml -# Example pod annotation apiVersion: v1 kind: Pod metadata: - name: dynamo-pod-abc123 + name: dynamo-decode-pod-abc123 labels: - nvidia.com/dynamo-namespace: dynamo # set by pod - nvidia.com/dynamo-component: backend - annotations: - dynamo.nvidia.com/endpoint/generate="{...}" # pod dynamically adds and updates this annotatino - dynamo.nvidia.com/endpoint/generate_tokens="{...}" + nvidia.com/dynamo-namespace: dynamo + nvidia.com/dynamo-component: decode +spec: + readinessProbe: + httpGet: + path: /health + port: 8080 + initialDelaySeconds: 5 + periodSeconds: 10 +``` + +#### Service Selector +```yaml +apiVersion: v1 +kind: Service +metadata: + name: dynamo-decode-service +spec: + selector: + nvidia.com/dynamo-namespace: dynamo + nvidia.com/dynamo-component: decode + ports: + - port: 8080 # The port itself is a dummy port. It is not used for traffic (although it could be). + targetPort: 8080 + name: http +``` +#### EndpointSlice (Auto-generated) +```yaml +apiVersion: discovery.k8s.io/v1 +kind: EndpointSlice +metadata: + name: dynamo-decode-service-abc123 + labels: + kubernetes.io/service-name: dynamo-decode-service + nvidia.com/dynamo-namespace: dynamo + nvidia.com/dynamo-component: decode +addressType: IPv4 +endpoints: +- addresses: ["10.1.2.3"] + conditions: + ready: true # Based on pod readiness probe + targetRef: + kind: Pod + name: dynamo-decode-pod-abc123 + namespace: default +ports: +- name: http + port: 8080 + protocol: TCP ``` -### APIs +### Discovery Flow -APIs stay the same +1. **Instance Registration**: When `create_instance()` is called, the pod creates a ConfigMap with instance metadata +2. **Readiness Signaling**: When `set_instance_status(instance, "ready")` is called, the pod's readiness probe starts returning 200 +3. **EndpointSlice Update**: Kubernetes automatically updates the EndpointSlice to mark the endpoint as ready +4. **Event Propagation**: Clients watching EndpointSlices receive events about the ready instance -```python -# SERVER SIDE - -# Operator creates a pod with labels for namespace, component. The namespace, component, and pod name are injected into the pod. -# get_component() returns a struct with info on the current instance's namespace, component, and pod name. -endpoint = get_component().endpoint("generate") -# NEW: Modifies its own annotation to add the endpoint to the list of endpoints it serves -# subscribes to a NATS topic that is a function of (namespace, component, endpoint, pod name) -await endpoint.serve_endpoint(RequestHandler().generate) - -# CLIENT SIDE - -# setup a kube watch for pods matching the labels for namespace, component -# filter the pods to only include ones that have the endpoint in their annotation -# block until a READY pod is found that matches the labels for namespace, component and has the endpoint in their annotation -endpoint = runtime.namespace("dynamo").component("backend").endpoint("generate") -client = await endpoint.client() -await client.wait_for_instances() - -# write to the NATS topic that is a function of (namespace, component, endpoint, pod name) -stream = await client.generate("request") -``` +### Event Mapping -```mermaid -sequenceDiagram - participant Pod - participant KubeAPI - participant Client +The `subscribe()` method maps EndpointSlice events to ServiceDiscovery events: - Pod->>KubeAPI: Pod created with namespace/component labels - Pod->>KubeAPI: Add endpoint annotation dynamically - Pod->>Pod: Readiness probe becomes healthy - Pod->>KubeAPI: Pod status updated to Ready +| EndpointSlice Event | ServiceDiscovery Event | Description | +|---------------------|------------------------|-------------| +| Endpoint added with `ready: true` | `NewReadyInstanceEvent` | New instance is ready for traffic | +| Endpoint condition changes to `ready: true` | `InstanceReadyEvent` | Existing instance becomes ready | +| Endpoint condition changes to `ready: false` | `InstanceNotReadyEvent` | Instance becomes unavailable | +| Endpoint removed from slice | `InstanceDeletedEvent` | Instance is terminated | - Client->>KubeAPI: Watch pods with matching labels - KubeAPI-->>Client: Send pod events with annotations - Client->>Client: Filter pods by endpoint annotation - Client->>Client: Update local cache of matching ready pods - Client->>Pod: Route requests to selected pod instance +### Implementation Details - Note over Client: Client-side filtering by endpoint annotation -``` +#### Instance ID Resolution +- **Instance ID**: Pod name (e.g., `dynamo-decode-pod-abc123`) +- **EndpointSlice Reference**: `targetRef.name` points to the pod name +- **Metadata Lookup**: ConfigMap name derived from pod name for `get_metadata()` -Notes: -- Allows for storage of metadata associated with the endpoint in the annotation -- Allows for dynamic updates of endpoints -- Client side parsing and filtering of the annotation may add overhead -- Honors the API, blocks until we know for sure an endpoint is available on the given component +#### Watch Implementation +```python +# Kubernetes watch setup for subscribe() +watch_filter = { + "labelSelector": f"kubernetes.io/service-name=dynamo-{component}-service" +} +endpoint_slice_watch = k8s_client.watch_endpoint_slices(filter=watch_filter) + +for event in endpoint_slice_watch: + if event.type == "MODIFIED": + for endpoint in event.object.endpoints: + if endpoint.conditions.ready: + emit_event(NewReadyInstanceEvent( + instance_id=endpoint.targetRef.name, + address=endpoint.addresses[0] + )) +``` +### Benefits of EndpointSlice Approach -feedback: -- add examples of labels -- remove mentions to nats transport -- reorder 3 to 2 in the doc -- workers have associated configmap that the worker can read/write to +1. **Native Kubernetes Integration**: Leverages built-in service discovery +2. **Automatic Cleanup**: Pods and EndpointSlices are cleaned up when instances terminate +3. **Scalability**: EndpointSlices handle large numbers of endpoints efficiently +4. **Consistency**: Kubernetes ensures eventual consistency across the cluster +5. **Health Integration**: Readiness probes directly control traffic eligibility From 63f391a79d03f9644dfbe1d7e70e87ba8d964964 Mon Sep 17 00:00:00 2001 From: mohammedabdulwahhab Date: Tue, 9 Sep 2025 09:08:24 -0700 Subject: [PATCH 14/19] Update deps/etcd-k8s.md Co-authored-by: Neelay Shah Signed-off-by: mohammedabdulwahhab --- deps/etcd-k8s.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/deps/etcd-k8s.md b/deps/etcd-k8s.md index 9ebd9aa..3206198 100644 --- a/deps/etcd-k8s.md +++ b/deps/etcd-k8s.md @@ -1,4 +1,4 @@ -# ETCD-less Dynamo in Kubernetes: Endpoint Instance Discovery +# Service and Model Discovery Interface ## Background From df9ddfac245e189bb5e01a4d3355b25da166a828 Mon Sep 17 00:00:00 2001 From: mohammedabdulwahhab Date: Wed, 10 Sep 2025 08:02:44 -0700 Subject: [PATCH 15/19] Update etcd-k8s.md Signed-off-by: mohammedabdulwahhab --- deps/etcd-k8s.md | 266 ++++++++++++++++++++++++++++++----------------- 1 file changed, 168 insertions(+), 98 deletions(-) diff --git a/deps/etcd-k8s.md b/deps/etcd-k8s.md index 3206198..1f1d68d 100644 --- a/deps/etcd-k8s.md +++ b/deps/etcd-k8s.md @@ -1,4 +1,4 @@ -# Service and Model Discovery Interface +# ETCD-less Dynamo in Kubernetes: Endpoint Instance Discovery ## Background @@ -16,24 +16,55 @@ ETCD usage in Dynamo can be categorized into the following: To de-couple Dynamo from ETCD, we define a minimal `ServiceDiscovery` interface that can be satisfied by different backends (etcd, kubernetes, etc). In Kubernetes environments, we will use the Kubernetes APIs to implement this interface. -### Methods -- `create_instance`(namespace: str, component: str, metadata: dict) -> Instance - - Creates an instance and persists associated immutable metadata. Does not mark the instance as ready. -- `list_instances`(namespace: str, component: str) -> List[Instance] - - Lists all instances that match the given namespace and component. -- `get_metadata`(namespace: str, component: str, instance_id: str) -> Instance - - Returns the metadata for an instance. -- `subscribe`(namespace: str, component: str) -> EventStream - - Subscribes to events for the set of instances that match (namespace, component). Returns a stream of events giving signals for when instances that match the subscription are created, deleted, or readiness status changes. -- `set_instance_status`(instance: Instance, status: str) - - Marks the instance as ready or not ready for traffic. -- `get_instance_status`(instance: Instance) -> str - - Returns the status of the instance. - -- `write_kv`(key: str, value: str) - - Writes data at a given key. Data not associated with any instance. -- `read_kv`(key: str) - - Reads data at a given key. Data not associated with any instance. +### Server-Side (ServiceRegistry) + +#### Methods +- `register_instance(namespace: str, component: str) -> InstanceHandle` + - Registers a new instance of the given namespace and component. Returns an InstanceHandle. + +- `InstanceHandle.instance_id() -> str` + - Returns the instance id. Returns the unique identifier for the instance in the ServiceRegistry. + +- `InstanceHandle.set_metadata(metadata: dict) -> None` + - Write the metadata associated with the instance. + +- `InstanceHandle.set_ready(status: InstanceStatus) -> None` + - Marks the instance as ready or not ready for traffic. + + +### Client-Side (ServiceDiscovery) + +#### Methods +- `list_instances(namespace: str, component: str) -> List[Instance]` + - Lists all instances that match the given namespace and component. Returns a list of Instance objects. + +- `watch(namespace: str, component: str) -> InstanceEventStream` + - Watches for events for the set of instances that match `(namespace, component)`. + - Returns a stream of events (InstanceAddedEvent, InstanceRemovedEvent) + +- `Instance.metadata() -> dict` + - Returns the metadata for a specific instance. + +## Where will these APIs be used? + +These APIs are intended to be used internally within the Rust codebase where there are currently calls to `etcd_client` for service discovery and model management. + +Some examples of code that will be impacted: + +Frontend: +(How the frontend discovers workers and maintains inventory of model to worker mappings) +- run_watcher function at [lib/llm/src/entrypoint/input/http.rs](https://github.com/ai-dynamo/dynamo/blob/main/lib/llm/src/entrypoint/input/http.rs) +- ModelWatcher at [lib/llm/src/discovery/watcher.rs](https://github.com/ai-dynamo/dynamo/blob/main/lib/llm/src/discovery/watcher.rs) + +Model registration (register_llm function): +(Initial worker bootstrapping as part of register_llm) +- LocalModel.attach at [lib/llm/src/local_model.rs](https://github.com/ai-dynamo/dynamo/blob/main/lib/llm/src/local_model.rs) + +Dynamo Runtime: +(Used for registering on the runtime (Server) and for getting clients to components (Client)) +- client.rs [lib/runtime/src/client.rs](https://github.com/ai-dynamo/dynamo/blob/main/lib/runtime/src/client.rs) +- endpoint.rs [lib/runtime/src/endpoint.rs](https://github.com/ai-dynamo/dynamo/blob/main/lib/runtime/src/endpoint.rs) + ### Overall Flow @@ -48,18 +79,39 @@ Credits: @itay #### Frontend Discovers Decode Workers -- Decode calls `create_instance("dynamo", "decode", metadata)` and `set_instance_status(decode_worker, "ready")` to register itself with the service discovery backend. Its metadata includes the model it serves. -- Frontend calls `list_instances("dynamo", "decode")` and `get_metadata` to get all decode workers and the associated model information to bootstrap its cache. -- Frontend sets up a watch on the decode workers using `subscribe("dynamo", "decode")` to keep its cache up to date. +##### Decode Worker Set Up + +```python +# Register the instance +decode_worker = service_registry.register_instance("dynamo", "decode") +instance_id = decode_worker.instance_id() # Unique identifier for the instance + +# Start up NATS listener for the endpoints of this component (or other transport). We may or may not need the instance id to setup the transport. +comp_transport_details = set_up_nats_listener(instance_id) + +# Write metadata associated with the instance +metadata = { + "model": {...}, # Runtime Info + "transport": comp_transport_details, # Transport details for this component + "mdc": {...}, # Model Deployment Card +} + +decode_worker.set_metadata(metadata) +decode_worker.set_ready("ready") +``` + +##### Frontend Discovers Decode Workers ```python # Frontend start: decode_workers = service_discovery.list_instances("dynamo", "decode") for decode_worker in decode_workers: # Fetch the associated metadata for the instance to register the model in-memory model registry - metadata = service_discovery.get_metadata("dynamo", "decode", decode_worker.instance_id) + metadata = decode_worker.metadata() model = metadata["Model"] # map the instance to the model in the in-memory model registry ... + +# Sets up watch to keep cache up to date ``` #### Frontend Needs to Know what Model Key -> Instance Mapping @@ -68,11 +120,11 @@ Addressed above. #### Frontend Needs to Know Some Model Specifics (like Tokenizer) -- Decode writes a Model Card using the `write_kv` method. Frontend can simply read this entry. +- Model card is duplicated on each decode worker in the metadata. #### Decode Worker Needs to Know how to Reach Prefill Workers -- Decode worker does the exact same thing as the frontend, instead listing and subscribing to the "prefill" component. +- This is done in the exact same way as the way the frontend reaches the decode workers. @@ -81,13 +133,30 @@ Addressed above. This table relates when each method is used in the context of endpoint instance discovery and model / worker management. We will also compare reference impls for each method in etcd and kubernetes. -### create_instance +### register_instance -This method is used to create an instance and persist associated metadata tied to the life of the instance. Notably, this metadata could capture immutable information associated with the instance such as model information (e.g what is currently stored in ETCD as `models/{uuid}`). +This method is used to register a new instance of the given namespace and component. Returns an InstanceHandle that can be used to manage the instance. ```python -# Example: Registration of a decode worker with associated model information -# Currently, this is being stored as part of the register_llm function in ETCD as `models/{uuid}` +# Example: Registration of a decode worker +# Server-side: Register the instance +decode_worker = service_registry.register_instance("dynamo", "decode") +instance_id = decode_worker.instance_id() # Unique identifier for the instance +``` + +#### Kubernetes Reference Impl +- Asserts the pod has namespace and component labels that match up with method args. Fails otherwise. +- Asserts there is a Kubernetes Service that will select on namespace and component labels. (More on this later) +- Returns an InstanceHandle object tied to the pod name. This can be used to fetch the unique identifier for the instance. + +Note: The instance registration is tied to the pod lifecycle. + +### set_metadata + +This method is used to write metadata associated with an instance. + +```python +# Server-side: Set metadata for the instance metadata = { "Model": { "name": "Qwen3-32B", @@ -97,49 +166,69 @@ metadata = { "max_num_seqs": 256, "max_num_batched_tokens": 2048 } - } + }, + "Transport": {...}, # Transport details for this component + "MDC": {...}, # Model Deployment Card } -decode_worker = service_discovery.create_instance(namespace, component, metadata) +decode_worker.set_metadata(metadata) +``` + +#### Implementation Details +- Updates an in-memory struct within the component process +- Exposes the metadata via a `/metadata` HTTP endpoint on the component +- Metadata is available immediately after being set, no external storage required + +### set_ready + +This method is used to mark the instance as ready or not ready for traffic. + +```python +# Context: decode worker has finished loading the model +# Register the instance +decode_worker = service_registry.register_instance("dynamo", "decode") +await start_http_server() +decode_worker.set_metadata(metadata) +# When the server is ready to serve, mark the instance as ready for discovery. +decode_worker.set_ready("ready") ``` #### Kubernetes Reference Impl -- Asserts the pod has namespace and component labels that match up with method args. Fails otherwise. -- Asserts there is a Kubernetes Service that will select on namespace and component labels. (More on this later) -- Writes a ConfigMap with the attached metadata. ConfigMap name is a function of the pod name. Returns an Instance object. +- The readiness probe of the pod will proxy the result of the instance's readiness status. When the instance is ready for traffic, the readiness probe will return 200 and EndpointSlices will be updated to include the instance. +- Components that have called `watch` on this component will be notified when the instance is ready for traffic. They can use this instance to route traffic to using their transport. -Note: There are many ways to persist and access this immutable metadata. We can use ConfigMaps, a PVC with ReadWriteMany access, etc. -### get_metadata +### list_instances -This method is to fetch the metadata associated with an instance. For instance, the frontend can use this method to fetch the model information associated with an instance when it receives an event that a new worker is ready. +This method is used to list all instances that match the given namespace and component. Returns a list of Instance objects. ```python -# Context: frontend has watch setup on decode workers -stream = service_discovery.subscribe("dynamo", "decode") -for event in stream: - switch event: - case NewReadyInstanceEvent: - # Fetch the associated metadata for the instance to register the model in-memory model registry - metadata = service_discovery.get_metadata("dynamo", "decode", instance_id) - model = metadata["Model"] +# Context: frontend startup - discover existing decode workers +decode_workers = service_discovery.list_instances("dynamo", "decode") +for decode_worker in decode_workers: + # Fetch the associated metadata for the instance + metadata = decode_worker.metadata() + model = metadata["Model"] + # register the model in-memory model registry + # map the instance to the model ``` #### Kubernetes Reference Impl -- Fetches the associated ConfigMap with the attached metadata. +- Queries EndpointSlices for the Service that selects on the namespace and component labels. +- Returns a list of Instance objects for all ready endpoints. -Note: There are many ways to persist and access this immutable metadata. We can use ConfigMaps, a PVC with ReadWriteMany access, etc. -### subscribe +### watch -This method is to subscribe to events for the set of instances that match (namespace, component). Returns a stream of events giving signals for when instances that match the subscription are created, deleted, or readiness status changes. This can be used by the frontend to maintain its inventory of ready decode workers. It can be used by decode workers to maintain its inventory of ready prefill workers. +This method is used to watch for events for the set of instances that match (namespace, component). Returns a stream of events (InstanceAddedEvent, InstanceRemovedEvent) giving signals for when instances that match the subscription are created, deleted, or readiness status changes. ```python # Context: frontend has watch setup on decode workers -stream = service_discovery.subscribe("dynamo", "decode") +stream = service_discovery.watch("dynamo", "decode") for event in stream: switch event: - case NewReadyInstanceEvent: - metadata = service_discovery.get_metadata("dynamo", "decode", instance_id) + case InstanceAddedEvent: + # Fetch the associated metadata for the instance + metadata = event.instance.metadata() model = metadata["Model"] # register the model in-memory model registry # map the instance to the model @@ -149,45 +238,25 @@ for event in stream: - Sets up a kubectl watch for EndpointSlices corresponding to the Service that selects on the namespace and component labels. - Returns a stream of events that inform us when a READY pod matching the namespace and component is up. -### set_instance_status / get_instance_status +### Instance.metadata() -This method is to mark the instance as ready or not ready for traffic. +This method is used to fetch the metadata associated with a specific instance. It makes an HTTP request to the instance's `/metadata` endpoint. ```python -# Context: decode worker has finished loading the model -metadata = { - "Model": { - "name": "Qwen3-32B", - "type": "Completions", - "runtime_config": { - "total_kv_blocks": 24064, - "max_num_seqs": 256, - "max_num_batched_tokens": 2048 - } - } -} -decode_worker = service_discovery.create_instance(namespace, component, metadata) -# Set up endpoints/transport. This can be http/grpc/nats. When the underlying transport is ready to serve, mark the instance as ready for discovery. -await start_http_server() -# When the server is ready to serve, mark the instance as ready for discovery. -service_discovery.set_instance_status(decode_worker, "ready") -``` - -#### Kubernetes Reference Impl -- The readiness probe of the pod will proxy the result of `get_instance_status`. When the instance will be ready for traffic, the readiness probe will return 200 and EndpointSlices will be updated to include the instance. -- Components that have called `subscribe` on this component will be notified when the instance is ready for traffic. They can use this instance to route traffic to using their transport. - -### write_kv / read_kv - -This method is to write and read data at a given key. This is a useful place to store the model card (e.g `mdc/{model-slug}`) and other model specific information. - -```python -# Context: decode worker is registering the MDC before it marks itself as ready -service_discovery.write_kv("mdc/Qwen3-32B", mdc) +# Client-side: Get metadata from a specific instance +decode_workers = service_discovery.list_instances("dynamo", "decode") +for decode_worker in decode_workers: + # Fetch the associated metadata for the instance + metadata = decode_worker.metadata() + model = metadata["Model"] + # register the model in-memory model registry + # map the instance to the model ``` -#### Kubernetes Reference Impl -- Writes a ConfigMap with the attached data. ConfigMap name is a function of the key. +#### Implementation Details +- Makes an HTTP GET request to `/metadata` on the instance +- Returns the metadata payload stored in the component's in-memory struct +- No external storage lookup required - direct communication with the instance ## Kubernetes EndpointSlice Discovery Mechanism @@ -264,32 +333,33 @@ ports: ### Discovery Flow -1. **Instance Registration**: When `create_instance()` is called, the pod creates a ConfigMap with instance metadata -2. **Readiness Signaling**: When `set_instance_status(instance, "ready")` is called, the pod's readiness probe starts returning 200 -3. **EndpointSlice Update**: Kubernetes automatically updates the EndpointSlice to mark the endpoint as ready -4. **Event Propagation**: Clients watching EndpointSlices receive events about the ready instance +1. **Instance Registration**: When `register_instance()` is called, the InstanceHandle is created and tied to the pod +2. **Metadata Setup**: When `set_metadata()` is called, the metadata is stored in-memory and exposed via `/metadata` endpoint +3. **Readiness Signaling**: When `set_ready("ready")` is called, the pod's readiness probe starts returning 200 +4. **EndpointSlice Update**: Kubernetes automatically updates the EndpointSlice to mark the endpoint as ready +5. **Event Propagation**: Clients watching EndpointSlices receive events about the ready instance ### Event Mapping -The `subscribe()` method maps EndpointSlice events to ServiceDiscovery events: +The `watch()` method maps EndpointSlice events to ServiceDiscovery events: | EndpointSlice Event | ServiceDiscovery Event | Description | |---------------------|------------------------|-------------| -| Endpoint added with `ready: true` | `NewReadyInstanceEvent` | New instance is ready for traffic | -| Endpoint condition changes to `ready: true` | `InstanceReadyEvent` | Existing instance becomes ready | -| Endpoint condition changes to `ready: false` | `InstanceNotReadyEvent` | Instance becomes unavailable | -| Endpoint removed from slice | `InstanceDeletedEvent` | Instance is terminated | +| Endpoint added with `ready: true` | `InstanceAddedEvent` | New instance is ready for traffic | +| Endpoint condition changes to `ready: true` | `InstanceAddedEvent` | Existing instance becomes ready | +| Endpoint condition changes to `ready: false` | `InstanceRemovedEvent` | Instance becomes unavailable | +| Endpoint removed from slice | `InstanceRemovedEvent` | Instance is terminated | ### Implementation Details #### Instance ID Resolution - **Instance ID**: Pod name (e.g., `dynamo-decode-pod-abc123`) - **EndpointSlice Reference**: `targetRef.name` points to the pod name -- **Metadata Lookup**: ConfigMap name derived from pod name for `get_metadata()` +- **Metadata Lookup**: HTTP request to `{instance_address}/metadata` for `Instance.metadata()` #### Watch Implementation ```python -# Kubernetes watch setup for subscribe() +# Kubernetes watch setup for watch() watch_filter = { "labelSelector": f"kubernetes.io/service-name=dynamo-{component}-service" } @@ -299,7 +369,7 @@ for event in endpoint_slice_watch: if event.type == "MODIFIED": for endpoint in event.object.endpoints: if endpoint.conditions.ready: - emit_event(NewReadyInstanceEvent( + emit_event(InstanceAddedEvent( instance_id=endpoint.targetRef.name, address=endpoint.addresses[0] )) From 388dca1aaee9543c4415b29375ee0a7c44f7f15b Mon Sep 17 00:00:00 2001 From: mohammedabdulwahhab Date: Wed, 10 Sep 2025 08:28:55 -0700 Subject: [PATCH 16/19] Update etcd-k8s.md Signed-off-by: mohammedabdulwahhab --- deps/etcd-k8s.md | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/deps/etcd-k8s.md b/deps/etcd-k8s.md index 1f1d68d..91b9806 100644 --- a/deps/etcd-k8s.md +++ b/deps/etcd-k8s.md @@ -12,9 +12,18 @@ ETCD usage in Dynamo can be categorized into the following: - Worker Port Conflict Resolution - Router State Sharing Synchronization +## Breakdown of the Implementation +- Remove ETCD* from top level APIs +- Define ServiceDiscovery/ServiceRegistry interfaces in Rust +- Implementations: + - Kubernetes + - Existing impl for ETCD needs to be moved behind interface + - TODO: Local/file system +- + ## ServiceDiscovery Interface -To de-couple Dynamo from ETCD, we define a minimal `ServiceDiscovery` interface that can be satisfied by different backends (etcd, kubernetes, etc). In Kubernetes environments, we will use the Kubernetes APIs to implement this interface. +To de-couple Dynamo from ETCD, we define `ServiceDiscovery` and `ServiceRegistry` interfaces that can be satisfied by different backends (etcd, kubernetes, etc). In Kubernetes environments, we will use the Kubernetes APIs to implement this interface. ### Server-Side (ServiceRegistry) @@ -47,11 +56,11 @@ To de-couple Dynamo from ETCD, we define a minimal `ServiceDiscovery` interface ## Where will these APIs be used? -These APIs are intended to be used internally within the Rust codebase where there are currently calls to `etcd_client` for service discovery and model management. +These APIs are intended to be used internally within the Rust codebase where there are currently calls to `etcd_client` for service discovery and model management. We might have to adjust top level APIs to use Some examples of code that will be impacted: -Frontend: +Frontend Worker Discovery: (How the frontend discovers workers and maintains inventory of model to worker mappings) - run_watcher function at [lib/llm/src/entrypoint/input/http.rs](https://github.com/ai-dynamo/dynamo/blob/main/lib/llm/src/entrypoint/input/http.rs) - ModelWatcher at [lib/llm/src/discovery/watcher.rs](https://github.com/ai-dynamo/dynamo/blob/main/lib/llm/src/discovery/watcher.rs) @@ -65,7 +74,6 @@ Dynamo Runtime: - client.rs [lib/runtime/src/client.rs](https://github.com/ai-dynamo/dynamo/blob/main/lib/runtime/src/client.rs) - endpoint.rs [lib/runtime/src/endpoint.rs](https://github.com/ai-dynamo/dynamo/blob/main/lib/runtime/src/endpoint.rs) - ### Overall Flow At a high level, these are the service discovery requirements in a given disaggregated inference setup: From 5aef0b092a4e1939aac907dfee6a9b5a87270c66 Mon Sep 17 00:00:00 2001 From: mohammedabdulwahhab Date: Wed, 10 Sep 2025 08:30:07 -0700 Subject: [PATCH 17/19] Rename document title to Service and Model Discovery Interface Signed-off-by: mohammedabdulwahhab --- deps/etcd-k8s.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/deps/etcd-k8s.md b/deps/etcd-k8s.md index 91b9806..7bd94f3 100644 --- a/deps/etcd-k8s.md +++ b/deps/etcd-k8s.md @@ -1,4 +1,4 @@ -# ETCD-less Dynamo in Kubernetes: Endpoint Instance Discovery +# Service and Model Discovery Interface ## Background From bab299ab2fbb9c716ef43bcb992065db3f6a7b12 Mon Sep 17 00:00:00 2001 From: mohammedabdulwahhab Date: Wed, 10 Sep 2025 08:43:44 -0700 Subject: [PATCH 18/19] Update etcd-k8s.md Signed-off-by: mohammedabdulwahhab --- deps/etcd-k8s.md | 53 ++++++++++++++++++++++++------------------------ 1 file changed, 26 insertions(+), 27 deletions(-) diff --git a/deps/etcd-k8s.md b/deps/etcd-k8s.md index 7bd94f3..7f8dea8 100644 --- a/deps/etcd-k8s.md +++ b/deps/etcd-k8s.md @@ -12,27 +12,18 @@ ETCD usage in Dynamo can be categorized into the following: - Worker Port Conflict Resolution - Router State Sharing Synchronization -## Breakdown of the Implementation -- Remove ETCD* from top level APIs -- Define ServiceDiscovery/ServiceRegistry interfaces in Rust -- Implementations: - - Kubernetes - - Existing impl for ETCD needs to be moved behind interface - - TODO: Local/file system -- - ## ServiceDiscovery Interface -To de-couple Dynamo from ETCD, we define `ServiceDiscovery` and `ServiceRegistry` interfaces that can be satisfied by different backends (etcd, kubernetes, etc). In Kubernetes environments, we will use the Kubernetes APIs to implement this interface. +To de-couple Dynamo from ETCD, we define a `ServiceDiscovery` interface that can be satisfied by different backends (etcd, kubernetes, etc). In Kubernetes environments, we will use the Kubernetes APIs to implement this interface. -### Server-Side (ServiceRegistry) +### Server-Side (Service Registry) #### Methods - `register_instance(namespace: str, component: str) -> InstanceHandle` - Registers a new instance of the given namespace and component. Returns an InstanceHandle. - `InstanceHandle.instance_id() -> str` - - Returns the instance id. Returns the unique identifier for the instance in the ServiceRegistry. + - Returns a unique identifier for the instance. (Similar to the primary lease id in ETCD) - `InstanceHandle.set_metadata(metadata: dict) -> None` - Write the metadata associated with the instance. @@ -41,7 +32,7 @@ To de-couple Dynamo from ETCD, we define `ServiceDiscovery` and `ServiceRegistry - Marks the instance as ready or not ready for traffic. -### Client-Side (ServiceDiscovery) +### Client-Side (Service Discovery) #### Methods - `list_instances(namespace: str, component: str) -> List[Instance]` @@ -54,9 +45,18 @@ To de-couple Dynamo from ETCD, we define `ServiceDiscovery` and `ServiceRegistry - `Instance.metadata() -> dict` - Returns the metadata for a specific instance. +## Breakdown of the Implementation +- Remove ETCD* from top level APIs +- Define ServiceDiscovery interfaces in Rust +- Implementations: + - Kubernetes + - Existing impl for ETCD needs to be moved behind interface + - TODO: Local/file system +- Re-factor existing code to use new interfaces ([See Where will these APIs be used?](#where-will-these-apis-be-used)) + ## Where will these APIs be used? -These APIs are intended to be used internally within the Rust codebase where there are currently calls to `etcd_client` for service discovery and model management. We might have to adjust top level APIs to use +These APIs are intended to be used internally within the Rust codebase where there are currently calls to `etcd_client` for service discovery and model management. Some examples of code that will be impacted: @@ -76,7 +76,7 @@ Dynamo Runtime: ### Overall Flow -At a high level, these are the service discovery requirements in a given disaggregated inference setup: +At a high level, these are the service discovery requirements in a given disaggregated inference setup. We note how the APIs defined above can be used to achieve this. Credits: @itay @@ -139,8 +139,6 @@ Addressed above. ## API Reference -This table relates when each method is used in the context of endpoint instance discovery and model / worker management. We will also compare reference impls for each method in etcd and kubernetes. - ### register_instance This method is used to register a new instance of the given namespace and component. Returns an InstanceHandle that can be used to manage the instance. @@ -155,18 +153,18 @@ instance_id = decode_worker.instance_id() # Unique identifier for the instance #### Kubernetes Reference Impl - Asserts the pod has namespace and component labels that match up with method args. Fails otherwise. - Asserts there is a Kubernetes Service that will select on namespace and component labels. (More on this later) -- Returns an InstanceHandle object tied to the pod name. This can be used to fetch the unique identifier for the instance. +- Returns an InstanceHandle object tied to the pod name. This can be used to fetch the unique identifier for the instance which can be used in setting up the transport. Note: The instance registration is tied to the pod lifecycle. ### set_metadata -This method is used to write metadata associated with an instance. +This method is used to set metadata associated with an instance. ```python # Server-side: Set metadata for the instance metadata = { - "Model": { + "model": { "name": "Qwen3-32B", "type": "Completions", "runtime_config": { @@ -175,16 +173,16 @@ metadata = { "max_num_batched_tokens": 2048 } }, - "Transport": {...}, # Transport details for this component - "MDC": {...}, # Model Deployment Card + "transport": {...}, # Transport details for this component + "mdc": {...}, # Model Deployment Card } decode_worker.set_metadata(metadata) ``` #### Implementation Details -- Updates an in-memory struct within the component process +- Updates an in-memory struct within the instance process - Exposes the metadata via a `/metadata` HTTP endpoint on the component -- Metadata is available immediately after being set, no external storage required +- No external storage required. ### set_ready @@ -192,10 +190,12 @@ This method is used to mark the instance as ready or not ready for traffic. ```python # Context: decode worker has finished loading the model -# Register the instance + decode_worker = service_registry.register_instance("dynamo", "decode") -await start_http_server() + +# Setup handlers for the decode component, then setup transport. decode_worker.set_metadata(metadata) + # When the server is ready to serve, mark the instance as ready for discovery. decode_worker.set_ready("ready") ``` @@ -224,7 +224,6 @@ for decode_worker in decode_workers: - Queries EndpointSlices for the Service that selects on the namespace and component labels. - Returns a list of Instance objects for all ready endpoints. - ### watch This method is used to watch for events for the set of instances that match (namespace, component). Returns a stream of events (InstanceAddedEvent, InstanceRemovedEvent) giving signals for when instances that match the subscription are created, deleted, or readiness status changes. From 981844c8e1ca48552c0cc6a9ff34bb86b61c23c2 Mon Sep 17 00:00:00 2001 From: mohammedabdulwahhab Date: Wed, 10 Sep 2025 08:44:34 -0700 Subject: [PATCH 19/19] Update implementation breakdown in etcd-k8s.md Removed outdated implementation details and restructured the breakdown of the implementation section. Signed-off-by: mohammedabdulwahhab --- deps/etcd-k8s.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/deps/etcd-k8s.md b/deps/etcd-k8s.md index 7f8dea8..935a2df 100644 --- a/deps/etcd-k8s.md +++ b/deps/etcd-k8s.md @@ -45,15 +45,6 @@ To de-couple Dynamo from ETCD, we define a `ServiceDiscovery` interface that can - `Instance.metadata() -> dict` - Returns the metadata for a specific instance. -## Breakdown of the Implementation -- Remove ETCD* from top level APIs -- Define ServiceDiscovery interfaces in Rust -- Implementations: - - Kubernetes - - Existing impl for ETCD needs to be moved behind interface - - TODO: Local/file system -- Re-factor existing code to use new interfaces ([See Where will these APIs be used?](#where-will-these-apis-be-used)) - ## Where will these APIs be used? These APIs are intended to be used internally within the Rust codebase where there are currently calls to `etcd_client` for service discovery and model management. @@ -389,3 +380,12 @@ for event in endpoint_slice_watch: 3. **Scalability**: EndpointSlices handle large numbers of endpoints efficiently 4. **Consistency**: Kubernetes ensures eventual consistency across the cluster 5. **Health Integration**: Readiness probes directly control traffic eligibility + +## Breakdown of the Implementation +- Remove ETCD* from top level APIs +- Define ServiceDiscovery interfaces in Rust +- Implementations: + - Kubernetes + - Existing impl for ETCD needs to be moved behind interface + - TODO: Local/file system +- Re-factor existing code to use new interfaces ([See Where will these APIs be used?](#where-will-these-apis-be-used))