feat: replace polling with event-driven metrics updates #3207

keivenchang · 2025-09-24T19:36:53Z

Overview:

Replace periodic polling-based metrics collection with event-driven updates triggered by ModelWatcher. This eliminates unnecessary background polling and makes metrics updates more responsive to actual model lifecycle events.

Details:

Replace periodic polling task with event-driven metrics updates triggered by ModelWatcher
Add update_model_metrics function that updates metrics when models are added
Remove start_runtime_config_polling_task and related polling infrastructure
Update integration test to use ModelWatcher directly instead of run_watcher
Add etcd_client to HttpServiceConfig for health endpoint functionality

Where should the reviewer start?

lib/llm/src/entrypoint/input/http.rs - Main changes to run_watcher function and new update_model_metrics
lib/llm/src/http/service/metrics.rs - Removal of polling task and cleanup of unused functions
lib/llm/tests/http_metrics.rs - Updated integration test to use ModelWatcher directly

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Relates to DIS-676

/coderabbit profile chill

Summary by CodeRabbit

New Features
- Real-time model metrics updates during model add/remove events (no background polling required).
Bug Fixes
- Ensures all HTTP endpoints update correctly for both added and removed models.
Refactor
- Simplified service initialization by removing embedded polling and unused client wiring.
- Streamlined metrics handling within the service for greater reliability.
Tests
- Reworked integration tests to use a stepwise model lifecycle with explicit watcher setup and clearer metrics assertions.
Documentation
- Added comments clarifying metrics updates and endpoint behavior.

- Replace periodic polling task with event-driven metrics updates triggered by ModelWatcher - Add update_model_metrics function that updates metrics when models are added - Remove start_runtime_config_polling_task and related polling infrastructure - Update integration test to use ModelWatcher directly instead of run_watcher - Add etcd_client to HttpServiceConfig for health endpoint functionality This change improves performance by eliminating unnecessary polling and makes metrics updates more responsive to actual model lifecycle events. Signed-off-by: Keiven Chang <[email protected]>

coderabbitai · 2025-09-24T19:42:36Z

Walkthrough

Consolidates HttpService construction with etcd, passes metrics into the model watcher, extends watcher to update metrics and endpoints on model changes, removes HttpService’s etcd-held metrics polling, inlines MDC metrics updates inside Metrics, and rewrites the integration test to exercise the new watcher-driven metrics flow.

Changes

Cohort / File(s)	Summary
Entrypoint + Watcher wiring `lib/llm/src/entrypoint/input/http.rs`	Chains HttpService build with etcd; passes `metrics` from service state into `run_watcher`; clones model manager for task use; expands endpoint updates for Added/Removed; adds `update_model_metrics` handling runtime-config and MDC via etcd when available.
HTTP metrics internals `lib/llm/src/http/service/metrics.rs`	Removes `update_mdc_metrics`, `update_metrics_from_model_entry`, and `start_runtime_config_polling_task`; inlines MDC updates within `update_metrics_from_model_entry_with_mdc`; keeps runtime-config updates in the same method.
HTTP service struct/build `lib/llm/src/http/service/service_v2.rs`	Drops `etcd_client` from `HttpService`; stops starting metrics polling in `run`; builder now initializes `State` with optional etcd but does not store etcd in `HttpService`.
Integration test (metrics) `lib/llm/tests/http_metrics.rs`	Reorders setup; introduces `ModelWatcher` and explicit etcd watching; simplifies MDC registration and metrics verification; adjusts timing; asserts endpoint and MDC-related metrics via watcher-driven updates.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Tester as Test/Runner
  participant Builder as HttpServiceBuilder
  participant Http as HttpService
  participant State as State (metrics)
  participant Watcher as ModelWatcher Task
  participant Etcd as Etcd
  participant Metrics as Metrics

  Tester->>Builder: with_etcd_client(...).build()
  Builder->>State: new_with_etcd(optional Etcd)
  Builder->>Http: construct (no etcd field)
  Builder-->>Tester: HttpService

  Tester->>Watcher: run_watcher(model_manager, etcd, Metrics)
  Note over Watcher: Watch etcd for model add/remove

  Etcd-->>Watcher: ModelAdded(ModelType)
  Watcher->>Http: update_http_endpoints(Added)
  Watcher->>Metrics: update_model_metrics(Added, model_manager, etcd?)
  Note right of Metrics: Inline MDC + runtime-config updates

  Etcd-->>Watcher: ModelRemoved(ModelType)
  Watcher->>Http: update_http_endpoints(Removed)
  Watcher->>Metrics: update_model_metrics(Removed)<br/>(preserve existing metrics)

  Note over Http: No background polling for metrics

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

feat: add NIM FE (num_request_max) + runtime config metrics with periodic polling #3107 — Touches the same metrics/service paths; this PR removes the polling and helpers introduced there, moving updates into the watcher and inlining MDC.
feat: Dynamic Endpoint Exposure Based on Model Type #1447 — Also modifies model-watcher ↔ HTTP-service integration and dynamic endpoint updates, overlapping with watcher signature/flow changes here.
fix: frontend metrics to be renamed from nv_llm_http_service_* => dynamo_frontend_* #2176 — Alters the metrics module internals; this PR further restructures metrics behavior and interfaces in the same file.

Poem

A rabbit taps code with a gentle thrum,
Watchers awake as the metrics hum.
No polls tonight—just signals swift,
Endpoints bloom with a model’s shift.
MDC whispers through etcd air—
Hops of joy: the gauges bear.
Carrots count, and all’s repaired.

Pre-merge checks

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title concisely summarizes the core change of replacing polling with event-driven metrics updates and aligns directly with the primary modifications in the code, making it clear and specific for reviewers scanning the history.
Description Check	✅ Passed	The description follows the repository’s template by providing distinct Overview, Details, Where to start, and Related Issues sections with clear information about the changes, their motivation, and the files to review.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

lib/llm/src/entrypoint/input/http.rs (1)

291-333: Ensure consistent model label casing when updating metrics

Use lowercase model labels to align with other metric emitters and avoid duplicate time series.

Apply:

 async fn update_model_metrics(
@@
         ModelUpdate::Added(model_type) => {
             tracing::debug!("Updating metrics for added model type: {:?}", model_type);
 
             // Get all model entries and update metrics for matching types
             let model_entries = model_manager.get_model_entries();
             for entry in model_entries {
                 if entry.model_type == model_type {
                     // Update runtime config metrics if available
                     if let Some(runtime_config) = &entry.runtime_config {
-                        metrics.update_runtime_config_metrics(&entry.name, runtime_config);
+                        let model = entry.name.to_lowercase();
+                        metrics.update_runtime_config_metrics(&model, runtime_config);
                     }
 
                     // Update MDC metrics if etcd is available
                     if let Some(ref etcd) = etcd_client
                         && let Err(e) = metrics
                             .update_metrics_from_model_entry_with_mdc(&entry, etcd)
                             .await
                     {

🧹 Nitpick comments (4)

lib/llm/src/http/service/metrics.rs (2)

497-509: Normalize model label casing and clamp MDC numeric conversions

Use the same lowercase model label you use elsewhere (e.g., request/inflight) to avoid split time series due to casing differences.
Prefer clamp_u64_to_i64 for MDC numeric fields (mirrors runtime-config handling and avoids accidental negative values if upstream widens types).

Apply:

@@
-        // Load and update MDC metrics
+        // Load and update MDC metrics
         let model_slug = Slug::from_string(&model_entry.name);
         let store: Box<dyn KeyValueStore> = Box::new(EtcdStorage::new(etcd_client.clone()));
         let card_store = Arc::new(KeyValueStoreManager::new(store));
 
         match card_store
             .load::<ModelDeploymentCard>(MDC_ROOT_PATH, &model_slug)
             .await
         {
             Ok(Some(mdc)) => {
-                // Inline MDC metrics update
-                self.model_context_length
-                    .with_label_values(&[&model_entry.name])
-                    .set(mdc.context_length as i64);
-
-                self.model_kv_cache_block_size
-                    .with_label_values(&[&model_entry.name])
-                    .set(mdc.kv_cache_block_size as i64);
-
-                self.model_migration_limit
-                    .with_label_values(&[&model_entry.name])
-                    .set(mdc.migration_limit as i64);
+                // Inline MDC metrics update
+                let model = model_entry.name.to_lowercase();
+                self.model_context_length
+                    .with_label_values(&[&model])
+                    .set(clamp_u64_to_i64(mdc.context_length as u64));
+
+                self.model_kv_cache_block_size
+                    .with_label_values(&[&model])
+                    .set(clamp_u64_to_i64(mdc.kv_cache_block_size as u64));
+
+                self.model_migration_limit
+                    .with_label_values(&[&model])
+                    .set(clamp_u64_to_i64(mdc.migration_limit as u64));
             }

Also applies to: 482-486

159-166: Remove stale polling config doc in lib/llm/src/http/service/metrics.rs
Remove the “Runtime Config Polling Configuration” section and the DYN_HTTP_SVC_CONFIG_METRICS_POLL_INTERVAL_SECS entry—metrics now use event-driven updates.

lib/llm/src/http/service/service_v2.rs (1)

317-319: Remove stale comment about starting polling in run()

The comment refers to a metrics polling task that no longer exists.
-        // Note: Metrics polling task will be started in run() method to have access to cancellation token
+        // Metrics are updated via ModelWatcher-driven events; no background polling task.

lib/llm/src/entrypoint/input/http.rs (1)

242-258: Spawned updater task is fine; consider graceful shutdown linkage (optional)

Task drops are fine here; if you want graceful shutdown, optionally tie to a CancellationToken and select! to exit on cancel.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5323b61 and a907ae9.

📒 Files selected for processing (4)

lib/llm/src/entrypoint/input/http.rs (5 hunks)
lib/llm/src/http/service/metrics.rs (1 hunks)
lib/llm/src/http/service/service_v2.rs (1 hunks)
lib/llm/tests/http_metrics.rs (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

📓 Common learnings

Learnt from: keivenchang
PR: ai-dynamo/dynamo#3035
File: lib/runtime/src/metrics/prometheus_names.rs:49-53
Timestamp: 2025-09-16T00:26:37.092Z
Learning: keivenchang prefers consistency in metric naming standardization over strict adherence to Prometheus conventions about gauge vs counter suffixes. When standardizing metrics naming, prioritize consistency across the codebase rather than technical pedantry about individual metric type conventions.

🧬 Code graph analysis (3)

lib/llm/src/http/service/service_v2.rs (2)

lib/runtime/src/transports/etcd.rs (1)

etcd_client (131-133)

lib/llm/src/http/service.rs (1)

new (46-51)

lib/llm/src/entrypoint/input/http.rs (3)

lib/llm/src/http/service/service_v2.rs (3)

etcd_client (114-116)

model_manager (190-192)

spawn (194-197)

lib/llm/src/model_card.rs (2)

model_type (593-593)

model_type (796-798)

lib/llm/src/local_model.rs (2)

runtime_config (179-182)

runtime_config (376-378)

lib/llm/tests/http_metrics.rs (5)

lib/llm/src/engines.rs (1)

make_echo_engine (120-124)

lib/llm/tests/common/ports.rs (1)

get_random_port (5-15)

lib/llm/src/http/service/metrics.rs (5)

default (129-131)

new (166-350)

new (580-585)

new (596-617)

new (692-701)

lib/llm/src/http/service/service_v2.rs (3)

new (74-86)

etcd_client (114-116)

manager (106-108)

lib/llm/src/local_model.rs (1)

model_name (98-101)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)

GitHub Check: Build and Test - sglang
GitHub Check: Build and Test - vllm
GitHub Check: Build and Test - trtllm
GitHub Check: pre-merge-rust (lib/runtime/examples)
GitHub Check: pre-merge-rust (lib/bindings/python)
GitHub Check: pre-merge-rust (.)
GitHub Check: Build and Test - dynamo

🔇 Additional comments (4)

lib/llm/src/http/service/service_v2.rs (1)

297-299: LGTM: centralizing etcd client in State

Passing etcd_client into State via new_with_etcd aligns with removing polling from HttpService and simplifies construction.

lib/llm/src/entrypoint/input/http.rs (2)

58-61: Builder chaining improvement

Inline with_etcd_client(...).build()? is cleaner and matches the new config flow.

83-84: Good: pass Metrics handle to watcher

This enables event-driven metric updates without polling.

lib/llm/tests/http_metrics.rs (1)

296-335: Test restructuring aligns with watcher-driven flow

Imports and setup for ModelWatcher and EngineConfig look correct and consistent with the new event-driven metrics path.

lib/llm/tests/http_metrics.rs

lib/llm/src/http/service/metrics.rs

lib/llm/tests/http_metrics.rs

Signed-off-by: Keiven Chang <[email protected]>

lib/llm/src/http/service/service_v2.rs

lib/llm/src/http/service/metrics.rs

…_not_polling2

keivenchang requested a review from a team as a code owner September 24, 2025 19:36

pull-request-size bot added the size/XL label Sep 24, 2025

github-actions bot added the feat label Sep 24, 2025

keivenchang self-assigned this Sep 24, 2025

keivenchang requested a review from KrishnanPrash September 24, 2025 19:37

coderabbitai bot reviewed Sep 24, 2025

View reviewed changes

lib/llm/tests/http_metrics.rs Outdated Show resolved Hide resolved

keivenchang commented Sep 24, 2025

View reviewed changes

lib/llm/src/http/service/metrics.rs Show resolved Hide resolved

lib/llm/tests/http_metrics.rs Show resolved Hide resolved

refactor: revert test organization changes

de16471

Signed-off-by: Keiven Chang <[email protected]>

copy-pr-bot bot temporarily deployed to GITLAB September 24, 2025 22:37 Inactive

copy-pr-bot bot temporarily deployed to GITLAB September 24, 2025 22:38 Inactive

keivenchang commented Sep 25, 2025

View reviewed changes

lib/llm/src/http/service/service_v2.rs Show resolved Hide resolved

keivenchang commented Sep 25, 2025

View reviewed changes

lib/llm/src/http/service/metrics.rs Show resolved Hide resolved

keivenchang requested review from nnshah1 and grahamking September 25, 2025 16:02

grahamking approved these changes Sep 25, 2025

View reviewed changes

Merge branch 'main' into keivenchang/DIS_676__NIM_metrics_use_Watcher…

3faca79

…_not_polling2

copy-pr-bot bot temporarily deployed to GITLAB September 26, 2025 15:57 Inactive

keivenchang merged commit 53f3d2a into main Sep 26, 2025
20 of 21 checks passed

keivenchang deleted the keivenchang/DIS_676__NIM_metrics_use_Watcher_not_polling2 branch September 26, 2025 16:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: replace polling with event-driven metrics updates #3207

feat: replace polling with event-driven metrics updates #3207

keivenchang commented Sep 24, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Sep 24, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

feat: replace polling with event-driven metrics updates #3207

feat: replace polling with event-driven metrics updates #3207

Conversation

keivenchang commented Sep 24, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 24, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

keivenchang commented Sep 24, 2025 •

edited by coderabbitai bot

Loading