Skip to content

Conversation

keivenchang
Copy link
Contributor

@keivenchang keivenchang commented Sep 24, 2025

Overview:

Replace periodic polling-based metrics collection with event-driven updates triggered by ModelWatcher. This eliminates unnecessary background polling and makes metrics updates more responsive to actual model lifecycle events.

Details:

  • Replace periodic polling task with event-driven metrics updates triggered by ModelWatcher
  • Add update_model_metrics function that updates metrics when models are added
  • Remove start_runtime_config_polling_task and related polling infrastructure
  • Update integration test to use ModelWatcher directly instead of run_watcher
  • Add etcd_client to HttpServiceConfig for health endpoint functionality

Where should the reviewer start?

  • lib/llm/src/entrypoint/input/http.rs - Main changes to run_watcher function and new update_model_metrics
  • lib/llm/src/http/service/metrics.rs - Removal of polling task and cleanup of unused functions
  • lib/llm/tests/http_metrics.rs - Updated integration test to use ModelWatcher directly

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Relates to DIS-676

/coderabbit profile chill

Summary by CodeRabbit

  • New Features
    • Real-time model metrics updates during model add/remove events (no background polling required).
  • Bug Fixes
    • Ensures all HTTP endpoints update correctly for both added and removed models.
  • Refactor
    • Simplified service initialization by removing embedded polling and unused client wiring.
    • Streamlined metrics handling within the service for greater reliability.
  • Tests
    • Reworked integration tests to use a stepwise model lifecycle with explicit watcher setup and clearer metrics assertions.
  • Documentation
    • Added comments clarifying metrics updates and endpoint behavior.

- Replace periodic polling task with event-driven metrics updates triggered by ModelWatcher
- Add update_model_metrics function that updates metrics when models are added
- Remove start_runtime_config_polling_task and related polling infrastructure
- Update integration test to use ModelWatcher directly instead of run_watcher
- Add etcd_client to HttpServiceConfig for health endpoint functionality

This change improves performance by eliminating unnecessary polling and
makes metrics updates more responsive to actual model lifecycle events.

Signed-off-by: Keiven Chang <[email protected]>
@keivenchang keivenchang requested a review from a team as a code owner September 24, 2025 19:36
@github-actions github-actions bot added the feat label Sep 24, 2025
@keivenchang keivenchang self-assigned this Sep 24, 2025
Copy link
Contributor

coderabbitai bot commented Sep 24, 2025

Walkthrough

Consolidates HttpService construction with etcd, passes metrics into the model watcher, extends watcher to update metrics and endpoints on model changes, removes HttpService’s etcd-held metrics polling, inlines MDC metrics updates inside Metrics, and rewrites the integration test to exercise the new watcher-driven metrics flow.

Changes

Cohort / File(s) Summary
Entrypoint + Watcher wiring
lib/llm/src/entrypoint/input/http.rs
Chains HttpService build with etcd; passes metrics from service state into run_watcher; clones model manager for task use; expands endpoint updates for Added/Removed; adds update_model_metrics handling runtime-config and MDC via etcd when available.
HTTP metrics internals
lib/llm/src/http/service/metrics.rs
Removes update_mdc_metrics, update_metrics_from_model_entry, and start_runtime_config_polling_task; inlines MDC updates within update_metrics_from_model_entry_with_mdc; keeps runtime-config updates in the same method.
HTTP service struct/build
lib/llm/src/http/service/service_v2.rs
Drops etcd_client from HttpService; stops starting metrics polling in run; builder now initializes State with optional etcd but does not store etcd in HttpService.
Integration test (metrics)
lib/llm/tests/http_metrics.rs
Reorders setup; introduces ModelWatcher and explicit etcd watching; simplifies MDC registration and metrics verification; adjusts timing; asserts endpoint and MDC-related metrics via watcher-driven updates.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Tester as Test/Runner
  participant Builder as HttpServiceBuilder
  participant Http as HttpService
  participant State as State (metrics)
  participant Watcher as ModelWatcher Task
  participant Etcd as Etcd
  participant Metrics as Metrics

  Tester->>Builder: with_etcd_client(...).build()
  Builder->>State: new_with_etcd(optional Etcd)
  Builder->>Http: construct (no etcd field)
  Builder-->>Tester: HttpService

  Tester->>Watcher: run_watcher(model_manager, etcd, Metrics)
  Note over Watcher: Watch etcd for model add/remove

  Etcd-->>Watcher: ModelAdded(ModelType)
  Watcher->>Http: update_http_endpoints(Added)
  Watcher->>Metrics: update_model_metrics(Added, model_manager, etcd?)
  Note right of Metrics: Inline MDC + runtime-config updates

  Etcd-->>Watcher: ModelRemoved(ModelType)
  Watcher->>Http: update_http_endpoints(Removed)
  Watcher->>Metrics: update_model_metrics(Removed)<br/>(preserve existing metrics)

  Note over Http: No background polling for metrics
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

A rabbit taps code with a gentle thrum,
Watchers awake as the metrics hum.
No polls tonight—just signals swift,
Endpoints bloom with a model’s shift.
MDC whispers through etcd air—
Hops of joy: the gauges bear.
Carrots count, and all’s repaired.

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Title Check ✅ Passed The title concisely summarizes the core change of replacing polling with event-driven metrics updates and aligns directly with the primary modifications in the code, making it clear and specific for reviewers scanning the history.
Description Check ✅ Passed The description follows the repository’s template by providing distinct Overview, Details, Where to start, and Related Issues sections with clear information about the changes, their motivation, and the files to review.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
lib/llm/src/entrypoint/input/http.rs (1)

291-333: Ensure consistent model label casing when updating metrics

Use lowercase model labels to align with other metric emitters and avoid duplicate time series.

Apply:

 async fn update_model_metrics(
@@
         ModelUpdate::Added(model_type) => {
             tracing::debug!("Updating metrics for added model type: {:?}", model_type);
 
             // Get all model entries and update metrics for matching types
             let model_entries = model_manager.get_model_entries();
             for entry in model_entries {
                 if entry.model_type == model_type {
                     // Update runtime config metrics if available
                     if let Some(runtime_config) = &entry.runtime_config {
-                        metrics.update_runtime_config_metrics(&entry.name, runtime_config);
+                        let model = entry.name.to_lowercase();
+                        metrics.update_runtime_config_metrics(&model, runtime_config);
                     }
 
                     // Update MDC metrics if etcd is available
                     if let Some(ref etcd) = etcd_client
                         && let Err(e) = metrics
                             .update_metrics_from_model_entry_with_mdc(&entry, etcd)
                             .await
                     {
🧹 Nitpick comments (4)
lib/llm/src/http/service/metrics.rs (2)

497-509: Normalize model label casing and clamp MDC numeric conversions

  • Use the same lowercase model label you use elsewhere (e.g., request/inflight) to avoid split time series due to casing differences.
  • Prefer clamp_u64_to_i64 for MDC numeric fields (mirrors runtime-config handling and avoids accidental negative values if upstream widens types).

Apply:

@@
-        // Load and update MDC metrics
+        // Load and update MDC metrics
         let model_slug = Slug::from_string(&model_entry.name);
         let store: Box<dyn KeyValueStore> = Box::new(EtcdStorage::new(etcd_client.clone()));
         let card_store = Arc::new(KeyValueStoreManager::new(store));
 
         match card_store
             .load::<ModelDeploymentCard>(MDC_ROOT_PATH, &model_slug)
             .await
         {
             Ok(Some(mdc)) => {
-                // Inline MDC metrics update
-                self.model_context_length
-                    .with_label_values(&[&model_entry.name])
-                    .set(mdc.context_length as i64);
-
-                self.model_kv_cache_block_size
-                    .with_label_values(&[&model_entry.name])
-                    .set(mdc.kv_cache_block_size as i64);
-
-                self.model_migration_limit
-                    .with_label_values(&[&model_entry.name])
-                    .set(mdc.migration_limit as i64);
+                // Inline MDC metrics update
+                let model = model_entry.name.to_lowercase();
+                self.model_context_length
+                    .with_label_values(&[&model])
+                    .set(clamp_u64_to_i64(mdc.context_length as u64));
+
+                self.model_kv_cache_block_size
+                    .with_label_values(&[&model])
+                    .set(clamp_u64_to_i64(mdc.kv_cache_block_size as u64));
+
+                self.model_migration_limit
+                    .with_label_values(&[&model])
+                    .set(clamp_u64_to_i64(mdc.migration_limit as u64));
             }

Also applies to: 482-486


159-166: Remove stale polling config doc in lib/llm/src/http/service/metrics.rs
Remove the “Runtime Config Polling Configuration” section and the DYN_HTTP_SVC_CONFIG_METRICS_POLL_INTERVAL_SECS entry—metrics now use event-driven updates.

lib/llm/src/http/service/service_v2.rs (1)

317-319: Remove stale comment about starting polling in run()

The comment refers to a metrics polling task that no longer exists.

-        // Note: Metrics polling task will be started in run() method to have access to cancellation token
+        // Metrics are updated via ModelWatcher-driven events; no background polling task.
lib/llm/src/entrypoint/input/http.rs (1)

242-258: Spawned updater task is fine; consider graceful shutdown linkage (optional)

Task drops are fine here; if you want graceful shutdown, optionally tie to a CancellationToken and select! to exit on cancel.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5323b61 and a907ae9.

📒 Files selected for processing (4)
  • lib/llm/src/entrypoint/input/http.rs (5 hunks)
  • lib/llm/src/http/service/metrics.rs (1 hunks)
  • lib/llm/src/http/service/service_v2.rs (1 hunks)
  • lib/llm/tests/http_metrics.rs (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: keivenchang
PR: ai-dynamo/dynamo#3035
File: lib/runtime/src/metrics/prometheus_names.rs:49-53
Timestamp: 2025-09-16T00:26:37.092Z
Learning: keivenchang prefers consistency in metric naming standardization over strict adherence to Prometheus conventions about gauge vs counter suffixes. When standardizing metrics naming, prioritize consistency across the codebase rather than technical pedantry about individual metric type conventions.
🧬 Code graph analysis (3)
lib/llm/src/http/service/service_v2.rs (2)
lib/runtime/src/transports/etcd.rs (1)
  • etcd_client (131-133)
lib/llm/src/http/service.rs (1)
  • new (46-51)
lib/llm/src/entrypoint/input/http.rs (3)
lib/llm/src/http/service/service_v2.rs (3)
  • etcd_client (114-116)
  • model_manager (190-192)
  • spawn (194-197)
lib/llm/src/model_card.rs (2)
  • model_type (593-593)
  • model_type (796-798)
lib/llm/src/local_model.rs (2)
  • runtime_config (179-182)
  • runtime_config (376-378)
lib/llm/tests/http_metrics.rs (5)
lib/llm/src/engines.rs (1)
  • make_echo_engine (120-124)
lib/llm/tests/common/ports.rs (1)
  • get_random_port (5-15)
lib/llm/src/http/service/metrics.rs (5)
  • default (129-131)
  • new (166-350)
  • new (580-585)
  • new (596-617)
  • new (692-701)
lib/llm/src/http/service/service_v2.rs (3)
  • new (74-86)
  • etcd_client (114-116)
  • manager (106-108)
lib/llm/src/local_model.rs (1)
  • model_name (98-101)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: Build and Test - sglang
  • GitHub Check: Build and Test - vllm
  • GitHub Check: Build and Test - trtllm
  • GitHub Check: pre-merge-rust (lib/runtime/examples)
  • GitHub Check: pre-merge-rust (lib/bindings/python)
  • GitHub Check: pre-merge-rust (.)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (4)
lib/llm/src/http/service/service_v2.rs (1)

297-299: LGTM: centralizing etcd client in State

Passing etcd_client into State via new_with_etcd aligns with removing polling from HttpService and simplifies construction.

lib/llm/src/entrypoint/input/http.rs (2)

58-61: Builder chaining improvement

Inline with_etcd_client(...).build()? is cleaner and matches the new config flow.


83-84: Good: pass Metrics handle to watcher

This enables event-driven metric updates without polling.

lib/llm/tests/http_metrics.rs (1)

296-335: Test restructuring aligns with watcher-driven flow

Imports and setup for ModelWatcher and EngineConfig look correct and consistent with the new event-driven metrics path.

@keivenchang keivenchang merged commit 53f3d2a into main Sep 26, 2025
20 of 21 checks passed
@keivenchang keivenchang deleted the keivenchang/DIS_676__NIM_metrics_use_Watcher_not_polling2 branch September 26, 2025 16:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants