-
Notifications
You must be signed in to change notification settings - Fork 614
feat: replace polling with event-driven metrics updates #3207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: replace polling with event-driven metrics updates #3207
Conversation
- Replace periodic polling task with event-driven metrics updates triggered by ModelWatcher - Add update_model_metrics function that updates metrics when models are added - Remove start_runtime_config_polling_task and related polling infrastructure - Update integration test to use ModelWatcher directly instead of run_watcher - Add etcd_client to HttpServiceConfig for health endpoint functionality This change improves performance by eliminating unnecessary polling and makes metrics updates more responsive to actual model lifecycle events. Signed-off-by: Keiven Chang <[email protected]>
WalkthroughConsolidates HttpService construction with etcd, passes metrics into the model watcher, extends watcher to update metrics and endpoints on model changes, removes HttpService’s etcd-held metrics polling, inlines MDC metrics updates inside Metrics, and rewrites the integration test to exercise the new watcher-driven metrics flow. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Tester as Test/Runner
participant Builder as HttpServiceBuilder
participant Http as HttpService
participant State as State (metrics)
participant Watcher as ModelWatcher Task
participant Etcd as Etcd
participant Metrics as Metrics
Tester->>Builder: with_etcd_client(...).build()
Builder->>State: new_with_etcd(optional Etcd)
Builder->>Http: construct (no etcd field)
Builder-->>Tester: HttpService
Tester->>Watcher: run_watcher(model_manager, etcd, Metrics)
Note over Watcher: Watch etcd for model add/remove
Etcd-->>Watcher: ModelAdded(ModelType)
Watcher->>Http: update_http_endpoints(Added)
Watcher->>Metrics: update_model_metrics(Added, model_manager, etcd?)
Note right of Metrics: Inline MDC + runtime-config updates
Etcd-->>Watcher: ModelRemoved(ModelType)
Watcher->>Http: update_http_endpoints(Removed)
Watcher->>Metrics: update_model_metrics(Removed)<br/>(preserve existing metrics)
Note over Http: No background polling for metrics
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
Pre-merge checks✅ Passed checks (3 passed)
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal). Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
lib/llm/src/entrypoint/input/http.rs (1)
291-333
: Ensure consistent model label casing when updating metricsUse lowercase model labels to align with other metric emitters and avoid duplicate time series.
Apply:
async fn update_model_metrics( @@ ModelUpdate::Added(model_type) => { tracing::debug!("Updating metrics for added model type: {:?}", model_type); // Get all model entries and update metrics for matching types let model_entries = model_manager.get_model_entries(); for entry in model_entries { if entry.model_type == model_type { // Update runtime config metrics if available if let Some(runtime_config) = &entry.runtime_config { - metrics.update_runtime_config_metrics(&entry.name, runtime_config); + let model = entry.name.to_lowercase(); + metrics.update_runtime_config_metrics(&model, runtime_config); } // Update MDC metrics if etcd is available if let Some(ref etcd) = etcd_client && let Err(e) = metrics .update_metrics_from_model_entry_with_mdc(&entry, etcd) .await {
🧹 Nitpick comments (4)
lib/llm/src/http/service/metrics.rs (2)
497-509
: Normalize model label casing and clamp MDC numeric conversions
- Use the same lowercase model label you use elsewhere (e.g., request/inflight) to avoid split time series due to casing differences.
- Prefer clamp_u64_to_i64 for MDC numeric fields (mirrors runtime-config handling and avoids accidental negative values if upstream widens types).
Apply:
@@ - // Load and update MDC metrics + // Load and update MDC metrics let model_slug = Slug::from_string(&model_entry.name); let store: Box<dyn KeyValueStore> = Box::new(EtcdStorage::new(etcd_client.clone())); let card_store = Arc::new(KeyValueStoreManager::new(store)); match card_store .load::<ModelDeploymentCard>(MDC_ROOT_PATH, &model_slug) .await { Ok(Some(mdc)) => { - // Inline MDC metrics update - self.model_context_length - .with_label_values(&[&model_entry.name]) - .set(mdc.context_length as i64); - - self.model_kv_cache_block_size - .with_label_values(&[&model_entry.name]) - .set(mdc.kv_cache_block_size as i64); - - self.model_migration_limit - .with_label_values(&[&model_entry.name]) - .set(mdc.migration_limit as i64); + // Inline MDC metrics update + let model = model_entry.name.to_lowercase(); + self.model_context_length + .with_label_values(&[&model]) + .set(clamp_u64_to_i64(mdc.context_length as u64)); + + self.model_kv_cache_block_size + .with_label_values(&[&model]) + .set(clamp_u64_to_i64(mdc.kv_cache_block_size as u64)); + + self.model_migration_limit + .with_label_values(&[&model]) + .set(clamp_u64_to_i64(mdc.migration_limit as u64)); }Also applies to: 482-486
159-166
: Remove stale polling config doc in lib/llm/src/http/service/metrics.rs
Remove the “Runtime Config Polling Configuration” section and theDYN_HTTP_SVC_CONFIG_METRICS_POLL_INTERVAL_SECS
entry—metrics now use event-driven updates.lib/llm/src/http/service/service_v2.rs (1)
317-319
: Remove stale comment about starting polling in run()The comment refers to a metrics polling task that no longer exists.
- // Note: Metrics polling task will be started in run() method to have access to cancellation token + // Metrics are updated via ModelWatcher-driven events; no background polling task.lib/llm/src/entrypoint/input/http.rs (1)
242-258
: Spawned updater task is fine; consider graceful shutdown linkage (optional)Task drops are fine here; if you want graceful shutdown, optionally tie to a CancellationToken and select! to exit on cancel.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
lib/llm/src/entrypoint/input/http.rs
(5 hunks)lib/llm/src/http/service/metrics.rs
(1 hunks)lib/llm/src/http/service/service_v2.rs
(1 hunks)lib/llm/tests/http_metrics.rs
(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: keivenchang
PR: ai-dynamo/dynamo#3035
File: lib/runtime/src/metrics/prometheus_names.rs:49-53
Timestamp: 2025-09-16T00:26:37.092Z
Learning: keivenchang prefers consistency in metric naming standardization over strict adherence to Prometheus conventions about gauge vs counter suffixes. When standardizing metrics naming, prioritize consistency across the codebase rather than technical pedantry about individual metric type conventions.
🧬 Code graph analysis (3)
lib/llm/src/http/service/service_v2.rs (2)
lib/runtime/src/transports/etcd.rs (1)
etcd_client
(131-133)lib/llm/src/http/service.rs (1)
new
(46-51)
lib/llm/src/entrypoint/input/http.rs (3)
lib/llm/src/http/service/service_v2.rs (3)
etcd_client
(114-116)model_manager
(190-192)spawn
(194-197)lib/llm/src/model_card.rs (2)
model_type
(593-593)model_type
(796-798)lib/llm/src/local_model.rs (2)
runtime_config
(179-182)runtime_config
(376-378)
lib/llm/tests/http_metrics.rs (5)
lib/llm/src/engines.rs (1)
make_echo_engine
(120-124)lib/llm/tests/common/ports.rs (1)
get_random_port
(5-15)lib/llm/src/http/service/metrics.rs (5)
default
(129-131)new
(166-350)new
(580-585)new
(596-617)new
(692-701)lib/llm/src/http/service/service_v2.rs (3)
new
(74-86)etcd_client
(114-116)manager
(106-108)lib/llm/src/local_model.rs (1)
model_name
(98-101)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
- GitHub Check: Build and Test - sglang
- GitHub Check: Build and Test - vllm
- GitHub Check: Build and Test - trtllm
- GitHub Check: pre-merge-rust (lib/runtime/examples)
- GitHub Check: pre-merge-rust (lib/bindings/python)
- GitHub Check: pre-merge-rust (.)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (4)
lib/llm/src/http/service/service_v2.rs (1)
297-299
: LGTM: centralizing etcd client in StatePassing etcd_client into State via new_with_etcd aligns with removing polling from HttpService and simplifies construction.
lib/llm/src/entrypoint/input/http.rs (2)
58-61
: Builder chaining improvementInline with_etcd_client(...).build()? is cleaner and matches the new config flow.
83-84
: Good: pass Metrics handle to watcherThis enables event-driven metric updates without polling.
lib/llm/tests/http_metrics.rs (1)
296-335
: Test restructuring aligns with watcher-driven flowImports and setup for ModelWatcher and EngineConfig look correct and consistent with the new event-driven metrics path.
Signed-off-by: Keiven Chang <[email protected]>
Overview:
Replace periodic polling-based metrics collection with event-driven updates triggered by ModelWatcher. This eliminates unnecessary background polling and makes metrics updates more responsive to actual model lifecycle events.
Details:
Where should the reviewer start?
lib/llm/src/entrypoint/input/http.rs
- Main changes to run_watcher function and new update_model_metricslib/llm/src/http/service/metrics.rs
- Removal of polling task and cleanup of unused functionslib/llm/tests/http_metrics.rs
- Updated integration test to use ModelWatcher directlyRelated Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Relates to DIS-676
/coderabbit profile chill
Summary by CodeRabbit