You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: deploy/metrics/README.md
+24-1Lines changed: 24 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -79,7 +79,30 @@ When using Dynamo HTTP Frontend (`--framework VLLM` or `--framework TRTLLM`), th
79
79
-`dynamo_frontend_requests_total`: Total LLM requests (counter)
80
80
-`dynamo_frontend_time_to_first_token_seconds`: Time to first token (histogram)
81
81
82
-
**Note**: The `dynamo_frontend_inflight_requests_total` metric tracks requests from HTTP handler start until the complete response is finished, while `dynamo_frontend_queued_requests_total` tracks requests from HTTP handler start until first token generation begins (including prefill time). HTTP queue time is a subset of inflight time.
82
+
##### Model Configuration Metrics
83
+
84
+
The frontend also exposes model configuration metrics with the `dynamo_frontend_model_*` prefix. These metrics are populated from the worker backend registration service when workers register with the system:
These metrics come from the runtime configuration provided by worker backends during registration.
88
+
89
+
-`dynamo_frontend_model_total_kv_blocks`: Total KV blocks available for a worker serving the model (gauge)
90
+
-`dynamo_frontend_model_max_num_seqs`: Maximum number of sequences for a worker serving the model (gauge)
91
+
-`dynamo_frontend_model_max_num_batched_tokens`: Maximum number of batched tokens for a worker serving the model (gauge)
92
+
93
+
**MDC Metrics (from ModelDeploymentCard):**
94
+
These metrics come from the Model Deployment Card information provided by worker backends during registration.
95
+
96
+
-`dynamo_frontend_model_context_length`: Maximum context length for a worker serving the model (gauge)
97
+
-`dynamo_frontend_model_kv_cache_block_size`: KV cache block size for a worker serving the model (gauge)
98
+
-`dynamo_frontend_model_migration_limit`: Request migration limit for a worker serving the model (gauge)
99
+
100
+
**Worker Management Metrics:**
101
+
-`dynamo_frontend_model_workers`: Number of worker instances currently serving the model (gauge)
102
+
103
+
**Important Notes:**
104
+
- The `dynamo_frontend_inflight_requests_total` metric tracks requests from HTTP handler start until the complete response is finished, while `dynamo_frontend_queued_requests_total` tracks requests from HTTP handler start until first token generation begins (including prefill time). HTTP queue time is a subset of inflight time.
105
+
-**Model Name Deduplication**: When multiple worker instances register with the same model name, only the first instance's configuration metrics (runtime config and MDC metrics) will be populated. Subsequent instances with duplicate model names will be skipped for configuration metric updates, though the worker count metric will reflect all instances.
0 commit comments