-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
[Core] Add token-level KV cache metrics to V1 engine #29836
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Core] Add token-level KV cache metrics to V1 engine #29836
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| @property | ||
| def total_tokens(self) -> int: | ||
| """Get the total KV cache capacity in tokens. | ||
| Returns: | ||
| Total number of tokens that can be stored in the KV cache. | ||
| Calculated as: num_total_blocks × block_size | ||
| """ | ||
| return self.block_pool.get_num_total_blocks() * self.block_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Define block_size before computing token metrics
The new token metrics call self.block_size to compute totals/free/used tokens, but KVCacheManager.__init__ never sets a block_size attribute anywhere in the class. When scheduler stats are enabled (log_stats=True), Scheduler.make_stats now accesses kv_cache_manager.total_tokens/free_tokens/used_tokens, which will raise an AttributeError on the first call instead of returning stats. Please initialize or fetch the block size before using these properties.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this issue is already solved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces valuable token-level KV cache metrics, which will improve monitoring capabilities. The implementation is generally sound, but I've identified a critical issue in vllm/v1/core/kv_cache_manager.py where an uninitialized attribute self.block_size is used, which will lead to a runtime AttributeError. Additionally, the new logic lacks unit tests, which allowed this bug to be missed. I've provided detailed comments on these points. Addressing the critical bug and adding the necessary tests are essential before this PR can be merged.
| @property | ||
| def total_tokens(self) -> int: | ||
| """Get the total KV cache capacity in tokens. | ||
| Returns: | ||
| Total number of tokens that can be stored in the KV cache. | ||
| Calculated as: num_total_blocks × block_size | ||
| """ | ||
| return self.block_pool.get_num_total_blocks() * self.block_size | ||
|
|
||
| @property | ||
| def free_tokens(self) -> int: | ||
| """Get the number of available tokens in the KV cache. | ||
| Returns: | ||
| Number of free tokens available for allocation. | ||
| Calculated as: num_free_blocks × block_size | ||
| """ | ||
| return self.block_pool.get_num_free_blocks() * self.block_size | ||
|
|
||
| @property | ||
| def used_tokens(self) -> int: | ||
| """Get the number of currently used tokens in the KV cache. | ||
| Returns: | ||
| Number of tokens currently occupied in the KV cache. | ||
| Calculated as: total_tokens - free_tokens | ||
| Note: | ||
| This is a derived metric. The actual allocation is tracked | ||
| at the block level by BlockPool. | ||
| """ | ||
| return self.total_tokens - self.free_tokens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These new properties introduce metric calculation logic that is currently untested. While the PR description proposes unit tests, they should be implemented as part of this change to ensure correctness and prevent regressions. For instance, unit tests would have caught the AttributeError from the uninitialized self.block_size attribute. Please add tests for these new properties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this issue is already solved
83cd248 to
3800caf
Compare
🤖 Bot Review Status UpdateThe automated bot reviews are now outdated - the critical bug has been fixed. ✅ Fixed:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| Returns: | ||
| Total number of tokens that can be stored in the KV cache. | ||
| Calculated as: num_total_blocks × block_size | ||
| """ | ||
| return self.block_pool.get_num_total_blocks() * self.block_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Token metrics ignore larger KV block sizes
The new token-level metrics compute totals as num_blocks × self.block_size, but self.block_size is set to the hash block size (lines 157-161) rather than the actual per-group block sizes. The BlockPool supports KV cache groups whose real block size is a multiple of the hash block size, so on such models the reported total/free/used token counts will be proportionally under-reported even though the block pool can store more tokens. That makes the new metrics inaccurate whenever groups use larger blocks than the hash granularity.
Useful? React with 👍 / 👎.
|
@markmc - can you be on point for review for this one? |
Use actual KV cache block size from kv_cache_config instead of hash_block_size. **Issue**: The previous implementation incorrectly used `hash_block_size` for token metrics calculation. The hash_block_size is used for hashing granularity, not for the actual KV cache block size used by BlockPool. **Fix**: Initialize `self.block_size` from `kv_cache_config.kv_cache_groups[].kv_cache_spec.block_size`, which represents the actual block size used for token storage. **Impact**: This ensures token-level metrics (total_tokens, used_tokens, free_tokens) accurately reflect the real KV cache capacity, especially for models using larger block sizes than the hash granularity. Addresses bot review feedback on PR vllm-project#29836. Signed-off-by: Minsung-commit <[email protected]>
Use coordinator.block_size instead of hash_block_size for token metrics. **Issue**: The initial implementation incorrectly used `hash_block_size` for token metrics calculation. The hash_block_size is used for hashing granularity, not the actual KV cache block size used by BlockPool. **Solution**: Initialize `self.block_size` from `self.coordinator.block_size`, which already handles: - Extracting block_size from kv_cache_config.kv_cache_groups - DCP/PCP world_size scaling - Validation against hash_block_size (for UnitaryKVCacheCoordinator) This is cleaner than duplicating the logic and ensures consistency. **Impact**: Token-level metrics (total_tokens, used_tokens, free_tokens) now accurately reflect real KV cache capacity, especially for models using larger block sizes than hash granularity. Addresses bot review feedback on PR vllm-project#29836. Signed-off-by: Minsung-commit <[email protected]>
789f1e8 to
ce5eddc
Compare
Add token-level KV cache metrics (total, used, free) to complement existing percentage-based metrics in the V1 engine. ## Motivation Current V1 engine only provides kv_cache_usage as percentage (0.0-1.0). Absolute token counts are critical for: - Capacity planning: "28k tokens left" vs "35% free" - Cost accounting: Token-based billing - Monitoring: Prometheus/Grafana dashboards - Debugging: Understanding exact cache state ## Changes 1. **vllm/v1/metrics/stats.py**: Add fields to SchedulerStats - kv_cache_total_tokens: Total capacity - kv_cache_used_tokens: Currently occupied - kv_cache_free_tokens: Available space 2. **vllm/v1/core/block_pool.py**: Add get_num_total_blocks() - Returns total GPU blocks (excludes 1 reserved block) 3. **vllm/v1/core/kv_cache_manager.py**: Add properties - total_tokens, free_tokens, used_tokens - Derives block_size from coordinator (handles DCP/PCP scaling) 4. **vllm/v1/core/sched/scheduler.py**: Populate metrics in make_stats() ## Example Output Before: kv_cache_usage: 0.65 After: kv_cache_usage: 0.65 kv_cache_total_tokens: 82448 kv_cache_used_tokens: 53591 kv_cache_free_tokens: 28857 Addresses vllm-project#12283, vllm-project#26850 Signed-off-by: Minsung-commit <[email protected]>
ce5eddc to
0aa1356
Compare
|
Hi @Minsung-commit, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, |
This commit adds token-level KV cache metrics to the V1 engine, enabling more granular monitoring of KV cache utilization beyond the existing percentage-based metrics.
This PR addresses the V1 metrics initiative mentioned in #14101.
Motivation
Currently, vLLM V1 engine only provides kv_cache_usage as a float (0.0-1.0) representing percentage. While useful, this doesn't give users absolute token counts, which are critical for:
Changes
1. vllm/v1/metrics/stats.py
Add three new fields to SchedulerStats dataclass:
kv_cache_total_tokens: int = 0kv_cache_used_tokens: int = 0kv_cache_free_tokens: int = 02. vllm/v1/core/block_pool.py
Add
get_num_total_blocks()method to BlockPool:3. vllm/v1/core/kv_cache_manager.py
Add three read-only properties to KVCacheManager:
total_tokens: Total capacity (num_total_blocks × block_size)free_tokens: Available space (num_free_blocks × block_size)used_tokens: Occupied space (total_tokens - free_tokens)4. vllm/v1/core/sched/scheduler.py
Update
make_stats()to populate new token metrics:kv_cache_total_tokensfromkv_cache_manager.total_tokenskv_cache_used_tokensfromkv_cache_manager.used_tokenskv_cache_free_tokensfromkv_cache_manager.free_tokensBenefits
Example Usage
Before (only percentage):
After (percentage + tokens):
Now operators can see: "We have ~29k tokens left before we need to scale"
Testing
Manual Verification
py_compile)Proposed Unit Tests
Due to environment constraints, unit tests are proposed but not yet implemented:
Test Plan:
Note: Willing to add comprehensive unit tests if maintainers provide guidance on test environment setup requirements.
Related Issues
Checklist
Additional Notes
This PR implements a community-requested feature that has been pending for 9+ months since #14101 was closed in favor of a "V1 metrics initiative". This implementation:
Signed-off-by: dlalstjd931203 [email protected]