Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Nov 20, 2025

Erroneous vCPU kicks can occur when cancellation arrives after its target call completes, kicking the next call's vCPU, which immediately re-enters since cancellation wasn't requested for it. This can happen on both Linux and Windows platforms.

Changes

  • Added METRIC_ERRONEOUS_VCPU_KICKS counter in metrics/mod.rs
  • Incremented when HyperlightExit::Cancelled() received but !cancel_requested && !debug_interrupted
Ok(HyperlightExit::Cancelled()) => {
    if !cancel_requested && !debug_interrupted {
        metrics::counter!(METRIC_ERRONEOUS_VCPU_KICKS).increment(1);
        continue;  // Stale cancellation, retry
    }
    // ... handle actual cancellation
}

Scenarios Tracked

This metric tracks erroneous vCPU kicks that can occur in two scenarios:

  1. Linux: A signal from a previous guest call arrives late and interrupts a new call
  2. Windows: WHvCancelRunVirtualProcessor is called right after vCPU exits but RUNNING_BIT is still true, canceling the next vCPU run (which is immediately retried since cancellation was not requested)

This provides visibility into stale cancellation frequency as discussed in #1024.

Original prompt

This section details on the original issue you should resolve

<issue_title>Add a metric for tracking erroneous vCPU kicks</issue_title>
<issue_description>If a signal is delivered very late as a result of calling kill(), it can interrupt a guest call that was not intended to be cancelled. This results in the vcpu getting "kicked", but then re-enters the guest VM immediately because the new call was not intended to be cancelled. It would be good to track how much this happens.

See #1024 (comment) for original comment</issue_description>

Comments on the Issue (you are @copilot in this section)


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

@simongdavies simongdavies added the kind/enhancement For PRs adding features, improving functionality, docs, tests, etc. label Nov 20, 2025
Copilot AI changed the title [WIP] Add metric for tracking erroneous vCPU kicks Add metric for erroneous vCPU kicks from stale signals Nov 20, 2025
Copilot AI requested a review from simongdavies November 20, 2025 22:48
Copilot finished work on behalf of simongdavies November 20, 2025 22:48
@ludfjig
Copy link
Contributor

ludfjig commented Nov 21, 2025

@copilot It can also happen on windows in the small gap where WHvCancelRunVirtualProcessor is called right after vcpu already exited but RUNNING_BIT is still true. This will cancel the next vcpu run (but it be immediately be retried since cancellation was probably not requested). Please update the comments to reflect this scenario in addition to the already mentioned stale signals

Copy link
Contributor Author

Copilot AI commented Nov 21, 2025

@copilot It can also happen on windows in the small gap where WHvCancelRunVirtualProcessor is called right after vcpu already exited but RUNNING_BIT is still true. This will cancel the next vcpu run (...

Updated the comments in both metrics/mod.rs and hypervisor/mod.rs to include the Windows scenario where WHvCancelRunVirtualProcessor is called right after vCPU exits but RUNNING_BIT is still true. Commit: 4cac386

Copilot AI changed the title Add metric for erroneous vCPU kicks from stale signals Add metric for erroneous vCPU kicks from stale cancellations Nov 21, 2025
Copilot AI requested a review from ludfjig November 21, 2025 00:20
Copilot finished work on behalf of ludfjig November 21, 2025 00:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/enhancement For PRs adding features, improving functionality, docs, tests, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a metric for tracking erroneous vCPU kicks

3 participants