Skip to content

Conversation

davepacheco
Copy link
Collaborator

@davepacheco davepacheco commented Sep 16, 2025

This PR adds a new live test for Nexus handoff.

Depends on #9022 and #9023.

TODO:

  • Make sure the test turns off the autoplanner and/or verifies that it is off

@davepacheco
Copy link
Collaborator Author

davepacheco commented Sep 16, 2025

With a4x2 built from cb83530 and live test built from 8032377, it takes a while, but it works:

#      TMPDIR=/var/tmp ./cargo-nextest nextest run --profile=live-tests          --archive-file live-tests-archive/omicron-live-tests.tar.zst          --workspace-remap live-tests-archive
  Extracting 2 binaries, 1 build script output directory, and 5 linked paths to /var/tmp/nextest-archive-Fer3fa
   Extracted 79 files to /var/tmp/nextest-archive-Fer3fa in 3.66s
info: experimental features enabled: setup-scripts
------------
 Nextest run ID c4a011b3-5629-4c0a-8a84-e186530b0122 with nextest profile: live-tests
    Starting 2 tests across 2 binaries
        SLOW [> 60.000s] omicron-live-tests::test_nexus_add_remove test_nexus_add_remove
        SLOW [>120.000s] omicron-live-tests::test_nexus_add_remove test_nexus_add_remove
        SLOW [>180.000s] omicron-live-tests::test_nexus_add_remove test_nexus_add_remove
        SLOW [>240.000s] omicron-live-tests::test_nexus_add_remove test_nexus_add_remove
        PASS [ 252.738s] omicron-live-tests::test_nexus_add_remove test_nexus_add_remove
        SLOW [> 60.000s] omicron-live-tests::test_nexus_handoff test_nexus_handoff
        SLOW [>120.000s] omicron-live-tests::test_nexus_handoff test_nexus_handoff
        SLOW [>180.000s] omicron-live-tests::test_nexus_handoff test_nexus_handoff
        SLOW [>240.000s] omicron-live-tests::test_nexus_handoff test_nexus_handoff
        SLOW [>300.000s] omicron-live-tests::test_nexus_handoff test_nexus_handoff
        SLOW [>360.000s] omicron-live-tests::test_nexus_handoff test_nexus_handoff
        SLOW [>420.000s] omicron-live-tests::test_nexus_handoff test_nexus_handoff
        PASS [ 433.442s] omicron-live-tests::test_nexus_handoff test_nexus_handoff
------------
     Summary [ 686.212s] 2 tests run: 2 passed (2 slow), 0 skipped

@davepacheco davepacheco self-assigned this Sep 16, 2025
@davepacheco davepacheco marked this pull request as ready for review September 16, 2025 17:34
Base automatically changed from dap/write-notyet-records to main September 17, 2025 13:21
Comment on lines 131 to 135
/// Returns whether the given blueprint's sled configurations appear to be
/// propagated to all sleds.
///
/// Returns the inventory collection so that the caller can check additional
/// details if wanted.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Returns whether the given blueprint's sled configurations appear to be
/// propagated to all sleds.
///
/// Returns the inventory collection so that the caller can check additional
/// details if wanted.
/// Verifies that the given blueprint's sled configurations appear to be
/// propagated to all active sleds.
///
/// Returns the inventory collection so that the caller can check additional
/// details if wanted.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch -- updated the comment in 34aa89a.

Comment on lines 201 to 202
// Wait for the zones to be running.
// (This does not mean that their Nexus instances are running.)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Wait for the zones to be running.
// (This does not mean that their Nexus instances are running.)
// Wait for the zones to be running.
//
// We expect that the new Nexus zones will be blocked on the "not yet"
// DbMetadataNexusState, waiting for handoff.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the comment in 34aa89a.


// Make sure we're starting from a known-normal state.
// First, we have an enabled target blueprint.
let blueprint1 = blueprint_load_target_enabled(log, nexus)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick; WDYT about giving these more descriptive names than the enumerated "blueprint1"? I just suffered through this in #8936 , where there was a test that went up to like "blueprint12" in a multi-hundred line long test. Refactoring it to add a blueprint in the middle sucks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry about that. Sometimes I think the numbers are really helpful for keeping track of the order. In this case, at least, the names "initial", "new_nexus", "handoff", and "cleanup" are clear enough about the order. Done in 34aa89a.

Comment on lines 401 to 410
// Wait for this to get propagated everywhere.
let _latest_collection = blueprint_wait_sled_configs_propagated(
opctx,
datastore,
&blueprint4,
new_nexus,
Duration::from_secs(120),
)
.await
.expect("waiting for blueprint4 sled configs");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are we waiting for here? Should we be verifying that we can't access the "old Nexus quiesce" interface after this or something?

(If we have nothing else to verify, why wait?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just general test hygiene. I don't like to have tests kick off work that might not be finished when they finish running. The main reason is that this can make things flaky when you run multiple tests in a row because a second test winds up confused by the changing state left by the first test.

@davepacheco davepacheco enabled auto-merge (squash) September 17, 2025 19:57
@davepacheco davepacheco merged commit 0d82762 into main Sep 17, 2025
16 checks passed
@davepacheco davepacheco deleted the dap/handoff-live-test branch September 17, 2025 21:10
charliepark pushed a commit that referenced this pull request Sep 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants