Skip to content

Conversation

jgallagher
Copy link
Contributor

@jgallagher jgallagher commented Sep 22, 2025

(This is staged on top of #8936.)

Finishes the target-release reconfigurator-test, showing the simulate update walking through the process of starting new Nexus zones, waiting for handoff, then expunging the old Nexus zones.

Has two tweaks:

  • Fixes a planning report off-by-one bug where we'd claim a zone was both out of date and expunged (or updated) within the same plan.
  • Adds a set active-nexus-gen N command to reconfigurator-cli to control Nexus handoff instead of assuming it completes instantly.

Closes #8478

@jgallagher jgallagher requested a review from smklein September 22, 2025 22:08
@jgallagher jgallagher changed the title wip Complete target-release test simulating an entire update Sep 22, 2025
.and_modify(|zones| zones.push(zone_config.to_owned()))
.or_insert_with(|| vec![zone_config.to_owned()]);

// We check for out-of-date zones before expunging zones. If we just
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do want this change, but should it be in this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so? The test changing in this PR is the only place (at least in terms of tests / CI) where this change is visible.


# We ought to update the inventory for the final sled and then step through
# the Nexus handoff process, but that work is still in progress.
# Start the Nexus handoff process: A planning step here should place three new
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I did add this test to #8936 in a recent update, but I'm totally fine with deleting my changes and using what you have here.

(I think your addition of set active-nexus-gen 2 will become necessary anyway)

Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good to me. I haven't had a chance to look at the test output yet.

Seed { seed: String },
/// target number of Nexus instances (for planning)
NumNexus { num_nexus: u16 },
/// generation of the active Nexus zones (controlling handoff)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's worth calling out explicitly somewhere around here that this is controlling the simulated state about which (simulated) Nexus instances think they're active, as opposed to changing the generation in the blueprint. ("active generation" shouldn't imply a blueprint change, and obviously they're not doing blueprint-edit here, but I think it's still easy for someone to forget this and get confused.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, fair. I was trying to keep this short for --help but it's intricate enough that's probably the wrong thing to target. How about

/// specify the generation of Nexus zones that are considered
/// active when running the blueprint planner

?

Base automatically changed from nexus_gen_usage to main September 23, 2025 02:12
@david-crespo
Copy link
Contributor

david-crespo commented Sep 23, 2025

Nexus runs as part of this test, right? Is there a way I could hook into this (or just do my own similar thing) and hit the update status endpoint at various points throughout the process? See my test so far, very meh.

@jgallagher
Copy link
Contributor Author

Nexus runs as part of this test, right? Is there a way I could hook into this (or just do my own similar thing) and hit the update status endpoint at various points throughout the process? See my test so far, very meh.

Nexus is not running in this test, no. :(

Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. One thing I'm worried is missing is serializing this state (in the save command), exporting it from a live system (i.e., putting it into UnstableReconfiguratorState), and loading that back in. I'm not sure how much of a problem that is? It does feel like kind of a problem when using either of those workflows (save + load or omdb reconfigurator export + load) once the active generation is not 1.

@jgallagher
Copy link
Contributor Author

This looks good. One thing I'm worried is missing is serializing this state (in the save command), exporting it from a live system (i.e., putting it into UnstableReconfiguratorState), and loading that back in. I'm not sure how much of a problem that is? It does feel like kind of a problem when using either of those workflows (save + load or omdb reconfigurator export + load) once the active generation is not 1.

Hmm, great point. Is "the set of active Nexuses" an inherent problem with saved state? I.e., there isn't anything in that state that tells us which set of Nexuses is active, if there are multiple generations.

If there's only one generation, we could assume that's the active set. Maybe load should check for this, and (a) set the active generation if there's only one, or (b) require the user to specify the active generation if there are two?

@davepacheco
Copy link
Collaborator

This looks good. One thing I'm worried is missing is serializing this state (in the save command), exporting it from a live system (i.e., putting it into UnstableReconfiguratorState), and loading that back in. I'm not sure how much of a problem that is? It does feel like kind of a problem when using either of those workflows (save + load or omdb reconfigurator export + load) once the active generation is not 1.

Hmm, great point. Is "the set of active Nexuses" an inherent problem with saved state? I.e., there isn't anything in that state that tells us which set of Nexuses is active, if there are multiple generations.

I think there's probably nothing in the UnstableReconfiguratorState today that tells us with certainty what it is, but I was assuming we'd just add it.

If there's only one generation, we could assume that's the active set. Maybe load should check for this, and (a) set the active generation if there's only one, or (b) require the user to specify the active generation if there are two?

This is also an option.

@jgallagher
Copy link
Contributor Author

I think there's probably nothing in the UnstableReconfiguratorState today that tells us with certainty what it is, but I was assuming we'd just add it.

Oh! I'm wrong; I think we do have this. UnstableReconfiguratorState includes the planning input, which includes the set of active_nexus_zones. We should be able to infer the generation from that.

@jgallagher
Copy link
Contributor Author

Oh! I'm wrong; I think we do have this. UnstableReconfiguratorState includes the planning input, which includes the set of active_nexus_zones. We should be able to infer the generation from that.

This is implemented in ddf1afb. I tested it by hand, although if it seems important I might be able to finagle an automated test for it too? At least confirming that it can load a generation greater than 1.

Comment on lines +866 to +867
warnings:
could not determine active Nexus generation from serialized state: no target blueprint set (using default of 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why's this? Aren't we loaded something we just saved, and wouldn't the saved version have what we need?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test doesn't use the example system; it's just showing a handful of commands. Earlier in this file we see no blueprints at all:

> blueprint-list
T ENA ID PARENT TIME_CREATED 

which is consistent with the warning here that there's no target blueprint.

Comment on lines +274 to +278
res.warnings.push(format!(
"could not determine active Nexus \
generation from serialized state: \
{message} (using default of 1)"
));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm basically wondering if this should be an error (similar to what happens if you try to load it without specifying an inventory collection and there's more than one of them). But it depends on why this could happen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should never happen in a state blob pulled from a real system. But it can happen in some stripped down reconfigurator-cli contexts (like the one from your other comment).

Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking approved because I'd like to figure that out (the thing I commented on), but I don't want to block this too much longer than that concern.

@jgallagher jgallagher merged commit 2648592 into main Sep 24, 2025
16 checks passed
@jgallagher jgallagher deleted the john/target-release-test-nexus-handoff branch September 24, 2025 14:13
leftwo pushed a commit that referenced this pull request Sep 26, 2025
Finishes the `target-release` `reconfigurator-test`, showing the
simulate update walking through the process of starting new Nexus zones,
waiting for handoff, then expunging the old Nexus zones.

Has two tweaks:

* Fixes a planning report off-by-one bug where we'd claim a zone was
both out of date and expunged (or updated) within the same plan.
* Adds a `set active-nexus-gen N` command to `reconfigurator-cli` to
control Nexus handoff instead of assuming it completes instantly.

Closes #8478

---------

Co-authored-by: Sean Klein <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reconfigurator testing: Expand target-release.txt to walk through a full update
4 participants