Skip to content

Conversation

smklein
Copy link
Collaborator

@smklein smklein commented Aug 28, 2025

  • Actually update nexus generation within the top-level blueprint and Nexus zones
  • Deploy new and old nexus zones concurrently

Blueprint Planner

  • Automatically determine nexus generation when provisioning new Nexus zones, based on existing deployed zones
  • Update the logic for provisioning nexus zones, to deploy old and new nexus images side-by-side
  • Update the logic for expunging nexus zones, to only do so when running from a "newer" nexus
  • Add a planning stage to bump the top-level "nexus generation", if appropriate, which would trigger the old Nexuses to quiesce.

Fixes #8843, #8854

@smklein smklein force-pushed the nexus_gen_usage branch 2 times, most recently from d1bd3fb to bf8f274 Compare August 28, 2025 22:37
@smklein smklein force-pushed the nexus_gen_usage branch 2 times, most recently from 62a6819 to 30ecc07 Compare August 28, 2025 23:03
@smklein smklein force-pushed the image_reporting branch 2 times, most recently from 0b2efdd to 9c09f60 Compare August 29, 2025 21:22
@smklein smklein force-pushed the nexus_gen_usage branch 2 times, most recently from 6da9f39 to c4c748f Compare August 30, 2025 00:51
smklein added a commit that referenced this pull request Aug 30, 2025
Adds schema for nexus generations, leaves the value at "1".

These schemas will be used more earnestly in
#8936

Fixes #8853
@smklein smklein mentioned this pull request Sep 17, 2025
1 task
match current_nexus_generation {
Some(current) => write!(
f,
"zone gen ({zone_generation}) >= currently-running \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I haven't reviewed the planner changes yet, so maybe this question will make more sense once I do.)

Why do we report that the generation is >= the currently-running Nexus's generation for "waiting to expunge"? I thought we'd be expunging zones with a generation < the currently-running Nexus.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you're right - we only do want to expunge zones with a generation < the currently-running Nexus. That's the check we're making in the planner that would need to fail to generate this planning report.

This is called from do_plan_zone_updates in the caller, when we're iterating over out_of_date_zones. So basically:

  • We found a zone with an image that does not match what the TUF repo expects it to be (it's part of out_of_date_zones).
  • We tried to check can_zone_be_updated, to see if we can expunge this zone. When we did that, the generation for this observed zone looks "as new or newer" than our currently-running zone.

So, TL;DR: The context for this is report is that "the target TUF repo changed, and this zone looks out-of-date".

I think this is probably most common when the "zone we're trying to expunge" has a generation exactly equal to the active generation, but I wanted to be precise about this check. I think if the "zone we're trying to expunge" has a generation greater than the active generation, that would imply we changed TUF repos mid-update, which we've previously discussed banning anyway.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This goes to the question I had about "what is a ZoneWaitingToExpunge". What is it we're trying to report here? It doesn't seem like there's anything wrong or blocking?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is it we're trying to report here? It doesn't seem like there's anything wrong or blocking?

We are trying to report that there is a zone which has an image that does not match the target TUF repo. For almost all zones, this means "we want to expunge it". For Nexus, however, we only want to expunge it it's not actively running. The logic for this is entirely contained within can_zone_be_updated in nexus/reconfigurator/planning/src/planner.rs.

It doesn't seem like there's anything wrong or blocking?

Well it's not a "wrong" case for sure, but it is blocking, in IMO kinda the same way that "not enough redundancy for a service" could prevent it from being expunged. In this case, we want to expunge these Nexus zones, we just aren't ready to do so, because handoff has not finished.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well it's not a "wrong" case for sure, but it is blocking, in IMO kinda the same way that "not enough redundancy for a service" could prevent it from being expunged.

I don't think I follow this. In the "not enough redundancy" case, we block the update from proceeding until redundancy is restored. But these messages are showing up even if everything is fine and normal, and they eventually go away once we're far enough along in the update. That doesn't seem blocking?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to update the text in 4ea1405, and to make this more precise (only matching "exactly" the current version).

If we feel this is still redundant with the "... remaining out-of-date zones" message, I can remove it, but I did want to communicate something in the report about why "even though you might want this zone to be updated, there is a good reason why we aren't doing it right now"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't really understand what it's trying to communicate. Maybe that's because I do think it's redundant? I think I would claim that at the point at which we've started to update zones, but before we're done with all of them, there's nothing to report about Nexus other than including it in the "remaining out-of-date zones" count. All of those zones are either "waiting to be expunged" or "waiting to be updated in place" (depending on their flavor).

Maybe there's something more specific to report once we get to the point of starting new Nexus zones and performing the handoff? But I'm not sure exactly what that would be, or how we'd see it. We only get to see the reports if a new blueprint is produced that makes some change (otherwise we throw it away), so I'd think we'd see reports for

  • the blueprint that added the three new Nexus zones once all the non-Nexus zones are updated
  • the blueprint that bumped the top-level Nexus generation once quiescing is done
  • the blueprint (created by new Nexus) that expunges the old Nexus zones

but at none of those points are we really waiting for the old Nexus zones to be expunged, I think? Are there more blueprints coming out during the handoff beyond those three?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd be most interested in this report if:

  • All zones are updated, except for Nexus
  • According to the planner input, the old Nexuses are still running
  • They generate a blueprint saying "we aren't modifying anything, but there are still out-of-date zones"

In this case, I think it's important to identify that "Even though the system has not converged to an updated state, we are waiting for something -- this is what, and this is why".

Otherwise, we'd see: "blueprint with no changes, even though we know some zones are out-of-date"

Would it be more clear if I limited the scope of this report to "only emit this record if Nexus is the only out-of-date zone"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only get to see the reports if a new blueprint is produced that makes some change (otherwise we throw it away)

do planning inputs only propagate if something else in the blueprint changes?

If that's true, then I think we could fail to communicate the ZoneUnsafeToShutdown reports too - these reports also identify why "we might not be making any changes to your blueprint, even though there is work to be done"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored this a bit in 9843b4a:

  • This minimizes the window when we report about Nexus zones, until they're the only remaining zone to be updated
  • This removes all the noise from the reconfigurator-cli tests
  • I explicitly added this case to the test suite, where we try to perform an update from an old Nexus, and see that we're blocked. We also see the report in this case, which will identify why we aren't proceeding.

.collect();

// Define image sources: A (same as existing Nexus) and B (new)
let image_source_a = BlueprintZoneImageSource::InstallDataset;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kinda strengthens my fear that deciding generation based on image sources is a little fraught; two Nexus zones on two different sleds both with the source InstallDataset might be completely unrelated. (Presumably they aren't in practice! And it'd be very surprising. But an image source of InstallDataset means we'd need to check the zone manifest in inventory to know whether the zone image sources actually match.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, are you suggesting I handle this case, or that the nexus_generation logic would be fraught regardless?

This is what we said about this in RFD 588:

When creating a new Nexus instance, the planner picks the nexus_generation as follows:

If the new instance’s image matches that of any existing Nexus instance, expunged or otherwise, use the same blueprint generation as that instance. (This should always match the current blueprint generation or the next one (during an upgrade).)
If the new instance’s image does not match that of any existing Nexus instance, choose one generation number higher than all existing instances. (This should always match the current blueprint generation plus one.)
The intuition here is that a new generation is defined by deploying a new Nexus image. If there’s a new Nexus image, we’ll need a handoff. If not, we don’t.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I kind of feel like you shouldn't be able to add a new one while any of them has source InstallDataset? I feel like we need to wait for image resolution to finish.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think my concern is that the RFD bit you quoted talks in terms of "matching" images. InstallDataset alone does not give enough information to know whether it matches another sled's InstallDataset image (even of the same zone type). I think given we block zone updates until all zone image sources are known (or will, once #8921 lands), we should also gate doing anything related to Nexus handoff on the same thing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the tests in this PR to avoid upgrading from the InstallDataset in b1ee973 - but I'm going to avoid including a gate within the logic of "nexus zone add", because #8921 should prevent the addition of any zone, nexus or not, if there are InstallDatasets present.

Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't yet looked at the builder or the planner but I see the PR is changing a lot so I wanted to get these comments in. Will circle back to those files next.

Comment on lines 2343 to 2344
* 0 zones waiting to be expunged:
* zone 466a9f29-62bf-4e63-924a-b9efdb86afec (nexus): zone gen (1) >= currently-running Nexus gen (1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems confusing -- it sounds like we're trying to expunge these zones because they're at a newer generation?

It also seems like these messages come and go over the next several steps.

I think this is probably just a problem with the text, not the behavior. I'm not sure what it's trying to communicate though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to update the text in 4ea1405.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, made this report only appear when it's more relevant - see: 9843b4a

* skipping noop zone image source check on sled d81c6a84-79b8-4958-ae41-ea46c9b19763: all 6 zones are already from artifacts
* 1 pending MGS update:
* model0:serial0: RotBootloader(PendingMgsUpdateRotBootloaderDetails { expected_stage0_version: ArtifactVersion("0.0.1"), expected_stage0_next_version: NoValidVersion })
* only placed 0/2 desired nexus zones
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's going on with these? I only skimmed the test so far but I think at this point in the test we're expecting to be doing an update, but we can't proceed because there are pending MGS updates. Why is it saying it only placed 0/2 desired Nexus zones?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, after doing a little digging, I think this is kinda nasty, and probably should be considered related to #8921.

At this point in the test, there are three sleds:

  • One has nexus @ InstallDataset
  • One has nexus @ 1.0.0
  • One has nexus @ 2.0.0

This should normally not be possible, but I believe is occurring because the three sleds have been manually edited. They also appear to all have been constructed with "nexus_generation = 1", which matches the blueprint-level "nexus_generation".

(This would be flagged by blippy as a corrupt blueprint - using the same generation for different images - but I don't think anyone is checking in this test)

This means we give really weird input to the planner here: We say that all three Nexuses are active, because they're all in-service sleds running the desired version of Nexus.

Without the #8921 changes, we will try to proceed placing discretionary zones, even though there is an InstallDataset present. As a part of this, we find the "currently in-charge Nexus image", which happens to find InstallDataset first.

Then, the planner tries to ensure that the "currently in-charge Nexus image" has sufficient redundancy. It sees that one sled has (Nexus, InstallDataset), and wants "two more Nexuses at this image" to reach the target. This is where the "2" in "0/2" comes from.

Next, the planner calls add_discretionary_zones, but the place_zone logic prevents zones of the same type from being placed on the same sled. Basically, all sleds report "I have a nexus already (ignoring the image source), so you can't place a new nexus here". This means we place zero new Nexuses.

This triggers the report.out_of_eligible_sleds call, which creates this message in the output.

I think the "pending MGS updates" logic would prevent zones from being updated, but this logic is happening in the context of "trying to restore redundancy, when the planner thinks we're running at reduced capacity" -- this is entirely in the domain of "zone add", not "zone update".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be super explicit: This "three-different-versions-at-once" behavior is also on main:

> blueprint-show latest
blueprint 8f2d1f39-7c88-4701-aa43-56bf281b28c1
parent: ce365dff-2cdb-4f35-a186-b15e20e1e700
sled: 2b8f0cb3-0295-4b3c-bc58-4fe88b57112c (active, config generation 5)
host phase 2 contents:
------------------------
slot boot image source
------------------------
A current contents
B current contents
physical disks:
------------------------------------------------------------------------------------
vendor model serial disposition
------------------------------------------------------------------------------------
fake-vendor fake-model serial-72c59873-31ff-4e36-8d76-ff834009349a in service
datasets:
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
dataset name dataset id disposition quota reservation compression
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
oxp_72c59873-31ff-4e36-8d76-ff834009349a/crucible 8c4fa711-1d5d-4e93-85f0-d17bff47b063 in service none none off
oxp_72c59873-31ff-4e36-8d76-ff834009349a/crypt/clickhouse 3b66453b-7148-4c1b-84a9-499e43290ab4 in service none none off
oxp_72c59873-31ff-4e36-8d76-ff834009349a/crypt/external_dns 841d5648-05f0-47b0-b446-92f6b60fe9a6 in service none none off
oxp_72c59873-31ff-4e36-8d76-ff834009349a/crypt/internal_dns 3560dd69-3b23-4c69-807d-d673104cfc68 in service none none off
oxp_72c59873-31ff-4e36-8d76-ff834009349a/crypt/zone 4829f422-aa31-41a8-ab73-95684ff1ef48 in service none none off
oxp_72c59873-31ff-4e36-8d76-ff834009349a/crypt/zone/oxz_clickhouse_353b3b65-20f7-48c3-88f7-495bd5d31545 318fae85-abcb-4259-b1b6-ac96d193f7b7 in service none none off
oxp_72c59873-31ff-4e36-8d76-ff834009349a/crypt/zone/oxz_crucible_bd354eef-d8a6-4165-9124-283fb5e46d77 2ad1875a-92ac-472f-8c26-593309f0e4da in service none none off
oxp_72c59873-31ff-4e36-8d76-ff834009349a/crypt/zone/oxz_crucible_pantry_ad6a3a03-8d0f-4504-99a4-cbf73d69b973 c31623de-c19b-4615-9f1d-5e1daa5d3bda in service none none off
oxp_72c59873-31ff-4e36-8d76-ff834009349a/crypt/zone/oxz_external_dns_6c3ae381-04f7-41ea-b0ac-74db387dbc3a b46de15d-33e7-4cd0-aa7c-e7be2a61e71b in service none none off
oxp_72c59873-31ff-4e36-8d76-ff834009349a/crypt/zone/oxz_internal_dns_99e2f30b-3174-40bf-a78a-90da8abba8ca 09b9cc9b-3426-470b-a7bc-538f82dede03 in service none none off
oxp_72c59873-31ff-4e36-8d76-ff834009349a/crypt/zone/oxz_nexus_466a9f29-62bf-4e63-924a-b9efdb86afec 775f9207-c42d-4af2-9186-27ffef67735e in service none none off
oxp_72c59873-31ff-4e36-8d76-ff834009349a/crypt/zone/oxz_ntp_62620961-fc4a-481e-968b-f5acbac0dc63 2db6b7c1-0f46-4ced-a3ad-48872793360e in service none none off
oxp_72c59873-31ff-4e36-8d76-ff834009349a/crypt/debug 93957ca0-9ed1-4e7b-8c34-2ce07a69541c in service 100 GiB none gzip-9
omicron zones:
---------------------------------------------------------------------------------------------------------------
zone type zone id image source disposition underlay IP
---------------------------------------------------------------------------------------------------------------
clickhouse 353b3b65-20f7-48c3-88f7-495bd5d31545 install dataset in service fd00:1122:3344:102::23
crucible bd354eef-d8a6-4165-9124-283fb5e46d77 install dataset in service fd00:1122:3344:102::26
crucible_pantry ad6a3a03-8d0f-4504-99a4-cbf73d69b973 install dataset in service fd00:1122:3344:102::25
external_dns 6c3ae381-04f7-41ea-b0ac-74db387dbc3a install dataset in service fd00:1122:3344:102::24
internal_dns 99e2f30b-3174-40bf-a78a-90da8abba8ca install dataset in service fd00:1122:3344:1::1
internal_ntp 62620961-fc4a-481e-968b-f5acbac0dc63 install dataset in service fd00:1122:3344:102::21
nexus 466a9f29-62bf-4e63-924a-b9efdb86afec install dataset in service fd00:1122:3344:102::22
sled: 98e6b7c2-2efa-41ca-b20a-0a4d61102fe6 (active, config generation 7)
host phase 2 contents:
------------------------
slot boot image source
------------------------
A current contents
B current contents
physical disks:
------------------------------------------------------------------------------------
vendor model serial disposition
------------------------------------------------------------------------------------
fake-vendor fake-model serial-c6d33b64-fb96-4129-bab1-7878a06a5f9b in service
datasets:
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
dataset name dataset id disposition quota reservation compression
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
oxp_c6d33b64-fb96-4129-bab1-7878a06a5f9b/crucible 43931274-7fe8-4077-825d-dff2bc8efa58 in service none none off
oxp_c6d33b64-fb96-4129-bab1-7878a06a5f9b/crypt/external_dns a4c3032e-21fa-4d4a-b040-a7e3c572cf3c in service none none off
oxp_c6d33b64-fb96-4129-bab1-7878a06a5f9b/crypt/internal_dns 4f60b534-eaa3-40a1-b60f-bfdf147af478 in service none none off
oxp_c6d33b64-fb96-4129-bab1-7878a06a5f9b/crypt/zone 4617d206-4330-4dfa-b9f3-f63a3db834f9 in service none none off
oxp_c6d33b64-fb96-4129-bab1-7878a06a5f9b/crypt/zone/oxz_crucible_5199c033-4cf9-4ab6-8ae7-566bd7606363 ad41be71-6c15-4428-b510-20ceacde4fa6 in service none none off
oxp_c6d33b64-fb96-4129-bab1-7878a06a5f9b/crypt/zone/oxz_crucible_pantry_ba4994a8-23f9-4b1a-a84f-a08d74591389 1bca7f71-5e42-4749-91ec-fa40793a3a9a in service none none off
oxp_c6d33b64-fb96-4129-bab1-7878a06a5f9b/crypt/zone/oxz_external_dns_803bfb63-c246-41db-b0da-d3b87ddfc63d 3ac089c9-9dec-465b-863a-188e80d71fb4 in service none none off
oxp_c6d33b64-fb96-4129-bab1-7878a06a5f9b/crypt/zone/oxz_internal_dns_427ec88f-f467-42fa-9bbb-66a91a36103c 686c19cf-a0d7-45f6-866f-c564612b2664 in service none none off
oxp_c6d33b64-fb96-4129-bab1-7878a06a5f9b/crypt/zone/oxz_nexus_0c71b3b2-6ceb-4e8f-b020-b08675e83038 793ac181-1b01-403c-850d-7f5c54bda6c9 in service none none off
oxp_c6d33b64-fb96-4129-bab1-7878a06a5f9b/crypt/zone/oxz_ntp_6444f8a5-6465-4f0b-a549-1993c113569c cdf3684f-a6cf-4449-b9ec-e696b2c663e2 in service none none off
oxp_c6d33b64-fb96-4129-bab1-7878a06a5f9b/crypt/debug 248c6c10-1ac6-45de-bb55-ede36ca56bbd in service 100 GiB none gzip-9
omicron zones:
-----------------------------------------------------------------------------------------------------------------------
zone type zone id image source disposition underlay IP
-----------------------------------------------------------------------------------------------------------------------
crucible 5199c033-4cf9-4ab6-8ae7-566bd7606363 artifact: version 1.0.0 in service fd00:1122:3344:101::25
crucible_pantry ba4994a8-23f9-4b1a-a84f-a08d74591389 artifact: version 1.0.0 in service fd00:1122:3344:101::24
external_dns 803bfb63-c246-41db-b0da-d3b87ddfc63d artifact: version 1.0.0 in service fd00:1122:3344:101::23
internal_dns 427ec88f-f467-42fa-9bbb-66a91a36103c artifact: version 1.0.0 in service fd00:1122:3344:2::1
internal_ntp 6444f8a5-6465-4f0b-a549-1993c113569c artifact: version 1.0.0 in service fd00:1122:3344:101::21
nexus 0c71b3b2-6ceb-4e8f-b020-b08675e83038 artifact: version 1.0.0 in service fd00:1122:3344:101::22
sled: d81c6a84-79b8-4958-ae41-ea46c9b19763 (active, config generation 6)
host phase 2 contents:
------------------------
slot boot image source
------------------------
A current contents
B current contents
physical disks:
------------------------------------------------------------------------------------
vendor model serial disposition
------------------------------------------------------------------------------------
fake-vendor fake-model serial-4930954e-9ac7-4453-b63f-5ab97c389a99 in service
datasets:
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
dataset name dataset id disposition quota reservation compression
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
oxp_4930954e-9ac7-4453-b63f-5ab97c389a99/crucible 090bd88d-0a43-4040-a832-b13ae721f74f in service none none off
oxp_4930954e-9ac7-4453-b63f-5ab97c389a99/crypt/external_dns 4da74a5b-6911-4cca-b624-b90c65530117 in service none none off
oxp_4930954e-9ac7-4453-b63f-5ab97c389a99/crypt/internal_dns 252ac39f-b9e2-4697-8c07-3a833115d704 in service none none off
oxp_4930954e-9ac7-4453-b63f-5ab97c389a99/crypt/zone 45cd9687-20be-4247-b62a-dfdacf324929 in service none none off
oxp_4930954e-9ac7-4453-b63f-5ab97c389a99/crypt/zone/oxz_crucible_f55647d4-5500-4ad3-893a-df45bd50d622 1cb0a47a-59ac-4892-8e92-cf87b4290f96 in service none none off
oxp_4930954e-9ac7-4453-b63f-5ab97c389a99/crypt/zone/oxz_crucible_pantry_75b220ba-a0f4-4872-8202-dc7c87f062d0 b1deff4b-51df-4a37-9043-afbd7c70a1cb in service none none off
oxp_4930954e-9ac7-4453-b63f-5ab97c389a99/crypt/zone/oxz_external_dns_f6ec9c67-946a-4da3-98d5-581f72ce8bf0 c65a9c1c-36dc-4ddb-8aac-ec3be8dbb209 in service none none off
oxp_4930954e-9ac7-4453-b63f-5ab97c389a99/crypt/zone/oxz_internal_dns_ea5b4030-b52f-44b2-8d70-45f15f987d01 21fd4f3a-ec31-469b-87b1-087c343a2422 in service none none off
oxp_4930954e-9ac7-4453-b63f-5ab97c389a99/crypt/zone/oxz_nexus_3eeb8d49-eb1a-43f8-bb64-c2338421c2c6 e009d8b8-4695-4322-b53f-f03f2744aef7 in service none none off
oxp_4930954e-9ac7-4453-b63f-5ab97c389a99/crypt/zone/oxz_ntp_f10a4fb9-759f-4a65-b25e-5794ad2d07d8 41071985-1dfd-4ce5-8bc2-897161a8bce4 in service none none off
oxp_4930954e-9ac7-4453-b63f-5ab97c389a99/crypt/debug 7a6a2058-ea78-49de-9730-cce5e28b4cfb in service 100 GiB none gzip-9
omicron zones:
-----------------------------------------------------------------------------------------------------------------------
zone type zone id image source disposition underlay IP
-----------------------------------------------------------------------------------------------------------------------
crucible f55647d4-5500-4ad3-893a-df45bd50d622 artifact: version 2.0.0 in service fd00:1122:3344:103::25
crucible_pantry 75b220ba-a0f4-4872-8202-dc7c87f062d0 artifact: version 2.0.0 in service fd00:1122:3344:103::24
external_dns f6ec9c67-946a-4da3-98d5-581f72ce8bf0 artifact: version 2.0.0 in service fd00:1122:3344:103::23
internal_dns ea5b4030-b52f-44b2-8d70-45f15f987d01 artifact: version 2.0.0 in service fd00:1122:3344:3::1
internal_ntp f10a4fb9-759f-4a65-b25e-5794ad2d07d8 artifact: version 2.0.0 in service fd00:1122:3344:103::21
nexus 3eeb8d49-eb1a-43f8-bb64-c2338421c2c6 artifact: version 2.0.0 in service fd00:1122:3344:103::22

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's my consolation, after chatting with @sunshowers :

This is only happening because of the following config option:

> # Set the add_zones_with_mupdate_override planner config to ensure that zone
> # adds happen despite zone image sources not being Artifact.
> set planner-config --add-zones-with-mupdate-override true
planner config updated:
*   add zones with mupdate override:   false -> true

If this switch wasn't set, and we mupdated into this situation, we would normally refuse to do any add/update operations at all, while in this forcefully corrupt state.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. @sunshowers, should we turn off this flag at this point in the test? (Is it weird that most of this test runs with that flag on?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this question; I know when I've had to update this test I haven't understood the changes I've made as well as I'd like. It's a pretty intricate test though so I'm not sure what to suggest for making it clearer.

match current_nexus_generation {
Some(current) => write!(
f,
"zone gen ({zone_generation}) >= currently-running \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This goes to the question I had about "what is a ZoneWaitingToExpunge". What is it we're trying to report here? It doesn't seem like there's anything wrong or blocking?

.collect();
let second_sled_id = sled_ids[1];

// Use a different image source (artifact vs install dataset)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I wonder if this should produce an error if any Nexus zones have source "install dataset"? Because we don't know that the image is actually different.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to defer to the conversation in #8936 (comment) - I'm down to restrict the usage of the install dataset, but doing this requires patching tests that rain changed in #8921.

Once that merges, I'm willing to add more explicit checks, but none of the "zone add" logic in the planner should execute anyway if there are any InstallDataset zones in use.

hash: ArtifactHash([0x42; 32]),
};

// Add another Nexus zone with different image source - should increment generation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since test_nexus_generation_assignment_new_generation() already tested that if you pass a particular generation to sled_add_zone_nexus() then you get a zone with that generation, I wonder if we should just have this test call determine_nexus_generation() with both cases: one with the same image and one with a new one. I don't think we really need to add the zones and check that they got what we expected.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed this test out of necessity, with moving determine_nexus_generation out of the planner.

(Updated in 8d17221)

Comment on lines 1582 to 1584
/// - If any existing Nexus zone has the same image source, reuse its generation
/// - Otherwise, use the highest existing generation + 1
/// - If no existing zones exist, return an error
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do feel like this probably belongs in the planner. There are no callers within the builder.

.collect();

// Define image sources: A (same as existing Nexus) and B (new)
let image_source_a = BlueprintZoneImageSource::InstallDataset;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I kind of feel like you shouldn't be able to add a new one while any of them has source InstallDataset? I feel like we need to wait for image resolution to finish.

let zones_currently_updating =
self.get_zones_not_yet_propagated_to_inventory();
if !zones_currently_updating.is_empty() {
info!(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little surprising there's no update to report for this. But maybe that happens elsewhere?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a problem on main too. Filed #9047.

Comment on lines 2193 to 2194
// For Nexus, we need to confirm that the active generation has
// moved beyond this zone.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling this function for Nexus feels a little weird. For other zones, I think we're saying: "we're about to update this one zone, either by expunging it [and then subsequently adding one] or else replacing it. Can we do that now?" For Nexus, though, this will only return true after we've done the handoff to the new fleet.

I'm not sure it's worth reworking. Assuming not, I'd clarify this comment a bit:

Suggested change
// For Nexus, we need to confirm that the active generation has
// moved beyond this zone.
// For Nexus, we're only ready to "update" this zone once control has been handed off to a newer generation of Nexus zones. (Once that happens, we're not really going to update this zone, just expunge it.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored this a bit, but done in 4beb420.

Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a few last suggestions but this is looking good to me! Given how tricky this is, I'd like to get @jgallagher's +1 too.

Comment on lines +1562 to +1571
let (nexus_updateable_zones, non_nexus_updateable_zones): (
Vec<_>,
Vec<_>,
) = out_of_date_zones
.into_iter()
.filter(|(_, zone, _)| {
self.are_zones_ready_for_updates(mgs_updates)
&& self.can_zone_be_shut_down_safely(&zone, &mut report)
})
.partition(|(_, zone, _)| zone.zone_type.is_nexus());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take it or leave it: I'm wondering if we can make should_nexus_zone_be_expunged() more type-safe (so it can't panic on bad input) with something like (this is untested):

Suggested change
let (nexus_updateable_zones, non_nexus_updateable_zones): (
Vec<_>,
Vec<_>,
) = out_of_date_zones
.into_iter()
.filter(|(_, zone, _)| {
self.are_zones_ready_for_updates(mgs_updates)
&& self.can_zone_be_shut_down_safely(&zone, &mut report)
})
.partition(|(_, zone, _)| zone.zone_type.is_nexus());
let (nexus_updateable_zones, non_nexus_updateable_zones): (
Vec<_>,
Vec<_>,
) = out_of_date_zones
.into_iter()
.filter(|(_, zone, _)| {
self.are_zones_ready_for_updates(mgs_updates)
&& self.can_zone_be_shut_down_safely(&zone, &mut report)
})
.map(|(sled_id, zone, image_source)| {
let nexus_config = match &zone.z_type {
blueprint_zone_type::Nexus(nexus_config) => nexus_config,
_ => None,
};
(sled_id, zone, image_source, nexus_config)
})
.partition(|(_, _, _, nexus_config)| nexus_config.is_none());

Then we'll already have the nexus_config and can pass it to should_nexus_zone_be_expunged(). Not a big deal.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be more on-board for this, but I think the type of nexus_config would be wrapped in an option here, which would need to be unwrapped anyway?

// identify why it is not ready for update.
fn can_zone_be_updated(
//
// Precondition: zone must be a Nexus zone
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Precondition: zone must be a Nexus zone
// Precondition: zone must be a Nexus zone and be running an out-of-date image

(trying to communicate that this shouldn't be used in any case other than "update")

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating phrasing in e1c9f06

* skipping noop zone image source check on sled d81c6a84-79b8-4958-ae41-ea46c9b19763: all 6 zones are already from artifacts
* 1 pending MGS update:
* model0:serial0: RotBootloader(PendingMgsUpdateRotBootloaderDetails { expected_stage0_version: ArtifactVersion("0.0.1"), expected_stage0_next_version: NoValidVersion })
* only placed 0/2 desired nexus zones
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. @sunshowers, should we turn off this flag at this point in the test? (Is it weird that most of this test runs with that flag on?)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be in a follow-on PR, but I feel like we should update this test to finish the update now. (If it's easy, it'd be nice to get in this PR -- it'd be a tidy confirmation that all is working.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in e1c9f06 - looks like it works!

(though it looks like it also was updated in #9059 by @jgallagher )

Copy link
Contributor

@jgallagher jgallagher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM too, thanks! Left a handful of minor nits and clarifying questions, but nothing blocking.

.unwrap();
let image_source = BlueprintZoneImageSource::InstallDataset;

// Add first Nexus zone - should get generation 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two nitpicky questions:

  • Is it that it should get generation 1, or that it should get builder.parent_blueprint().nexus_generation since that's what we're passing in?
  • If the latter, do we still need this test now that the method isn't doing any logic to pick a generation and is just using whatever we pass in?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I first cleaned up this test to clarify "the output matches the input", but I agree with you - I think it's a bit silly to have a test that the value is just a pass-through.

Removed in e1c9f06


// We may need to bump the top-level Nexus generation number
// to update Nexus zones.
let nexus_generation_bump = self.do_plan_nexus_generation_update()?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be guarded by any of the "are prior steps still pending" checks like we have on updating zones above? My gut feeling is that in a normal update this wouldn't matter (we wouldn't be in a position for do_plan_nexus_generation_update() to do anything unless everything else is already done anyway). Maybe there are some weird cases where it might come up? Sled addition at a particularly unlucky time or something?

I think the answer is "no, don't guard it" - if we're in a position where we're ready to trigger a handoff, we should probably go ahead and do that even if the planner thinks there's now something else to do "earlier" in the update process, and the new Nexus can finish up whatever is left? But it seems like a weird enough case I wanted to ask.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that inside the body of do_plan_nexus_generation_update, we're already guarding against the stuff we care about, and validate that these things have propagated to inventory (e.g., new Nexuses booting) if we care about it.

I just scanned through do_plan_nexus_generation_update - it seems like any spot where we need to validate some property iterating over zones, it checks both the blueprint and the pending changes in the sled_editors.

I believe that, in the case where we're adding a zone and then performing this check:

  • We'll be able to see the newly planned zone (e.g., looking that all zones are on a newer image)
  • We won't see it in inventory, and could raise an error if we need to see it there (e.g., needing sufficient new Nexuses to actually be running)

//
// This presumably includes the currently-executing Nexus where
// this logic is being considered.
let Some(current_gen) = self.lookup_current_nexus_generation()? else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should lookup_current_nexus_generation() return a Result<Generation, _> instead of a Result<Option<Generation>, _>? Or a different question: Can we get Ok(None) here from a well-formed parent blueprint?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, this actually simplifies the space of possible reported states if we treat it like an error. Done in aef9586

panic!("did not converge after {MAX_PLANNING_ITERATIONS} iterations");
}

struct BlueprintGenerator {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This work is already done so I wouldn't change it in this PR, but I don't love this - it feels like it's replicating what we built reconfigurator-cli to do for the tests it enables (set up a system, generate TUF repos, walk through various scenarios, etc., in a way that's easier to read and maintain than unit tests). Do you think we could replace these tests with reconfigurator-cli-based ones, or are they poking at things that would be hard to do there?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be happy to take this on as a follow-up to this PR. Candidly I didn't realize reconfigurator-cli was flexible enough to support blueprint editing like we do in tests, but I can take a look at that now.

* skipping noop zone image source check on sled d81c6a84-79b8-4958-ae41-ea46c9b19763: all 6 zones are already from artifacts
* 1 pending MGS update:
* model0:serial0: RotBootloader(PendingMgsUpdateRotBootloaderDetails { expected_stage0_version: ArtifactVersion("0.0.1"), expected_stage0_next_version: NoValidVersion })
* only placed 0/2 desired nexus zones
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this question; I know when I've had to update this test I haven't understood the changes I've made as well as I'd like. It's a pretty intricate test though so I'm not sure what to suggest for making it clearer.

@smklein smklein merged commit d1bdb3c into main Sep 23, 2025
17 checks passed
@smklein smklein deleted the nexus_gen_usage branch September 23, 2025 02:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Blueprint planner should "create new Nexus zones" before expunging old ones
3 participants