-
Notifications
You must be signed in to change notification settings - Fork 58
(7/N) Use nexus_generation, update it #8936
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 48 commits
Commits
Show all changes
57 commits
Select commit
Hold shift + click to select a range
38a265b
(4/N) db_metadata_nexus database queries (handoff)
smklein 3e61a97
(5/N) Read database access records on boot
smklein e1b43c3
(6/N) Add image to planning report for zones
smklein 6b7e9ed
(7/N) Use nexus_generation, update it
smklein db263c1
Merge with main
smklein bbb204c
drop keys for InlineErrorChain
smklein 67505ec
Merge
smklein e1f2fe1
Merge
smklein 529874d
update nexus internal API
smklein 961338c
Actually create NotYet records when appropriate
smklein fed7aa0
Merge with main
smklein bf79e74
Merge
smklein 51b3103
Merge
smklein 9514deb
Fix database_nexus_access_create to use active Nexus
smklein ce31570
Make clippy happy
smklein 70d4ab9
Simplify database_nexus_access_create, move tests
smklein 690ea16
Unify queries for all db_metadata_nexus states
smklein f15fa4b
comments, enums, set_..._nexuses
smklein 5525529
determine_nexus_generation no longer returns an option
smklein c8126af
maybe when we say the gen bumps, we should do that
smklein 056829f
merge
smklein e3615a9
merge
smklein a38a811
move some generation checks out of planner, into blippy
smklein b4bcca7
get_zones_not_yet_propagated_to_inventory comments
smklein a6548ca
clarifying planner comments
smklein 448a39d
proposed
smklein 07d79f3
Test cleanup, make lack of nexus zones more of an error, delay deploy…
smklein 730a3b5
Ensure nexus can always be 'shut down' safely, refactor can_zone_be_u…
smklein f34422f
Merge
smklein 4939771
clippy
smklein aa76e15
Change how simulated reconfigurator creates planning input
smklein b4cfa2d
plumb nexus_generation as arg
smklein ff13b71
stop using --no-zones
smklein 2e04448
set nexus_generation explicitly during reconfigurator-cli tests
smklein 8067ac9
Also update reconfigurator-cli output
smklein e9b2e17
remove XXX comment
smklein 2ab3af2
Merge with main (mostly sled_add_zone_nexus stuff)
smklein 87ecf81
merge with main
smklein 02d6aae
Review feedback (BTreeSets, using PlanningInputBuilder, dedup setters)
smklein e8d0d03
Converge with 9023, to make the incoming merge easier
smklein ef82b33
merge
smklein 0ac6f92
merge
smklein f82ad3b
Continuing to reduce the diff
smklein 59194c0
Continuing to reduce diff
smklein deb3139
Report about the right zones
smklein e7c58ef
Throw errors on corrupt db/bp active gens
smklein d42a109
iterate over all zones
smklein ff3d86b
Update nexus last
smklein 7b67eaf
review feedback
smklein 8d17221
Review feedback, moving determine_nexus_gen (and tests) to planner
smklein b1ee973
update nexus tests to only upgrade between non-installdataset versions
smklein 4ea1405
trying to improve planner report
smklein 9843b4a
Minimize window where planner report shows up, test it
smklein 4beb420
More feedback
smklein f61d3fd
Merge
smklein e1c9f06
Finish target-release test, remove defunct test, nits
smklein aef9586
Error to not know current nexus generation
smklein File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's going on with these? I only skimmed the test so far but I think at this point in the test we're expecting to be doing an update, but we can't proceed because there are pending MGS updates. Why is it saying it only placed 0/2 desired Nexus zones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, after doing a little digging, I think this is kinda nasty, and probably should be considered related to #8921.
At this point in the test, there are three sleds:
This should normally not be possible, but I believe is occurring because the three sleds have been manually edited. They also appear to all have been constructed with "nexus_generation = 1", which matches the blueprint-level "nexus_generation".
(This would be flagged by blippy as a corrupt blueprint - using the same generation for different images - but I don't think anyone is checking in this test)
This means we give really weird input to the planner here: We say that all three Nexuses are active, because they're all in-service sleds running the desired version of Nexus.
Without the #8921 changes, we will try to proceed placing discretionary zones, even though there is an InstallDataset present. As a part of this, we find the "currently in-charge Nexus image", which happens to find
InstallDataset
first.Then, the planner tries to ensure that the "currently in-charge Nexus image" has sufficient redundancy. It sees that one sled has
(Nexus, InstallDataset)
, and wants "two more Nexuses at this image" to reach the target. This is where the "2" in "0/2" comes from.Next, the planner calls
add_discretionary_zones
, but theplace_zone
logic prevents zones of the same type from being placed on the same sled. Basically, all sleds report "I have a nexus already (ignoring the image source), so you can't place a new nexus here". This means we place zero new Nexuses.This triggers the
report.out_of_eligible_sleds
call, which creates this message in the output.I think the "pending MGS updates" logic would prevent zones from being updated, but this logic is happening in the context of "trying to restore redundancy, when the planner thinks we're running at reduced capacity" -- this is entirely in the domain of "zone add", not "zone update".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be super explicit: This "three-different-versions-at-once" behavior is also on main:
omicron/dev-tools/reconfigurator-cli/tests/output/cmds-mupdate-update-flow-stdout
Lines 1697 to 1841 in c3ea904
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's my consolation, after chatting with @sunshowers :
This is only happening because of the following config option:
If this switch wasn't set, and we mupdated into this situation, we would normally refuse to do any add/update operations at all, while in this forcefully corrupt state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. @sunshowers, should we turn off this flag at this point in the test? (Is it weird that most of this test runs with that flag on?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on this question; I know when I've had to update this test I haven't understood the changes I've made as well as I'd like. It's a pretty intricate test though so I'm not sure what to suggest for making it clearer.