initial Reconfigurator (update) debugging guide #9361

davepacheco · 2025-11-07T00:58:02Z

This is still a work in progress!

This is one of those PRs that's hard to know when it's done. We could spend a lot of time improving this but I think what's here now is still much better than nothing so I'll be inclined to land this early and then iterate in main. Still, I welcome feedback, both on this PR and in follow ups! And I do want to flesh out the stuff in the decision tree section (at the end) that's obviously unfinished before landing this.

I'm also considering renaming this to the Reconfigurator Operator Guide because it's more than just debugging.

jgallagher

This is awesome!

docs/debugging-reconfigurator.adoc

jgallagher · 2025-11-07T15:20:38Z

docs/debugging-reconfigurator.adoc

+...
+```
+
+NOTE: As mentioned above, host OS phase 2 updates are implemented in the same step as host phase 1 updates, even though the step is only labeled "host phase 1".


Should we just change the label in the blueprint comment?

I think so.

jgallagher · 2025-11-07T15:21:42Z

docs/debugging-reconfigurator.adoc

+...
+```
+
+Most other zones use add/expunge updates.  These are done in multiple steps.  The first step explicitly expunges the zone in advance of the update.  Subsequent steps mark the expunged zone ready for cleanup and add the replacement.  These subsequent steps are currently unlabeled in the `omdb reconfigurator history` output.  So it looks like this for one zone update:


Should we fix these unlabeled steps too?

Ideally, yes.

docs/debugging-reconfigurator.adoc

jgallagher · 2025-11-07T15:45:31Z

docs/debugging-reconfigurator.adoc

+[#debug-stuck-MGS-update]
+=== Debugging why an MGS-driven update appears stuck
+
+// XXX-dap


Do you want to flesh out these sections before merging, or merge and flesh out on main?

I plan to flesh these out before merging.

labbott · 2025-11-07T16:08:14Z

docs/debugging-reconfigurator.adoc

+
+The top-level items in this list are strictly sequential.  Control plane zone updates do not start until all MGS-driven updates are complete.  Nexus handoff does not start until all other control plane zones are completed.
+
+MGS updates (e.g., SP updates) may be skipped if the new target release specifies the same versions for these components that are already deployed.  This is common in development/test environments, though unexpected in customer environments.  All releases, even development ones, have new versions of the host OS and control plane zones so these are never skipped.


I think "versions" is overloaded a bit here. The Host OS and control plane zone images could have the same binary image files but different artifact versions. Maybe specify that part explicitly?

Here, I meant the literal version number in the TUF artifact metadata for the SP image, RoT image, and RoT bootloader. I believe that today, the control plane zones include their version in their image, so they are necessarily different images and versions in different TUF repos, so none of this can apply. I thought that was true of the host OS as well but I'm not positive.

labbott · 2025-11-07T17:11:05Z

docs/debugging-reconfigurator.adoc

+
+The time for update is dominated by:
+
+* Sled reboots.  The process reboots each sled twice: once for SP update and once for host OS update.  It generally takes 4-5 minutes after the SP update and 8 minutes after the host OS update before the system moves onto the next step.


We should double check this timing for cosmo sleds

I edited this to be more specific about what we've measured.

davepacheco · 2025-11-07T20:04:50Z

This is now as complete as I intended for the first draft so it's ready for full review. I've also addressed the feedback so far. Thanks!

askfongjojo · 2025-11-10T18:26:35Z

docs/debugging-reconfigurator.adoc

+* During the brief periods where Crucible pantry zones are updated, in-flight Crucible operations like disk import may fail.  The user will have to try the operation again.
+* During the brief periods where Clickhouse or Oximeter is offline, some metric data may fail to be collected (so the metric data will be absent for that period) or queried.
+* During the many brief periods when components or sleds are restarted, some instance start operations may fail, if they or their disks get allocated to sleds or Crucible instances that are currently offline.
+* During the many brief periods when components or sleds are restarted, some disk create operations may fail, if they get allocated to Crucible instances that are currently offline.


Disk deletion will also fail (time out) if one of its downstairs is on a sled that is down.

Would it be worth mentioning here the possible guest failure modes (e.g., file system becomes read-only, prompting for fsck execution)?

The bit about nvme i/o timeout kernel settings has been added to the user-facing troubleshooting guide (also linked from System Update guide). For OS that supports only up to 255 second time-out, the setting doesn't really help to cover the duration of a sled update btw.

davepacheco added 7 commits November 6, 2025 12:25

WIP

5915e83

WIP

54a0521

planner report -> planning report

66d5e1e

add (limited) section about what to do if things go badly

b8031b8

edits

8e87acb

flesh out decision tree

4fa6c0a

edits

a963dc4

jgallagher reviewed Nov 7, 2025

View reviewed changes

labbott reviewed Nov 7, 2025

View reviewed changes

davepacheco added 4 commits November 7, 2025 09:50

review feedback

5f23a35

more review feedback

168a5dc

more edits

222c252

flesh out debugging inventory

328c8db

davepacheco marked this pull request as ready for review November 7, 2025 20:03

askfongjojo reviewed Nov 10, 2025

View reviewed changes


		The top-level items in this list are strictly sequential. Control plane zone updates do not start until all MGS-driven updates are complete. Nexus handoff does not start until all other control plane zones are completed.

		MGS updates (e.g., SP updates) may be skipped if the new target release specifies the same versions for these components that are already deployed. This is common in development/test environments, though unexpected in customer environments. All releases, even development ones, have new versions of the host OS and control plane zones so these are never skipped.


		The time for update is dominated by:

		* Sled reboots. The process reboots each sled twice: once for SP update and once for host OS update. It generally takes 4-5 minutes after the SP update and 8 minutes after the host OS update before the system moves onto the next step.

initial Reconfigurator (update) debugging guide #9361

Are you sure you want to change the base?

initial Reconfigurator (update) debugging guide #9361

Uh oh!

Conversation

davepacheco commented Nov 7, 2025

Uh oh!

jgallagher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davepacheco commented Nov 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants