Skip to content

Conversation

@davepacheco
Copy link
Collaborator

This is still a work in progress!

This is one of those PRs that's hard to know when it's done. We could spend a lot of time improving this but I think what's here now is still much better than nothing so I'll be inclined to land this early and then iterate in main. Still, I welcome feedback, both on this PR and in follow ups! And I do want to flesh out the stuff in the decision tree section (at the end) that's obviously unfinished before landing this.

I'm also considering renaming this to the Reconfigurator Operator Guide because it's more than just debugging.

Copy link
Contributor

@jgallagher jgallagher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome!

...
```

NOTE: As mentioned above, host OS phase 2 updates are implemented in the same step as host phase 1 updates, even though the step is only labeled "host phase 1".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just change the label in the blueprint comment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so.

...
```

Most other zones use add/expunge updates. These are done in multiple steps. The first step explicitly expunges the zone in advance of the update. Subsequent steps mark the expunged zone ready for cleanup and add the replacement. These subsequent steps are currently unlabeled in the `omdb reconfigurator history` output. So it looks like this for one zone update:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we fix these unlabeled steps too?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, yes.

[#debug-stuck-MGS-update]
=== Debugging why an MGS-driven update appears stuck

// XXX-dap
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to flesh out these sections before merging, or merge and flesh out on main?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan to flesh these out before merging.


The top-level items in this list are strictly sequential. Control plane zone updates do not start until all MGS-driven updates are complete. Nexus handoff does not start until all other control plane zones are completed.

MGS updates (e.g., SP updates) may be skipped if the new target release specifies the same versions for these components that are already deployed. This is common in development/test environments, though unexpected in customer environments. All releases, even development ones, have new versions of the host OS and control plane zones so these are never skipped.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "versions" is overloaded a bit here. The Host OS and control plane zone images could have the same binary image files but different artifact versions. Maybe specify that part explicitly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, I meant the literal version number in the TUF artifact metadata for the SP image, RoT image, and RoT bootloader. I believe that today, the control plane zones include their version in their image, so they are necessarily different images and versions in different TUF repos, so none of this can apply. I thought that was true of the host OS as well but I'm not positive.


The time for update is dominated by:

* Sled reboots. The process reboots each sled twice: once for SP update and once for host OS update. It generally takes 4-5 minutes after the SP update and 8 minutes after the host OS update before the system moves onto the next step.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should double check this timing for cosmo sleds

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I edited this to be more specific about what we've measured.

@davepacheco davepacheco marked this pull request as ready for review November 7, 2025 20:03
@davepacheco
Copy link
Collaborator Author

This is now as complete as I intended for the first draft so it's ready for full review. I've also addressed the feedback so far. Thanks!

* During the brief periods where Crucible pantry zones are updated, in-flight Crucible operations like disk import may fail. The user will have to try the operation again.
* During the brief periods where Clickhouse or Oximeter is offline, some metric data may fail to be collected (so the metric data will be absent for that period) or queried.
* During the many brief periods when components or sleds are restarted, some instance start operations may fail, if they or their disks get allocated to sleds or Crucible instances that are currently offline.
* During the many brief periods when components or sleds are restarted, some disk create operations may fail, if they get allocated to Crucible instances that are currently offline.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disk deletion will also fail (time out) if one of its downstairs is on a sled that is down.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth mentioning here the possible guest failure modes (e.g., file system becomes read-only, prompting for fsck execution)?

The bit about nvme i/o timeout kernel settings has been added to the user-facing troubleshooting guide (also linked from System Update guide). For OS that supports only up to 255 second time-out, the setting doesn't really help to cover the duration of a sled update btw.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants