[sled-agent] Don't block `HardwareMonitor` when starting the switch zone if MGS isn't reachable #9002

jgallagher · 2025-09-05T19:06:40Z

See #8970 for context. This doesn't fix all the ways we could get wedged if the switch zone is unhealthy after starting it up, but it does fix the case where it's so broken not even MGS is functional.

The changes here are pretty precarious (and took a bunch of retries to get something that worked!). I ran them on dublin while doing some testing for #8480, and was successfully able to start and stop the switch zone even if the sidecar was powered off and MGS was unresponsive.

I'll leave some comments on the changes below to point out details, but in general I really think #8970 warrants a bigger rework - maybe something along the lines of sled-agent-config-reconciler except limited in scope to managing and configuring the switch zone.

…anager

…-control

…ed_or_deactivated()

…task

…nks-async

jgallagher · 2025-09-05T19:07:40Z

sled-agent/src/bootstrap/early_networking.rs

-        )
-        .await
-        .expect("Expected an infinite retry loop getting our switch ID");
+        let switch_slot = mgs_client


We no longer retry this forever here, because our caller will retry if we return an error. (There are still spots later in this function that go into infinite retry loops, so it's possible for MGS to be healthy, we get a successful response here, then get stuck in one of those. But one fix at a time.)

Confirmed that the retry at the higher level occurs via reading the code.

jgallagher · 2025-09-05T19:08:56Z

sled-agent/src/services.rs

-        // If we've given the switch an underlay address, we also need to inject
-        // SMF properties so that tfport uplinks can be created.
-        if let Some((ip, Some(rack_network_config))) = underlay_info {
-            self.ensure_switch_zone_uplinks_configured(ip, rack_network_config)


This is where we used to get stuck trying forever; instead, we'll now do this uplink configuration either

a) inside the async task we were already spawning if starting the switch zone for the first time
b) inside a new async task we now spawn if we're reconfiguring the switch zone because we just got our network config from RSS

jgallagher · 2025-09-05T19:09:25Z

sled-agent/src/services.rs

+                    }
+                    let me = self.clone();
+                    let (exit_tx, exit_rx) = oneshot::channel();
+                    *worker = Some(Task {


This is the new task we spawn in the "reconfiguring an existing switch zone" case.

jgallagher · 2025-09-05T19:09:41Z

sled-agent/src/services.rs

+        // Then, if we have underlay info, go into a loop trying to configure
+        // our uplinks. As above, retry until we succeed or are told to stop.
+        if let Some(underlay_info) = underlay_info {
+            self.ensure_switch_zone_uplinks_configured_loop(


This extends what the task we were already spawning for the "start the switch zone" case.

jmpesp · 2025-09-08T00:39:46Z

sled-agent/src/services.rs

+        // TODO-correctness How can we have an underlay IP without a rack
+        // network config??


I wonder: do we even get here without receiving the sled agent config? Maybe this can be tied to SledAgentState somehow?

I don't think we do, but it looked tricky to fix up the types. I'll take another look and either try fixing it or file an issue with some details.

It looks like the Option comes from the bootstore.get_network_config() call. This call has to return an option because before RSS runs and propagates to all nodes the bootstore configuration is unset. So maybe, inside get_network_config we should wait for a a Some and then remove the optionality later on.

Theoretically, if the bootstore early network config is updated, we'll need to do some reconfiguration, but that happens elsewhere and I believe is driven by an RPW.

Ahh, so on a cold boot, there is (or might be?) a window of time where we know our IP (because we ledger it ourselves) but don't know the RackNetworkConfig yet (because we have to unlock the bootstore first)?

If that's right I should probably just remove this comment and keep things as-is.

We ledger the RackNetworkConfig as well. The issue is that the RackNetworkConfig is written to a single (or a few) bootstore nodes and replicated between them. I think it may be possible that some nodes haven't yet learned the RackNetworkConfig on cold boot because the crash happened before the gossip. But this is only detectable if the option is None. If there was an old version and a new version hasn't propagated, this is not detectable locally.

There is no "unlocking the bootstore". The bootstore is used solely to enable configuring the network so we can establish time sync with an external NTP server so that when we do unlock the control plane CRDB actually works.

Thanks; I'll reword this comment.

andrewjstone

AFAICT, this is correct according to the PR description.

No further reviewing will give me more clarity as this switch zone/early networking code is a rats nest and could use some love.

andrewjstone · 2025-09-16T21:33:38Z

sled-agent/src/bootstrap/early_networking.rs

-        )
-        .await
-        .expect("Expected an infinite retry loop getting our switch ID");
+        let switch_slot = mgs_client


Confirmed that the retry at the higher level occurs via reading the code.

andrewjstone · 2025-09-16T21:46:37Z

sled-agent/src/services.rs

+        // TODO-correctness How can we have an underlay IP without a rack
+        // network config??


It looks like the Option comes from the bootstore.get_network_config() call. This call has to return an option because before RSS runs and propagates to all nodes the bootstore configuration is unset. So maybe, inside get_network_config we should wait for a a Some and then remove the optionality later on.

andrewjstone · 2025-09-16T21:50:11Z

sled-agent/src/services.rs

+        // TODO-correctness How can we have an underlay IP without a rack
+        // network config??


Theoretically, if the bootstore early network config is updated, we'll need to do some reconfiguration, but that happens elsewhere and I believe is driven by an RPW.

andrewjstone · 2025-09-16T21:59:13Z

sled-agent/src/services.rs

+            return;
+        };
+
+        loop {


IIUC we loop here waiting for MGS rather than blocking the early networking because this code runs in its own task. That seems like the crux of the fix.

…one if MGS isn't reachable (#9002) See #8970 for context. This doesn't fix _all_ the ways we could get wedged if the switch zone is unhealthy after starting it up, but it does fix the case where it's so broken not even MGS is functional. The changes here are pretty precarious (and took a bunch of retries to get something that worked!). I ran them on dublin while doing some testing for #8480, and was successfully able to start and stop the switch zone even if the sidecar was powered off and MGS was unresponsive.

jgallagher added 25 commits August 29, 2025 12:49

add HardwareMonitorHandle and add it to long running tasks

4e6a2a3

remove TofinoManager; separate caching of tofino_loaded from ServiceM…

131bc2e

…anager

check policy when deciding whether to start or stop switch zone

b4bb460

listen for policy changes

64db578

add sled-agent endpoints to get/put switch zone policy

73af35a

openapi

2a8a5cb

clippy and update server API versions

bd24c9e

add omdb subcommands to control switch zone policy

d72b7d0

reject "switch zone disable" requests from our own switch zone

6c3b861

Merge remote-tracking branch 'origin/main' into john/omdb-switch-zone…

88bdb3a

…-control

Merge remote-tracking branch 'origin/main' into john/omdb-switch-zone…

8fb28dd

…-control

remove set_tofino_loaded() and add args to ensure_switch_zone_activat…

c0b9899

…ed_or_deactivated()

api docs

73e3f3f

rustdoc link

8f8d2c0

[sled-agent] Move switch zone uplink config into switch zone startup …

2831991

…task

Merge remote-tracking branch 'origin/main' into john/switch-zone-upli…

ed046b7

…nks-async

UnderlayInfo struct instead of tuple

d014309

extract ensure_switch_zone_uplinks_configured_loop() method

e65e055

also ensure uplinks when reconfiguring switch zone

010432a

Merge remote-tracking branch 'origin/main' into john/switch-zone-upli…

e321d2e

…nks-async

unlock mutex to prevent deadlock

1029364

Merge remote-tracking branch 'origin/main' into john/switch-zone-upli…

d9e3aec

…nks-async

move underlay uplink config into a worker task

62d55e1

Merge remote-tracking branch 'origin/main' into john/switch-zone-upli…

c0c9892

…nks-async

Merge remote-tracking branch 'origin/main' into john/switch-zone-upli…

a6a642b

…nks-async

jgallagher requested review from andrewjstone and smklein September 5, 2025 19:06

jgallagher commented Sep 5, 2025

View reviewed changes

jmpesp reviewed Sep 8, 2025

View reviewed changes

andrewjstone approved these changes Sep 16, 2025

View reviewed changes

jgallagher added 2 commits September 17, 2025 17:27

reword comment

3dc1f40

Merge branch 'main' into john/switch-zone-uplinks-async

1c40048

jgallagher merged commit 4cb26a0 into main Sep 18, 2025
16 checks passed

jgallagher deleted the john/switch-zone-uplinks-async branch September 18, 2025 16:28

		// TODO-correctness How can we have an underlay IP without a rack
		// network config??

+                          return;
+                      };
+                      loop {

[sled-agent] Don't block HardwareMonitor when starting the switch zone if MGS isn't reachable #9002

[sled-agent] Don't block HardwareMonitor when starting the switch zone if MGS isn't reachable #9002

Uh oh!

Conversation

jgallagher commented Sep 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewjstone left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

[sled-agent] Don't block `HardwareMonitor` when starting the switch zone if MGS isn't reachable #9002

[sled-agent] Don't block `HardwareMonitor` when starting the switch zone if MGS isn't reachable #9002