-
Notifications
You must be signed in to change notification settings - Fork 54
[sled-agent] Don't block HardwareMonitor
when starting the switch zone if MGS isn't reachable
#9002
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ed_or_deactivated()
) | ||
.await | ||
.expect("Expected an infinite retry loop getting our switch ID"); | ||
let switch_slot = mgs_client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We no longer retry this forever here, because our caller will retry if we return an error. (There are still spots later in this function that go into infinite retry loops, so it's possible for MGS to be healthy, we get a successful response here, then get stuck in one of those. But one fix at a time.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed that the retry at the higher level occurs via reading the code.
// If we've given the switch an underlay address, we also need to inject | ||
// SMF properties so that tfport uplinks can be created. | ||
if let Some((ip, Some(rack_network_config))) = underlay_info { | ||
self.ensure_switch_zone_uplinks_configured(ip, rack_network_config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where we used to get stuck trying forever; instead, we'll now do this uplink configuration either
a) inside the async task we were already spawning if starting the switch zone for the first time
b) inside a new async task we now spawn if we're reconfiguring the switch zone because we just got our network config from RSS
} | ||
let me = self.clone(); | ||
let (exit_tx, exit_rx) = oneshot::channel(); | ||
*worker = Some(Task { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the new task we spawn in the "reconfiguring an existing switch zone" case.
// Then, if we have underlay info, go into a loop trying to configure | ||
// our uplinks. As above, retry until we succeed or are told to stop. | ||
if let Some(underlay_info) = underlay_info { | ||
self.ensure_switch_zone_uplinks_configured_loop( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This extends what the task we were already spawning for the "start the switch zone" case.
sled-agent/src/services.rs
Outdated
// TODO-correctness How can we have an underlay IP without a rack | ||
// network config?? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder: do we even get here without receiving the sled agent config? Maybe this can be tied to SledAgentState
somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we do, but it looked tricky to fix up the types. I'll take another look and either try fixing it or file an issue with some details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the Option
comes from the bootstore.get_network_config() call. This call has to return an option because before RSS runs and propagates to all nodes the bootstore configuration is unset. So maybe, inside get_network_config
we should wait for a a Some
and then remove the optionality later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theoretically, if the bootstore early network config is updated, we'll need to do some reconfiguration, but that happens elsewhere and I believe is driven by an RPW.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, so on a cold boot, there is (or might be?) a window of time where we know our IP (because we ledger it ourselves) but don't know the RackNetworkConfig
yet (because we have to unlock the bootstore first)?
If that's right I should probably just remove this comment and keep things as-is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We ledger the RackNetworkConfig
as well. The issue is that the RackNetworkConfig
is written to a single (or a few) bootstore nodes and replicated between them. I think it may be possible that some nodes haven't yet learned the RackNetworkConfig
on cold boot because the crash happened before the gossip. But this is only detectable if the option is None
. If there was an old version and a new version hasn't propagated, this is not detectable locally.
There is no "unlocking the bootstore". The bootstore is used solely to enable configuring the network so we can establish time sync with an external NTP server so that when we do unlock the control plane CRDB actually works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks; I'll reword this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT, this is correct according to the PR description.
No further reviewing will give me more clarity as this switch zone/early networking code is a rats nest and could use some love.
) | ||
.await | ||
.expect("Expected an infinite retry loop getting our switch ID"); | ||
let switch_slot = mgs_client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed that the retry at the higher level occurs via reading the code.
sled-agent/src/services.rs
Outdated
// TODO-correctness How can we have an underlay IP without a rack | ||
// network config?? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the Option
comes from the bootstore.get_network_config() call. This call has to return an option because before RSS runs and propagates to all nodes the bootstore configuration is unset. So maybe, inside get_network_config
we should wait for a a Some
and then remove the optionality later on.
sled-agent/src/services.rs
Outdated
// TODO-correctness How can we have an underlay IP without a rack | ||
// network config?? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theoretically, if the bootstore early network config is updated, we'll need to do some reconfiguration, but that happens elsewhere and I believe is driven by an RPW.
return; | ||
}; | ||
|
||
loop { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC we loop here waiting for MGS rather than blocking the early networking because this code runs in its own task. That seems like the crux of the fix.
…one if MGS isn't reachable (#9002) See #8970 for context. This doesn't fix _all_ the ways we could get wedged if the switch zone is unhealthy after starting it up, but it does fix the case where it's so broken not even MGS is functional. The changes here are pretty precarious (and took a bunch of retries to get something that worked!). I ran them on dublin while doing some testing for #8480, and was successfully able to start and stop the switch zone even if the sidecar was powered off and MGS was unresponsive.
See #8970 for context. This doesn't fix all the ways we could get wedged if the switch zone is unhealthy after starting it up, but it does fix the case where it's so broken not even MGS is functional.
The changes here are pretty precarious (and took a bunch of retries to get something that worked!). I ran them on dublin while doing some testing for #8480, and was successfully able to start and stop the switch zone even if the sidecar was powered off and MGS was unresponsive.
I'll leave some comments on the changes below to point out details, but in general I really think #8970 warrants a bigger rework - maybe something along the lines of
sled-agent-config-reconciler
except limited in scope to managing and configuring the switch zone.