-
Notifications
You must be signed in to change notification settings - Fork 603
OCPEDGE-2213: podman-etcd: fix to prevent learner from starting before cluster is ready #2098
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Can one of the project admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2098/1/input |
jaypoulz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left you a few questions. :)
| local peer_url=$(ip_url $member_ip) | ||
|
|
||
| ocf_log info "add $member_name ($member_ip, $endpoint_url) to the member list as learner" | ||
| ocf_log info "add $member_name ($member_ip) to the member list as learner" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be checking if it's already a learner before we we add it as one? I see this error sometimes with the debug-start command, where one of the nodes is a learner, and thus add_member_as_learner fails when it is added again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. I shouldn't even try to add a new member if it finds it in the member list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It calls add_member_as_learner in two cases
- in
podman_etcd, when it just forced a new cluster, hence there can't be any learner - in
monitor, when there is no peer in the member list.
I never tried with debug-start and force-new-cluster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue in debug-start would be the first case, which means it just forced a new cluster. Could be a race condition, since I only see it in CI.
| # promotion is expected to fail if the peer is not yet up-to-date | ||
| ocf_log info "could not promote member $learner_member_id_hex, error code: $?" | ||
| return $OCF_SUCCESS | ||
| return $OCF_ERR_GENERIC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we treat this as an error now? I know we need to retry this later, but as the comment says - it is OK for this to fail if we just not ready yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should update the code to react to the rc if we return an error here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we treat this as an error now?
I made this change to clean up standalone_node and learner_node attributes immediately after promotion. See https://github.com/clobrano/resource-agents/blob/99d36fa651ce8f3aadde818de166955f48a680d5/heartbeat/podman-etcd#L1065-L1079
The problem I wanted to address was that the attributes were not cleaned up immediately, but in the next monitor loop, meaning the member remained in a learner state for an additional 30 seconds (https://github.com/clusterlabs/resource-agents/blob/677e3add17957a59b3b96137ff3d39ca7b99b280/heartbeat/podman-etcd#L1065-L1079).
You should update the code to react to the rc if we return an error here.
While we need reconcile_member_state to check the return code, so it can skip attribute cleanup if there's a failure, manage_peer_membership (which is above in the call stack) should ignore it, because failing to promote a member shouldn't stop the agent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha - this distinguishes between cases where we need to clean up the attribute and not. I wonder if that was the case for my debug-start (since it would run multiple times in a row). If it had already forced a new cluster, the learner attribute may have been stale from the previous run.
heartbeat/podman-etcd
Outdated
| promote_learner_member "$learner_member_id" | ||
| return $? | ||
| if ! promote_learner_member "$learner_member_id"; then | ||
| return $? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried that this returns and will never retry. (And that we'll not run into this in testing because our etcds are generally very small for new clusters).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should use an OCF rc code here.
| # promotion is expected to fail if the peer is not yet up-to-date | ||
| ocf_log info "could not promote member $learner_member_id_hex, error code: $?" | ||
| return $OCF_SUCCESS | ||
| return $OCF_ERR_GENERIC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should update the code to react to the rc if we return an error here.
heartbeat/podman-etcd
Outdated
| promote_learner_member "$learner_member_id" | ||
| return $? | ||
| if ! promote_learner_member "$learner_member_id"; then | ||
| return $? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should use an OCF rc code here.
99d36fa to
22faf04
Compare
|
Can one of the project admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2098/2/input |
…cluster is ready Clear stale learner_node attribute during stop and on restart when no active resources exist, ensuring learner always waits for peer availability.
22faf04 to
aa19b65
Compare
|
Can one of the project admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2098/3/input |
Clear stale
learner_nodeattribute during stop and on restart when no active resources exist, ensuring learner always waits for peer availability.