Skip to content

Conversation

@clobrano
Copy link
Collaborator

Clear stale learner_node attribute during stop and on restart when no active resources exist, ensuring learner always waits for peer availability.

@knet-jenkins
Copy link

knet-jenkins bot commented Nov 10, 2025

Can one of the project admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2098/1/input

Copy link

@jaypoulz jaypoulz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left you a few questions. :)

local peer_url=$(ip_url $member_ip)

ocf_log info "add $member_name ($member_ip, $endpoint_url) to the member list as learner"
ocf_log info "add $member_name ($member_ip) to the member list as learner"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be checking if it's already a learner before we we add it as one? I see this error sometimes with the debug-start command, where one of the nodes is a learner, and thus add_member_as_learner fails when it is added again

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I shouldn't even try to add a new member if it finds it in the member list

Copy link
Collaborator Author

@clobrano clobrano Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It calls add_member_as_learner in two cases

  • in podman_etcd, when it just forced a new cluster, hence there can't be any learner
  • in monitor, when there is no peer in the member list.

I never tried with debug-start and force-new-cluster

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue in debug-start would be the first case, which means it just forced a new cluster. Could be a race condition, since I only see it in CI.

# promotion is expected to fail if the peer is not yet up-to-date
ocf_log info "could not promote member $learner_member_id_hex, error code: $?"
return $OCF_SUCCESS
return $OCF_ERR_GENERIC

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we treat this as an error now? I know we need to retry this later, but as the comment says - it is OK for this to fail if we just not ready yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should update the code to react to the rc if we return an error here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we treat this as an error now?

I made this change to clean up standalone_node and learner_node attributes immediately after promotion. See https://github.com/clobrano/resource-agents/blob/99d36fa651ce8f3aadde818de166955f48a680d5/heartbeat/podman-etcd#L1065-L1079

The problem I wanted to address was that the attributes were not cleaned up immediately, but in the next monitor loop, meaning the member remained in a learner state for an additional 30 seconds (https://github.com/clusterlabs/resource-agents/blob/677e3add17957a59b3b96137ff3d39ca7b99b280/heartbeat/podman-etcd#L1065-L1079).

You should update the code to react to the rc if we return an error here.

While we need reconcile_member_state to check the return code, so it can skip attribute cleanup if there's a failure, manage_peer_membership (which is above in the call stack) should ignore it, because failing to promote a member shouldn't stop the agent.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha - this distinguishes between cases where we need to clean up the attribute and not. I wonder if that was the case for my debug-start (since it would run multiple times in a row). If it had already forced a new cluster, the learner attribute may have been stale from the previous run.

promote_learner_member "$learner_member_id"
return $?
if ! promote_learner_member "$learner_member_id"; then
return $?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried that this returns and will never retry. (And that we'll not run into this in testing because our etcds are generally very small for new clusters).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should use an OCF rc code here.

# promotion is expected to fail if the peer is not yet up-to-date
ocf_log info "could not promote member $learner_member_id_hex, error code: $?"
return $OCF_SUCCESS
return $OCF_ERR_GENERIC
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should update the code to react to the rc if we return an error here.

promote_learner_member "$learner_member_id"
return $?
if ! promote_learner_member "$learner_member_id"; then
return $?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should use an OCF rc code here.

@oalbrigt oalbrigt changed the title OCPEDGE-2213: fix(podman-etcd): prevent learner from starting before cluster is ready OCPEDGE-2213: podman-etcd: fix to prevent learner from starting before cluster is ready Nov 11, 2025
@clobrano clobrano force-pushed the fix/learner-stale-attribute branch from 99d36fa to 22faf04 Compare November 12, 2025 15:12
@knet-jenkins
Copy link

knet-jenkins bot commented Nov 12, 2025

Can one of the project admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2098/2/input

…cluster is ready

Clear stale learner_node attribute during stop and on restart when no
active resources exist, ensuring learner always waits for peer
availability.
@clobrano clobrano force-pushed the fix/learner-stale-attribute branch from 22faf04 to aa19b65 Compare November 13, 2025 10:54
@knet-jenkins
Copy link

knet-jenkins bot commented Nov 13, 2025

Can one of the project admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2098/3/input

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants