OCPEDGE-2213: podman-etcd: fix to prevent learner from starting before cluster is ready #2098

clobrano · 2025-11-10T15:07:34Z

Clear stale learner_node attribute during stop and on restart when no active resources exist, ensuring learner always waits for peer availability.

…omotion succeeds

knet-jenkins · 2025-11-10T15:08:20Z

Can one of the project admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2098/1/input

jaypoulz

Left you a few questions. :)

jaypoulz · 2025-11-10T20:59:21Z

heartbeat/podman-etcd

 	local peer_url=$(ip_url $member_ip)

-	ocf_log info "add $member_name ($member_ip, $endpoint_url) to the member list as learner"
+	ocf_log info "add $member_name ($member_ip) to the member list as learner"


Should we be checking if it's already a learner before we we add it as one? I see this error sometimes with the debug-start command, where one of the nodes is a learner, and thus add_member_as_learner fails when it is added again

Interesting. I shouldn't even try to add a new member if it finds it in the member list

It calls add_member_as_learner in two cases

in podman_etcd, when it just forced a new cluster, hence there can't be any learner

in monitor, when there is no peer in the member list.

I never tried with debug-start and force-new-cluster

The issue in debug-start would be the first case, which means it just forced a new cluster. Could be a race condition, since I only see it in CI.

jaypoulz · 2025-11-10T21:01:51Z

heartbeat/podman-etcd

 		# promotion is expected to fail if the peer is not yet up-to-date
 		ocf_log info "could not promote member $learner_member_id_hex, error code: $?"
-		return $OCF_SUCCESS
+		return $OCF_ERR_GENERIC


Why do we treat this as an error now? I know we need to retry this later, but as the comment says - it is OK for this to fail if we just not ready yet.

You should update the code to react to the rc if we return an error here.

Why do we treat this as an error now?

I made this change to clean up standalone_node and learner_node attributes immediately after promotion. See https://github.com/clobrano/resource-agents/blob/99d36fa651ce8f3aadde818de166955f48a680d5/heartbeat/podman-etcd#L1065-L1079

The problem I wanted to address was that the attributes were not cleaned up immediately, but in the next monitor loop, meaning the member remained in a learner state for an additional 30 seconds (https://github.com/clusterlabs/resource-agents/blob/677e3add17957a59b3b96137ff3d39ca7b99b280/heartbeat/podman-etcd#L1065-L1079).

You should update the code to react to the rc if we return an error here.

While we need reconcile_member_state to check the return code, so it can skip attribute cleanup if there's a failure, manage_peer_membership (which is above in the call stack) should ignore it, because failing to promote a member shouldn't stop the agent.

Gotcha - this distinguishes between cases where we need to clean up the attribute and not. I wonder if that was the case for my debug-start (since it would run multiple times in a row). If it had already forced a new cluster, the learner attribute may have been stale from the previous run.

jaypoulz · 2025-11-10T21:02:52Z

heartbeat/podman-etcd

-		promote_learner_member "$learner_member_id"
-		return $?
+		if ! promote_learner_member "$learner_member_id"; then
+			return $?


I'm worried that this returns and will never retry. (And that we'll not run into this in testing because our etcds are generally very small for new clusters).

You should use an OCF rc code here.

oalbrigt · 2025-11-11T09:20:23Z

heartbeat/podman-etcd

 		# promotion is expected to fail if the peer is not yet up-to-date
 		ocf_log info "could not promote member $learner_member_id_hex, error code: $?"
-		return $OCF_SUCCESS
+		return $OCF_ERR_GENERIC


You should update the code to react to the rc if we return an error here.

oalbrigt · 2025-11-11T09:20:48Z

heartbeat/podman-etcd

-		promote_learner_member "$learner_member_id"
-		return $?
+		if ! promote_learner_member "$learner_member_id"; then
+			return $?


You should use an OCF rc code here.

knet-jenkins · 2025-11-12T15:15:21Z

Can one of the project admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2098/2/input

…cluster is ready Clear stale learner_node attribute during stop and on restart when no active resources exist, ensuring learner always waits for peer availability.

knet-jenkins · 2025-11-13T10:55:30Z

Can one of the project admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2098/3/input

clobrano added 2 commits November 10, 2025 15:57

fix: podman-etcd should cleanup standalone/learner attributes when pr…

688c407

…omotion succeeds

fix: remove misleading endpoint IP from log

6cd23a8

jaypoulz reviewed Nov 10, 2025

View reviewed changes

oalbrigt requested changes Nov 11, 2025

View reviewed changes

oalbrigt changed the title ~~OCPEDGE-2213: fix(podman-etcd): prevent learner from starting before cluster is ready~~ OCPEDGE-2213: podman-etcd: fix to prevent learner from starting before cluster is ready Nov 11, 2025

clobrano force-pushed the fix/learner-stale-attribute branch from 99d36fa to 22faf04 Compare November 12, 2025 15:12

clobrano requested review from jaypoulz and oalbrigt November 12, 2025 15:13

oalbrigt approved these changes Nov 13, 2025

View reviewed changes

OCPEDGE-2213: fix(podman-etcd): prevent learner from starting before …

aa19b65

…cluster is ready Clear stale learner_node attribute during stop and on restart when no active resources exist, ensuring learner always waits for peer availability.

clobrano force-pushed the fix/learner-stale-attribute branch from 22faf04 to aa19b65 Compare November 13, 2025 10:54

OCPEDGE-2213: podman-etcd: fix to prevent learner from starting before cluster is ready #2098

Are you sure you want to change the base?

OCPEDGE-2213: podman-etcd: fix to prevent learner from starting before cluster is ready #2098

Conversation

clobrano commented Nov 10, 2025

Uh oh!

knet-jenkins bot commented Nov 10, 2025

Uh oh!

jaypoulz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clobrano Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

knet-jenkins bot commented Nov 12, 2025

Uh oh!

knet-jenkins bot commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

clobrano Nov 11, 2025 •

edited

Loading