Skip to content

Conversation

Neilhamza
Copy link

@Neilhamza Neilhamza commented Sep 16, 2025

  • What I did
    Added a new MachineConfig template file under templates/master/00-master/two-node-with-fencing/files/ that installs the fencing_validator.sh script to /usr/local/bin/ on control-plane nodes for Two-Node Fencing clusters.

  • How to verify it

Deploy a Two-Node Fencing cluster.

Verify the MachineConfig for masters includes the new file.

On a master node, run:

oc debug node/ -- chroot /host ls -l /usr/local/bin/fencing_validator.sh
oc debug node/ -- chroot /host /usr/local/bin/fencing_validator.sh --help

The script should be present, executable (0755), and runnable.

  • Description for the changelog
    Ship /usr/local/bin/fencing_validator.sh via MCO for Two-Node Fencing clusters.
image

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Sep 16, 2025
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Sep 16, 2025

@Neilhamza: This pull request references OCPEDGE-2188 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

  • What I did
    Added a new MachineConfig template file under templates/master/00-master/two-node-with-fencing/files/ that installs the fencing_validator.sh script to /usr/local/bin/ on control-plane nodes for Two-Node Fencing clusters.

  • How to verify it

Deploy a Two-Node Fencing cluster.

Verify the MachineConfig for masters includes the new file.

On a master node, run:

oc debug node/ -- chroot /host ls -l /usr/local/bin/fencing_validator.sh
oc debug node/ -- chroot /host /usr/local/bin/fencing_validator.sh --help

The script should be present, executable (0755), and runnable.

  • Description for the changelog
    Ship /usr/local/bin/fencing_validator.sh via MCO for Two-Node Fencing clusters.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 16, 2025
Copy link
Contributor

openshift-ci bot commented Sep 16, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Neilhamza
Once this PR has been reviewed and has the lgtm label, please assign cheesesashimi for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Neilhamza Neilhamza changed the title [WIP] OCPEDGE-2188: embed fencing validator into TNF MCO OCPEDGE-2188: embed fencing validator into TNF MCO Sep 16, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 16, 2025
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Sep 16, 2025

@Neilhamza: This pull request references OCPEDGE-2188 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set.

In response to this:

  • What I did
    Added a new MachineConfig template file under templates/master/00-master/two-node-with-fencing/files/ that installs the fencing_validator.sh script to /usr/local/bin/ on control-plane nodes for Two-Node Fencing clusters.

  • How to verify it

Deploy a Two-Node Fencing cluster.

Verify the MachineConfig for masters includes the new file.

On a master node, run:

oc debug node/ -- chroot /host ls -l /usr/local/bin/fencing_validator.sh
oc debug node/ -- chroot /host /usr/local/bin/fencing_validator.sh --help

The script should be present, executable (0755), and runnable.

  • Description for the changelog
    Ship /usr/local/bin/fencing_validator.sh via MCO for Two-Node Fencing clusters.
image

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor

openshift-ci bot commented Sep 16, 2025

@Neilhamza: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-mco-disruptive 9a33dea link false /test e2e-gcp-mco-disruptive
ci/prow/e2e-gcp-op-ocl 9a33dea link false /test e2e-gcp-op-ocl
ci/prow/bootstrap-unit 9a33dea link false /test bootstrap-unit
ci/prow/e2e-azure-ovn-upgrade-out-of-change 9a33dea link false /test e2e-azure-ovn-upgrade-out-of-change
ci/prow/e2e-aws-mco-disruptive 9a33dea link false /test e2e-aws-mco-disruptive

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link
Contributor

@eggfoobar eggfoobar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, I had some suggestions and questions. This is my initial pass, I'll give it another review once I deploy and test it on a cluster.

@@ -0,0 +1,502 @@
mode: 0755
path: "/usr/local/bin/fencing_validator.sh"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is going to the bin folder, maybe renaming it to just fencing_validator would be easier user experience, so you would execute fencing_validator --help instead of fencing_validator.sh --help. This way we don't leak implementation detail to end user and can change things up in the future but still maintain the same experience.

@jaypoulz What do you think?

OC_REQ_TIMEOUT="${OC_REQ_TIMEOUT:-10s}"
CMD_EXEC_TIMEOUT_SECS="${CMD_EXEC_TIMEOUT_SECS:-60s}"

# -------- Exit codes --------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, are these exit codes following a predefined meaning or just arbitrary?

EXIT_DAEMONS_BAD=22
EXIT_ETCD_NOT_READY=23
EXIT_ETCD_FATAL=24
EXIT_REFUSE_FENCE_UNSTABLE=30
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The EXIT_REFUSE_FENCE_UNSTABLE variable seems to not be used anymore, is there a condition that would cause this to trigger? If not, we should remove it for now to avoid noise

usage() {
cat <<'EOF'
Usage:
fencing-validator [--user <ssh-user>] [--ssh-key <path>]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to update this to reflect the call correctly, either fencing_validator or fencing-validator

EOF
}

log(){ printf '\033[36m[INFO]\033[0m %s\n' "$*"; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these log statements we should prefix them with the log keyword so they are clear.

log_info # or keep as log since it's the default behavior
log_warn
log_err
log_ok

fi
}

etcd_two_started() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused on the two part of this, it seems this function gets called with both NODE_A and NODE_B as input, what is the two referring to?


etcd_two_started() {
local tgt="$1" out rc
out="$(host_run "$tgt" "podman exec etcd sh -lc 'ETCDCTL_API=3 etcdctl member list -w table'" 2>&1)"; rc=$?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we're doing a lot of logic based out of this output, can we fetch this as JSON output and use jq to validate outputs?

fi

for ip in "$IP_A" "$IP_B"; do
awk -F'|' -v ip="$ip" '
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have the output as json, we should be able to check if both IPs exist with jq here. wdyt?

return 1
}

wait_etcd(){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we make this function in line with the other etcd commands, etcd_wait, etcd_ready or etcd_started

fence "$PCMK_B"
wait_not_ready "$NODE_B"; wait_ready "$NODE_B"; wait_etcd; check_daemon_status || exit $EXIT_DAEMONS_BAD

ok "Disruptive validation PASSED" No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add a new line here, typically not an issue but just incase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants