-
Notifications
You must be signed in to change notification settings - Fork 447
OCPEDGE-2188: embed fencing validator into TNF MCO #5285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: nhamza <[email protected]>
@Neilhamza: This pull request references OCPEDGE-2188 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Neilhamza The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@Neilhamza: This pull request references OCPEDGE-2188 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@Neilhamza: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, I had some suggestions and questions. This is my initial pass, I'll give it another review once I deploy and test it on a cluster.
@@ -0,0 +1,502 @@ | |||
mode: 0755 | |||
path: "/usr/local/bin/fencing_validator.sh" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is going to the bin folder, maybe renaming it to just fencing_validator
would be easier user experience, so you would execute fencing_validator --help
instead of fencing_validator.sh --help
. This way we don't leak implementation detail to end user and can change things up in the future but still maintain the same experience.
@jaypoulz What do you think?
OC_REQ_TIMEOUT="${OC_REQ_TIMEOUT:-10s}" | ||
CMD_EXEC_TIMEOUT_SECS="${CMD_EXEC_TIMEOUT_SECS:-60s}" | ||
|
||
# -------- Exit codes -------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, are these exit codes following a predefined meaning or just arbitrary?
EXIT_DAEMONS_BAD=22 | ||
EXIT_ETCD_NOT_READY=23 | ||
EXIT_ETCD_FATAL=24 | ||
EXIT_REFUSE_FENCE_UNSTABLE=30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The EXIT_REFUSE_FENCE_UNSTABLE
variable seems to not be used anymore, is there a condition that would cause this to trigger? If not, we should remove it for now to avoid noise
usage() { | ||
cat <<'EOF' | ||
Usage: | ||
fencing-validator [--user <ssh-user>] [--ssh-key <path>] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to update this to reflect the call correctly, either fencing_validator
or fencing-validator
EOF | ||
} | ||
|
||
log(){ printf '\033[36m[INFO]\033[0m %s\n' "$*"; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For these log statements we should prefix them with the log keyword so they are clear.
log_info # or keep as log since it's the default behavior
log_warn
log_err
log_ok
fi | ||
} | ||
|
||
etcd_two_started() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused on the two
part of this, it seems this function gets called with both NODE_A and NODE_B as input, what is the two referring to?
|
||
etcd_two_started() { | ||
local tgt="$1" out rc | ||
out="$(host_run "$tgt" "podman exec etcd sh -lc 'ETCDCTL_API=3 etcdctl member list -w table'" 2>&1)"; rc=$? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we're doing a lot of logic based out of this output, can we fetch this as JSON output and use jq to validate outputs?
fi | ||
|
||
for ip in "$IP_A" "$IP_B"; do | ||
awk -F'|' -v ip="$ip" ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have the output as json, we should be able to check if both IPs exist with jq here. wdyt?
return 1 | ||
} | ||
|
||
wait_etcd(){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we make this function in line with the other etcd
commands, etcd_wait
, etcd_ready
or etcd_started
fence "$PCMK_B" | ||
wait_not_ready "$NODE_B"; wait_ready "$NODE_B"; wait_etcd; check_daemon_status || exit $EXIT_DAEMONS_BAD | ||
|
||
ok "Disruptive validation PASSED" No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets add a new line here, typically not an issue but just incase.
What I did
Added a new MachineConfig template file under templates/master/00-master/two-node-with-fencing/files/ that installs the fencing_validator.sh script to /usr/local/bin/ on control-plane nodes for Two-Node Fencing clusters.
How to verify it
Deploy a Two-Node Fencing cluster.
Verify the MachineConfig for masters includes the new file.
On a master node, run:
oc debug node/ -- chroot /host ls -l /usr/local/bin/fencing_validator.sh
oc debug node/ -- chroot /host /usr/local/bin/fencing_validator.sh --help
The script should be present, executable (0755), and runnable.
Ship /usr/local/bin/fencing_validator.sh via MCO for Two-Node Fencing clusters.