You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Updated TNF EP to address some drift from original requirements.
- Updated warning against baremetal platform including BMC block
- Updated test section to note that we'll skip requirements criteria if no requirements are provided
- Added a new block that explains the PacemakerCluster API, the status collector, and the health check controller
|[Feature Gates](#feature-gate-changes)| Add a new `DualReplicaTopology` feature which can be enabled via the `CustomNoUpgrade` feature set |
158
-
|[OpenShift API](#openshift-api-changes)| Add `DualReplica` as a new value for `ControlPlaneTopology`|
159
-
|[ETCD Operator](#etcd-operator-changes)| Add a mode to stop managing the etcd container, a new scaling strategy, and new TNF controller for initializing pacemaker|
158
+
|[OpenShift API](#openshift-api-changes)| Add `DualReplica` as a new value for `ControlPlaneTopology`, `PacemakerCluster` CRD for CEO health checking|
159
+
|[ETCD Operator](#etcd-operator-changes)| Add external etcd mode, new scaling strategy, new TNF controller for initializing pacemaker, and pacemaker healthy checker|
160
160
|[Install Config](#install-config-changes)| Update install config API to accept fencing credentials in the control plane for `platform: None` and `platform: Baremetal`|
161
161
|[Installer](#installer-changes)| Populate the nodes with initial pacemaker configuration when deploying with 2 control-plane nodes and no arbiter |
162
162
|[MCO](#mco-changes)| Add an MCO extension for installing pacemaker and corosync in RHCOS; MachineConfigPool maxUnavailable set to 1 |
@@ -317,6 +317,9 @@ In the future, it may be possible to lower the privilege level of the TNF contro
317
317
to run without root privileges. We are working with the RHEL-HA team to identify the specific set of commands that we use to narrow the scope of progress towards this goal. This remains a long-term
318
318
objective for both teams.
319
319
320
+
##### The PacemakerCluster Health Check
321
+
See [Status Propogation with PacemakerCluster Health Check](#status-propogation-with-pacemakercluster-health-check)
322
+
320
323
#### Install Config Changes
321
324
322
325
In order to initialize pacemaker with valid fencing credentials, they will be consumed by the installer via the installation config and created on the cluster as a cluster secret.
@@ -382,53 +385,7 @@ sshKey: ''
382
385
```
383
386
384
387
Unfortunately, Bare Metal Operator already has an API that accepts BMC credentials as part of configuring BareMetalHost CRDs. Adding BMC credentials to the BareMetalHost CRD allows the Baremetal
385
-
Operator to manage the power status of that host via ironic. This is **strictly incompatible** with TNF because both the Bare Metal Operator and the pacemaker fencing agent will have control over the
386
-
machine state.
387
-
388
-
This example shows an **invalid** install configuration that the installer will reject for TNF.
389
-
```
390
-
apiVersion: v1
391
-
baseDomain: example.com
392
-
compute:
393
-
- name: worker
394
-
replicas: 0
395
-
controlPlane:
396
-
name: master
397
-
replicas: 2
398
-
fencing:
399
-
credentials:
400
-
- hostname: <control-0-hostname>
401
-
address: https://<redfish-api-url>
402
-
username: <username>
403
-
password: <password>
404
-
- hostname: <control-1-hostname>
405
-
address: https://<redfish-api-url>
406
-
username: <username>
407
-
password: <password>
408
-
metadata:
409
-
name: <cluster-name>
410
-
platform:
411
-
baremetal:
412
-
apiVIPs:
413
-
- <api_ip>
414
-
ingressVIPs:
415
-
- <wildcard_ip>
416
-
hosts:
417
-
- name: openshift-cp-0
418
-
role: master
419
-
bmc:
420
-
address: ipmi://<out_of_band_ip>
421
-
username: <username>
422
-
password: <password>
423
-
- name: openshift-cp-1
424
-
role: master
425
-
bmc:
426
-
address: ipmi://<out_of_band_ip>
427
-
username: <username>
428
-
password: <password>
429
-
pullSecret: ''
430
-
sshKey: ''
431
-
```
388
+
Operator to manage the power status of that host via ironic. To work around this, we detach the control-plane nodes from ironic once they are provisioned by adding the detached annotation (`baremetalhost.metal3.io/detached: ""`).
432
389
433
390
##### Why don't we reuse the existing APIs in the `Baremetal` platform?
434
391
Reusing the existing APIs tightly couples separate outcomes that are important to distinguish for the end user.
@@ -708,6 +665,35 @@ This collection of diagrams collects a series of scenarios where both nodes fail
708
665
709
666

710
667
668
+
#### Status Propogation with PacemakerCluster Health Check
669
+
An important goal of Two Node OpenShift with Fencing is ensuring that we always warn the user before a disaster event can occur that would require manually intervention if we have the information we
670
+
need to do so. An example of this would be if the cluster administrator rotated their BMC password without updating the fencing secret in the cluster. This would be caught by the pacemaker monitoring
671
+
checks, but something in the cluster would need to aware of that to propogate that information to the user directly.
672
+
673
+
To acheive this, we plan on using a pair of new controllers in CEO. The first is a status collector, which syncs every 30 seconds to gather that current state of pacemaker via `sudo pcs status xml`.
674
+
This is parsed and populates a new status object called a "PacemakerCluster", which is a singleton resource that created by CEO when the transition to an external etcd is completed.
675
+
676
+
The `PacemakerCluster` resource provides the cluster with key information that CEO can use to determine the overall health and threats to etcd. It consists of 5 basic building blocks:
677
+
- A summary of active nodes and resources
678
+
- A list of nodes currently registered in pacemaker
679
+
- A list of recent events recorded by the pacemaker resources
680
+
- A list of recent fencing events performed by pacemaker
681
+
- A dump of the full pacemaker XML. This is kept so that in the case that the XML API is changed in a way that breaks the other fields, we can quickly deliver a fix for the breakage that parses the
682
+
XML directly.
683
+
684
+
Once the PacemakerCluster object is populated is it handled on the CEO side by a new pacemaker healthcheck controller. This controller evaluates the status of the report and creates events in CEO for the following things:
685
+
- Transitions between healthy and unhealthy pacemaker states
686
+
- Errors for resources that are in an unhealthy state
687
+
- Warnings for resource actions that have been taken on the cluster (e.g. start/stopping etcd, kubelet, or redfish)
688
+
- Warnings for fencing events that have happened on the cluster
689
+
690
+
More importantly, it also sets the CEO's status to degraded if one of the following conditions are true:
691
+
- Not all resources and nodes are in their expected / healthy state
692
+
- The pacemakercluster status object is stale (hasn't been updated in the last 5 minutes)
693
+
694
+
Overall these health checks are almost entirely informational. The only time they are used outside of this event creation or operator status is to ensure that the nodes recorded in pacemaker match the
695
+
nodes being added to the cluster during a node replacement event. This ensures that CEO can enforce that we replace the correct (failed) node in pacemaker as well as the cluster.
696
+
711
697
#### Running Two Node OpenShift with Fencing with a Failed Node
712
698
713
699
An interesting aspect of TNF is that should a node fail and remain in a failed state, the cluster recovery operation will allow the survivor to restart etcd as a cluster-of-one and resume normal
@@ -840,12 +826,13 @@ Disadvantages:
840
826
841
827
Pacemaker will be running as a system daemon and reporting errors about its various agents to the system journal. The question is, what is the best way to expose these to a cluster admin? A simple
842
828
example would be an issue where pacemaker discovers that its fencing agent can no longer talk to the BMC. What is the best way to raise this error to the cluster admin, such that they can see that
843
-
their cluster may be at risk of failure if no action is taken to resolve the problem? In our current design, we'd likely need to explore what kinds of errors we can bubble up through existing
844
-
cluster health APIs to see if something suitable can be reused.
829
+
their cluster may be at risk of failure if no action is taken to resolve the problem?
845
830
846
831
For situations where we recognize a risk to etcd health if no action is taken, we plan on monitoring the pacemaker status via the TNF controller and setting CEO to degraded with a message to
847
832
explain the action(s) needed. This has the added benefit of ensuring that the installer fails during deployment if we cannot properly set up etcd under pacemaker.
848
833
834
+
See the PacemakerCluster Health Check section above for more details.
835
+
849
836
## Test Plan
850
837
851
838
**Note:***Section not required until targeted at a release.*
@@ -869,7 +856,7 @@ The initial release of TNF should aim to build a regression baseline.
869
856
| Test | Kubelet failure [^2]| A new TNF test to detect if the cluster recovers if kubelet fails. |
870
857
| Test | Failure in etcd [^2]| A new TNF test to detect if the cluster recovers if etcd fails. |
871
858
| Test | Valid PDBs | A new TNF test to verify that PDBs are set to the correct configuration |
872
-
| Test | Conformant recovery | A new TNF test to verify recovery times for failure events are within the creteria defined in the requirements|
859
+
| Test | Conformant recovery | A new TNF test to verify recovery times meet or beat requirements if requirements are set. |
873
860
| Test | Fencing health check | A new TNF test to verify that the [Fencing Health Check](#fencing-health-check) process is successful |
874
861
| Test | Replacing a control-plane node | A new TNF test to verify that you can replace a control-plane node in a 2-node cluster |
875
862
| Test | Certificate rotation with an unhealthy node | A new TNF test to verify certificate rotation on a cluster with an unhealthy node that rejoins after the rotation |
0 commit comments