|
| 1 | +--- |
| 2 | +title: must-gather-operator-pvc-destination |
| 3 | +authors: |
| 4 | + - "@shivprakashmuley" |
| 5 | +reviewers: |
| 6 | + - "@TrilokGeer" |
| 7 | + - "@Prashanth684" |
| 8 | +approvers: |
| 9 | + - "@TrilokGeer" |
| 10 | + - "@Prashanth684" |
| 11 | +api-approvers: |
| 12 | + - "@TrilokGeer" |
| 13 | + - "@Prashanth684" |
| 14 | +creation-date: 2025-09-04 |
| 15 | +last-updated: 2025-09-04 |
| 16 | +tracking-link: |
| 17 | + - https://issues.redhat.com/browse/MG-68 |
| 18 | + |
| 19 | +--- |
| 20 | + |
| 21 | +# Must-Gather Operator: PVC destination for gathered data |
| 22 | + |
| 23 | +## Release Signoff Checklist |
| 24 | + |
| 25 | +- [ ] Enhancement is `implementable` |
| 26 | +- [ ] Design details are appropriately documented from clear requirements |
| 27 | +- [ ] Test plan is defined |
| 28 | +- [ ] Graduation criteria for dev preview, tech preview, GA |
| 29 | +- [ ] User-facing documentation is created in `openshift-docs` |
| 30 | + |
| 31 | +## Summary |
| 32 | + |
| 33 | +Introduce an optional field in the `MustGather` custom resource that allows specifying a PersistentVolumeClaim (PVC) as a destination for gathered artifacts. When a PVC is specified, the must-gather operator mounts it into the gather pod, persisting all content written to `/must-gather` to the PVC. If not specified, it defaults to using ephemeral storage (`emptyDir`). |
| 34 | + |
| 35 | +## Motivation |
| 36 | + |
| 37 | +### Current Behavior and Problem Statement |
| 38 | + |
| 39 | +The must-gather operator currently relies on ephemeral storage (`emptyDir` volumes) to store the data it collects. This means the gathered artifacts are tied directly to the lifecycle of the gather pod. This approach presents several significant challenges: |
| 40 | + |
| 41 | +- **Data Loss:** Since the storage is ephemeral, any data collected is permanently lost if the gather pod is evicted, crashes, or is deleted for any reason. This makes the collection process fragile and unreliable, especially in unstable clusters where it is most needed. |
| 42 | +- **Storage Capacity Limitations:** Ephemeral `emptyDir` volumes are constrained by the storage capacity of the underlying node. For large-scale data collection, such as gathering extensive logs or coredumps, `must-gather` can easily exhaust this limited space, causing the collection to fail. |
| 43 | + |
| 44 | + |
| 45 | +### Proposed Solution |
| 46 | + |
| 47 | +This enhancement addresses these issues by enabling the use of a PersistentVolumeClaim (PVC) for storing all `must-gather` artifacts. By writing directly to a persistent volume, we fundamentally change how data is managed: |
| 48 | + |
| 49 | +- **Ensured Data Durability:** The lifecycle of the collected data is decoupled from the gather pod. Artifacts are safely stored on the PVC, surviving pod failures and enabling reliable data collection. |
| 50 | +- **Simplified Data Access:** Once the `must-gather` job completes, the data is immediately available on the PVC for analysis, processing, or retrieval. |
| 51 | +- **Scalable Data Collection:** Users can leverage persistent storage solutions that are not limited by a single node's capacity. This allows for the collection of much larger datasets without fear of failure due to storage limits. |
| 52 | +- **Streamlined Artifact Management:** PVCs can be managed with standard Kubernetes storage policies, such as StorageClasses, quotas, and retention policies. This simplifies long-term retention, backup, and access control for `must-gather` artifacts. |
| 53 | + |
| 54 | +### User Stories |
| 55 | + |
| 56 | +- As a cluster administrator, I want must-gather output stored on a pre-provisioned PVC so that I can collect large datasets without failing due to ephemeral volume limits. |
| 57 | +- As a support engineer, I want artifacts retained on a PVC with a defined retention policy so that I can audit and compare multiple runs. |
| 58 | +- As a developer, I want to specify a subPath for runs so that I can organize multiple collections on a single PVC. |
| 59 | + |
| 60 | + |
| 61 | +### Goals |
| 62 | + |
| 63 | +- Introduce an optional `storage` field in the `MustGather` CRD to support PVC-backed storage for must-gather runs. |
| 64 | +- If PVC is configured, ensure the gather container writes directly into the PVC by mounting it at `/must-gather`. |
| 65 | +- If no storage is specified, continue to use ephemeral storage as the default. |
| 66 | + |
| 67 | +### Non-Goals |
| 68 | + |
| 69 | +- Automatic creation or lifecycle management of PVCs (out of scope for this enhancement; users bring or manage the PVC). |
| 70 | +- Remote copies/exports (e.g., to object storage); this enhancement only covers writing to a PVC. |
| 71 | + |
| 72 | +## Proposal |
| 73 | + |
| 74 | +- Introduce an optional `spec.storage` section in the `MustGather` CRD. When specified, it must contain a `type` field (only `PersistentVolume` is supported) and a `persistentVolume` configuration. |
| 75 | +- If `spec.storage` is configured, the controller mounts the referenced PVC at `/must-gather` and optionally uses `subPath` to organize runs. |
| 76 | +- If `spec.storage` is not provided, the operator defaults to using an `emptyDir` volume for ephemeral storage, preserving the existing behavior. |
| 77 | + |
| 78 | +### Workflow Description |
| 79 | + |
| 80 | +When a user wants to use persistent storage, the workflow is as follows: |
| 81 | +1. User creates a PVC in the same namespace as the `MustGather` resource (pre-provisioned by storage admin or dynamic provisioner). |
| 82 | +2. User applies a `MustGather` resource with `spec.storage.type: PersistentVolume`, referencing the PVC. |
| 83 | +3. The operator schedules the gather pod and mounts the PVC at `/must-gather`. |
| 84 | +4. The gather container runs as-is and writes its output to the mounted path. |
| 85 | +5. On completion, artifacts are available on the PVC for subsequent retrieval or processing. |
| 86 | + |
| 87 | +### Workflow for Ephemeral Storage (Default) |
| 88 | + |
| 89 | +1. User applies a `MustGather` resource without a `spec.storage` section. |
| 90 | +2. The operator schedules the gather pod using an `emptyDir` volume mounted at `/must-gather`. |
| 91 | +3. The gather container runs and writes its output to the `emptyDir` volume. |
| 92 | +4. Artifacts are available for the lifetime of the pod. |
| 93 | + |
| 94 | +Example `MustGather` CR: |
| 95 | + |
| 96 | +```yaml |
| 97 | +apiVersion: must-gather.openshift.io/v1alpha1 |
| 98 | +kind: MustGather |
| 99 | +metadata: |
| 100 | + name: network-debug |
| 101 | + namespace: must-gather |
| 102 | +spec: |
| 103 | + storage: |
| 104 | + type: PersistentVolume |
| 105 | + persistentVolume: |
| 106 | + claim: |
| 107 | + name: mg-artifacts |
| 108 | + # Optional: organize multiple runs in a single PVC |
| 109 | + subPath: runs/2025-09-01T12-00Z |
| 110 | + # Artifacts are written to /must-gather in the gather container |
| 111 | +``` |
| 112 | + |
| 113 | +### API Extensions |
| 114 | + |
| 115 | +This enhancement modifies the `MustGather` CRD schema to include a new `spec.storage` object that controls where artifacts are written. |
| 116 | + |
| 117 | +Proposed schema: |
| 118 | + |
| 119 | +```yaml |
| 120 | +spec: |
| 121 | + type: object |
| 122 | + properties: |
| 123 | + storage: |
| 124 | + type: object |
| 125 | + required: |
| 126 | + - type |
| 127 | + - persistentVolume |
| 128 | + properties: |
| 129 | + type: |
| 130 | + type: string |
| 131 | + enum: |
| 132 | + - PersistentVolume |
| 133 | + description: "Select PersistentVolume for artifact storage" |
| 134 | + persistentVolume: |
| 135 | + type: object |
| 136 | + properties: |
| 137 | + claim: |
| 138 | + type: object |
| 139 | + properties: |
| 140 | + name: |
| 141 | + type: string |
| 142 | + maxLength: 253 |
| 143 | + description: "PVC name in the same namespace" |
| 144 | + required: |
| 145 | + - name |
| 146 | + # Optional fields |
| 147 | + subPath: |
| 148 | + type: string |
| 149 | + description: "Optional subPath within the PVC to place artifacts" |
| 150 | +``` |
| 151 | +
|
| 152 | +Behavioral notes: |
| 153 | +
|
| 154 | +- The operator mounts the configured PVC at `/must-gather` in the gather container. |
| 155 | +- The PVC must reside in the same namespace as the `MustGather` resource. |
| 156 | + |
| 157 | +### Topology Considerations |
| 158 | + |
| 159 | +#### Hypershift / Hosted Control Planes |
| 160 | + |
| 161 | +This enhancement has no unique considerations for Hypershift. The must-gather operator runs in the guest cluster, and the PVC is expected to be available there. |
| 162 | + |
| 163 | +#### Standalone Clusters |
| 164 | + |
| 165 | +This change is relevant for standalone clusters. |
| 166 | + |
| 167 | +#### Single-node Deployments or MicroShift |
| 168 | + |
| 169 | +This proposal does not significantly affect the resource consumption of a single-node OpenShift deployment. It relies on the underlying storage infrastructure to provide the PVC. This is not applicable to MicroShift as `must-gather` is not a component of MicroShift. |
| 170 | + |
| 171 | +### Implementation Details/Notes/Constraints |
| 172 | + |
| 173 | +- Mount Strategy: Mount the PVC Volume at `/must-gather`. |
| 174 | +- Multi-Container: mount the same volume consistently across containers. |
| 175 | +- Access Modes: Ensure docs call out that RWO PVCs may schedule gather pods on the bound node; for RWX, any node can mount. |
| 176 | +- Node Placement: The gather pod inherits default scheduling; PVC storage class/node affinity may implicitly constrain scheduling. |
| 177 | +- Cleanup: This enhancement does not delete or modify the PVC. Users manage lifecycle. |
| 178 | + |
| 179 | +#### Controller and Job template changes |
| 180 | + |
| 181 | +The must-gather operator currently renders a Kubernetes Job from a Go template (see job template for reference: [controllers/mustgather/template.go](https://github.com/openshift/must-gather-operator/blob/master/controllers/mustgather/template.go)). This enhancement requires the controller to alter the Job's volumes and volumeMounts based on `spec.storage`: |
| 182 | + |
| 183 | +- If `spec.storage` is provided and its type is `PersistentVolume`: |
| 184 | + - Replace the volume that backs the output path with a `persistentVolumeClaim` source using `persistentVolume.claim.name`. |
| 185 | + - Ensure the gather container's `volumeMounts` mounts that volume at `/must-gather`. |
| 186 | + - If `persistentVolume.subPath` is provided, set `subPath` on the `volumeMount`. |
| 187 | +- If `spec.storage` is not provided: |
| 188 | + - The operator will continue to use an `emptyDir` volume, preserving the current behavior. |
| 189 | + |
| 190 | +Illustrative YAML fragment of the Job spec when PVC is configured: |
| 191 | + |
| 192 | +```yaml |
| 193 | +spec: |
| 194 | + template: |
| 195 | + spec: |
| 196 | + volumes: |
| 197 | + - name: must-gather-out |
| 198 | + persistentVolumeClaim: |
| 199 | + claimName: <.spec.storage.persistentVolume.claim.name> |
| 200 | + containers: |
| 201 | + - name: gather |
| 202 | + volumeMounts: |
| 203 | + - name: must-gather-out |
| 204 | + mountPath: /must-gather |
| 205 | + # only set when provided |
| 206 | + subPath: <.spec.storage.persistentVolume.subPath> |
| 207 | +``` |
| 208 | + |
| 209 | +### Risks and Mitigations |
| 210 | + |
| 211 | +- Incorrect AccessMode: Scheduling or mount may fail; expose clear status conditions and events. |
| 212 | +- PVC Pending/Unbound: The controller waits and surfaces a `PVCNotBound` condition; document that the PVC must exist and be bound. |
| 213 | +- Insufficient Capacity (ENOSPC): Collection may fail when the PVC fills; surface a `Failed` condition with reason; recommend sizing guidance and quotas. |
| 214 | +- SubPath misuse: Using a `subPath` already populated may overwrite data; document best practices and recommend unique run directories. |
| 215 | +- Namespace mismatch: PVC must be in the same namespace; validate and surface a `ValidationFailed` condition if not. |
| 216 | +- Cleanup/retention: Artifacts persist on PVC; document user responsibility for retention and provide guidance for lifecycle policies. |
| 217 | + |
| 218 | +### Drawbacks |
| 219 | + |
| 220 | +- Users must manage PVC lifecycle and capacity planning. |
| 221 | +- Potential for misconfiguration (e.g., wrong access mode) causing gather delays. |
| 222 | + |
| 223 | +### Output Format |
| 224 | + |
| 225 | +Unchanged. Must-gather images continue writing under `/must-gather`; directory structure is preserved, now backed by a PVC when configured. |
| 226 | + |
| 227 | +## Test Plan |
| 228 | + |
| 229 | +- Unit tests for CRD defaulting/validation of `spec.storage.persistentVolume`. |
| 230 | +- E2E tests: |
| 231 | + - Happy path: Pre-created PVC (RWO), must-gather completes, artifacts present on the PVC. |
| 232 | + - With `subPath`: Artifacts appear under the provided subpath. |
| 233 | + - PVC Pending: Operator does not start gather until bound. |
| 234 | + |
| 235 | +## Graduation Criteria |
| 236 | + |
| 237 | +### Dev Preview -> Tech Preview |
| 238 | + |
| 239 | +- Ability to utilize the enhancement end to end |
| 240 | +- End user documentation, relative API stability |
| 241 | +- Sufficient test coverage |
| 242 | +- Gather feedback from users rather than just developers |
| 243 | + |
| 244 | +### Tech Preview -> GA |
| 245 | + |
| 246 | +- More testing (upgrade, downgrade, scale) |
| 247 | +- Sufficient time for feedback |
| 248 | +- Available by default |
| 249 | +- User facing documentation created in [openshift-docs](https://github.com/openshift/openshift-docs/) |
| 250 | + |
| 251 | +### Removing a deprecated feature |
| 252 | + |
| 253 | +Not applicable. |
| 254 | + |
| 255 | +## Upgrade / Downgrade Strategy |
| 256 | + |
| 257 | +- This change is backward compatible. |
| 258 | +- Existing `MustGather` resources that do not have the `storage` field will continue to work as before, using ephemeral `emptyDir` storage. |
| 259 | +- New `MustGather` resources can optionally include the `storage` field to use a PVC. |
| 260 | +- On downgrade, the `storage` field will be ignored by older operators. The CRD will have the new field, but the old operator won't know about it. The behavior will be as if it's not there. |
| 261 | + |
| 262 | +## Version Skew Strategy |
| 263 | + |
| 264 | +This enhancement does not introduce any version skew concerns. The change is self-contained within the must-gather operator and its CRD. |
| 265 | + |
| 266 | +## Operational Aspects of API Extensions |
| 267 | + |
| 268 | +The MustGather CRD is the only API extension. The operator will manage its lifecycle. Failure to provision a PVC or incorrect permissions will be surfaced as status conditions on the MustGather resource. |
| 269 | + |
| 270 | +## Support Procedures |
| 271 | + |
| 272 | +If a `must-gather` run fails, support personnel should first inspect the `MustGather` resource's status and events to check for PVC-related errors (e.g., `PVCNotFound`, `PVCNotBound`). If the PVC is correctly bound, standard `must-gather` debugging procedures apply by inspecting the gather pod's logs. |
| 273 | + |
| 274 | +## Implementation History |
| 275 | + |
| 276 | +- 2025-09-04: Initial proposal. |
| 277 | + |
| 278 | +## Alternatives (Not Implemented) |
| 279 | + |
| 280 | + |
| 281 | +## Infrastructure Needed |
| 282 | + |
| 283 | +- None beyond a Kubernetes storage class capable of provisioning PVCs appropriate for cluster size and expected artifact volume. |
| 284 | + |
| 285 | +### MustGather Spec (illustrative) |
| 286 | + |
| 287 | +Spec fields overview: |
| 288 | + |
| 289 | +```go |
| 290 | +// +kubebuilder:validation:Enum=PersistentVolume |
| 291 | +type StorageType string |
| 292 | +
|
| 293 | +const ( |
| 294 | + StorageTypePersistentVolume StorageType = "PersistentVolume" |
| 295 | +) |
| 296 | +
|
| 297 | +type MustGatherSpec struct { |
| 298 | + // +optional |
| 299 | + Storage *Storage `json:"storage,omitempty"` |
| 300 | +} |
| 301 | + |
| 302 | +type Storage struct { |
| 303 | + // +required |
| 304 | + Type StorageType `json:"type"` |
| 305 | + // +required |
| 306 | + PersistentVolume PersistentVolumeConfig `json:"persistentVolume"` |
| 307 | +} |
| 308 | + |
| 309 | +type PersistentVolumeConfig struct { |
| 310 | + // +required |
| 311 | + Claim PersistentVolumeClaimReference `json:"claim"` |
| 312 | + // +optional |
| 313 | + SubPath string `json:"subPath,omitempty"` |
| 314 | +} |
| 315 | + |
| 316 | +type PersistentVolumeClaimReference struct { |
| 317 | + // +kubebuilder:validation:MaxLength=253 |
| 318 | + // +kubebuilder:validation:XValidation:rule="!format.dns1123Subdomain().validate(self).hasValue()",message="a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character." |
| 319 | + // +required |
| 320 | + Name string `json:"name"` |
| 321 | +} |
| 322 | +``` |
0 commit comments