Skip to content

Commit df1fada

Browse files
Merge pull request #1839 from shivprakashmuley/pvc-enhancement
MG-68: Proposal for PVC configuration in MustGather spec
2 parents 7fad59d + c9ae922 commit df1fada

File tree

1 file changed

+322
-0
lines changed

1 file changed

+322
-0
lines changed
Lines changed: 322 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,322 @@
1+
---
2+
title: must-gather-operator-pvc-destination
3+
authors:
4+
- "@shivprakashmuley"
5+
reviewers:
6+
- "@TrilokGeer"
7+
- "@Prashanth684"
8+
approvers:
9+
- "@TrilokGeer"
10+
- "@Prashanth684"
11+
api-approvers:
12+
- "@TrilokGeer"
13+
- "@Prashanth684"
14+
creation-date: 2025-09-04
15+
last-updated: 2025-09-04
16+
tracking-link:
17+
- https://issues.redhat.com/browse/MG-68
18+
19+
---
20+
21+
# Must-Gather Operator: PVC destination for gathered data
22+
23+
## Release Signoff Checklist
24+
25+
- [ ] Enhancement is `implementable`
26+
- [ ] Design details are appropriately documented from clear requirements
27+
- [ ] Test plan is defined
28+
- [ ] Graduation criteria for dev preview, tech preview, GA
29+
- [ ] User-facing documentation is created in `openshift-docs`
30+
31+
## Summary
32+
33+
Introduce an optional field in the `MustGather` custom resource that allows specifying a PersistentVolumeClaim (PVC) as a destination for gathered artifacts. When a PVC is specified, the must-gather operator mounts it into the gather pod, persisting all content written to `/must-gather` to the PVC. If not specified, it defaults to using ephemeral storage (`emptyDir`).
34+
35+
## Motivation
36+
37+
### Current Behavior and Problem Statement
38+
39+
The must-gather operator currently relies on ephemeral storage (`emptyDir` volumes) to store the data it collects. This means the gathered artifacts are tied directly to the lifecycle of the gather pod. This approach presents several significant challenges:
40+
41+
- **Data Loss:** Since the storage is ephemeral, any data collected is permanently lost if the gather pod is evicted, crashes, or is deleted for any reason. This makes the collection process fragile and unreliable, especially in unstable clusters where it is most needed.
42+
- **Storage Capacity Limitations:** Ephemeral `emptyDir` volumes are constrained by the storage capacity of the underlying node. For large-scale data collection, such as gathering extensive logs or coredumps, `must-gather` can easily exhaust this limited space, causing the collection to fail.
43+
44+
45+
### Proposed Solution
46+
47+
This enhancement addresses these issues by enabling the use of a PersistentVolumeClaim (PVC) for storing all `must-gather` artifacts. By writing directly to a persistent volume, we fundamentally change how data is managed:
48+
49+
- **Ensured Data Durability:** The lifecycle of the collected data is decoupled from the gather pod. Artifacts are safely stored on the PVC, surviving pod failures and enabling reliable data collection.
50+
- **Simplified Data Access:** Once the `must-gather` job completes, the data is immediately available on the PVC for analysis, processing, or retrieval.
51+
- **Scalable Data Collection:** Users can leverage persistent storage solutions that are not limited by a single node's capacity. This allows for the collection of much larger datasets without fear of failure due to storage limits.
52+
- **Streamlined Artifact Management:** PVCs can be managed with standard Kubernetes storage policies, such as StorageClasses, quotas, and retention policies. This simplifies long-term retention, backup, and access control for `must-gather` artifacts.
53+
54+
### User Stories
55+
56+
- As a cluster administrator, I want must-gather output stored on a pre-provisioned PVC so that I can collect large datasets without failing due to ephemeral volume limits.
57+
- As a support engineer, I want artifacts retained on a PVC with a defined retention policy so that I can audit and compare multiple runs.
58+
- As a developer, I want to specify a subPath for runs so that I can organize multiple collections on a single PVC.
59+
60+
61+
### Goals
62+
63+
- Introduce an optional `storage` field in the `MustGather` CRD to support PVC-backed storage for must-gather runs.
64+
- If PVC is configured, ensure the gather container writes directly into the PVC by mounting it at `/must-gather`.
65+
- If no storage is specified, continue to use ephemeral storage as the default.
66+
67+
### Non-Goals
68+
69+
- Automatic creation or lifecycle management of PVCs (out of scope for this enhancement; users bring or manage the PVC).
70+
- Remote copies/exports (e.g., to object storage); this enhancement only covers writing to a PVC.
71+
72+
## Proposal
73+
74+
- Introduce an optional `spec.storage` section in the `MustGather` CRD. When specified, it must contain a `type` field (only `PersistentVolume` is supported) and a `persistentVolume` configuration.
75+
- If `spec.storage` is configured, the controller mounts the referenced PVC at `/must-gather` and optionally uses `subPath` to organize runs.
76+
- If `spec.storage` is not provided, the operator defaults to using an `emptyDir` volume for ephemeral storage, preserving the existing behavior.
77+
78+
### Workflow Description
79+
80+
When a user wants to use persistent storage, the workflow is as follows:
81+
1. User creates a PVC in the same namespace as the `MustGather` resource (pre-provisioned by storage admin or dynamic provisioner).
82+
2. User applies a `MustGather` resource with `spec.storage.type: PersistentVolume`, referencing the PVC.
83+
3. The operator schedules the gather pod and mounts the PVC at `/must-gather`.
84+
4. The gather container runs as-is and writes its output to the mounted path.
85+
5. On completion, artifacts are available on the PVC for subsequent retrieval or processing.
86+
87+
### Workflow for Ephemeral Storage (Default)
88+
89+
1. User applies a `MustGather` resource without a `spec.storage` section.
90+
2. The operator schedules the gather pod using an `emptyDir` volume mounted at `/must-gather`.
91+
3. The gather container runs and writes its output to the `emptyDir` volume.
92+
4. Artifacts are available for the lifetime of the pod.
93+
94+
Example `MustGather` CR:
95+
96+
```yaml
97+
apiVersion: must-gather.openshift.io/v1alpha1
98+
kind: MustGather
99+
metadata:
100+
name: network-debug
101+
namespace: must-gather
102+
spec:
103+
storage:
104+
type: PersistentVolume
105+
persistentVolume:
106+
claim:
107+
name: mg-artifacts
108+
# Optional: organize multiple runs in a single PVC
109+
subPath: runs/2025-09-01T12-00Z
110+
# Artifacts are written to /must-gather in the gather container
111+
```
112+
113+
### API Extensions
114+
115+
This enhancement modifies the `MustGather` CRD schema to include a new `spec.storage` object that controls where artifacts are written.
116+
117+
Proposed schema:
118+
119+
```yaml
120+
spec:
121+
type: object
122+
properties:
123+
storage:
124+
type: object
125+
required:
126+
- type
127+
- persistentVolume
128+
properties:
129+
type:
130+
type: string
131+
enum:
132+
- PersistentVolume
133+
description: "Select PersistentVolume for artifact storage"
134+
persistentVolume:
135+
type: object
136+
properties:
137+
claim:
138+
type: object
139+
properties:
140+
name:
141+
type: string
142+
maxLength: 253
143+
description: "PVC name in the same namespace"
144+
required:
145+
- name
146+
# Optional fields
147+
subPath:
148+
type: string
149+
description: "Optional subPath within the PVC to place artifacts"
150+
```
151+
152+
Behavioral notes:
153+
154+
- The operator mounts the configured PVC at `/must-gather` in the gather container.
155+
- The PVC must reside in the same namespace as the `MustGather` resource.
156+
157+
### Topology Considerations
158+
159+
#### Hypershift / Hosted Control Planes
160+
161+
This enhancement has no unique considerations for Hypershift. The must-gather operator runs in the guest cluster, and the PVC is expected to be available there.
162+
163+
#### Standalone Clusters
164+
165+
This change is relevant for standalone clusters.
166+
167+
#### Single-node Deployments or MicroShift
168+
169+
This proposal does not significantly affect the resource consumption of a single-node OpenShift deployment. It relies on the underlying storage infrastructure to provide the PVC. This is not applicable to MicroShift as `must-gather` is not a component of MicroShift.
170+
171+
### Implementation Details/Notes/Constraints
172+
173+
- Mount Strategy: Mount the PVC Volume at `/must-gather`.
174+
- Multi-Container: mount the same volume consistently across containers.
175+
- Access Modes: Ensure docs call out that RWO PVCs may schedule gather pods on the bound node; for RWX, any node can mount.
176+
- Node Placement: The gather pod inherits default scheduling; PVC storage class/node affinity may implicitly constrain scheduling.
177+
- Cleanup: This enhancement does not delete or modify the PVC. Users manage lifecycle.
178+
179+
#### Controller and Job template changes
180+
181+
The must-gather operator currently renders a Kubernetes Job from a Go template (see job template for reference: [controllers/mustgather/template.go](https://github.com/openshift/must-gather-operator/blob/master/controllers/mustgather/template.go)). This enhancement requires the controller to alter the Job's volumes and volumeMounts based on `spec.storage`:
182+
183+
- If `spec.storage` is provided and its type is `PersistentVolume`:
184+
- Replace the volume that backs the output path with a `persistentVolumeClaim` source using `persistentVolume.claim.name`.
185+
- Ensure the gather container's `volumeMounts` mounts that volume at `/must-gather`.
186+
- If `persistentVolume.subPath` is provided, set `subPath` on the `volumeMount`.
187+
- If `spec.storage` is not provided:
188+
- The operator will continue to use an `emptyDir` volume, preserving the current behavior.
189+
190+
Illustrative YAML fragment of the Job spec when PVC is configured:
191+
192+
```yaml
193+
spec:
194+
template:
195+
spec:
196+
volumes:
197+
- name: must-gather-out
198+
persistentVolumeClaim:
199+
claimName: <.spec.storage.persistentVolume.claim.name>
200+
containers:
201+
- name: gather
202+
volumeMounts:
203+
- name: must-gather-out
204+
mountPath: /must-gather
205+
# only set when provided
206+
subPath: <.spec.storage.persistentVolume.subPath>
207+
```
208+
209+
### Risks and Mitigations
210+
211+
- Incorrect AccessMode: Scheduling or mount may fail; expose clear status conditions and events.
212+
- PVC Pending/Unbound: The controller waits and surfaces a `PVCNotBound` condition; document that the PVC must exist and be bound.
213+
- Insufficient Capacity (ENOSPC): Collection may fail when the PVC fills; surface a `Failed` condition with reason; recommend sizing guidance and quotas.
214+
- SubPath misuse: Using a `subPath` already populated may overwrite data; document best practices and recommend unique run directories.
215+
- Namespace mismatch: PVC must be in the same namespace; validate and surface a `ValidationFailed` condition if not.
216+
- Cleanup/retention: Artifacts persist on PVC; document user responsibility for retention and provide guidance for lifecycle policies.
217+
218+
### Drawbacks
219+
220+
- Users must manage PVC lifecycle and capacity planning.
221+
- Potential for misconfiguration (e.g., wrong access mode) causing gather delays.
222+
223+
### Output Format
224+
225+
Unchanged. Must-gather images continue writing under `/must-gather`; directory structure is preserved, now backed by a PVC when configured.
226+
227+
## Test Plan
228+
229+
- Unit tests for CRD defaulting/validation of `spec.storage.persistentVolume`.
230+
- E2E tests:
231+
- Happy path: Pre-created PVC (RWO), must-gather completes, artifacts present on the PVC.
232+
- With `subPath`: Artifacts appear under the provided subpath.
233+
- PVC Pending: Operator does not start gather until bound.
234+
235+
## Graduation Criteria
236+
237+
### Dev Preview -> Tech Preview
238+
239+
- Ability to utilize the enhancement end to end
240+
- End user documentation, relative API stability
241+
- Sufficient test coverage
242+
- Gather feedback from users rather than just developers
243+
244+
### Tech Preview -> GA
245+
246+
- More testing (upgrade, downgrade, scale)
247+
- Sufficient time for feedback
248+
- Available by default
249+
- User facing documentation created in [openshift-docs](https://github.com/openshift/openshift-docs/)
250+
251+
### Removing a deprecated feature
252+
253+
Not applicable.
254+
255+
## Upgrade / Downgrade Strategy
256+
257+
- This change is backward compatible.
258+
- Existing `MustGather` resources that do not have the `storage` field will continue to work as before, using ephemeral `emptyDir` storage.
259+
- New `MustGather` resources can optionally include the `storage` field to use a PVC.
260+
- On downgrade, the `storage` field will be ignored by older operators. The CRD will have the new field, but the old operator won't know about it. The behavior will be as if it's not there.
261+
262+
## Version Skew Strategy
263+
264+
This enhancement does not introduce any version skew concerns. The change is self-contained within the must-gather operator and its CRD.
265+
266+
## Operational Aspects of API Extensions
267+
268+
The MustGather CRD is the only API extension. The operator will manage its lifecycle. Failure to provision a PVC or incorrect permissions will be surfaced as status conditions on the MustGather resource.
269+
270+
## Support Procedures
271+
272+
If a `must-gather` run fails, support personnel should first inspect the `MustGather` resource's status and events to check for PVC-related errors (e.g., `PVCNotFound`, `PVCNotBound`). If the PVC is correctly bound, standard `must-gather` debugging procedures apply by inspecting the gather pod's logs.
273+
274+
## Implementation History
275+
276+
- 2025-09-04: Initial proposal.
277+
278+
## Alternatives (Not Implemented)
279+
280+
281+
## Infrastructure Needed
282+
283+
- None beyond a Kubernetes storage class capable of provisioning PVCs appropriate for cluster size and expected artifact volume.
284+
285+
### MustGather Spec (illustrative)
286+
287+
Spec fields overview:
288+
289+
```go
290+
// +kubebuilder:validation:Enum=PersistentVolume
291+
type StorageType string
292+
293+
const (
294+
StorageTypePersistentVolume StorageType = "PersistentVolume"
295+
)
296+
297+
type MustGatherSpec struct {
298+
// +optional
299+
Storage *Storage `json:"storage,omitempty"`
300+
}
301+
302+
type Storage struct {
303+
// +required
304+
Type StorageType `json:"type"`
305+
// +required
306+
PersistentVolume PersistentVolumeConfig `json:"persistentVolume"`
307+
}
308+
309+
type PersistentVolumeConfig struct {
310+
// +required
311+
Claim PersistentVolumeClaimReference `json:"claim"`
312+
// +optional
313+
SubPath string `json:"subPath,omitempty"`
314+
}
315+
316+
type PersistentVolumeClaimReference struct {
317+
// +kubebuilder:validation:MaxLength=253
318+
// +kubebuilder:validation:XValidation:rule="!format.dns1123Subdomain().validate(self).hasValue()",message="a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character."
319+
// +required
320+
Name string `json:"name"`
321+
}
322+
```

0 commit comments

Comments
 (0)