-
Notifications
You must be signed in to change notification settings - Fork 193
Description
This is a proposal for introducing a new way of performing lifecycle management akin to Flux kustomize-controller operational mode.
The RetryOnFailure
strategy is suitable for statefulsets and other workloads that cannot tolerate rollbacks and have a high rollout duration susceptible to health check timeouts and transient capacity errors.
The RetryOnFailure
strategy will ensure that:
- An installation failure will not be retried immediately, instead, the controller will retry with an upgrade at a fixed interval defined in the HelmRelease.
- An upgrade failure will leave the Helm release in a failed state without performing any remediations.
- An upgrade failure will never result in a rollback or uninstall.
- Upgrade failures are always retried at a fixed interval defined in the HelmRelease.
API Changes
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
spec:
install:
strategy:
name: RetryOnFailure # defaults to RemediateOnFailure
retryInterval: 5m # only used for RetryOnFailure (defaults to 5m)
remediation: # ignored for RetryOnFailure
retries: 2
upgrade:
strategy:
name: RetryOnFailure # defaults to RemediateOnFailure
retryInterval: 5m # only used for RetryOnFailure (defaults to 5m)
remediation: # ignored for RetryOnFailure
retries: 2
strategy: rollback
When not specified, or when the strategy is set to RemediateOnFailure
, the lifecycle management works like before.
For installations, the RetryOnFailure
strategy will perform an uninstall on failure, then will rerun the installation after the specified retry interval.
For upgrades, the RetryOnFailure
strategy will behave like flux reconcile hr --force
when the remediation retries are set to 0.