feat: autorecover from stuck situations

gerrnot · gerrnot · commit 9cacda92fe1f · 2024-07-23T19:55:07.000+02:00
Signed-off-by: Gernot Feichter &lt;gernot.feichter@bearingpoint.com&gt;
diff --git a/hips/hip-9999.md b/hips/hip-9999.md
@@ -0,0 +1,102 @@
+---
+hip: 9999
+title: "Autorecover from stuck situations"
+authors: [ "Gernot Feichter <gernotfeichter@gmail.com>" ]
+created: "2024-07-12"
+type: "feature"
+status: "draft"
+helm-version: 3
+---
+
+## Abstract
+
+The idea is to simplify the handling for both manual users and CI/CD pipelines,
+to auto-recover from a state of stuck deployments, which is currently not possible unless users implement
+boilerplate code around their helm invocations.
+
+## Motivation
+
+If a helm deployment fails, I want to be able to retry it,
+ideally by running the same command again to keep things simple.
+
+There are two known situations how the user can run into such a situation where a retry will NOT work:
+1. A helm upgrade/install process is killed while the release is in state `PENDING-UPGRADE` or `PENDING-INSTALL`.
+2. The initial helm release installation (as performed via `helm upgrade --install`) is in state `FAILED`.
+
+Known Workarounds that should become OBSOLETE when this HIP is implemented to recover from such situations:
+1. `kubectl delete secret '<the name of the secret where helm stores release information>'.` (Not possible if you don't want to lose all history)
+2. `helm delete` your release. (Not possible if you don't want to lose all history)
+3. `helm rollback` your release. (Not possibly if it is the first installation)
+
+## Rationale
+
+The proposed solution uses a locking mechanism that is stored in k8s, such that all clients know whether a helm
+release is locked by themselves or not and for how long the lock is valid.
+
+It uses existing helm parameters like the --timeout parameter (which defaults to 5m) to determine for how long a helm release
+may be stuck in a pending state.
+
+## Specification
+
+The --timout parameter gets a deeper meaning.
+Previously the --timout parameter only had an effect on the helm process running on the respective client.
+After implementation, the --timout parameter will be stored in the helm release object (secret) in k8s and
+have an indirect impact on possible parallel processes.
+
+`helm ls -a` shows two new columns, regular `helm ls` does NOT show those:
+- LOCKED TILL
+  <datetime> calculated by the helm client: k8s server time + timeout parameter value
+- SESSION ID
+  Unique, random session id generated by the client
+
+Furthermore, if the helm client process gets killed (SIGTERM), it tries to clear the LOCKED TILL value,
+SESSION ID and sets the release into a failed state before terminating in order to free the lock.
+
+## Backwards compatibility
+
+It is assumed that the helm release object as stored in k8s will not break
+older clients if new fields are added while existing fields are untouched.
+
+Backwards compatibility will be tested during implementation!
+
+## Security implications
+
+The proposed solution should not have an impact on security.
+
+## How to teach this
+
+Since the way that helm is invoked is not altered, there will not be much to teach here.
+The usage of the timeout parameter is encouraged, but since the default timeout is already 5m, not even that
+needs to be encouraged.
+
+It should just reduce the amount of frustration when dealing with pending and failed helm releases.
+
+A retry of a failed command should just work (assuming the retry happens when no other client has a valid lock).
+
+## Reference implementation
+
+helm: https://github.com/gerrnot/helm/tree/feat/autorecover-from-stuck-situations
+
+acceptance-testing: https://github.com/gerrnot/acceptance-testing/tree/feat/autorecover-from-stuck-situations
+
+## Rejected ideas
+
+None
+
+## Open issues
+
+[] HIP status `accepted'
+
+[x] Reference implementation
+
+[x] Test for concurrent upgrade (valid lock should still block concurrent upgrade attempts)
+
+[] Test for kill scenario (forever stuck in pending)
+
+[] Backwards compatibility check (looking good already)
+
+## References
+
+https://github.com/helm/helm/issues/7476
+https://github.com/rancher/rancher/issues/44530
+https://github.com/helm/helm/issues/11863