|
| 1 | +--- |
| 2 | +hip: 9999 |
| 3 | +title: "Autorecover from stuck situations" |
| 4 | +authors: [ "Gernot Feichter <[email protected]>" ] |
| 5 | +created: "2024-07-12" |
| 6 | +type: "feature" |
| 7 | +status: "draft" |
| 8 | +helm-version: 3 |
| 9 | +--- |
| 10 | + |
| 11 | +## Abstract |
| 12 | + |
| 13 | +The idea is to simplify the handling for both manual users and CI/CD pipelines, |
| 14 | +to auto-recover from a state of stuck deployments, which is currently not possible unless users implement |
| 15 | +boilerplate code around their helm invocations. |
| 16 | + |
| 17 | +## Motivation |
| 18 | + |
| 19 | +If a helm deployment fails, I want to be able to retry it, |
| 20 | +ideally by running the same command again to keep things simple. |
| 21 | + |
| 22 | +There are two known situations how the user can run into such a situation where a retry will NOT work: |
| 23 | +1. A helm upgrade/install process is killed while the release is in state `PENDING-UPGRADE` or `PENDING-INSTALL`. |
| 24 | +2. The initial helm release installation (as performed via `helm upgrade --install`) is in state `FAILED`. |
| 25 | + |
| 26 | +Known Workarounds that should become OBSOLETE when this HIP is implemented to recover from such situations: |
| 27 | +1. `kubectl delete secret '<the name of the secret where helm stores release information>'.` (Not possible if you don't want to lose all history) |
| 28 | +2. `helm delete` your release. (Not possible if you don't want to lose all history) |
| 29 | +3. `helm rollback` your release. (Not possibly if it is the first installation) |
| 30 | + |
| 31 | +## Rationale |
| 32 | + |
| 33 | +The proposed solution uses a locking mechanism that is stored in k8s, such that all clients know whether a helm |
| 34 | +release is locked by themselves or not and for how long the lock is valid. |
| 35 | + |
| 36 | +It uses existing helm parameters like the --timeout parameter (which defaults to 5m) to determine for how long a helm release |
| 37 | +may be stuck in a pending state. |
| 38 | + |
| 39 | +## Specification |
| 40 | + |
| 41 | +The --timout parameter gets a deeper meaning. |
| 42 | +Previously the --timout parameter only had an effect on the helm process running on the respective client. |
| 43 | +After implementation, the --timout parameter will be stored in the helm release object (secret) in k8s and |
| 44 | +have an indirect impact on possible parallel processes. |
| 45 | + |
| 46 | +`helm ls -a` shows two new columns, regular `helm ls` does NOT show those: |
| 47 | +- LOCKED TILL |
| 48 | + <datetime> calculated by the helm client: k8s server time + timeout parameter value |
| 49 | +- SESSION ID |
| 50 | + Unique, random session id generated by the client |
| 51 | + |
| 52 | +Furthermore, if the helm client process gets killed (SIGTERM), it tries to clear the LOCKED TILL value, |
| 53 | +SESSION ID and sets the release into a failed state before terminating in order to free the lock. |
| 54 | + |
| 55 | +## Backwards compatibility |
| 56 | + |
| 57 | +It is assumed that the helm release object as stored in k8s will not break |
| 58 | +older clients if new fields are added while existing fields are untouched. |
| 59 | + |
| 60 | +Backwards compatibility will be tested during implementation! |
| 61 | + |
| 62 | +## Security implications |
| 63 | + |
| 64 | +The proposed solution should not have an impact on security. |
| 65 | + |
| 66 | +## How to teach this |
| 67 | + |
| 68 | +Since the way that helm is invoked is not altered, there will not be much to teach here. |
| 69 | +The usage of the timeout parameter is encouraged, but since the default timeout is already 5m, not even that |
| 70 | +needs to be encouraged. |
| 71 | + |
| 72 | +It should just reduce the amount of frustration when dealing with pending and failed helm releases. |
| 73 | + |
| 74 | +A retry of a failed command should just work (assuming the retry happens when no other client has a valid lock). |
| 75 | + |
| 76 | +## Reference implementation |
| 77 | + |
| 78 | +helm: https://github.com/gerrnot/helm/tree/feat/autorecover-from-stuck-situations |
| 79 | + |
| 80 | +acceptance-testing: https://github.com/gerrnot/acceptance-testing/tree/feat/autorecover-from-stuck-situations |
| 81 | + |
| 82 | +## Rejected ideas |
| 83 | + |
| 84 | +None |
| 85 | + |
| 86 | +## Open issues |
| 87 | + |
| 88 | +[] HIP status `accepted' |
| 89 | + |
| 90 | +[x] Reference implementation |
| 91 | + |
| 92 | +[x] Test for concurrent upgrade (valid lock should still block concurrent upgrade attempts) |
| 93 | + |
| 94 | +[] Test for kill scenario (forever stuck in pending) |
| 95 | + |
| 96 | +[] Backwards compatibility check (looking good already) |
| 97 | + |
| 98 | +## References |
| 99 | + |
| 100 | +https://github.com/helm/helm/issues/7476 |
| 101 | +https://github.com/rancher/rancher/issues/44530 |
| 102 | +https://github.com/helm/helm/issues/11863 |
0 commit comments