Skip to content

Commit 9cacda9

Browse files
committed
feat: autorecover from stuck situations
Signed-off-by: Gernot Feichter <[email protected]>
1 parent 36ac1d9 commit 9cacda9

File tree

1 file changed

+102
-0
lines changed

1 file changed

+102
-0
lines changed

hips/hip-9999.md

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
---
2+
hip: 9999
3+
title: "Autorecover from stuck situations"
4+
authors: [ "Gernot Feichter <[email protected]>" ]
5+
created: "2024-07-12"
6+
type: "feature"
7+
status: "draft"
8+
helm-version: 3
9+
---
10+
11+
## Abstract
12+
13+
The idea is to simplify the handling for both manual users and CI/CD pipelines,
14+
to auto-recover from a state of stuck deployments, which is currently not possible unless users implement
15+
boilerplate code around their helm invocations.
16+
17+
## Motivation
18+
19+
If a helm deployment fails, I want to be able to retry it,
20+
ideally by running the same command again to keep things simple.
21+
22+
There are two known situations how the user can run into such a situation where a retry will NOT work:
23+
1. A helm upgrade/install process is killed while the release is in state `PENDING-UPGRADE` or `PENDING-INSTALL`.
24+
2. The initial helm release installation (as performed via `helm upgrade --install`) is in state `FAILED`.
25+
26+
Known Workarounds that should become OBSOLETE when this HIP is implemented to recover from such situations:
27+
1. `kubectl delete secret '<the name of the secret where helm stores release information>'.` (Not possible if you don't want to lose all history)
28+
2. `helm delete` your release. (Not possible if you don't want to lose all history)
29+
3. `helm rollback` your release. (Not possibly if it is the first installation)
30+
31+
## Rationale
32+
33+
The proposed solution uses a locking mechanism that is stored in k8s, such that all clients know whether a helm
34+
release is locked by themselves or not and for how long the lock is valid.
35+
36+
It uses existing helm parameters like the --timeout parameter (which defaults to 5m) to determine for how long a helm release
37+
may be stuck in a pending state.
38+
39+
## Specification
40+
41+
The --timout parameter gets a deeper meaning.
42+
Previously the --timout parameter only had an effect on the helm process running on the respective client.
43+
After implementation, the --timout parameter will be stored in the helm release object (secret) in k8s and
44+
have an indirect impact on possible parallel processes.
45+
46+
`helm ls -a` shows two new columns, regular `helm ls` does NOT show those:
47+
- LOCKED TILL
48+
<datetime> calculated by the helm client: k8s server time + timeout parameter value
49+
- SESSION ID
50+
Unique, random session id generated by the client
51+
52+
Furthermore, if the helm client process gets killed (SIGTERM), it tries to clear the LOCKED TILL value,
53+
SESSION ID and sets the release into a failed state before terminating in order to free the lock.
54+
55+
## Backwards compatibility
56+
57+
It is assumed that the helm release object as stored in k8s will not break
58+
older clients if new fields are added while existing fields are untouched.
59+
60+
Backwards compatibility will be tested during implementation!
61+
62+
## Security implications
63+
64+
The proposed solution should not have an impact on security.
65+
66+
## How to teach this
67+
68+
Since the way that helm is invoked is not altered, there will not be much to teach here.
69+
The usage of the timeout parameter is encouraged, but since the default timeout is already 5m, not even that
70+
needs to be encouraged.
71+
72+
It should just reduce the amount of frustration when dealing with pending and failed helm releases.
73+
74+
A retry of a failed command should just work (assuming the retry happens when no other client has a valid lock).
75+
76+
## Reference implementation
77+
78+
helm: https://github.com/gerrnot/helm/tree/feat/autorecover-from-stuck-situations
79+
80+
acceptance-testing: https://github.com/gerrnot/acceptance-testing/tree/feat/autorecover-from-stuck-situations
81+
82+
## Rejected ideas
83+
84+
None
85+
86+
## Open issues
87+
88+
[] HIP status `accepted'
89+
90+
[x] Reference implementation
91+
92+
[x] Test for concurrent upgrade (valid lock should still block concurrent upgrade attempts)
93+
94+
[] Test for kill scenario (forever stuck in pending)
95+
96+
[] Backwards compatibility check (looking good already)
97+
98+
## References
99+
100+
https://github.com/helm/helm/issues/7476
101+
https://github.com/rancher/rancher/issues/44530
102+
https://github.com/helm/helm/issues/11863

0 commit comments

Comments
 (0)