Skip to content

Commit f6040ca

Browse files
committed
feat: autorecover from stuck situations
Signed-off-by: Gernot Feichter <[email protected]>
1 parent 36ac1d9 commit f6040ca

File tree

1 file changed

+96
-0
lines changed

1 file changed

+96
-0
lines changed

hips/hip-9999.md

Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
---
2+
hip: 9999
3+
title: "Autorecover from stuck situations"
4+
authors: [ "Gernot Feichter <[email protected]>" ]
5+
created: "2024-07-12"
6+
type: "feature"
7+
status: "draft"
8+
helm-version: 3
9+
---
10+
11+
## Abstract
12+
13+
The idea is to simplify the handling for both manual users and CI/CD pipelines,
14+
to auto-recover from a state of stuck deployments, which is currently not possible unless users implement
15+
boilerplate code around their helm invocations.
16+
17+
## Motivation
18+
19+
If a helm deployment fails, I want to be able to retry it,
20+
ideally by running the same command again to keep things simple.
21+
22+
There are two known situations how the user can run into such a situation where a retry will NOT work:
23+
1. A helm upgrade/install process is killed while the release is in state `PENDING-UPGRADE` or `PENDING-INSTALL`.
24+
2. The initial helm release installation (as performed via `helm upgrade --install`) is in state `FAILED`.
25+
26+
Known Workarounds that should become OBSOLETE when this HIP is implemented to recover from such situations:
27+
1. `kubectl delete secret '<the name of the secret where helm stores release information>'.` (Not possible if you don't want to lose all history)
28+
2. `helm delete` your release. (Not possible if you don't want to lose all history)
29+
3. `helm rollback` your release. (Not possibly if it is the first installation)
30+
31+
## Rationale
32+
33+
The proposed solution uses a locking mechanism that is stored in k8s, such that all clients know whether a helm
34+
release is locked by themselves or not and for how long the lock is valid.
35+
36+
It uses existing helm parameters like the --timeout parameter (which defaults to 5m) to determine for how long a helm release
37+
may be stuck in a pending state.
38+
39+
## Specification
40+
41+
The --timout parameter gets a deeper meaning.
42+
Previously the --timout parameter only had an effect on the helm process running on the respective client.
43+
After implementation, the --timout parameter will be stored in the helm release object (secret) in k8s and
44+
have an indirect impact on possible parallel processes.
45+
46+
`helm ls -a` shows two new columns, regular `helm ls` does NOT show those:
47+
- LOCKED TILL
48+
<datetime> calculated by the helm client: k8s server time + timeout parameter value
49+
- SESSION ID
50+
Unique, random session id generated by the client
51+
52+
Furthermore, if the helm client process gets killed (SIGTERM), it tries to clear the LOCKED TILL value,
53+
SESSION ID and sets the release into a failed state before terminating in order to free the lock.
54+
55+
## Backwards compatibility
56+
57+
It is assumed that the helm release object as stored in k8s will not break
58+
older clients if new fields are added while existing fields are untouched.
59+
60+
Backwards compatibility will be tested during implementation!
61+
62+
## Security implications
63+
64+
The proposed solution should not have an impact on security.
65+
66+
## How to teach this
67+
68+
Since the way that helm is invoked is not altered, there will not be much to teach here.
69+
The usage of the timeout parameter is encouraged, but since the default timeout is already 5m, not even that
70+
needs to be encouraged.
71+
72+
It should just reduce the amount of frustration when dealing with pending and failed helm releases.
73+
74+
A retry of a failed command should just work (assuming the retry happens when no other client has a valid lock).
75+
76+
## Reference implementation
77+
78+
TODO
79+
80+
## Rejected ideas
81+
82+
None
83+
84+
## Open issues
85+
86+
[] HIP status `accepted'
87+
88+
[] Reference implementation
89+
90+
[] Backwards compatibility check
91+
92+
## References
93+
94+
https://github.com/helm/helm/issues/7476
95+
https://github.com/rancher/rancher/issues/44530
96+
https://github.com/helm/helm/issues/11863

0 commit comments

Comments
 (0)