Skip to content

Conversation

KaiyiLiu1234
Copy link
Collaborator

Before this patch, the Prometheus User Workload Token Secret did not have a reliable reconciliation pattern to handle expirations. This patch added a controller which will remove any secrets that have expired, so they can be reconciled. The UWM Token Secret has its expiration date reduced from 1 year to 1 week, and the controller will check for expired tokens once per day.

@github-actions github-actions bot added the feat A new feature or enhancement label Jul 24, 2025
@codecov
Copy link

codecov bot commented Jul 24, 2025

Codecov Report

❌ Patch coverage is 64.00000% with 45 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.26%. Comparing base (f210aba) to head (bb42bad).
⚠️ Report is 2 commits behind head on v1alpha1.

Files with missing lines Patch % Lines
pkg/reconciler/security.go 41.79% 36 Missing and 3 partials ⚠️
pkg/components/power-monitor/deployment.go 94.11% 2 Missing and 1 partial ⚠️
pkg/reconciler/power-monitor.go 40.00% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           v1alpha1     #588      +/-   ##
============================================
- Coverage     79.35%   77.26%   -2.09%     
============================================
  Files            11       11              
  Lines          1172     1267      +95     
============================================
+ Hits            930      979      +49     
- Misses          217      258      +41     
- Partials         25       30       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@KaiyiLiu1234
Copy link
Collaborator Author

@sthaha I have tested with two controllers and with a ticker in a single controller (power monitor internal). They both work as expected under varying conditions (upon creation, upon cr changes, upon removal, etc). I have also confirmed that they are seamless too (will not cause any metrics to disappear). I believe using a second controller is better because if we need to add more secrets or secret tokens with expirations, then this controller can be adapted to handle that.

@SamYuan1990
Copy link
Collaborator

@KaiyiLiu1234 , a small question, may I know where secret used for? and it sounds like a cert, so after this PR, how we integrate with cert mgr?

sthaha
sthaha previously requested changes Jul 24, 2025
Copy link
Collaborator

@sthaha sthaha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can avoid a lot of repetition of logic by using right interfaces.

@KaiyiLiu1234 KaiyiLiu1234 force-pushed the token-expiration branch 2 times, most recently from 6ce25cd to d4b8f60 Compare July 28, 2025 23:01
@KaiyiLiu1234
Copy link
Collaborator Author

KaiyiLiu1234 commented Jul 28, 2025

@KaiyiLiu1234 , a small question, may I know where secret used for? and it sounds like a cert, so after this PR, how we integrate with cert mgr?

@SamYuan1990 The secret is called prometheus-user-workload-token and it is used for authorization so kube rbac proxy can identify/authenticate this as prometheus-user-workload (token review). This is critical part of the security for kube rbac proxy as without the token, kube rbac proxy cannot identify the object attempting to contact it. This is not a certificate. This is a secret that contains a jwt token of the service account.

@KaiyiLiu1234
Copy link
Collaborator Author

@sthaha flags have been added and code improvements have been made. Manual check using the ttl and refresh flags is required (by default, ttl is 7 days and refresh is 1 day).

@KaiyiLiu1234 KaiyiLiu1234 force-pushed the token-expiration branch 2 times, most recently from 6e87714 to 3aeb732 Compare July 31, 2025 01:17

//go:embed assets/dashboards/power-monitor-namespace-info.json
namespaceInfoDashboardJson string

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment specifying its a default value?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add UT for the new changes?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same UT for this as well?

@vprashar2929
Copy link
Collaborator

@KaiyiLiu1234 Are there new code changes? I see the tests are failing after recent push: https://github.com/sustainable-computing-io/kepler-operator/actions/runs/17871112754/job/51119968780?pr=588

- op: add
path: /spec/template/spec/containers/0/args/0
value: --deployment-namespace=kepler
- op: add
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need for k8s as UWM is specific to OCP.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I can remove that. Also, can we just merge this if after your tests it works as expected? I can add the reconciler tests afterwards in a separate PR.

@vprashar2929
Copy link
Collaborator

@KaiyiLiu1234 Unrelated to your changes but something we should fix which is that sometime deleter takes time:

2025-09-29T11:03:35Z    ERROR   Reconciler error        {"controller": "secret", "controllerGroup": "", "controllerKind": "Secret", "Secret": {"name":"prometheus-user-workload-token","namespace":"power-monitor"}, "namespace": "power-monitor", "name": "prometheus-user-workload-token
", "reconcileID": "2fdc6c7d-1d04-421a-839a-725dadf66fae", "error": "prometheus-user-workload-token (/v1, Kind=Secret): deleter: timed out waiting for deletion : context deadline exceeded"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:263
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:224
2025-09-29T11:03:35Z    LEVEL(-5)       Reconciling     {"controller": "secret", "controllerGroup": "", "controllerKind": "Secret", "Secret": {"name":"prometheus-user-workload-token","namespace":"power-monitor"}, "namespace": "power-monitor", "name": "prometheus-user-workload-token
", "reconcileID": "121eba6b-ea59-475e-b2ca-b3c69eb7b3cf"}
2025-09-29T11:03:35Z    INFO    Start of reconcile      {"controller": "secret", "controllerGroup": "", "controllerKind": "Secret", "Secret": {"name":"prometheus-user-workload-token","namespace":"power-monitor"}, "namespace": "power-monitor", "name": "prometheus-user-workload-token
", "reconcileID": "121eba6b-ea59-475e-b2ca-b3c69eb7b3cf"}
2025-09-29T11:03:35Z    INFO    secret not expired yet, requeuing       {"controller": "secret", "controllerGroup": "", "controllerKind": "Secret", "Secret": {"name":"prometheus-user-workload-token","namespace":"power-monitor"}, "namespace": "power-monitor", "name": "prometheus-use
r-workload-token", "reconcileID": "121eba6b-ea59-475e-b2ca-b3c69eb7b3cf", "expiration-time": "2025-09-29T11:12:35Z", "time-until-expiration": "8m59.660532559s"}
2025-09-29T11:03:35Z    INFO    End of reconcile        {"controller": "secret", "controllerGroup": "", "controllerKind": "Secret", "Secret": {"name":"prometheus-user-workload-token","namespace":"power-monitor"}, "namespace": "power-monitor", "name": "prometheus-user-workload-token
", "reconcileID": "121eba6b-ea59-475e-b2ca-b3c69eb7b3cf"}

Before this patch, the Prometheus User Workload Token Secret did not
have a reliable reconciliation pattern to handle expirations. This patch
added a controller which will remove any secrets that have expired, so they
can be reconciled. The UWM Token Secret has its expiration date reduced from 1 year
to 1 week, and the controller will check for expired tokens once per day.

Signed-off-by: Kaiyi Liu <[email protected]>
@vprashar2929 vprashar2929 dismissed sthaha’s stale review October 6, 2025 05:40

changes looks good

@vprashar2929 vprashar2929 merged commit 06f822b into sustainable-computing-io:v1alpha1 Oct 6, 2025
21 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feat A new feature or enhancement manual-validation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants