Skip to content

✨ Performance Alerting #2081

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions .github/workflows/e2e.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file has a number of trailing whitespaces you should clean up.

Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,21 @@ jobs:
- name: Run e2e tests
run: ARTIFACT_PATH=/tmp/artifacts make test-e2e

- name: alerts-check
# Grab all current alerts, filtering out pending, and print the GH actions warning string
# containing the alert name and description.
#
# NOTE: Leaving this as annotating-only instead of failing the run until we have some more
# finely-tuned alerts.
run: |
if [[ -s /tmp/artifacts/alerts.out ]]; then \
jq -r 'if .state=="firing" then
"::error title=Prometheus Alert Firing::\(.labels.alertname): \(.annotations.description)"
elif .state=="pending" then
"::warning title=Prometheus Alert Pending::\(.labels.alertname): \(.annotations.description)"
end' /tmp/artifacts/alerts.out
fi

- uses: actions/upload-artifact@v4
if: failure()
with:
Expand Down
26 changes: 15 additions & 11 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -277,19 +277,23 @@ test-experimental-e2e: run image-registry prometheus experimental-e2e e2e e2e-me
.PHONY: prometheus
prometheus: PROMETHEUS_NAMESPACE := olmv1-system
prometheus: PROMETHEUS_VERSION := v0.83.0
prometheus: TMPDIR := $(shell mktemp -d)
prometheus: #EXHELP Deploy Prometheus into specified namespace
./hack/test/setup-monitoring.sh $(PROMETHEUS_NAMESPACE) $(PROMETHEUS_VERSION) $(KUSTOMIZE)

# The metrics.out file contains raw json data of the metrics collected during a test run.
# In an upcoming PR, this query will be replaced with one that checks for alerts from
# prometheus. Prometheus will gather metrics we currently query for over the test run,
# and provide alerts from the metrics based on the rules that we set.
trap 'echo "Cleaning up $(TMPDIR)"; rm -rf "$(TMPDIR)"' EXIT; \
curl -s "https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/refs/tags/$(PROMETHEUS_VERSION)/kustomization.yaml" > "$(TMPDIR)/kustomization.yaml"; \
curl -s "https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/refs/tags/$(PROMETHEUS_VERSION)/bundle.yaml" > "$(TMPDIR)/bundle.yaml"; \
(cd $(TMPDIR) && $(KUSTOMIZE) edit set namespace $(PROMETHEUS_NAMESPACE)) && kubectl create -k "$(TMPDIR)"
kubectl wait --for=condition=Ready pods -n $(PROMETHEUS_NAMESPACE) -l app.kubernetes.io/name=prometheus-operator
$(KUSTOMIZE) build config/prometheus | CATALOGD_SERVICE_CERT=$(shell kubectl get certificate -n olmv1-system catalogd-service-cert -o jsonpath={.spec.secretName}) envsubst '$$CATALOGD_SERVICE_CERT' | kubectl apply -f -
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of this secret ought to be fixed, so you shouldn't have to extract it?

kubectl wait --for=condition=Ready pods -n $(PROMETHEUS_NAMESPACE) -l app.kubernetes.io/name=prometheus-operator --timeout=60s
kubectl wait --for=create pods -n $(PROMETHEUS_NAMESPACE) prometheus-prometheus-0 --timeout=60s
kubectl wait --for=condition=Ready pods -n $(PROMETHEUS_NAMESPACE) prometheus-prometheus-0 --timeout=120s
Copy link
Contributor

@camilamacedo86 camilamacedo86 Jul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to centralise the Prometheus installation and related configurations in the hack directory? It might help keep things more organised and easier to understand.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I would prefer it to be part of the existing e2e manifests, since this is something we are planning to do for our e2e's.


# The output alerts.out file contains any alerts, pending or firing, collected during a test run in json format.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line has trailing whitespace

.PHONY: e2e-metrics
e2e-metrics: #EXHELP Request metrics from prometheus; place in ARTIFACT_PATH if set
curl -X POST \
-H "Content-Type: application/x-www-form-urlencoded" \
--data 'query={pod=~"operator-controller-controller-manager-.*|catalogd-controller-manager-.*"}' \
http://localhost:30900/api/v1/query > $(if $(ARTIFACT_PATH),$(ARTIFACT_PATH),.)/metrics.out
e2e-metrics: ALERTS_FILE_PATH := $(if $(ARTIFACT_PATH),$(ARTIFACT_PATH),.)/alerts.out
e2e-metrics: #EXHELP Request metrics from prometheus; select only actively firing alerts; place in ARTIFACT_PATH if set
curl -X GET http://localhost:30900/api/v1/alerts | jq 'if (.data.alerts | length) > 0 then .data.alerts.[] else empty end' > $(ALERTS_FILE_PATH)

.PHONY: extension-developer-e2e
extension-developer-e2e: KIND_CLUSTER_NAME := operator-controller-ext-dev-e2e
Expand Down
8 changes: 8 additions & 0 deletions config/prometheus/auth_token.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
apiVersion: v1
kind: Secret
type: kubernetes.io/service-account-token
metadata:
name: prometheus-metrics-token
namespace: system
annotations:
kubernetes.io/service-account.name: prometheus
34 changes: 34 additions & 0 deletions config/prometheus/catalogd_service_monitor.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: catalogd-controller-manager-metrics-monitor
namespace: system
spec:
endpoints:
- path: /metrics
port: metrics
interval: 10s
scheme: https
authorization:
credentials:
name: prometheus-metrics-token
key: token
tlsConfig:
# NAMESPACE_PLACEHOLDER replaced by replacements in kustomization.yaml
serverName: catalogd-service.NAMESPACE_PLACEHOLDER.svc
insecureSkipVerify: false
ca:
secret:
# CATALOGD_SERVICE_CERT must be replaced by envsubst
name: ${CATALOGD_SERVICE_CERT}
key: ca.crt
cert:
secret:
name: ${CATALOGD_SERVICE_CERT}
key: tls.crt
keySecret:
name: ${CATALOGD_SERVICE_CERT}
key: tls.key
selector:
matchLabels:
app.kubernetes.io/name: catalogd
40 changes: 40 additions & 0 deletions config/prometheus/kubelet_service_monitor.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: kubelet
namespace: system
labels:
k8s-app: kubelet
spec:
jobLabel: k8s-app
endpoints:
- port: https-metrics
scheme: https
path: /metrics
interval: 10s
honorLabels: true
tlsConfig:
insecureSkipVerify: true
bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
metricRelabelings:
- action: keep
sourceLabels: [pod,container]
regex: (operator-controller|catalogd).*;manager
- port: https-metrics
scheme: https
path: /metrics/cadvisor
interval: 10s
honorLabels: true
tlsConfig:
insecureSkipVerify: true
bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
metricRelabelings:
- action: keep
sourceLabels: [pod,container]
regex: (operator-controller|catalogd).*;manager
selector:
matchLabels:
k8s-app: kubelet
namespaceSelector:
matchNames:
- kube-system
35 changes: 35 additions & 0 deletions config/prometheus/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: olmv1-system
resources:
- prometheus.yaml
- catalogd_service_monitor.yaml
- kubelet_service_monitor.yaml
- operator_controller_service_monitor.yaml
- prometheus_rule.yaml
- auth_token.yaml
- network_policy.yaml
- service.yaml
- rbac
replacements:
- source:
kind: ServiceMonitor
name: catalogd-controller-manager-metrics-monitor
fieldPath: metadata.namespace
targets:
- select:
kind: ServiceMonitor
name: catalogd-controller-manager-metrics-monitor
fieldPaths:
- spec.endpoints.0.tlsConfig.serverName
options:
delimiter: '.'
index: 1
- select:
kind: ServiceMonitor
name: operator-controller-controller-manager-metrics-monitor
fieldPaths:
- spec.endpoints.0.tlsConfig.serverName
options:
delimiter: '.'
index: 1
16 changes: 16 additions & 0 deletions config/prometheus/network_policy.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: prometheus
namespace: system
spec:
podSelector:
matchLabels:
app.kubernetes.io/name: prometheus
policyTypes:
- Egress
- Ingress
egress:
- {} # Allows all egress traffic for metrics requests
ingress:
- {} # Allows us to query prometheus
33 changes: 33 additions & 0 deletions config/prometheus/operator_controller_service_monitor.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: operator-controller-controller-manager-metrics-monitor
namespace: system
spec:
endpoints:
- path: /metrics
interval: 10s
port: https
scheme: https
authorization:
credentials:
name: prometheus-metrics-token
key: token
tlsConfig:
# NAMESPACE_PLACEHOLDER replaced by replacements in kustomization.yaml
serverName: operator-controller-service.NAMESPACE_PLACEHOLDER.svc
insecureSkipVerify: false
ca:
secret:
name: olmv1-cert
key: ca.crt
cert:
secret:
name: olmv1-cert
key: tls.crt
keySecret:
name: olmv1-cert
key: tls.key
selector:
matchLabels:
control-plane: operator-controller-controller-manager
18 changes: 18 additions & 0 deletions config/prometheus/prometheus.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: prometheus
namespace: system
spec:
logLevel: debug
serviceAccountName: prometheus
scrapeTimeout: 30s
scrapeInterval: 1m
securityContext:
runAsNonRoot: true
runAsUser: 65534
seccompProfile:
type: RuntimeDefault
ruleSelector: {}
serviceDiscoveryRole: EndpointSlice
serviceMonitorSelector: {}
59 changes: 59 additions & 0 deletions config/prometheus/prometheus_rule.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: controller-alerts
namespace: system
spec:
groups:
- name: controller-panic
rules:
- alert: reconciler-panic
expr: controller_runtime_reconcile_panics_total{} > 0
annotations:
description: "controller of pod {{ $labels.pod }} experienced panic(s); count={{ $value }}"
- alert: webhook-panic
expr: controller_runtime_webhook_panics_total{} > 0
annotations:
description: "controller webhook of pod {{ $labels.pod }} experienced panic(s); count={{ $value }}"
- name: resource-usage
rules:
- alert: oom-events
expr: container_oom_events_total > 0
annotations:
description: "container {{ $labels.container }} of pod {{ $labels.pod }} experienced OOM event(s); count={{ $value }}"
- alert: operator-controller-memory-growth
expr: deriv(sum(container_memory_working_set_bytes{pod=~"operator-controller.*",container="manager"})[5m:]) > 50_000
Copy link
Contributor

@camilamacedo86 camilamacedo86 Jul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dtfranz so we are manually defining the trashholders here?
Could we doc how it works in the https://github.com/operator-framework/operator-controller/blob/main/docs/contribute/developer.md ? WDYT?
Not a blocker for this one for sure

for: 5m
keep_firing_for: 1d
annotations:
description: "operator-controller pod memory usage growing at a high rate for 5 minutes: {{ $value | humanize }}B/sec"
- alert: catalogd-memory-growth
expr: deriv(sum(container_memory_working_set_bytes{pod=~"catalogd.*",container="manager"})[5m:]) > 50_000
for: 5m
keep_firing_for: 1d
annotations:
description: "catalogd pod memory usage growing at a high rate for 5 minutes: {{ $value | humanize }}B/sec"
- alert: operator-controller-memory-usage
expr: sum(container_memory_working_set_bytes{pod=~"operator-controller.*",container="manager"}) > 100_000_000
for: 5m
keep_firing_for: 1d
annotations:
description: "operator-controller pod using high memory resources for the last 5 minutes: {{ $value | humanize }}B"
- alert: catalogd-memory-usage
expr: sum(container_memory_working_set_bytes{pod=~"catalogd.*",container="manager"}) > 75_000_000
for: 5m
keep_firing_for: 1d
annotations:
description: "catalogd pod using high memory resources for the last 5 minutes: {{ $value | humanize }}B"
- alert: operator-controller-cpu-usage
expr: rate(container_cpu_usage_seconds_total{pod=~"operator-controller.*",container="manager"}[5m]) * 100 > 20
for: 5m
keep_firing_for: 1d
annotations:
description: "operator-controller using high cpu resource for 5 minutes: {{ $value | printf \"%.2f\" }}%"
- alert: catalogd-cpu-usage
expr: rate(container_cpu_usage_seconds_total{pod=~"catalogd.*",container="manager"}[5m]) * 100 > 20
for: 5m
keep_firing_for: 1d
annotations:
description: "catalogd using high cpu resources for 5 minutes: {{ $value | printf \"%.2f\" }}%"
4 changes: 4 additions & 0 deletions config/prometheus/rbac/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
resources:
- prometheus_service_account.yaml
- prometheus_cluster_role.yaml
- prometheus_cluster_rolebinding.yaml
29 changes: 29 additions & 0 deletions config/prometheus/rbac/prometheus_cluster_role.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups: [""]
resources:
- nodes
- nodes/metrics
- services
- endpoints
- pods
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources:
- configmaps
verbs: ["get"]
- apiGroups:
- discovery.k8s.io
resources:
- endpointslices
verbs: ["get", "list", "watch"]
- apiGroups:
- networking.k8s.io
resources:
- ingresses
verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
verbs: ["get"]
12 changes: 12 additions & 0 deletions config/prometheus/rbac/prometheus_cluster_rolebinding.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: system
5 changes: 5 additions & 0 deletions config/prometheus/rbac/prometheus_service_account.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: system
15 changes: 15 additions & 0 deletions config/prometheus/service.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
apiVersion: v1
kind: Service
metadata:
name: prometheus-service
namespace: system
spec:
type: NodePort
ports:
- name: web
nodePort: 30900
port: 9090
protocol: TCP
targetPort: web
selector:
prometheus: prometheus
Loading
Loading