Skip to content

Commit 6b1796b

Browse files
committed
do nova
1 parent 4c368dd commit 6b1796b

File tree

7 files changed

+311
-296
lines changed

7 files changed

+311
-296
lines changed

helm/README.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,3 +32,9 @@ Helm charts are organized into three main directories:
3232
├── cortex-prometheus-operator
3333
└── ...
3434
```
35+
36+
## Versioning
37+
38+
We use [semantic versioning](https://semver.org/) for our Helm charts.
39+
Each chart has its own `Chart.yaml` file that specifies the version of the chart and its dependencies.
40+

helm/bundles/cortex-manila/prometheus-rules/manila.alerts

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
groups:
22
- name: cortex-manila-alerts
33
rules:
4-
- alert: CortexManilaSchedulerDown
4+
- alert: CortexManilaInitialPlacementDown
55
expr: |
66
up{component="cortex-manila-scheduler"} != 1 or
77
absent(up{component="cortex-manila-scheduler"})
@@ -13,9 +13,9 @@
1313
severity: warning
1414
support_group: workload-management
1515
annotations:
16-
summary: "Cortex external scheduler for Manila is down"
16+
summary: "Cortex initial placement for Manila is down"
1717
description: >
18-
The Cortex scheduler is down. Initial placement requests from Manila will
18+
The Cortex initial placement is down. Initial placement requests from Manila will
1919
not be served. This is no immediate problem, since Manila will continue
2020
placing new shares. However, the placement will be less desirable.
2121

helm/bundles/cortex-manila/prometheus-rules/mqtt.alerts

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,10 @@
66
for: 1m
77
labels:
88
context: mqtt
9+
dashboard: cortex/cortex
10+
service: cortex
11+
severity: info
12+
support_group: workload-management
913
annotations:
1014
summary: "Cortex is trying to connect to MQTT too often"
1115
description: >
@@ -18,6 +22,10 @@
1822
for: 5m
1923
labels:
2024
context: db
25+
dashboard: cortex/cortex
26+
service: cortex
27+
severity: info
28+
support_group: workload-management
2129
annotations:
2230
summary: "Cortex is trying to connect to the database too often"
2331
description: >
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
groups:
2+
- name: cortex-nova-alerts
3+
rules:
4+
- alert: CortexNovaDeschedulerPipelineErroring
5+
expr: delta(cortex_descheduler_pipeline_vm_descheduling_duration_seconds_count{component=~"cortex-nova-.*",error="true"}[2m]) > 0
6+
for: 5m
7+
labels:
8+
context: descheduler
9+
dashboard: cortex/cortex
10+
service: cortex
11+
severity: info
12+
support_group: workload-management
13+
annotations:
14+
summary: "Descheduler pipeline is erroring."
15+
description: >
16+
The Cortex descheduler pipeline is encountering errors during its execution.
17+
This may indicate issues with the descheduling logic or the underlying infrastructure.
18+
It is recommended to investigate the descheduler logs and the state of the VMs being processed.
Lines changed: 266 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,266 @@
1+
groups:
2+
- name: cortex-nova-alerts
3+
rules:
4+
- alert: CortexNovaInitialPlacementDown
5+
expr: |
6+
up{component="cortex-nova-scheduler"} != 1 or
7+
absent(up{component="cortex-nova-scheduler"})
8+
for: 1m
9+
labels:
10+
context: liveness
11+
dashboard: cortex/cortex
12+
service: cortex
13+
severity: warning
14+
support_group: workload-management
15+
annotations:
16+
summary: "Cortex initial placement for Nova is down"
17+
description: >
18+
The Cortex initial placement service is down. Initial placement requests from Nova will
19+
not be served. This is no immediate problem, since Nova will continue
20+
placing new VMs. However, the placement will be less desirable.
21+
22+
- alert: CortexNovaSyncerDown
23+
expr: |
24+
up{component="cortex-nova-syncer"} != 1 or
25+
absent(up{component="cortex-nova-syncer"})
26+
for: 1m
27+
labels:
28+
context: liveness
29+
dashboard: cortex/cortex
30+
service: cortex
31+
severity: warning
32+
support_group: workload-management
33+
annotations:
34+
summary: "Cortex syncer is down"
35+
description: >
36+
The Cortex syncer is down. Cortex requires somewhat recent data from
37+
it's datasources (OpenStack, Prometheus, etc.) to make accurate
38+
scheduling decisions. If this issue persists for a longer time, the
39+
data based will slowly drift away from the actual state of the
40+
datacenter, which may lead to less desirable placement decisions.
41+
This is no immediate problem, since Nova will continue placing new VMs.
42+
43+
- alert: CortexNovaExtractorDown
44+
expr: |
45+
up{component="cortex-nova-extractor"} != 1 or
46+
absent(up{component="cortex-nova-extractor"})
47+
for: 1m
48+
labels:
49+
context: liveness
50+
dashboard: cortex/cortex
51+
service: cortex
52+
severity: warning
53+
support_group: workload-management
54+
annotations:
55+
summary: "Cortex extractor is down"
56+
description: >
57+
The Cortex extractor is down. This means that newly available data
58+
about the datacenter will not be used to extract scheduling knowledge.
59+
This is no immediate problem, since Nova will continue placing new VMs.
60+
However, the placement will be less desirable.
61+
62+
- alert: CortexNovaHttpRequest400sTooHigh
63+
expr: rate(cortex_scheduler_api_request_duration_seconds_count{component="cortex-nova-scheduler",status=~"4.+"}[5m]) > 0.1
64+
for: 5m
65+
labels:
66+
context: api
67+
dashboard: cortex/cortex
68+
service: cortex
69+
severity: info
70+
support_group: workload-management
71+
annotations:
72+
summary: "HTTP request 400 errors too high"
73+
description: >
74+
Cortex is responding to Nova initial placement requests with HTTP 4xx
75+
errors. This is expected when the scheduling request cannot be served
76+
by Cortex. However, it could also indicate that the Nova request
77+
format has changed and Cortex is unable to parse it.
78+
79+
- alert: CortexNovaHttpRequest500sTooHigh
80+
expr: rate(cortex_scheduler_api_request_duration_seconds_count{component="cortex-nova-scheduler",status=~"5.+"}[5m]) > 0.1
81+
for: 5m
82+
labels:
83+
context: api
84+
dashboard: cortex/cortex
85+
service: cortex
86+
severity: info
87+
support_group: workload-management
88+
annotations:
89+
summary: "HTTP request 500 errors too high"
90+
description: >
91+
Cortex is responding to Nova initial placement requests with HTTP 5xx
92+
errors. This is not expected and indicates that Cortex is having some
93+
internal problem. Nova will continue to place new VMs, but the
94+
placement will be less desirable. Thus, no immediate action is needed.
95+
96+
- alert: CortexNovaHighMemoryUsage
97+
expr: process_resident_memory_bytes{component=~"cortex-nova-.*"} > 1000 * 1024 * 1024
98+
for: 5m
99+
labels:
100+
context: memory
101+
dashboard: cortex/cortex
102+
service: cortex
103+
severity: info
104+
support_group: workload-management
105+
annotations:
106+
summary: "Cortex {{`{{$labels.component}}`}} uses too much memory"
107+
description: >
108+
Cortex should not be using more than 1000 MiB of memory. Usually it
109+
should use much less, so there may be a memory leak or other changes
110+
that are causing the memory usage to increase significantly.
111+
112+
- alert: CortexNovaHighCPUUsage
113+
expr: rate(process_cpu_seconds_total{component=~"cortex-nova-.*"}[1m]) > 0.5
114+
for: 5m
115+
labels:
116+
context: cpu
117+
dashboard: cortex/cortex
118+
service: cortex
119+
severity: info
120+
support_group: workload-management
121+
annotations:
122+
summary: "Cortex {{`{{$labels.component}}`}} uses too much CPU"
123+
description: >
124+
Cortex should not be using more than 50% of a single CPU core. Usually
125+
it should use much less, so there may be a CPU leak or other changes
126+
that are causing the CPU usage to increase significantly.
127+
128+
- alert: CortexNovaSyncNotSuccessful
129+
expr: cortex_sync_request_processed_total{component=~"cortex-nova-.*"} - cortex_sync_request_duration_seconds_count{component=~"cortex-nova-.*"} > 0
130+
for: 5m
131+
labels:
132+
context: syncstatus
133+
dashboard: cortex/cortex
134+
service: cortex
135+
severity: info
136+
support_group: workload-management
137+
annotations:
138+
summary: "Sync not successful"
139+
description: >
140+
Cortex experienced an issue syncing data from a datasource. This may
141+
happen when the datasource (OpenStack, Prometheus, etc.) is down or
142+
the sync module is misconfigured. No immediate action is needed, since
143+
the sync module will retry the sync operation and the currently synced
144+
data will be kept. However, when this problem persists for a longer
145+
time the service will have a less recent view of the datacenter.
146+
147+
- alert: CortexNovaSyncObjectsDroppedToZero
148+
expr: cortex_sync_objects{component=~"cortex-nova-.*"} == 0
149+
for: 5m
150+
labels:
151+
context: syncobjects
152+
dashboard: cortex/cortex
153+
service: cortex
154+
severity: info
155+
support_group: workload-management
156+
annotations:
157+
summary: "Cortex is not syncing any new data from {{`{{$labels.datasource}}`}}"
158+
description: >
159+
Cortex is not syncing any objects from a datasource. This may happen
160+
when the datasource (OpenStack, Prometheus, etc.) is down or the sync
161+
module is misconfigured. No immediate action is needed, since the sync
162+
module will retry the sync operation and the currently synced data will
163+
be kept. However, when this problem persists for a longer time the
164+
service will have a less recent view of the datacenter.
165+
166+
- alert: CortexNovaSyncObjectsTooHigh
167+
expr: cortex_sync_objects{component=~"cortex-nova-.*"} > 1000000
168+
for: 5m
169+
labels:
170+
context: syncobjects
171+
dashboard: cortex/cortex
172+
service: cortex
173+
severity: info
174+
support_group: workload-management
175+
annotations:
176+
summary: "Cortex is syncing unexpectedly many objects from {{`{{$labels.datasource}}`}}"
177+
description: >
178+
Cortex is syncing more than 1 million objects from a datasource. This
179+
may happen when the datasource (OpenStack, Prometheus, etc.) returns
180+
unexpectedly many objects, or when the database cannot drop old objects.
181+
No immediate action is needed, but should this condition persist for a
182+
longer time, the database may fill up and crash.
183+
184+
- alert: CortexNovaTooManyMQTTConnectionAttempts
185+
expr: rate(cortex_mqtt_connection_attempts_total{component=~"cortex-nova-.*"}[5m]) > 0.1
186+
for: 1m
187+
labels:
188+
context: mqtt
189+
dashboard: cortex/cortex
190+
service: cortex
191+
severity: info
192+
support_group: workload-management
193+
annotations:
194+
summary: "Cortex is trying to connect to MQTT too often"
195+
description: >
196+
Cortex is trying to connect to the MQTT broker too often. This may
197+
happen when the broker is down or the connection parameters are
198+
misconfigured.
199+
200+
- alert: CortexNovaTooManyDBConnectionAttempts
201+
expr: rate(cortex_db_connection_attempts_total{component=~"cortex-nova-.*"}[5m]) > 0.1
202+
for: 5m
203+
labels:
204+
context: db
205+
dashboard: cortex/cortex
206+
service: cortex
207+
severity: info
208+
support_group: workload-management
209+
annotations:
210+
summary: "Cortex is trying to connect to the database too often"
211+
description: >
212+
Cortex is trying to connect to the database too often. This may happen
213+
when the database is down or the connection parameters are misconfigured.
214+
215+
- alert: CortexNovaHostCPUUtilizationAbove100Percent
216+
expr: cortex_host_utilization_per_host_pct{component=~"cortex-nova-.*",resource="cpu"} > 100
217+
for: 5m
218+
labels:
219+
context: hostutilization
220+
dashboard: cortex/cortex
221+
service: cortex
222+
severity: info
223+
support_group: workload-management
224+
annotations:
225+
summary: "CPU utilization on host {{`{{$labels.compute_host_name}}`}} is above 100%"
226+
description: >
227+
OpenStack Placement reports CPU utilization above 100% for host {{`{{$labels.compute_host_name}}`}} in AZ {{`{{$labels.availability_zone}}`}} for over 5 minutes.
228+
This can happen if there are VMs in the SHUTOFF state: these VMs still consume resources in Placement, but not in the underlying infrastructure (e.g., VMware). As a result, it is possible to manually migrate additional VMs onto a host with shut off VMs. The combined resource allocation (from running and shut off VMs) can then exceed the host's capacity, causing Placement to report utilization above 100%. This is expected behavior, as powering on the shut off VMs would overcommit the host.
229+
Another cause may be shutting down a node without migrating its VMs. The total capacity drops, but Placement still accounts for the shut off VMs’ resource usage.
230+
This situation should be investigated and resolved to ensure accurate resource accounting and avoid operational issues.
231+
232+
233+
- alert: CortexNovaHostMemoryUtilizationAbove100Percent
234+
expr: cortex_host_utilization_per_host_pct{component=~"cortex-nova-.*",resource="memory"} > 100
235+
for: 5m
236+
labels:
237+
context: hostutilization
238+
dashboard: cortex/cortex
239+
service: cortex
240+
severity: info
241+
support_group: workload-management
242+
annotations:
243+
summary: "Memory utilization on host {{`{{$labels.compute_host_name}}`}} is above 100%"
244+
description: >
245+
OpenStack Placement reports Memory utilization above 100% for host {{`{{$labels.compute_host_name}}`}} in AZ {{`{{$labels.availability_zone}}`}} for over 5 minutes.
246+
This can happen if there are VMs in the SHUTOFF state: these VMs still consume resources in Placement, but not in the underlying infrastructure (e.g., VMware). As a result, it is possible to manually migrate additional VMs onto a host with shut off VMs. The combined resource allocation (from running and shut off VMs) can then exceed the host's capacity, causing Placement to report utilization above 100%. This is expected behavior, as powering on the shut off VMs would overcommit the host.
247+
Another cause may be shutting down a node without migrating its VMs. The total capacity drops, but Placement still accounts for the shut off VMs’ resource usage.
248+
This situation should be investigated and resolved to ensure accurate resource accounting and avoid operational issues.
249+
250+
- alert: CortexNovaHostDiskUtilizationAbove100Percent
251+
expr: cortex_host_utilization_per_host_pct{component=~"cortex-nova-.*",resource="disk"} > 100
252+
for: 5m
253+
labels:
254+
context: hostutilization
255+
dashboard: cortex/cortex
256+
service: cortex
257+
severity: info
258+
support_group: workload-management
259+
annotations:
260+
summary: "Disk utilization on host {{`{{$labels.compute_host_name}}`}} is above 100%."
261+
description: >
262+
OpenStack Placement reports Disk utilization above 100% for host {{`{{$labels.compute_host_name}}`}} in AZ {{`{{$labels.availability_zone}}`}} for over 5 minutes.
263+
This can happen if there are VMs in the SHUTOFF state: these VMs still consume resources in Placement, but not in the underlying infrastructure (e.g., VMware). As a result, it is possible to manually migrate additional VMs onto a host with shut off VMs. The combined resource allocation (from running and shut off VMs) can then exceed the host's capacity, causing Placement to report utilization above 100%. This is expected behavior, as powering on the shut off VMs would overcommit the host.
264+
Another cause may be shutting down a node without migrating its VMs. The total capacity drops, but Placement still accounts for the shut off VMs’ resource usage.
265+
This situation should be investigated and resolved to ensure accurate resource accounting and avoid operational issues.
266+

0 commit comments

Comments
 (0)