Skip to content

Commit b4dc266

Browse files
committed
chore: add Nvidia GPU driver and kernel setup
Signed-off-by: Kevin Carter <[email protected]>
1 parent 87ab1b6 commit b4dc266

File tree

1 file changed

+293
-0
lines changed

1 file changed

+293
-0
lines changed
Lines changed: 293 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,293 @@
1+
---
2+
date: 2025-12-01
3+
title: Getting Started with GPU Compute on Rackspace OpenStack Flex
4+
authors:
5+
- cloudnull
6+
description: >
7+
Getting Started with GPU Compute on Rackspace OpenStack Flex
8+
categories:
9+
- OpenStack
10+
- GPU
11+
- NVIDIA
12+
- Compute
13+
- Instance
14+
- Setup
15+
16+
---
17+
18+
# Getting Started with GPU Compute on Rackspace OpenStack Flex
19+
20+
**Your instance is up, your GPU is attached, and now you're staring at a blank terminal wondering what's next. Time to get Clouding.**
21+
22+
Rackspace OpenStack Flex delivers GPU-enabled compute instances with real hardware acceleration, not the "GPU-adjacent" experience you might get elsewhere. Whether you're running inference workloads, training models, or just need raw parallel compute power, getting your instance properly configured is the difference between frustration and actual productivity.
23+
24+
<!-- more -->
25+
26+
This guide walks you through the complete setup process: from fresh Ubuntu 24.04 instance to a fully optimized NVIDIA GPU environment ready for production workloads.
27+
28+
## Prerequisites
29+
30+
Before diving in, make sure you have.
31+
32+
- A Rackspace OpenStack Flex instance with an attached GPU (H100, A30, or similar)
33+
- Ubuntu 24.04 LTS as your base image
34+
- SSH access to your instance
35+
- Root or sudo privileges
36+
37+
If you're provisioning a new instance, select a GPU-enabled flavor from the OpenStack Flex catalog. The platform's SR-IOV architecture provides direct hardware access to your GPU resources, delivering near-native performance without the virtualization overhead that plagues other cloud GPU offerings.
38+
39+
## Step 1: Update Your System
40+
41+
First things first, get your system current. This isn't just good hygiene; NVIDIA drivers have specific kernel dependencies, and starting with an outdated kernel is a recipe for cryptic error messages at 2 AM.
42+
43+
SSH into your instance and run.
44+
45+
```bash
46+
sudo apt update
47+
sudo apt upgrade -y
48+
```
49+
50+
## Step 2: Configure GRUB for PCI Passthrough
51+
52+
This is the step that separates the tutorials that work from the ones that leave you debugging kernel panics. When GPUs are passed through to VMs via SR-IOV or PCI passthrough, the kernel's default PCI resource allocation can conflict with how the hypervisor has already mapped the device.
53+
54+
The fix is telling the kernel to be more flexible about PCI resource assignment. Edit your GRUB configuration to add the following parameters.
55+
56+
```bash
57+
sudo sed -i 's/^GRUB_CMDLINE_LINUX="\(.*\)"/GRUB_CMDLINE_LINUX="\1 pci=realloc,assign-busses,nocrs"/' /etc/default/grub
58+
```
59+
60+
Let's break down what these parameters do.
61+
62+
- **pci=realloc** , Allows the kernel to reallocate PCI resources if the BIOS allocation doesn't match what's needed
63+
- **assign-busses** , Forces the kernel to assign PCI bus numbers rather than trusting BIOS assignments
64+
- **nocrs** , Ignores PCI _CRS (Current Resource Settings) from ACPI, preventing conflicts with hypervisor mappings
65+
66+
Now regenerate your GRUB configuration and initramfs.
67+
68+
```bash
69+
sudo update-grub2
70+
sudo update-initramfs -u
71+
```
72+
73+
Reboot to load the new kernel parameters and drivers.
74+
75+
```bash
76+
sudo reboot
77+
```
78+
79+
## Step 3: Install NVIDIA Drivers
80+
81+
Wait a minute or two for the instance to come back up, then SSH back in. You can verify your kernel version with `uname -r` to confirm the upgrade took effect.
82+
83+
Here's where things get real. NVIDIA provides drivers specifically optimized for datacenter GPUs like the H100 and A30. We're using the 580.x branch, which is the current production release for these cards. However, always check [NVIDIA's official driver page](https://www.nvidia.com/Download/index.aspx) for the latest recommendations.
84+
85+
Download the driver package.
86+
87+
```bash
88+
wget https://us.download.nvidia.com/tesla/580.105.08/nvidia-driver-local-repo-ubuntu2404-580.105.08_1.0-1_amd64.deb
89+
```
90+
91+
Install the local repository package.
92+
93+
```bash
94+
sudo dpkg -i nvidia-driver-local-repo-ubuntu2404-580.105.08_1.0-1_amd64.deb
95+
```
96+
97+
Copy the GPG keyring to your system's trusted keys.
98+
99+
```bash
100+
sudo cp /var/nvidia-driver-local-repo-ubuntu2404-580.105.08/nvidia-driver-local-207F658F-keyring.gpg /usr/share/keyrings/
101+
```
102+
103+
Update your package lists and install the CUDA drivers.
104+
105+
```bash
106+
sudo apt update
107+
sudo apt install -y cuda-drivers-580
108+
```
109+
110+
!!! note "Alternative: Open-Source NVIDIA Drivers"
111+
112+
If you prefer the open-source kernel modules (which NVIDIA now provides for datacenter GPUs), you can install those instead:
113+
114+
```bash
115+
sudo apt install -y nvidia-open-580
116+
```
117+
118+
The open drivers provide identical functionality for compute workloads and are fully supported on H100/A30 hardware. The choice between proprietary and open modules is largely philosophical at this point, both work.
119+
120+
Once again, reboot to load the NVIDIA drivers.
121+
122+
```bash
123+
sudo reboot
124+
```
125+
126+
## Step 4: Verify Your Installation
127+
128+
After the reboot, SSH back in and run the moment of truth.
129+
130+
```bash
131+
sudo nvidia-smi
132+
```
133+
134+
You should see output showing your GPU(s), their temperature, memory usage, and driver version. Something like.
135+
136+
```shell
137+
+-----------------------------------------------------------------------------------------+
138+
| NVIDIA-SMI 580.105.08 Driver Version: 580.105.08 CUDA Version: 13.0 |
139+
+-----------------------------------------+------------------------+----------------------+
140+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
141+
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
142+
| | | MIG M. |
143+
|=========================================+========================+======================|
144+
| 0 NVIDIA H100 NVL On | 00000000:06:00.0 Off | Off |
145+
| N/A 33C P0 64W / 400W | 0MiB / 95830MiB | 0% Default |
146+
| | | Disabled |
147+
+-----------------------------------------+------------------------+----------------------+
148+
| 1 NVIDIA H100 NVL On | 00000000:07:00.0 Off | Off |
149+
| N/A 32C P0 62W / 400W | 0MiB / 95830MiB | 0% Default |
150+
| | | Disabled |
151+
+-----------------------------------------+------------------------+----------------------+
152+
| 2 NVIDIA H100 NVL On | 00000000:08:00.0 Off | Off |
153+
| N/A 35C P0 65W / 400W | 0MiB / 95830MiB | 0% Default |
154+
| | | Disabled |
155+
+-----------------------------------------+------------------------+----------------------+
156+
| 3 NVIDIA H100 NVL On | 00000000:09:00.0 Off | Off |
157+
| N/A 32C P0 61W / 400W | 0MiB / 95830MiB | 0% Default |
158+
| | | Disabled |
159+
+-----------------------------------------+------------------------+----------------------+
160+
161+
+-----------------------------------------------------------------------------------------+
162+
| Processes: |
163+
| GPU GI CI PID Type Process name GPU Memory |
164+
| ID ID Usage |
165+
|=========================================================================================|
166+
| No running processes found |
167+
+-----------------------------------------------------------------------------------------+
168+
```
169+
170+
If you see your GPU listed with valid stats, congratulations, you have working GPU compute. If you see errors about the driver not being loaded, double-check that your GRUB parameters were applied correctly with `cat /proc/cmdline`.
171+
172+
## Step 5: Optimize for Production Workloads
173+
174+
Now that your GPU is working, let's tune it for actual compute work. These optimizations reduce latency, improve throughput, and ensure consistent performance under load.
175+
176+
### Enable Persistence Mode
177+
178+
By default, the NVIDIA driver unloads when no applications are using the GPU. This saves power but adds several seconds of initialization latency when you start a new workload. Persistence mode keeps the driver loaded:
179+
180+
```bash
181+
sudo nvidia-smi -pm 1
182+
```
183+
184+
### Set Compute-Exclusive Mode
185+
186+
If you're running a single application that needs full GPU access (most ML training and inference scenarios), exclusive process mode ensures no other processes can claim GPU resources:
187+
188+
```bash
189+
sudo nvidia-smi -c EXCLUSIVE_PROCESS
190+
```
191+
192+
### Lock GPU Clocks
193+
194+
NVIDIA GPUs dynamically scale their clock frequencies based on load, temperature, and power limits. For benchmarking or latency-sensitive workloads, you may want consistent performance. Lock the graphics clock to maximum:
195+
196+
```bash
197+
sudo nvidia-smi -lgc 1980,1980 # H100 max graphics clock
198+
```
199+
200+
And lock the memory clock:
201+
202+
```bash
203+
sudo nvidia-smi -lmc 2619,2619 # H100 HBM3 max
204+
```
205+
206+
!!! note
207+
208+
These specific values are for H100 GPUs. Check your specific GPU's capabilities with `nvidia-smi -q -d SUPPORTED_CLOCKS`.
209+
210+
### Optional: Disable ECC Memory
211+
212+
H100 and A30 GPUs have Error Correcting Code (ECC) memory enabled by default, which protects against bit flips but reduces available memory bandwidth by 5-10%. For workloads where you need maximum throughput and can tolerate occasional errors (training jobs with checkpointing, inference where you can retry failures), you can disable ECC.
213+
214+
```bash
215+
sudo nvidia-smi -e 0
216+
```
217+
218+
**Warning:** This requires a reboot to take effect, and you should carefully evaluate your data integrity requirements before disabling ECC. For financial, scientific, or medical applications, keep ECC enabled.
219+
220+
## Making Settings Persistent
221+
222+
The `nvidia-smi` settings above won't survive a reboot. To make them persistent, create a systemd service.
223+
224+
```bash
225+
sudo tee /etc/systemd/system/nvidia-persistenced.service << 'EOF'
226+
[Unit]
227+
Description=NVIDIA Persistence Daemon
228+
After=network.target
229+
230+
[Service]
231+
Type=forking
232+
ExecStart=/usr/bin/nvidia-persistenced --user root
233+
ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced
234+
235+
[Install]
236+
WantedBy=multi-user.target
237+
EOF
238+
```
239+
240+
Enable and start the service.
241+
242+
```bash
243+
sudo systemctl daemon-reload
244+
sudo systemctl enable nvidia-persistenced
245+
sudo systemctl start nvidia-persistenced
246+
```
247+
248+
For the clock settings and compute mode, create a systemd service that runs after the persistence daemon.
249+
250+
```bash
251+
sudo tee /etc/systemd/system/nvidia-gpu-optimize.service << 'EOF'
252+
[Unit]
253+
Description=NVIDIA GPU Optimization Settings
254+
After=nvidia-persistenced.service
255+
Requires=nvidia-persistenced.service
256+
257+
[Service]
258+
Type=oneshot
259+
RemainAfterExit=yes
260+
ExecStart=/usr/bin/nvidia-smi -c EXCLUSIVE_PROCESS
261+
ExecStart=/usr/bin/nvidia-smi -lgc 1980,1980
262+
ExecStart=/usr/bin/nvidia-smi -lmc 2619,2619
263+
264+
[Install]
265+
WantedBy=multi-user.target
266+
EOF
267+
```
268+
269+
Enable the service.
270+
271+
```bash
272+
sudo systemctl daemon-reload
273+
sudo systemctl enable nvidia-gpu-optimize
274+
```
275+
276+
The service will apply your optimization settings automatically on every boot, after the persistence daemon has initialized the GPU.
277+
278+
## What's Next?
279+
280+
With your GPU environment configured, you're ready to install your actual workload tooling.
281+
282+
- **CUDA Toolkit** , For compiling CUDA applications: `sudo apt install cuda-toolkit-12-8`
283+
- **cuDNN** , For deep learning frameworks: available from NVIDIA's developer portal
284+
- **PyTorch/TensorFlow** , Most ML frameworks auto-detect and use CUDA when available
285+
- **Container runtimes** , NVIDIA Container Toolkit for Docker/Podman GPU passthrough
286+
287+
The OpenStack Flex platform's accelerator-optimized architecture means your GPU workloads get direct hardware access without the overhead you'd experience on hyperscaler clouds. Combined with Rackspace's flexible pricing, you can run GPU compute at a fraction of public cloud costs while maintaining the control and performance characteristics of dedicated infrastructure.
288+
289+
---
290+
291+
*Running into issues? The most common problems are GRUB parameters not being applied (check `/proc/cmdline`), kernel module conflicts with nouveau (blacklist it in `/etc/modprobe.d/`), or mismatched driver versions. The NVIDIA forums and Rackspace support are both excellent resources when you hit the weird edge cases.*
292+
293+
For more information about GPU-enabled compute on Rackspace OpenStack Flex, see the [Genestack documentation](https://docs.rackspacecloud.com/) or contact your Rackspace account team. [Sign-up](https://cart.rackspace.com/en/cloud/openstack-flex/account) for Rackspace OpenStack today.

0 commit comments

Comments
 (0)