This is a manual for system admin.
1. First-time Setup of Cluster Nodes
4. Setup Supplementary Services
Notes on Supplementary Services
Our cluster is located in the core server room, E6-106.
We have been designated with an IP address range: 10.0.1.64/27, where the gateway: 10.0.1.65, address pool: 10.0.1.66-94.
Our private 100GbE network IP address range: 192.168.233.0/24
Current IP assignment:
| Name | Spec | 
|---|---|
| 10.0.1.66 / 192.168.233.6 | core node on management server | 
| 10.0.1.67 / 192.168.233.10 | login node on management server | 
| 10.0.1.68 / 192.168.233.8 | supplementary service node on management server | 
| 10.0.1.69 | Peng's Synology NAS | 
| 10.0.1.70 / 192.168.233.233 | TrueNAS on storage server | 
| 10.0.1.71 / 192.168.233.11 | GPU Node 01 | 
| 10.0.1.72 / 192.168.233.12 | GPU Node 02 | 
| 10.0.1.73 / 192.168.233.13 | GPU Node 03 | 
| 10.0.1.74 / 192.168.233.14 | GPU Node 04 | 
| 10.0.1.75 / 192.168.233.15 | GPU Node 05 | 
| 10.0.1.76 / 192.168.233.16 | GPU Node 06 | 
| 10.0.1.77 / 192.168.233.17 | GPU Node 07 | 
| 10.0.1.78 / 192.168.233.18 | GPU Node 08 | 
| 10.0.1.79 / 192.168.233.2 | iKuai on management server | 
| 10.0.1.80 | BMC of storage server | 
| 10.0.1.81 | BMC of GPU Node 01 | 
| 10.0.1.82 | BMC of GPU Node 02 | 
| 10.0.1.83 | BMC of GPU Node 03 | 
| 10.0.1.84 | BMC of GPU Node 04 | 
| 10.0.1.85 | BMC of GPU Node 05 | 
| 10.0.1.86 | BMC of GPU Node 06 | 
| 10.0.1.87 | BMC of GPU Node 07 | 
| 10.0.1.88 | BMC of GPU Node 08 | 
| 10.0.1.89 / 192.168.233.101 | TrueNAS on storage server (2U, SSD-only) | 
| 10.0.1.90 | BMC of storage server (2U, SSD-only) | 
| 10.0.1.91 | NVIDIA Mellanox Spectrum SN2700 100GbE Switch MGMT | 
| 10.0.1.93 | ESXi on management server | 
| 10.0.1.94 | BMC of management server | 
System Topology:
┌───────────────────────────────────┐ ┌──────────────────────────────────┐
│             Login Node            │ │        NGINX Reverse Proxy       │
└─────────────┬─────────────────────┘ └────────┬────────┬────────────────┘
              │                                │        │
            Access      ┌────────Access────────┘      Access
              │         │                               │
┌─────────────▼─────────▼───────────┐ ┌─────────────────▼─────────────────┐
│     Determined AI GPU Cluster     │ │      Supplementary Services       │
├───────────────────────────────────┤ ├───────────────────────────────────┤
│                                   │ │                                   │
│ ┌──────┐ ┌────┐ ┌────┐ ┌────┐     │ │  ┌──────┐ ┌───────┐ ┌───────┐     │
│ │Master│ │GPU │ │GPU │ │GPU │     │ │  │      │ │       │ │       │     │
│ │      │ │    │ │    │ │    │ ... │ │  │Harbor│ │Grafana│ │ Other │ ... │
│ │ Node │ │Node│ │Node│ │Node│     │ │  │      │ │       │ │       │     │
│ └──────┘ └────┘ └────┘ └────┘     │ │  └──────┘ └───────┘ └───────┘     │
│                                   │ │                                   │
└───────────────────┬───────────────┘ └──────────┬────────────────────────┘
                    │                            │
                  Access                       Access
                    │                            │
┌───────────────────▼────────────────────────────▼────────────────────────┐
│                              TrueNAS - NFS                              │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│                              Storage Server                             │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘
The specifics of the cluster nodes are as follows:
GPU Node 1:
| Name | Spec | 
|---|---|
| Model | Powerleader PR4908R (Supermicro 4124GS-TNR) | 
| CPU | AMD EPYC 7302 * 2 (32C/64T, 3.0-3.3GHz) | 
| RAM | Samsung M393A2K43DB2-CVF DDR4 256G (16G*16) 2933MT/s ECC REG | 
| GPU | MSI (0x1462) RTX 3090 Turbo * 8 | 
| SSD | Intel P4510 2TB (U.2 PCIe 3.1) * 1 | 
| NIC | Intel I350-T2 1GbE Dual Port | 
| NIC | Mellanox ConnectX-4 VPI EDR QSFP28 MCX455A-ECAT 100Gb ETH/IB Single Port | 
| RAID | LSI MegaRAID SAS-3 3108 | 
GPU Node 2:
| Name | Spec | 
|---|---|
| Model | Powerleader PR4908R (Supermicro 4124GS-TNR) | 
| CPU | AMD EPYC 7402 * 2 (48C/96T, 2.8-3.35GHz) | 
| RAM | SK Hynix HMA84GR7DJR4N-XN DDR4 512G (32G*16) 3200MT/s ECC REG | 
| GPU | MANLI (NVIDIA/0x10DE) RTX 4090 * 8 | 
| SSD | Intel P4510 2TB (U.2 PCIe 3.1) * 1 | 
| SSD | Kioxa CD6 7.68TB (U.2 PCIe 4.0) * 1 | 
| NIC | Intel I350-T2 1GbE Dual Port | 
| NIC | Mellanox ConnectX-4 VPI EDR QSFP28 MCX455A-ECAT 100Gb ETH/IB Single Port | 
GPU Node 3, 4:
| Name | Spec | 
|---|---|
| Model | Powerleader PR4908R (Supermicro 4124GS-TNR) | 
| CPU | AMD EPYC 7402 * 2 (48C/96T, 2.8-3.35GHz) | 
| RAM | Samsung M393A4K40DB3-CWE DDR4 512G (32G*16) 3200MT/s ECC REG | 
| GPU | MSI (0x1462) RTX 3090 * 8 | 
| SSD | Intel P4510 2TB (U.2 PCIe 3.1) * 1 | 
| NIC | Intel I350-T2 1GbE Dual Port | 
| NIC | Mellanox ConnectX-4 VPI EDR QSFP28 MCX455A-ECAT 100Gb ETH/IB Single Port | 
GPU Node 5:
| Name | Spec | 
|---|---|
| Model | ASUS ESC8000A-E11 | 
| CPU | AMD EPYC 7543 * 2 (64C/128T, 2.8-3.7GHz) | 
| RAM | Samsung M393A4K40EB3-CWE DDR4 512G (32G*16) 3200MT/s ECC REG | 
| GPU | MANLI (NVIDIA/0x10DE) RTX 4090 * 8 | 
| SSD | Intel S4610 (SSDSC2KG96) 960G (SATA) (RAID 1) * 2 | 
| NIC | Intel I350-T4 1GbE Quad Port | 
| NIC | Mellanox ConnectX-4 VPI EDR QSFP28 MCX455A-ECAT 100Gb ETH/IB Single Port | 
| RAID | LSI SAS3008 PCI-Express Fusion-MPT SAS-3 | 
GPU Node 6, 7:
| Name | Spec | 
|---|---|
| Model | ASUS ESC8000A-E12 | 
| CPU | AMD EPYC 9554 * 2 (128C/256T, 3.1-3.75GHz) | 
| RAM | Samsung M321R8GA0BB0-CQKZJ / Micron MTC40F2046S1RC48BA1 DDR5 1536G (64G*24) 4800MT/s ECC REG | 
| GPU | MSI (NVIDIA/0x10DE) RTX 4090 * 8 | 
| SSD | Samsung PM9A3 1.92T (U.2 PCIe 4.0) * 1 | 
| NIC | Intel I350-AM2 1GbE Dual Port | 
| NIC | Mellanox ConnectX-4 VPI EDR QSFP28 MCX455A-ECAT 100Gb ETH/IB Single Port | 
GPU Node 8:
| Name | Spec | 
|---|---|
| Model | ASUS ESC8000A-E12 | 
| CPU | AMD EPYC 9554 * 2 (128C/256T, 3.1-3.75GHz) | 
| RAM | SK Hynix HMCG94AEBRA109N DDR5 1536G (64G*24) 4800MT/s ECC REG | 
| GPU | NVIDIA (0x10de) RTX 6000 Ada Generation 48G * 8 | 
| SSD | Samsung PM9A3 (MZQL21T9HCJR-00A07) 1.92TB 2.5" NVMe U.2 drive * 2 | 
| NIC | Mellanox ConnectX-6 VPI NIC; HDR100, EDR IB/100GbE; dual-port QSFP56; PCIe4.0 x16; (MCX653106A-ECAT) | 
| NIC | Intel I350-T2 1GbE Dual Port | 
Storage Server
| Name | Spec | 
|---|---|
| Model | Powerleader PR4224AK (Supermicro H11SSL) | 
| CPU | AMD EPYC 7302 (16C/32T, 3.0-3.3GHz) | 
| RAM | Samsung M393A4K40DB2-CWE DDR4 256G (32G*8) 2933MT/s ECC REG | 
| SSD | INTEL 760p (SSDPEKKW256G8) 256G (M.2 PCIe 3.0) * 1 | 
| SSD | Intel S4510 1.92TB (SATA) * 2 | 
| SSD | WD Ultrastar DC SN640 (WUS4BB076D7P3E3) 7.68TB (U.2 PCIe 3.0) * 12 (2 corrupted) | 
| HDD | Seagate Exos X18 18TB * 24 | 
| NIC | Intel i210 1GbE * 2 | 
| NIC | Mellanox ConnectX-4 VPI EDR QSFP28 MCX455A-ECAT 100Gb ETH/IB Single Port | 
| RAID | LSI SAS3008 PCI-Express Fusion-MPT SAS-3 | 
Storage Server (2U, SSD-only):
| Name | Spec | 
|---|---|
| Model | Dell PowerEdge R7525 | 
| CPU | AMD EPYC 7313 * 2 (32C/64T, 3.0-3.7GHz) | 
| RAM | Samsung DDR4 ECC REG 3200MHz 512G (32G * 16) | 
| SSD | KIOXIA DC CD7 RI 960GB 2.5" NVMe U.2 drive * 1 | 
| SSD | Seagate Nytro 5350H 7.68 TB 2.5" NVMe U.3 drive * 8 (replaced) | 
| SSD | Samsung BM1743 30.72 TB 2.5" NVMe U.3 drive * 6 | 
| SSD | KIOXIA CM6 30.72 TB 2.5" NVMe U.3 drive * 2 | 
| NIC | Broadcom BCM5720 Gigabit Ethernet * 2 | 
| NIC | Mellanox ConnectX-6 VPI NIC; HDR100, EDR IB/100GbE; dual-port QSFP56; PCIe4.0 x16; (MCX653106A-ECAT) | 
Management Server
| Name | Spec | 
|---|---|
| Model | ASUS RS520-E9-RS8 V2 | 
| CPU | Intel Xeon Silver 4210R * 2 (20C/40T, 2.4-3.2GHz) | 
| RAM | Samsung M393A4K40EB3-CWE DDR4 64G (32G*2) 3200MT/s @ 2400MT/s ECC REG | 
| SSD | Intel S4610 (SSDSC2KG96) 960G * 2 (SATA) (RAID 1) | 
| NIC | Intel i350-AM2 1GbE Dual Port | 
| NIC | Mellanox ConnectX-4 VPI EDR QSFP28 MCX455A-ECAT 100Gb ETH/IB Single Port | 
| RAID | LSI SAS3008 PCI-Express Fusion-MPT SAS-3 | 
Switch
| Brand | Model & Spec | 
|---|---|
| NVIDIA Mellanox | Spectrum SN2700 100GbE 1U Open Ethernet Switch with NVIDIA Onyx, 32 QSFP28 ports, 2 PSU, x86 CPU, Standard depth | 
Note:
hwinfo:
sudo apt install hwinfo && sudo hwinfo > hwinfo.logCheck memory model:
sudo dmidecode -t memoryCheck GPU model:
lspci -vnn | grep VGA -A 12for NVIDIA Geforce GTX/RTX graphics cards,lspci -vnn | grep "3D controller" -A 12for NVIDIA Quadro/Tesla/"" acceleration cards.Disable NVIDIA GPU's ECC:
sudo nvidia-smi -e 0




