A Prometheus exporter for realtime job monitoring of PBS Professional HPC clusters. Gathers metrics from PBS job cgroups along with job metadata and node metrics.
This exporter collects:
- Node Metrics: Cluster-wide node status and attributes from
pbsnodes. - Job Metrics: Job submission information for each PBS job.
- Cgroup Metrics: Realtime CPU and memory usage for each job via cgroups. Supports both V1 and V2.
Configuration is managed with command-line flags. View command help:
pbs_exporter --help
usage: pbs_exporter [<flags>]
Flags:
--[no-]help Show context-sensitive help (also try --help-long and --help-man).
--[no-]cgroup.enabled Enable cgroup collector.
--cgroup.root="/sys/fs/cgroup" Root path of cgroup filesystem hierarchy.
--[no-]job.enabled Enable job collector.
--web.listen-address=":9307" Address to listen on for web interface and telemetry.
--[no-]node.enabled Enable node collector.
--job.pbs_home="/var/spool/pbs" PBS home directory.
--scrape.timeout=5 Per-scrape timeout in seconds.
--log.level=info Only log messages with the given severity or above. One of: [debug, info, warn, error]
--log.format=logfmt Output format of log messages. One of: [logfmt, json]
--[no-]version Show application version.The exporter is designed to run in two modes: on compute nodes to gather job-specific data, and on a single node to gather cluster-wide metrics.
Run the exporter on all compute nodes to collect job, and cgroup metrics.
pbs_exporterPBS node metrics will be the same from every node and should be collected once or deduplicated. Run the exporter for only node metrics:
pbs_exporter --node.enabled --no-cgroup.enabled --no-job.enabledBinaries can be downloaded from the Github releases page.
Download source and build, requires go and make.
git clone https://github.com/0nebody/pbs_exporter.git
cd pbs_exporter
make pbs_exporterPre-built Grafana dashboards can be downloaded along with the exporter from the Github releases page. Basic dashboard modifications can be made with the configuration file and building from Jsonnet.
make dashboardsDashboards are split into public and private dashboards. Public dashboards filter metrics to display jobs launched by the logged in user. This assumes common usernames between HPC and Grafana users using a shared auth backend.
GPU metrics are not collected by this exporter, but integrates with the NVIDIA DCGM exporter. The DCGM exporter requires a PBS hook to map job IDs with assigned GPUs. Configuring DCGM exporter for HPC jobs is documented in the NVIDIA DCGM repository.
An example Prometheus configuration is available in the repository to help you get started with scraping the exporter.
Access to $PBS_HOME/mom_priv/jobs requires elevated privileges. This is required when collecting with --job.enabled set to true.
Use setcap 'cap_dac_read_search=ep' pbs_exporter to run with minimal elevated privileges.