Published in PEARC ’23: Jobstats: A Slurm-Compatible Job Monitoring Platform for CPU and GPU Clusters

Josko Plazonic, Jonathan Halverson, and Troy Comi

Job monitoring on high-performance computing clusters is important for evaluating hardware performance, troubleshooting failed jobs, identifying inefficient jobs and more. The combination of the Prometheus monitoring framework and the Grafana visualization toolkit has proven successful in recent years. This work shows how four Prometheus exporters can be configured for a Slurm cluster to provide detailed job-level information on CPU/GPU efficiencies and CPU/GPU memory usage as well as node-level Network File System (NFS) statistics and cluster-level General Parallel File System (GPFS) activity. A novel approach was devised to efficiently store a summary of this data in the Slurm database for each completed job. The open-source job monitoring platform introduced here can be used for batch, interactive and Open OnDemand jobs. Several tools are presented that use the Prometheus and Slurm databases to create dashboards, utilization reports and alerts.

Read the paper: https://doi.org/10.1145/3569951.3604396

The Princeton Research Software Engineering Group Blog

Princeton University

Like this: