Josko Plazonic, Jonathan Halverson, and Troy Comi
Job monitoring on high-performance computing clusters is important for evaluating hardware performance, troubleshooting failed jobs, identifying inefficient jobs and more. The combination of the Prometheus monitoring framework and the Grafana visualization toolkit has proven successful in recent years. This work shows how four Prometheus exporters can be configured for a Slurm cluster to provide detailed job-level information on CPU/GPU efficiencies and CPU/GPU memory usage as well as node-level Network File System (NFS) statistics and cluster-level General Parallel File System (GPFS) activity. A novel approach was devised to efficiently store a summary of this data in the Slurm database for each completed job. The open-source job monitoring platform introduced here can be used for batch, interactive and Open OnDemand jobs. Several tools are presented that use the Prometheus and Slurm databases to create dashboards, utilization reports and alerts.
Read the paper: https://doi.org/10.1145/3569951.3604396