Friday, 28 March 2025, 15:00 CET
High Performance Computing (HPC) systems are large, heterogeneous, sophisticated – and are therefore so complicated that they are difficult to use efficiently. HPC users are allocated finite compute time on systems and yet have no portable utility to confirm that they are effectively utilizing the allocation at their disposal.
To address these problems, ZeroSum is a user space library that is launched within the process space of the HPC application. For each application process, it will monitor the application threads, MPI communication, and the hardware resources assigned to them – including CPU cores and/or hardware threads, memory usage and GPU utilization. Supported systems include Linux based operating systems, as well as GPUs from NVIDIA (using the NVML library), AMD (using the ROCm-SMI library) and Intel (using the SYCL API).
Host side monitoring utilizes the virtual /proc filesystem and therefore is portable to all Linux systems. When integrated with the hwloc library, visualizations of utilization data can be generated from included Python post-processing scripts. Automatic deadlock detection is available, and ZeroSum will generate call stacks from all ranks, merge them, and visualize the resulting merged call stacks to help diagnose where expected behavior diverged (similar to STAT/Cray-STAT). Monitoring overhead is less than 0.5%
About the Presenter
Kevin Huck is a Senior Research Associate in the Oregon Advanced Computing Institute for Science and Society (OACISS) at the University of Oregon. He is investigating the challenges of performance analysis of large HPC applications as well as automated methods for diagnosing and treating performance problems both offline and with runtime controls.
His MS and PhD degrees in Computer and Information Science are from the University of Oregon, USA, and his BS in Computer Science is from the University of Cincinnati, Ohio, USA.