Beyond Benchmarks: Comparing Parallel Programming Models in Real-World Scenarios


Date
Jun 12, 2024 11:00 AM — 11:30 AM
Location
Grundlsee, Austria

There are many parallel programming models in the HPC landscape. Some aim for high flexibility and responsibility of the programmer (e.g. MPI), while others advocate a separation of concerns, offering a domain-specific-language-like experience. These try to provide application developers with the means to express algorithms on a high level of abstraction and move performance optimization away to the system developer and toolchain (e.g. Celerity). Regardless of their focus, all these models usually offer functional portability but are exposed to the problem of performance portability and programmer productivity. Many works study this topic, but most of them rely on evaluation using benchmarks, e.g. the NAS NPB, BLAS, Stream, other proxy apps or even textbook sample codes. Real-world applications however, especially with their engineering requirements, are often much more problematic. The few works that do consider real-world applications are usually limited to comparing one established implementation (MPI) with a new one meant to replace it (e.g. MPI+CUDA), for obvious effort reasons.

In this talk, we will discuss our aim to provide a qualitative study on performance portability and programmer productivity using a real-world application with interesting properties, Cronos, and porting it to multiple parallel programming models. A key idea is to employ computer science students that have some parallel programming background but are not experts, posing a case study resembling real-life circumstances and enabling the analysis of aspects such as ease-of-use, productivity, or documentation quality.

Cronos is an astrophysics 3D structured grid simulation developed and maintained at the University of Innsbruck with common but also challenging characteristics. While suitable for execution on GPUs, Cronos uses features that many programming models often struggle with including dynamic, data-dependent time step length calculation, posing a latency bottleneck, or parallel I/O using HDF5. Finally, we aim to keep a high standard in terms of software engineering, something that needs to be considered when integrating with parallel programming models. Cronos has been successfully used in a PRACE project on up to 16.000 cores on Joliot-Curie Rome at GENCI.

This project has been partially funded by EuroHPC-JU grant 956137 and the Vienna Scientific Cluster (VSC).