Bootcamp Performance-oriented Software Engineering: Experiences from Working with Real-world Problems and Developers


Date
May 21, 2025 3:15 PM — 3:45 PM
Location
Rimske Terme, Slovenia

Highly efficient and parallel software is mandatory today, whether in industry or in academia, due to the increase in parallelism in modern hardware and ever-increasing problem sizes and complexities. Practical experience shows that a lack of parallelism and software optimizations are usually compensated for by increased investments in hardware and thus by overprovisioning, which may not only lead to direct economic cost disadvantages, but often also to further losses such as increased energy consumption or low efficiency. However, the development of fast and efficient software is often tedious and requires a deep understanding of programming models, performance, and hardware. In this talk we present our experience with the FFG-funded Bootcamp Performanceoriented Softwarengineering, a qualification measure in which 17 employees from 10 companies were trained to become “digital professionals” in topics including parallel programming, HPC, accelerator computing, code quality, and productivity. These digital professionals deal comprehensively with in-house IT projects and were trained to improve their understanding of the interaction between software and hardware and to increase the productivity of both developers and their software.

We will provide insights into the current state of the industry with regard to parallel programming and HPC use cases, along with selected examples of performance improvements we were able to achieve within just a few weeks of training and development. One example use case originates from the processing of 3D model data in the area of cabinet making and furniture industry. The software was already written in C++ for performance reasons, but lacked any parallelization due to the use of non-thread-safe third party libraries to process proprietary data containers. We analyzed the requirements of this use case and identified parallelization potential using OpenMP tasks along with careful pipelining and synchronization in order to ensure thread-safe interaction with the third party library. Preliminary results showed a speedup of 4x to 5x depending on the input data.

Another use case originates from material science, simulating the generation and propagation of x-rays in computer tomography. The software was written in Python for execution on CPUs and while sensible math packages were used, there was little further optimization in the code. We analyzed the requirements of the use case, ported the Python code to make use of GPUs through numba and CUDA and identified several software components which are used for debugging only and hence could be removed from production runs. A benchmark run originally taking approximately 250 seconds on the reference CPU was reduced to just 45 ms of execution time on a GPU, resulting in a speedup of approximately 5000x to 6000x (depending on the input data).