From January through July 2021, members of POP contributed to a series of seven workshops on application performance analysis and optimisation organised for the UK's ExCALIBUR programme by Durham University, the DiRAC consortium and N8 CIR. The series addressed inter-node, intra-node and core-level runtime efficiency as well as correctness of MPI and OpenMP/multithreaded application codes, featuring invited presentations of a variety of tools from JSC, RWTH Aachen and UVSQ plus others.
In contrast to established training workshops for individual participants held over several days all in the same week, a series of full-day virtual workshop sessions took place once per month for teams of application developers and analysts. In the mornings one or two new tools were introduced with plenary presentations and demonstrated in hands-on exercises with small example codes, with the afternoon using breakout rooms for the teams to be assisted applying the tools to their own applications. Along with access to Slack for collaboration, Durham provided participants priority access to their DINE cluster with 16 dual 16-core AMD EPYC 7302 compute nodes for the duration of the workshop series, where the presented tools were installed ready to use.
Almost 80 registered participants formed 13 teams around different scientific research codes, with typically 20 or so actively engaged in each session. Teams who were working with applications that they hadn't themselves developed most valued tools providing a high-level overview of execution performance and visualisation.
Those who had developed their own application code had a preference for tools that provided more detailed insight into CPU, OpenMP and MPI performance. Some tools were praised for better support for performance analysis of Python (which was used extensively in a few cases), or struggled with highly C++ templated code and parallel programming models based on OpenMP tasks and MPI used concurrently by multiple threads. While the comprehensive performance information obtainable from the tools was generally welcomed, managing its volume via event selection/filtering and annotating/recording important phases were often a challenge with the different tools.
At the start of each session teams were encouraged to informally share their significant insights and performance improvements. In one case, Scalasca/Vampir analyses identified copious small MPI messages degrading performance, that could be improved by their consolidation into fewer, larger messages and MPI_Waitall replacing individual waits. Another case used MAQAO to identify a few loops in an expensive data initialisation phase that could be reordered and simplified to deliver an 11-fold speed-up. Correctness checks provided by MUST and Archer fortunately only uncovered minor issues with the participants' application codes.
Participant feedback collected throughout and at the end of the workshop series showed that it was greatly appreciated, and that the format worked well. The extended workshop sessions and periods to follow-up between sessions facilitated deeper interactions and more advanced analyses than typically possible in compact training events, while reducing overload on participants. Having teams with application code of common interest encouraged everyone to work together and share the benefits during the workshop, and set them up to continue using their preferred tools in their ongoing development activities.
Almost all participants would like to see the workshop series repeated and there were also lots of suggestions for improving our tools and for additional topics for workshops which will be investigated.
An experience report produced by the organisers covering the key outcomes, findings and impressions of the workshop series is available from https://zenodo.org/record/5155503.
Recordings of the workshop presentations and associated slides are available on the workshop page at https://tinyurl.com/performanceanalysis2021.