POP audit helps developers double their code performance

Wednesday, May 24, 2017

k-Wave is an open-source toolbox for time domain acoustic and ultrasound simulations in complex and tissue-realistic media. Simulation functions are based on the k-space pseudospectral method.

POP was requested by developers from Brno University of Technology to audit the C++ version parallelised with MPI+OpenMP executing on the Salomon supercomputer hosted by IT4Innovations in the Czech Republic.  A configuration of 32 dual-processor Intel Xeon compute node was used running 64 MPI processes each with 12 OpenMP threads.  The 3D domain decomposition employed (4x4x4 process arrangement) was discovered to suffer from poor performance with large amounts of both MPI and OpenMP synchronization time arising from major load imbalance.

The figure shows an extract of the time-line visualization, showing the three FFTW phases for one timestep of the first four MPI processes. Originally (top with white background), the interior processes (ranks 1&2) wait in MPI communication (red) for the much slower exterior processes (ranks 0&3) where many more small and poorly-balanced parallel loops have lots of OpenMP synchronization time (cyan).  Although the exterior MPI processes have fewer grid cells, the OpenMP-parallelized FFTs from the FFTW library are much less efficient as they have a larger FFT base.

With this insight, the developers were quickly able to apply a periodic domain with identical halo zones for each MPI rank (lower time-line with lilac background), with the result that the execution is now more than twice as fast.  Both versions of the code are compared in the POP performance audit.