Lightweight MPI Profiling with mpiP

mpiP is an open-source library that provides lightweight profiling of MPI applications [1]. It uses statistical sampling to record profiling data, thus is not as accurate as other profiling tools but it is lightweight and trace files are much smaller particularly for very large MPI process runs. No code changes are required to use mpiP but a re-link is required. From the performance data provided by the tools, the POP performance metrics can be calculated. Detailed information on the POP metrics can be found in reference [2].

The code should be compiled with the -g flag and linked with the mpiP library. The link line is shown below:

-o app.exe -L<path to mpiP>/lib -lmpiP -lm -lbfd -liberty -lunwind

Note that the above link line must appear on the far-right of the actual link line. The code is then executed as normal with mpirun and the performance report is saved to a file. The file name will be printed at the end of the application run. The performance report contains the following sections:

  1. The percentage of time each rank is spending in MPI (which includes MPI-IO) and non-MPI;
  2. Call sites which are locations in the code containing MPI calls;
  3. The top 20 call sites that spend the most time in MPI;
  4. The top 20 calls sites that send the most data;
  5. MPI call site statistics which include number of times called, average/max/min time spent, and percentage of time in code and MPI;
  6. MPI call site statistics which include number of bytes sent, and average/max/min/total bytes sent.

Regions of interest in the code can be enclosed with MPI_PCONTROL to switch on/off profiling. The example code below shows how to control profiling of regions of interest:

call MPI_INIT( ierr )
call MPI_PCONTROL( 0 )      ! 1. disable profiling as it is enabled by default
[ ... ]                     ! 2. some computation
call MPI_PCONTROL( 1 )      ! 3. enable profiling
do i = 1, Ni                ! 4. region of interest that is doing
[ ... ]                     ! computation and communication
end do
call MPI_PCONTROL( 0 )      ! 5. disable profiling
[ ... ]                     ! 6. some other computation
call MPI_FINALIZE( ierr )

For C and C++, MPI_PCONTROL is a function and has the same API. For further information on mpiP, please see reference [1].

Below is an example output of the first section of the profiling report which shows how much time the application is spending in MPI and non-MPI. Note that MPI time also includes MPI-IO subroutine calls (including parallel NetCDF and parallel HDF5) but not POSIX I/O. The AppTime field includes MPI time, so to calculate user-code time, subtract MPITime from AppTime. The data in italics have been manually calculated and are not included in the output of mpiP.

@ Command : ./bt.C.9.mpi_io_full
@ Start time    : 2017 11 10 12:08:28
@ Stop time     : 2017 11 10 12:08:36
@ Run time      : 00:00:08

@--- MPI Time (seconds) ---------------------------------------------------

Task      AppTime            MPITime            MPI%        User code
0            7.75               3.00           38.72             4.75
1            7.75               3.48           44.94             4.27
2            7.75               3.55           45.73             4.20
3            7.75               3.46           44.69             4.29
4            7.75               3.55           45.76             4.20
5            7.75               3.55           45.82             4.20
6            7.75               3.52           45.41             4.23
7            7.75               3.50           45.14             4.25
8            7.75               3.53           45.51             4.22
total       69.80               31.10          44.63            38.61
max          7.75                3.55          45.82             4.75
min          7.75                3.00          38.72             4.20
avg          7.75                3.46          44.64             4.29

For POP metric calculations, 100% means it is perfect, 0% means it is the worst and a typical cut-off point for good enough performance is 80%. The runtime is calculated from the stop and start time and is 8 seconds for this example. The load balance (LB) metric can be calculated by the average user-code time (average of all MPI processes) divided by the maximum user-code time. For the above example, this is:

LB = 4.29 / 4.75 * 100 = 90.31%

The communication efficiency (CommE) can be calculated by maximum user code time divided by the runtime:

CommE = 4.75 / 8.00 * 100 = 59.37%

The parallel efficiency (PE) is a product of load balance and communication efficiency:

PE = 90.31 * 59.37 / 100 = 53.62%

The above metrics should be obtained for different number of MPI process counts, e.g. 2, 10, 60, etc, to see how the metrics scale. The metrics for the lowest number of MPI processes is referred to as the reference value. The computation efficiency is the reference total user time (e.g. the value for 1 or 2 MPI process run) divided by the total user name. For the above example, the total user time is 38.61. The computational efficiency is then multiplied with the parallel efficiency (PE) to give the global efficiency.

For poor load balance, either the parallel decomposition or the instructions per cycle (IPC) for each MPI process should be investigated further. For poor communication efficiency (PE), investigate the MPI aspects of the code, e.g. message sizes or the number of MPI subroutine calls. For example, if a large number of MPI calls are being made that are sending small message sizes, then the data could be potentially aggregated and sent in fewer MPI calls. For poor computational efficiency, investigate whether computation is being duplicated or the overall instructions per cycle is decreasing. One can use the Paraver/Extrae [3] profiling tool to calculate IPC.

The mpiP output then lists call sites which are locations in the code which call MPI subroutines. For further information on MPI data for call sites, please see reference [1].