Lightweight MPI Profiling with Intel MPI

The Intel MPI implementation [1] has built-in MPI profiling capabilities, including MPI-IO but excludes POSIX I/O. To use this feature, no code changes are required unless specifying which regions of interest to profile. If regions of interest are labelled then profiling data will be presented for the entire code, labelled regions, and non-labelled regions. From the performance data provided by the tools, the POP performance metrics can be calculated. Detailed information on the POP metrics can be found in reference [2].

To label a region, use the following Fortran code:

call MPI_INIT( ierr )
[ ... ]                           ! 1. some computation
call MPI_PCONTROL( 1, “label” )   ! 2. open a region
do i = 1, Ni                      ! 3. region of interest that is
[ ... ]                           ! doing computation and
end do                            ! communication
call MPI_PCONTROL( -1, “label” )  ! 4. close a region
[ ... ]                           ! 5. some other computation
call MPI_FINALIZE( ierr )

For C and C++, MPI_PCONTROL is a function and has the same API.

IPM Interface

Intel MPI can collect two types of performance data: native data and IPM data. The IPM feature stores high level profiling data such as how much time is spent in MPI and which MPI subroutines. To get a high-level overview of your MPI code, set the following environment variables:

export I_MPI_STATS=ipm
export I_MPI_STATS_FILE=prof.dat

The first variable tells Intel MPI to use the IPM module and the second variable sets the filename to store the profiling data. The output file will contain three sections that are prefixed by the text region:

  1. The * region denotes performance data for the entire code;
  2. Any labelled regions;
  3. And anything else that is not labelled.

Below is a sample output of a region labelled time-step for 9 MPI process run:

# command : ./bin/bt.A.9 (completed)
# host    : nash.nag.co.uk/x86_64_Linux     mpi_tasks : 9 on 1 nodes
# start   : 11/09/17/16:50:21               wallclock : 7.585738 sec
# stop    : 11/09/17/16:50:29               %comm     : 8.07
# region  : time-step   [ntasks] = 9
#
#                  [total]      <avg>          min          max
# entries           9            1             1            1            
# wallclock        67.4482       7.49425       7.49325      7.49485       
# user             66.5771       7.39745       7.38408      7.40627      
# system            0.903591     0.100399      0.091887     0.112871     
# mpi               5.21836      0.579817      0.349588     0.789046     
# %comm             7.73683      4.66442      10.5283      
# gflop/sec        NA           NA            NA           NA
# gbytes            0            0             0            0             
#
#
#               [time]        [calls]     <%mpi>    <%wall>
# MPI_Wait      3.79178        43200       72.66     5.62         
# MPI_Waitall   1.21447         1800       23.27     1.80         
# MPI_Isend     0.137987       32400        2.64     0.20         
# MPI_Irecv     0.0741241      32400        1.42     0.11         
# MPI_TOTAL     5.21836       109800      100.00     7.74

The user section lists the time spent in the user code, including MPI. For POP metrics we require the amount of time spent only in computation, i.e. not in MPI, and we calculate this from the above figures as user time minus MPI time. For POP metric calculations, 100% means it is perfect, 0% means it is the worst and a typical cut-off point for good enough performance is 80%.  The load balance (LB) metric is calculated by the average computation time divided by the maximum computation time. For the above example, this is:

LB = (7.39745 - 0.579817) / (7.40627 - 0.349588) * 100 = 96.61%

Note that the maximum computation time is approximated by the maximum user time minus the minimum MPI time.

The communication efficiency (CommE) can be calculated by maximum computation time divided by the runtime:

CommE = (7.40627 - 0.349588) / 7.585738 * 100 = 93.03%

The parallel efficiency (PE) is calculated as the product of the load balance and communication efficiency. For the above example:

PE = 96.61 * 93.03 / 100 = 89.88%

The above metrics should be obtained for different number of MPI process counts, e.g. 2, 10, 60, etc, to see how the metrics scale. The metrics for the lowest number of MPI processes is referred to as the reference value. The computation efficiency is the reference total user time (e.g. the value for 1 or 2 MPI process run) divided by the total user time. For the above example, the total user time is 66.5771. The computational efficiency is then multiplied with the parallel efficiency to give the global efficiency.  

For poor load balance, either the parallel decomposition or the instructions per cycle (IPC) for each MPI process should be investigated further. For poor communication efficiency, investigate the MPI aspects of the code, e.g. message sizes or the number of MPI subroutine calls. For example, if a large number of MPI calls are being made that are sending small message sizes, then the data could be potentially aggregated and sent in fewer MPI calls. See the Native Interface section on how to obtain details of MPI performance. For poor computational efficiency, investigate whether computation is being duplicated by the MPI processes or the overall IPC is decreasing. One can use the Paraver/Extrae [3] profiling tool to calculate IPC.

Native Interface

The IPM module gives a high-level overview of application performance. The native performance data provides in-depth MPI performance data, e.g. data transfer metrics, which could identify performance problems.

To capture native performance data, set the I_MPI_STATS environment variable to either one of the values 1, 2, 3, 4, 10 or 20. The higher the value, the more information that will be provided. Also set the filename that should store the performance report using the environment variable I_MPI_STATS_FILE, e.g. prof.dat. Then execute the parallel code with mpirun and the profile data will be stored in the output file, e.g. prof.dat.

The output file contains the following information:

  1. Amount of data transferred between MPI processes;
  2. MPI subroutines statistics, e.g. data transferred, number of calls;
  3. Amount of data transferred between MPI processes and by which MPI subroutine;
  4. Performance details of MPI collective subroutines.

For further information on Intel MPI’s profiling capabilities, please see reference [1].

References

[1] https://software.intel.com/en-us/mpi-developer-reference-windows-statistics-gathering-mode

[2] https://pop-coe.eu/node/69

[3] https://tools.bsc.es/paraver