Tool Time: Measurement of External Libraries with Score-P

Thursday, February 6, 2020

Very often scientific applications are using third-party libraries, e.g. MKL, FFTW, PETSc, HDF5 etc, for basic functions. Such libraries usually allow to simplify implementation and make the application faster due to fine tuning to currently used compilers and CPUs. We usually consider those libraries a black box and assume that the implementation is good enough for the user’s needs. Also, since in most cases using optimized external libraries is preferred over re-implementing basic functions, it is also nothing we want to discourage. However when external libraries are used extensively the user might need to know how much time the application spends in external library calls to understand the behavior of his application. Here we want to show how this can be done for Score-P.

If we use following simple example using the FFTW library:

#include <stdlib.h>
#include <time.h>
#include <fftw3.h>

int main( void ) {
   srand( time( NULL ) );
   int N = 10000;
   fftw_complex *in, *out;
   fftw_plan my_plan;

   in = ( fftw_complex* ) fftw_malloc( sizeof( fftw_complex ) * N );
   out = ( fftw_complex* ) fftw_malloc( sizeof( fftw_complex ) *N );
   for (int i=0;i<N;i++) {
      in[i][0] = rand() / RAND_MAX;
      in[i][1] = rand() / RAND_MAX;
   }
   my_plan = fftw_plan_dft_1d( N, in, out, FFTW_FORWARD,
                               FFTW_ESTIMATE );
   fftw_execute( my_plan );
   fftw_destroy_plan( my_plan );
   fftw_free( in );
   fftw_free( out );

   return 0;
}

Instrumenting and running it with Score-P

% scorep icc libwrap_example.c \
         -I<path-to-fftw-include-directory> \
         -L<path-to-fftw-lib-directory> \
         -lfftw3 -o libwrap_example_nowrap

we will see following in the CUBE browser:

As we can see there are no library calls in the call tree and all the execution time is attributed to ‘main’.

To separate them we have three potential solutions:

  • Manual instrumentation. In this case the user manually wraps library calls in the source code.
    • Advantages
      • Great flexibility for the developer, i.e. which function’s calls to wrap, at which granularity
      • Quick solution for application with only a small number of external library calls
    • Disadvantages
      • Does not provide internals of the library’s function calls
      • Provides no information about internal OpenMP parallelization, e.g. if OpenMP threads were used internally, as Score-P is using source-code instrumentation to capture OpenMP events. [This can change in the future once Score-P is adapted to use the new OpenMP 5.0 OMPT measurement interface.]
      • Requires knowledge of application
      • Requires some coding work
  • Full instrumentation of the library. Typically automatic compiler instrumentation will be sufficient.
    • Advantages
      • Provides complete internals of the library
      • Provides information about internal parallelization
    • Disadvantages
      • Not applicable for external libraries where source code is not available
      • Can be too detailed and potentially requires filtering
      • Requires additional installation of library which cannot be used in production due to instrumentation overhead
      • Can be time consuming
  • Library wrapping mechanism. This is relatively new feature of Score-P (provided starting with version 4.0) which allows semi-automatic instrumentation of library calls.
    • Advantages
      • Configurable, i.e. can provide public and internal calls
      • Can also be applied to external libraries where source code is not available
      • Score-P can switch its measurement on/off at link time
      • Can to some extent be reused for future installations
    • Disadvantages
      • Provides no information about internal OpenMP parallelization, e.g. if OpenMP threads were used internally
      • Preparation can be time consuming
      • To enable it we need a LLVM compiler infrastructure, i.e. "llvm-config", and "libclang" and its developer packages during configuration

As full or manual instrumentation are known and straightforward we are going to concentrate on the new library wrapping mechanism. Detailed instructions on how to create and install the FFTW3 wrapper library are described in this article. Further information can be found in the Score-P documentation.

Once we installed the FFTW wrapper library in the Score-P installation, we can use it for instrumentation via the scorep option "--libwrap=<libname>":

% scorep --libwrap=fftw3 icc libwrap_example.c \
         -I<path-to-fftw-include-directory> \
         -L<path-to-fftw-lib-directory> \
         -lfftw3 -o libwrap_example_wrap

On the picture below we can see results of library wrapping of aforementioned example.

Now we can see that most of the time was spent in "fftw-execute" call. And if necessary we can adjust/tune library calls. The Time metric in the Metric tree now contains a subdivision for wrapped libraries, which allows for more complex codes to separate time spent in user code and external libraries. As the external libraries can be a given dependency the user has no influence over, it also provides an option to focus analysis and optimization efforts on the code they can change.