The performance tools (PT) team at UVSQ (LI-PaRAD Laboratory) has been working on performance optimization tools and methodologies since 2004 starting with Itanium based clusters.
MAQAO (Modular Assembly Quality Analyzer and Optimizer) is a performance analysis and optimisation toolsuite operating at binary level. It can run single and multi-node parallel applications. When compared to other tools It provides an advanced focus on core/node performance.
Its main goal is to guide application developers along the optimization process through synthetic reports and hints. The tool mixes both dynamic and static analyses based on its ability to reconstruct high level structures such as functions and loops from an application binary. Since MAQAO operates at binary level, it is agnostic regarding the language used in the source code and does not require recompiling the application to perform analyses. Another key feature of MAQAO is its extensibility. Users can easily write their own modules using the Lua scripting language, allowing fast prototyping of new MAQAO modules. MAQAO has also been designed to concurrently support multiple architectures. At the moment, the Intel64, Xeon Phi and ARM architectures are implemented.
The main modules of MAQAO are:
- ONE View, a supervising module responsible for invoking the other modules and aggregating their results, which are then presented as synthetic reports in HTML or XLS format.
- LProf (Lightweight Profiler), a sampling-based lightweight profiler offering results at both function and loop levels and capable of categorizing its results depending on their source. LProf is agnostic with regard to the MPI or OpenMP runtime used by the application.
- CQA (Code Quality Analyzer), a static analyser assessing the quality of the code generated by the compiler and producing a set of reports describing potential issues, estimations of the gain if fixed, and hints on how to achieve this through modifications to the source code or the compilation chain. Since it relies on static analysis, CQA can also provide projections of code performance on different architectures (cross-analysis).
- VProf is a value profiler relying on instrumentation through binary rewriting. For example, it can be used to retrieve the number of iterations and execution time of loops and grouping the results by loop instances with a similar behaviour. It can also be used to gather typical parameter values for subroutine arguments.
- DECAN (DECremental ANalyzer), is an advanced MAQAO module using differential analysis on innermost loops to locate performance issues. To achieve this, DECAN generates several variants of an assembly loop through binary rewriting, by removing or modifying groups of instructions and adding probes to measure time or read hardware counters for the modified loop. The comparison of the results obtained after executing the different variants generated then allows to infer impact of the modified instructions on the overall loop performance. This also allows DECAN to predict the application behaviour if certain conditions are met (e.g. all data in L1 cache) or some transformations have been performed.
- ASSIST is a prototype code restructuring tool implementing advanced Profile Guided Optimization techniques: it first uses more refined performance analysis provided by MAQAO modules and second it performs complex code restructuring such as specialization.