3x Speed Improvement for Zenotech's zCFD Computational Fluid Dynamics Solver

Wednesday, November 22, 2017

zCFD by Zenotech is a density based finite volume and Discontinuous Galerkin (DG) computational fluid dynamics (CFD) solver for steady-state or time-dependent flow simulation. It decomposes domains using unstructured meshes. It is written in Python and C++ and parallelised with OpenMP and MPI.

POP conducted a Performance Audit to identify potential areas for improvement. This identified that the code was spending a surprisingly large amount of time executing in serial and that one particular OpenMP loop was suffering from load imbalance. POP also noted that the CPU frequency was being lowered when the code was run on the maximum number of threads (12 for the machine used in the Audit).

As a result, Zenotech made a number of changes to the code:

  • Parallelising serial portions of code. Although the code contained the correct OpenMP pragmas, the compiler found a particular region too complex to analyse and so did not apply any optimisations or OpenMP pragmas. This was solved by removing an inline keyword, which resulted in smaller blocks of code that the compiler was able to optimise. This also ensured that the OpenMP pragmas were enabled.
  • Improving load balance. The main load imbalance occurred when computing the far-field boundary conditions and was due to a call to pow() hitting a slow code-path when both base and exponent were close to 1. This was resolved by scaling the base, raising it to the power, and then undoing the scaling. Zenotech also found that switching to dynamic OpenMP loop scheduling improved load balance.
  • Removing OpenMP regions that were being created on multiple threads. The original code used two worker threads, both of which created OpenMP regions even though the work being done on the second thread was minimal. This meant that only half the CPU cores were actually doing useful computational work. Zenotech altered the code to remove the OpenMP region from the second worker thread, and this made all threads available to perform useful computation.
  • Memory management modifications. The code was spending a lot of time allocating and deallocating the additional arrays needed to perform MKL batch BLAS calls. Zenotech re-engineered zCFD to call optimised small matrix kernels that skips the error checks the MKL usually performs, thus removing the need to use batch BLAS. This was achieved using the MKL_DIRECT_CALL preprocessor macro.
  • Changing execution environment settings to boost CPU performance. The CPU frequency governor was set to ondemand by default on the machine used for the Audit, which meant that the frequency reduced when all 12 threads were active. Adding --cpu-freq=performance to the Slurm job submission commands resolved the issue.

For the test case used in the study, these improvements meant the code ran 1.65x faster on 12 threads. When Zenotech applied the modified code to a test case that was 100x larger, they observed a 3x performance improvement over the old code on 12 threads. The average cycle time fell from 3,253ms to 1,185ms, which corresponds to going from 10.4 GFlop/s to 30.6 GFlop/s for a single Broadwell socket.