In addition to performance assessments, POP also helps implement any recommendations, which require a level of technical expertise unavailable to the customer. The aim in these cases is to demonstrate the advanced techniques rather than deliver a full implementation. Recently, we have done such a Proof-of-Concept, in collaboration with the project ChEESE, on one of their flagship codes, called ASHEE.
The customers foresee a future use-case, which will require them to dump simulation results at high frequency. This seems only feasible with an asynchronous I/O scheme, where the application can continue computation in the foreground while I/O is done concurrently in the background. The code does not just calculate a single quantity, but a range of roughly 10-15 3-dimensional fields, each from a different but interdependent equation.
Our first task was to identify at which point a given field is fully computed and where it is merely updated: dumping of the data needs to be done after it is fully computed and before it is updated in the next timestep. This was done by inspection of the code and understanding its algorithm. Next, we had the choice to make a copy of the field data and pass this on to the I/O operation, or to pass a pointer to the original data. The advantage of the first option is its apparent simplicity, although this comes at the expense of doubling the memory requirement of the application. Also, as we realized later, deep-copying the field data was actually not trivial due to its realization within C++.
We therefore decided to go with the more interesting approach of keeping the data in place. To protect against race-conditions, we need to coordinate I/O operations with compute operations on the field. Using standard C++ threading techniques, such as semaphores and condition variables, we implemented a writer queue, which receives fields ready for dumping as soon as they are fully computed. The actual writing is done by a background thread that reads from the other end of the queue. It is possible for a field to be ready for update before its I/O has completed. Therefore, updating of a particular field will only take place if the background thread signals completion of its corresponding write operation through a condition variable.
We tested our asynchronous I/O scheme on a proxy-code, openfoam/ASHEE, which was provided to us, rather than the full application. We benchmarked the code for two problem sizes and for a range of node counts on two HPC systems with different parallel filesystems. Our experiments showed that we could overlap at least half of the I/O with computation. In some relevant cases, we could even fully overlap I/O with computation.
The customer is now considering implementing this asynchronous I/O scheme in the full application. We look forward to hearing from them what challenges they encounter and how it performs in production environments.
-- José Gracia (HLRS)