This article discusses the implementation of cache blocking for the Navier Stokes equations in PyFR for CPUs. Cache blocking is used as an alternative to kernel fusion to minimize unnecessary data movements between kernels at the main memory level.
Cache Blocking to Reduce Data Movements
The main idea behind cache blocking is to group together kernels that exchange data and execute them on small sub-regions of the domain that fit in per-core private data cache. This eliminates the need for frequent data movements between the main memory and the cache, resulting in improved performance.
In the context of the Navier Stokes equations with anti-aliasing support on mixed grids, cache blocking is particularly useful. It efficiently implements a tensor product factorization of the interpolation operators associated with anti-aliasing. By storing intermediate results in per-core private data cache, significant data movement from main memory is avoided.
Assessing Performance Gains
To evaluate the effectiveness of cache blocking, a theoretical model is developed. This model predicts the expected performance gains based on the implementation. The results indicate that the theoretical performance gains range from 1.99 to 2.62.
In order to validate these theoretical predictions, a benchmarking process is performed using a compressible 3D Taylor-Green vortex test case. The benchmarking is conducted on both hexahedral and prismatic grids, with third- and fourth-order solution polynomials.
Real-world Performance Improvements
The actual performance gains achieved through cache blocking in practice are found to be quite promising. The speedups obtained range from 1.67 to 3.67 compared to PyFR v1.11.0. These improvements highlight the effectiveness of cache blocking as a technique for optimizing the performance of numerical simulations involving the Navier Stokes equations.
Overall, the adoption of cache blocking in PyFR for CPUs shows great potential for improving the performance of the Navier Stokes equations with anti-aliasing support on mixed grids. By reducing data movements and utilizing per-core private data cache efficiently, this technique demonstrates significant performance gains in both theoretical predictions and real-world benchmarking.