As we approach the Exascale era, computer architectures are evolving ever-greater vector and matrix acceleration units—NVIDIA’s Ampere Tensor Cores, Intel’s AMX, and Arm’s SVE vector instruction set developments are just three recent examples [1, 2, 10]. To exploit these, it is expected that optimised math libraries such as those for dense and sparse linear algebra, will play an increasing role in achieving optimal performance. It is therefore useful to understand which of these functions dominate an application’s runtime, and in particular how this changes with increasing scale. This work aims to provide a contemporary dataset regarding how much dense linear algebra (BLAS) is used in HPC codes at scale. We have analysed several science codes widely used on the UK HPC service, ARCHER (https://www.archer.ac.uk), including CASTEP, CP2K, QuantumESPRESSO, and Nektar++. To capture demands from the AI community, we have additionally traced the training stage of the Convolutional Neural Network (CNN), AlexNet . HPLinpack is also included as a reference, as it exhibits a well-understood BLAS usage pattern. Results from across all the codes show that, unlike HPLinpack, BLAS usage is never more than 25% of the total runtime, even when running at a modest scale (32 nodes of the Arm-based supercomputer, Isambard). This presents limited speedup opportunity when considering Amdahl’s law, and suggests that application developers may need to adjust their algorithms to spend more time in optimised BLAS libraries to capitalise on new architectures and accelerators.
|Title of host publication
|SMC 2020: Driving Scientific and Engineering Discoveries Through the Convergence of HPC, Big Data and AI
|Published - 18 Dec 2020