Abstract
With the ever-growing interest in AI, Machine Learning, and Deep Learning, new acceleration techniques are being devised to leverage performance. One such technique is the use of matrix engines within CPUs to try and bridge the gap between CPU and GPU performance. Whilst GPUs typically dominate these kinds of workloads due to their inherent SIMD nature, they can come at a cost, namely power and data-offload overheads. As such, having matrix engines close to, or inside of, the CPU itself can provide additional performance whilst reducing these overheads. Currently, some recent CPU offerings from Intel, Apple, and IBM all have their own versions of a matrix engine, each implemented slightly differently but achieving the same goal of improved matrix multiplication performance. Although no hardware is currently available, Arm have also specified their own CPU matrix ISA extension, called the Scalable Matrix Extension (SME). Building from their Scalable Vector Extension (SVE), SME introduces new outer-product instructions and a 2-D matrix register to accelerate level 3 BLAS operations. A more recent version of the extension, SME2, adds support for inner-product and multi-vector instructions to support the acceleration of level 2 BLAS operations.
Due to the lack of available hardware, we utilise The Simulation Engine (SimEng) from the University of Bristol’s High Performance Computing Group, along with the Structural Simulation Toolkit (SST) from Sandia National Laboratories to simulate a hypothetical core design with an integrated SME matrix engine. This enables us to build on previous work from Wilkinson et al, which compared the performance of SVE and SME SGEMM. By widening the scope to SGEMM and DGEMM, we are able to more comprehensively evaluate the advantages that SME has compared to like-for-like NEON (a 128-bit SIMD extension) and SVE implementations, and how this may translate to performance gains in real workloads.
Due to the lack of available hardware, we utilise The Simulation Engine (SimEng) from the University of Bristol’s High Performance Computing Group, along with the Structural Simulation Toolkit (SST) from Sandia National Laboratories to simulate a hypothetical core design with an integrated SME matrix engine. This enables us to build on previous work from Wilkinson et al, which compared the performance of SVE and SME SGEMM. By widening the scope to SGEMM and DGEMM, we are able to more comprehensively evaluate the advantages that SME has compared to like-for-like NEON (a 128-bit SIMD extension) and SVE implementations, and how this may translate to performance gains in real workloads.
Original language | English |
---|---|
Publication status | Published - 10 Aug 2023 |
Event | Workshop on Modeling & Simulation of Systems and Applications - Seattle, United States Duration: 9 Aug 2023 → 11 Aug 2023 Conference number: 2023 https://www.bnl.gov/modsim/index.php |
Workshop
Workshop | Workshop on Modeling & Simulation of Systems and Applications |
---|---|
Abbreviated title | ModSim |
Country/Territory | United States |
City | Seattle |
Period | 9/08/23 → 11/08/23 |
Internet address |
Keywords
- SimEng
- Matrix
- Simulation
- Micro-Architecture
- Arm SME
- GEMM
- CPU
- High-performance computing