Leveraging Arm’s Scalable Matrix Extension to Accelerate Matrix Multiplication Kernels

Research output: Contribution to conferenceConference Poster

87 Downloads (Pure)

Abstract

With the ever-growing interest in AI, Machine Learning, and Deep Learning, new acceleration techniques are being devised to leverage performance. One such technique is the use of matrix engines within CPUs to try and bridge the gap between CPU and GPU performance. Whilst GPUs typically dominate these kinds of workloads due to their inherent SIMD nature, they can come at a cost, namely power and data-offload overheads. As such, having matrix engines close to, or inside of, the CPU itself can provide additional performance whilst reducing these overheads. Currently, some recent CPU offerings from Intel, Apple, and IBM all have their own versions of a matrix engine, each implemented slightly differently but achieving the same goal of improved matrix multiplication performance. Although no hardware is currently available, Arm have also specified their own CPU matrix ISA extension, called the Scalable Matrix Extension (SME). Building from their Scalable Vector Extension (SVE), SME introduces new outer-product instructions and a 2-D matrix register to accelerate level 3 BLAS operations. A more recent version of the extension, SME2, adds support for inner-product and multi-vector instructions to support the acceleration of level 2 BLAS operations.
Due to the lack of available hardware, we utilise The Simulation Engine (SimEng) from the University of Bristol’s High Performance Computing Group, along with the Structural Simulation Toolkit (SST) from Sandia National Laboratories to simulate a hypothetical core design with an integrated SME matrix engine. This enables us to build on previous work from Wilkinson et al, which compared the performance of SVE and SME SGEMM. By widening the scope to SGEMM and DGEMM, we are able to more comprehensively evaluate the advantages that SME has compared to like-for-like NEON (a 128-bit SIMD extension) and SVE implementations, and how this may translate to performance gains in real workloads.
Original languageEnglish
Publication statusPublished - 10 Aug 2023
EventWorkshop on Modeling & Simulation of Systems and Applications - Seattle, United States
Duration: 9 Aug 202311 Aug 2023
Conference number: 2023
https://www.bnl.gov/modsim/index.php

Workshop

WorkshopWorkshop on Modeling & Simulation of Systems and Applications
Abbreviated titleModSim
Country/TerritoryUnited States
CitySeattle
Period9/08/2311/08/23
Internet address

Keywords

  • SimEng
  • Matrix
  • Simulation
  • Micro-Architecture
  • Arm SME
  • GEMM
  • CPU
  • High-performance computing

Fingerprint

Dive into the research topics of 'Leveraging Arm’s Scalable Matrix Extension to Accelerate Matrix Multiplication Kernels'. Together they form a unique fingerprint.

Cite this