Performance portability is becoming increasingly important as next-generation high performance computing systems grow increasingly diverse and heterogeneous. Several new approaches to parallel programming have been developed in recent years to tackle this challenge, such as SYCL and Kokkos. While several studies have been published evaluating these new programming models, they have tended to focus on memory-bandwidth bound applications. In this paper we analyse the performance of the most promising modern parallel programming models, on a diverse range of contemporary high-performance hardware, using a compute-bound molecular docking mini-app. We present a mini-app for BUDE, the Bristol University Docking Engine, am application routinely used for drug discovery. We implement the mini-app in different programming models targeting both CPUs and GPUs, including SYCL and Kokkos. We then present an analysis of the performance of each implementation and compare them to highly-optimised baselines set using established programming models such as OpenMP, OpenCL, and CUDA. Our study includes a wide variety of modern hardware platforms covering CPUs based on x86 and Arm architectures, as well as GPUs. We found that, with the emerging higher-level parallel programming models framework such as SYCL, we could achieve performance comparable to that of the established models without hurting either portability or productivity. We identify a set of key challenges and pitfalls to take into account when adopting these emerging programming models, some of which are implementation-specific effects and not fundamental design errors that prevent further adoption. Finally, we discuss our findings in the wider context of performance-portable compute-bound workloads.
|Title of host publication||High Performance Computing - 36th International Conference, ISC High Performance 2021, Proceedings|
|Subtitle of host publication||36th International Conference, ISC High Performance 2021, Virtual Event, June 24 – July 2, 2021, Proceedings|
|Editors||Bradford L. Chamberlain, Bradford L. Chamberlain, Ana-Lucia Varbanescu, Hatem Ltaief, Piotr Luszczek|
|Number of pages||19|
|Publication status||Published - 17 Jun 2021|
|Event||ISC High Performance 2021 - Frankfurt, Germany|
Duration: 24 Jun 2021 → 2 Jul 2021
|Name||Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)|
|Conference||ISC High Performance 2021|
|Abbreviated title||ISC HPC|
|Period||24/06/21 → 2/07/21|
Bibliographical noteFunding Information:
This work used the Isambard UK National Tier-2 HPC Service (https://gw4.ac. uk/isambard/) operated by GW4 and the UK Met Office, and funded by EPSRC (EP/T022078/1). Access to the Cray XC50 supercomputer Swan was kindly provided 1 https://github.com/UoB-HPC/miniBUDE. 2 https://github.com/UoB-HPC/performance-portability/tree/2021-benchmarking/ benchmarking/2021/bude.
The authors would like to thank Si Hammond at Sandia National Laboratories for providing short-notice results for the A64FX platform. Thank you to James Price and Matt Martineau for their original contributions towards optimised OpenMP, OpenCL, and CUDA implementations of the BUDE kernel. This study would not have been possible without previous work by the developers of the Bristol University Docking Engine: Richard Sessions, Deborah Shoemark, and Amaurys Avila Ibarra. This work used the Isambard UK National Tier-2 HPC Service (https://gw4.ac. uk/isambard/) operated by GW4 and the UK Met Office, and funded by EPSRC (EP/T022078/1). Access to the Cray XC50 supercomputer Swan was kindly provided through the Cray Marketing Partner Network. Work in this study was carried out using the HPC Zoo, a research cluster run by the University of Bristol HPC Group (https:// uob-hpc.github.io/zoo/).
© 2021, Springer Nature Switzerland AG.
- programming models
- performance portability
- performance analysis
- compute-bound benchmark