Abstract
Although the OpenMP 4.0 standard has been available since 2013, support for GPUs has been absent up until very recently, with only a handful of experimental compilers available. In this work we evaluate the performance of Cray's new NVIDIA GPU targeting implementation of OpenMP 4.0, with the mini-apps TeaLeaf, CloverLeaf and BUDE. We successfully port each of the applications, using a simple and consistent design throughout, and achieve performance on an NVIDIA K20X that is comparable to Cray's OpenACC in all cases. BUDE, a compute bound code, required 2.2x the runtime of an equivalently optimised CUDA code, which we believe is caused by an inflated frequency of control flow operations and less efficient arithmetic optimisation. Impressively, both TeaLeaf and CloverLeaf, memory bandwidth bound codes, only required 1.3x the runtime of hand-optimised CUDA implementations. Overall, we find that OpenMP 4.0 is a highly usable open standard capable of performant heterogeneous execution, making it a promising option for scientific application developers.
Original language | English |
---|---|
Title of host publication | Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016 |
Publisher | Institute of Electrical and Electronics Engineers (IEEE) |
Pages | 338-347 |
Number of pages | 10 |
ISBN (Electronic) | 9781509021406 |
DOIs | |
Publication status | Published - 4 Aug 2016 |
Event | 30th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2016 - Chicago, United States Duration: 23 May 2016 → 27 May 2016 |
Conference
Conference | 30th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2016 |
---|---|
Country/Territory | United States |
City | Chicago |
Period | 23/05/16 → 27/05/16 |
Keywords
- Application programming interfaces
- High performance computing
- OpenMP
- Parallel computing
- Performance portability