TY - GEN
T1 - Pragmatic Performance Portability with OpenMP 4.x
AU - Martineau, Matt
AU - Price, James
AU - McIntosh-Smith, Simon
AU - Gaudin, Wayne
PY - 2016/9/21
Y1 - 2016/9/21
N2 - In this paper we investigate the current compiler technologies supporting OpenMP 4.x offloading, and consider their ability to achieve a pragmatic level of performance on each of the intended target architectures. We consider the mechanisms with which several of the existing compiler implementations map the OpenMP model onto target architectures, discussing their divergence and considering the impact on performance portability. Following this, we conduct performance testing with a number of representative data parallel kernels using Cray Compiling Environment (CCE) 8.5.0, IBM’s OpenMP 4.5 Clang branch, and ICC 16 targeting KNC. Our general observation is that maturity is leading to greatly improved implementations that adhere more strictly to the specification, which is improving the success rate of acceleration. At the time of writing, developers will likely have to rely on the pre-processor for certain kernels to achieve functional portability, but we expect that future homogenisation of required directives between compilers and architectures is feasible. Our quantitative results provide further evidence that OpenMP 4.x is already capable of achieving some level of performance portability.
AB - In this paper we investigate the current compiler technologies supporting OpenMP 4.x offloading, and consider their ability to achieve a pragmatic level of performance on each of the intended target architectures. We consider the mechanisms with which several of the existing compiler implementations map the OpenMP model onto target architectures, discussing their divergence and considering the impact on performance portability. Following this, we conduct performance testing with a number of representative data parallel kernels using Cray Compiling Environment (CCE) 8.5.0, IBM’s OpenMP 4.5 Clang branch, and ICC 16 targeting KNC. Our general observation is that maturity is leading to greatly improved implementations that adhere more strictly to the specification, which is improving the success rate of acceleration. At the time of writing, developers will likely have to rely on the pre-processor for certain kernels to achieve functional portability, but we expect that future homogenisation of required directives between compilers and architectures is feasible. Our quantitative results provide further evidence that OpenMP 4.x is already capable of achieving some level of performance portability.
KW - OpenMP 4.x
KW - Performance portability
KW - Parallel programming
UR - http://www.scopus.com/inward/record.url?scp=84992562841&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-45550-1_18
DO - 10.1007/978-3-319-45550-1_18
M3 - Conference Contribution (Conference Proceeding)
AN - SCOPUS:84992562841
SN - 9783319455495
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 253
EP - 267
BT - OpenMP
PB - Springer-Verlag Berlin
T2 - 12th International Workshop on OpenMP, IWOMP 2016
Y2 - 5 October 2016 through 7 October 2016
ER -