In this paper we investigate the current compiler technologies supporting OpenMP 4.x offloading, and consider their ability to achieve a pragmatic level of performance on each of the intended target architectures. We consider the mechanisms with which several of the existing compiler implementations map the OpenMP model onto target architectures, discussing their divergence and considering the impact on performance portability. Following this, we conduct performance testing with a number of representative data parallel kernels using Cray Compiling Environment (CCE) 8.5.0, IBM’s OpenMP 4.5 Clang branch, and ICC 16 targeting KNC. Our general observation is that maturity is leading to greatly improved implementations that adhere more strictly to the specification, which is improving the success rate of acceleration. At the time of writing, developers will likely have to rely on the pre-processor for certain kernels to achieve functional portability, but we expect that future homogenisation of required directives between compilers and architectures is feasible. Our quantitative results provide further evidence that OpenMP 4.x is already capable of achieving some level of performance portability.
|Name||Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)|
|Conference||12th International Workshop on OpenMP, IWOMP 2016|
|Period||5/10/16 → 7/10/16|
- OpenMP 4.x
- Performance portability
- Parallel programming