Performance Analysis and Optimization of Clang's OpenMP 4.5 GPU Support

Matt Martineau, Simon McIntosh-Smith, Carlo Bertolli, Arpith C. Jacob, Samuel F. Antao, Alexandre Eichenberger, Gheorghe Teodor Bercea, Tong Chen, Tian Jin, Kevin O'Brien, Georgios Rokos, Hyojin Sung, Zehra Sura

Research output: Chapter in Book/Report/Conference proceedingConference Contribution (Conference Proceeding)

16 Citations (Scopus)

Abstract

The Clang implementation of OpenMP 4.5 now provides full support for the specification, offering the only open source option for targeting NVIDIA GPUs. While using OpenMP allows portability across different architectures, matching native CUDA performance without major code restructuring is an open research issue.

In order to analyze the current performance, we port a suite of representative benchmarks, and the mature mini-apps TeaLeaf, CloverLeaf, and SNAP to the Clang OpenMP 4.5 compiler. We then collect performance results for those ports, and their equivalent CUDA ports, on an NVIDIA Kepler GPU. Through manual analysis of the generated code, we are able to discover the root cause of the performance differences between OpenMP and CUDA.

A number of improvements can be made to the existing compiler implementation to enable performance that approaches that of hand-optimized CUDA. Our first observation was that the generated code did not use fused-multiply-add instructions, which was resolved using an existing flag. Next we saw that the compiler was not passing any loads through non-coherent cache, and added a new flag to the compiler to assist with this problem.

We then observed that the compiler partitioning of threads and teams could be improved upon for the majority of kernels, which will guide future work to ensure that the compiler can pick more optimal defaults. We uncovered a register allocation issue with the existing implementation that, when fixed alongside the other issues, enables performance that is close to CUDA.

Finally, we use some different kernels to emphasize that support for managing memory hierarchies needs to be introduced into the specification, and propose a simple option for programming shared caches.
Original languageEnglish
Title of host publicationProceedings of PMBS 2016
Subtitle of host publication7th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems - Held in conjunction with SC 2016: The International Conference for High Performance Computing, Networking, Storage and Analysis
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Pages54-64
Number of pages11
ISBN (Electronic)9781509052189
ISBN (Print)9781509052196
DOIs
Publication statusPublished - Mar 2017
Event7th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems, PMBS 2016 - Salt Lake City, United States
Duration: 14 Nov 2016 → …

Conference

Conference7th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems, PMBS 2016
CountryUnited States
CitySalt Lake City
Period14/11/16 → …

Fingerprint Dive into the research topics of 'Performance Analysis and Optimization of Clang's OpenMP 4.5 GPU Support'. Together they form a unique fingerprint.

Cite this