A Compiler Optimization Framework for Directive-Based GPU Computing



Journal Title

Journal ISSN

Volume Title



In the past decade, accelerators, commonly Graphics Processing Units (GPUs), have played a key role in achieving Petascale performance and driving efforts to reach Exascale. However, significant advances in programming models for accelerator-based systems are required in order to close the gap between achievable and theoretical peak performance. These advances should not come at the cost of programmability. Directive-based programming models for accelerators, such as OpenACC and OpenMP, help non-expert programmers to parallelize applications productively and enable incremental parallelization of existing codes, but typically will result in lower performance than CUDA or OpenCL. The goal of this dissertation is to shrink this performance gap by supporting fine-grained parallelism and locality-awareness within a chip.

We propose a comprehensive loop scheduling transformation to help users more effectively exploit the multi-dimensional thread topology within the accelerator. An innovative redundant execution mode is developed in order to reduce unnecessary synchronization overhead. Our data locality optimizations utilize the different types of memory in the accelerator. Where the compiler analysis is insufficient, an explicit directive-based approach is proposed to guide the compiler to perform specific optimizations.

We have chosen to implement and evaluate our work using the OpenACC programming model, as it is the most mature and well-supported of the directive-based accelerator programming models, having multiple commercial implementations. However, the proposed methods can also be applied to OpenMP. For the hardware platform, we choose GPUs from Advanced Micro Devices (AMD) and NVIDIA, as well as Accelerated Processing Units (APUs) from AMD. We evaluate our proposed compiler framework and optimization algorithms with SPEC and NAS OpenACC benchmarks; the result suggests that these approaches will be effective for improving overall performance of code executing on GPUs. With the data locality optimizations, we observed up to 3.22 speedup running NAS and 2.41 speedup while running SPEC benchmarks. For the overall performance, the proposed compiler framework generates code with competitive performance to the state-of-art of commercial compiler from NVIDIA PGI.



OpenACC, Compiler Construction, Compiler Optimization, Data Locality Optimization, Loop Scheduling, Register Optimization