The shift toward parallel processor architectures has made programming and code generation increasingly challenging. To address this programmability challenge, this article presents a technique to fully automatically generate efficient and readable code for parallel processors (with a focus on GPUs). This is made possible by combining algorithmic skeletons, traditional compilation, and "algorithmic species," a classification of program code. Compilation starts by automatically annotating C code with class information (the algorithmic species). This code is then fed into the skeleton-based source-to-source compiler bones to generate CUDA code. To generate efficient code, bones also performs optimizations including host-accelerator transfer optimization and kernel fusion. This results in a unique approach, integrating a skeleton-based compiler for the first time into an automated flow. The benefits are demonstrated experimentally for PolyBench GPU kernels, showing geometric mean speed-ups of 1.4× and 2.4× compared to ppcg and Par4All, and for five Rodinia GPU benchmarks, showing a gap of only 1.2× compared to hand-optimized code.
|Journal||ACM Transactions on Architecture and Code Optimization|
|Publication status||Published - 2014|