Abstract
The shift toward parallel processor architectures has made programming and code generation increasingly challenging. To address this programmability challenge, this article presents a technique to fully automatically generate efficient and readable code for parallel processors (with a focus on GPUs). This is made possible by combining algorithmic skeletons, traditional compilation, and "algorithmic species," a classification of program code. Compilation starts by automatically annotating C code with class information (the algorithmic species). This code is then fed into the skeleton-based source-to-source compiler bones to generate CUDA code. To generate efficient code, bones also performs optimizations including host-accelerator transfer optimization and kernel fusion. This results in a unique approach, integrating a skeleton-based compiler for the first time into an automated flow. The benefits are demonstrated experimentally for PolyBench GPU kernels, showing geometric mean speed-ups of 1.4× and 2.4× compared to ppcg and Par4All, and for five Rodinia GPU benchmarks, showing a gap of only 1.2× compared to hand-optimized code.
Original language | English |
---|---|
Article number | 35 |
Pages (from-to) | 35-1-35-25 |
Journal | ACM Transactions on Architecture and Code Optimization |
Volume | 11 |
Issue number | 4 |
DOIs | |
Publication status | Published - 2014 |