TY - GEN
T1 - A study of the potential of locality-aware thread scheduling for GPUs
AU - Nugteren, C.
AU - Braak, van den, G.J.W.
AU - Corporaal, H.
PY - 2014
Y1 - 2014
N2 - Programming models such as CUDA and OpenCL allow the programmer to specify the independence of threads, effectively removing ordering constraints. Still, parallel architectures such as the graphics processing unit (GPU) do not exploit the potential of data-locality enabled by this independence. Therefore, programmers are required to manually perform data-locality optimisations such as memory coalescing or loop tiling. This work makes a case for locality-aware thread scheduling: re-ordering threads automatically for better locality to improve the programmability of multi-threaded processors. In particular, we analyse the potential of locality-aware thread scheduling for GPUs, considering among others cache performance, memory coalescing and bank locality. This work does not present an implementation of a locality-aware thread scheduler, but rather introduces the concept and identifies the potential. We conclude that non-optimised programs have the potential to achieve good cache and memory utilisation when using a smarter thread scheduler. A case-study of a naive matrix multiplication shows for example a 87% performance increase, leading to an IPC of 457 on a 512-core GPU.
AB - Programming models such as CUDA and OpenCL allow the programmer to specify the independence of threads, effectively removing ordering constraints. Still, parallel architectures such as the graphics processing unit (GPU) do not exploit the potential of data-locality enabled by this independence. Therefore, programmers are required to manually perform data-locality optimisations such as memory coalescing or loop tiling. This work makes a case for locality-aware thread scheduling: re-ordering threads automatically for better locality to improve the programmability of multi-threaded processors. In particular, we analyse the potential of locality-aware thread scheduling for GPUs, considering among others cache performance, memory coalescing and bank locality. This work does not present an implementation of a locality-aware thread scheduler, but rather introduces the concept and identifies the potential. We conclude that non-optimised programs have the potential to achieve good cache and memory utilisation when using a smarter thread scheduler. A case-study of a naive matrix multiplication shows for example a 87% performance increase, leading to an IPC of 457 on a 512-core GPU.
U2 - 10.1007/978-3-319-14313-2_13
DO - 10.1007/978-3-319-14313-2_13
M3 - Conference contribution
SN - 978-3-319-14312-5
T3 - Lecture Notes in Computer Science
SP - 146
EP - 157
BT - Euro-Par 2014: Parallel Processing Workshops : Euro-Par 2014 International Workshops, Porto, Portugal, August 25-26, 2014, Revised Selected Papers, Part II
A2 - Lopes, L.
A2 - Zilinskas, J.
PB - Springer
CY - Berlin
T2 - conference; 7th International Workshop on Multi-/Many-Core Computing Systems; 2014-08-26; 2014-08-26
Y2 - 26 August 2014 through 26 August 2014
ER -