Locality-aware CTA Clustering for modern GPUs

A. Li, S.L. Song, W. Liu, X. Liu, A. Kumar, H. Corporaal

Onderzoeksoutput: Hoofdstuk in Boek/Rapport/CongresprocedureConferentiebijdrageAcademicpeer review

28 Citaten (Scopus)

Samenvatting

Cache is designed to exploit locality; however, the role of onchip L1 data caches on modern GPUs is often awkward. The locality among global memory requests from different SMs (Streaming Multiprocessors) is predominantly harvested by the commonly-shared L2 with long access latency; while the in-core locality, which is crucial for performance delivery, is handled explicitly by user-controlled scratchpad memory. In this work, we disclose another type of data locality that has been long ignored but with performance boosting potential - the inter-CTA locality. Exploiting such locality is rather challenging due to unclear hardware feasibility, unknown and inaccessible underlying CTA scheduler, and small incore cache capacity. To address these issues, we first conduct a thorough empirical exploration on various modern GPUs and demonstrate that inter-CTA locality can be harvested, both spatially and temporally, on L1 or L1/Tex unified cache. Through further quantification process, we prove the significance and commonality of such locality among GPU applications, and discuss whether such reuse is exploitable. By leveraging these insights, we propose the concept of CTAClustering and its associated software-based techniques to reshape the default CTA scheduling in order to group the CTAs with potential reuse together on the same SM. Our techniques require no hardware modification and can be directly deployed on existing GPUs. In addition, we incorporate these techniques into an integrated framework for automatic inter-CTA locality optimization.We evaluate our techniques using a wide range of popular GPU applications on all modern generations of NVIDIA GPU architectures. The results show that our proposed techniques significantly improve cache performance through reducing L2 cache transactions by 55%, 65%, 29%, 28% on average for Fermi, Kepler, Maxwell and Pascal, respectively, leading to an average of 1.46x, 1.48x, 1.45x, 1.41x (up to 3.8x, 3.6x, 3.1x, 3.3x) performance speedups for applications with algorithmrelated inter-CTA reuse.

Originele taal-2Engels
TitelASPLOS 2017 - 22nd International Conference on Architectural Support for Programming Languages and Operating Systems
Plaats van productieNew York
UitgeverijAssociation for Computing Machinery, Inc
Pagina's297-311
Aantal pagina's15
ISBN van elektronische versie978-1-4503-4465-4
DOI's
StatusGepubliceerd - jun 2017
Evenement22nd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2017) - Xi'an, China
Duur: 8 apr 201712 apr 2017
Congresnummer: 22

Congres

Congres22nd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2017)
Verkorte titelASPLOS 2017
Land/RegioChina
StadXi'an
Periode8/04/1712/04/17

Vingerafdruk

Duik in de onderzoeksthema's van 'Locality-aware CTA Clustering for modern GPUs'. Samen vormen ze een unieke vingerafdruk.

Citeer dit