Locality-aware CTA Clustering for modern GPUs

A. Li, S.L. Song, W. Liu, X. Liu, A. Kumar, H. Corporaal

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

32 Citations (Scopus)


Cache is designed to exploit locality; however, the role of onchip L1 data caches on modern GPUs is often awkward. The locality among global memory requests from different SMs (Streaming Multiprocessors) is predominantly harvested by the commonly-shared L2 with long access latency; while the in-core locality, which is crucial for performance delivery, is handled explicitly by user-controlled scratchpad memory. In this work, we disclose another type of data locality that has been long ignored but with performance boosting potential - the inter-CTA locality. Exploiting such locality is rather challenging due to unclear hardware feasibility, unknown and inaccessible underlying CTA scheduler, and small incore cache capacity. To address these issues, we first conduct a thorough empirical exploration on various modern GPUs and demonstrate that inter-CTA locality can be harvested, both spatially and temporally, on L1 or L1/Tex unified cache. Through further quantification process, we prove the significance and commonality of such locality among GPU applications, and discuss whether such reuse is exploitable. By leveraging these insights, we propose the concept of CTAClustering and its associated software-based techniques to reshape the default CTA scheduling in order to group the CTAs with potential reuse together on the same SM. Our techniques require no hardware modification and can be directly deployed on existing GPUs. In addition, we incorporate these techniques into an integrated framework for automatic inter-CTA locality optimization.We evaluate our techniques using a wide range of popular GPU applications on all modern generations of NVIDIA GPU architectures. The results show that our proposed techniques significantly improve cache performance through reducing L2 cache transactions by 55%, 65%, 29%, 28% on average for Fermi, Kepler, Maxwell and Pascal, respectively, leading to an average of 1.46x, 1.48x, 1.45x, 1.41x (up to 3.8x, 3.6x, 3.1x, 3.3x) performance speedups for applications with algorithmrelated inter-CTA reuse.

Original languageEnglish
Title of host publicationASPLOS 2017 - 22nd International Conference on Architectural Support for Programming Languages and Operating Systems
Place of PublicationNew York
PublisherAssociation for Computing Machinery, Inc
Number of pages15
ISBN (Electronic)978-1-4503-4465-4
Publication statusPublished - Jun 2017
Event22nd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2017) - Xi'an, China
Duration: 8 Apr 201712 Apr 2017
Conference number: 22


Conference22nd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2017)
Abbreviated titleASPLOS 2017


  • Cache locality
  • CTA
  • GPU
  • Performance optimization
  • Runtime tool


Dive into the research topics of 'Locality-aware CTA Clustering for modern GPUs'. Together they form a unique fingerprint.

Cite this