Programming tensor cores from an image processing DSL

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

Abstract

Tensor Cores (TCUs) are specialized units first introduced by NVIDIA in the Volta microarchitecture in order to accelerate matrix multiplications for deep learning and linear algebra workloads. While these units have proved to be capable of providing significant speedups for specific applications, their programmability remains difficult for the average user. In this paper, we extend the Halide DSL and compiler with the ability to utilize these units when generating code for a CUDA based NVIDIA GPGPU. To this end, we introduce a new scheduling directive along with custom lowering passes that automatically transform a Halide AST in order to be able to generate code for the TCUs. We evaluate the generated code and show that it can achieve over 5X speedup compared to Halide manual schedules without TCU support, while it remains within 20% of the NVIDIA cuBLAS implementations for mixed precision GEMM and within 10% of manual CUDA implementations with WMMA intrinsics.

Original languageEnglish
Title of host publicationProceedings of the 23rd International Workshop on Software and Compilers for Embedded Systems, SCOPES 2020
EditorsSander Stuijk
PublisherAssociation for Computing Machinery, Inc
Pages36-41
Number of pages6
ISBN (Electronic)9781450371315
DOIs
Publication statusPublished - 25 May 2020
Event23rd International Workshop on Software and Compilers for Embedded Systems, SCOPES 2020 - St. Goar, Germany
Duration: 25 May 202026 May 2020

Conference

Conference23rd International Workshop on Software and Compilers for Embedded Systems, SCOPES 2020
CountryGermany
CitySt. Goar
Period25/05/2026/05/20

Keywords

  • GPGPUs
  • Halide
  • matrix multiplication
  • tensor cores

Fingerprint Dive into the research topics of 'Programming tensor cores from an image processing DSL'. Together they form a unique fingerprint.

Cite this