Efficient N:M Sparse DNN Training Using Algorithm, Architecture, and Dataflow Co-Design

Chao Fang, Wei Sun, Aojun Zhou, Zhongfeng Wang (Corresponding author)

Research output: Contribution to journalArticleAcademicpeer-review

5 Citations (Scopus)
6 Downloads (Pure)

Abstract

Sparse training is one of the promising techniques to reduce the computational cost of deep neural networks (DNNs) while retaining high accuracy. In particular, N:M fine-grained structured sparsity, where only {N} out of consecutive {M} elements can be nonzero, has attracted attention due to its hardware-friendly pattern and capability of achieving a high sparse ratio. However, the potential to accelerate N:M sparse DNN training has not been fully exploited, and there is a lack of efficient hardware supporting N:M sparse training. To tackle these challenges, this article presents a computation-efficient training scheme for N:M sparse DNNs using algorithm, architecture, and dataflow co-design. At the algorithm level, a bidirectional weight pruning method, dubbed BDWP, is proposed to leverage the N:M sparsity of weights during both forward and backward passes of DNN training, which can significantly reduce the computational cost while maintaining model accuracy. At the architecture level, a sparse accelerator for DNN training, namely, SAT, is developed to neatly support both the regular dense operations and the computation-efficient N:M sparse operations. At the dataflow level, multiple optimization methods ranging from interleave mapping, pregeneration of N:M sparse weights, and offline scheduling, are proposed to boost the computational efficiency of SAT. Finally, the effectiveness of our training scheme is evaluated on a Xilinx VCU1525 FPGA card using various DNN models (ResNet9, ViT, VGG19, ResNet18, and ResNet50) and datasets (CIFAR-10, CIFAR-100, Tiny ImageNet, and ImageNet). Experimental results show the SAT accelerator with the BDWP sparse training method under 2:8 sparse ratio achieves an average speedup of 1.75times over that with the dense training, accompanied by a negligible accuracy loss of 0.56% on average. Furthermore, our proposed training scheme significantly improves the training throughput by 2.97times - 25.22times and the energy efficiency by 1.36times - 3.58times over prior FPGA-based accelerators.

Original languageEnglish
Article number10256041
Pages (from-to)506-519
Number of pages14
JournalIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Volume43
Issue number2
DOIs
Publication statusPublished - Feb 2024

Keywords

  • Algorithm-hardware codesign
  • deep neural networks (DNNs)
  • DNN training
  • neural network compression
  • pruning
  • sparse training

Fingerprint

Dive into the research topics of 'Efficient N:M Sparse DNN Training Using Algorithm, Architecture, and Dataflow Co-Design'. Together they form a unique fingerprint.

Cite this