MacSim: a MAC-enabled high-performance low-power SIMD architecture

T. Geng, L. Waeijen, M.C.J. Peemen, H. Corporaal, Y. He

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

8 Citations (Scopus)

Abstract

Single-Instruction-Multiple-Data (SIMD) architectures, which exploit data-level parallelism (DLP), are widely used to achieve high-performance and low-power computing. In most of streaming applications, such as CNN-based detection and recognition, color space conversion and various kinds of filters, multiply-accumulate is one of the most important and expensive operations to be executed. In this paper, we propose a high-performance low-power SIMD architecture with advanced multiply accumulator (MAC) support (MacSim) to improve the computational efficiency. In addition, a smart loop tiling scheme is proposed. To support this tiling even further, the MAC unit is equipped with multiple accumulator registers. According to the Design Space Exploration (DSE) of the proposed MAC unit, a MAC instance with four accumulator registers (MAC4reg) is selected as a good choice for target kernels. In this paper, a 64-PE 16-bit (processing element) SIMD instance without MAC support is taken as the baseline. For a head-to-head comparison, a 64-PE 16-bit SIMD with MAC4reg (MacSim4) and the baseline SIMD are all implemented in HDL and synthesized with a TSMC 40nm low-power library. Five streaming application kernels are mapped to both architectures. Our experimental results show with MAC4reg the runtime and energy consumption are reduced up to 38% and 42% respectively. Besides, a 4-layer CNN-based detection application is also fully mapped onto the proposed MacSim4. Working at 950MHz, MacSim4 reaches a throughput of 62.4 GOPS, which meets the requirement of real-time (720P HD, 30fps) detection. The energy consumption per PE per operation is very low, 4.7pJ/Op excluding SRAM (Static Random Access Memory) and 4.8pJ/Op including a 2k-entry SRAM bank. As a prototype, the proposed SIMD is mapped into an FPGA and can run all the kernels.

Original languageEnglish
Title of host publicationProceedings - 19th Euromicro Conference on Digital System Design, DSD 2016
Place of PublicationPiscataway
PublisherInstitute of Electrical and Electronics Engineers
Pages160-167
Number of pages8
ISBN (Electronic)978-1-5090-2817-7
DOIs
Publication statusPublished - 26 Oct 2016
Event19th Euromicro Conference on Digital System Design (DSD 2016) - Limassol, Cyprus
Duration: 31 Aug 20162 Sept 2016
Conference number: 19
http://dsd-seaa2016.cs.ucy.ac.cy/index.php?p=DSD2016

Conference

Conference19th Euromicro Conference on Digital System Design (DSD 2016)
Abbreviated titleDSD 2016
Country/TerritoryCyprus
CityLimassol
Period31/08/162/09/16
Internet address

Keywords

  • High Performance
  • Loop Tiling
  • Low Power
  • MAC
  • SIMD

Fingerprint

Dive into the research topics of 'MacSim: a MAC-enabled high-performance low-power SIMD architecture'. Together they form a unique fingerprint.

Cite this