A tuning approach for iterative multiple 3d stencil pipeline on GPUs: Anisotropic Nonlinear Diffusion algorithm as case study

S. Tabik, M. Peemen, L. F. Romero

Research output: Contribution to journalArticleAcademicpeer-review

Abstract

This paper focuses on challenging applications that can be expressed as an iterative pipeline of multiple 3d stencil stages and explores their optimization space on GPUs. For this study, we selected a representative example from the field of digital signal processing, the Anisotropic Nonlinear Diffusion algorithm. An open issue to these applications is to determine the optimal fission/fusion level of the involved stages and whether that combination benefits from data tiling. This implies exploring a large space of all the possible fission/fusion combinations with and without tiling, thus making the process non-trivial. This study provides insights to reduce the optimization tuning space and programming effort of iterative multiple 3d stencils. Our results demonstrate that all combinations that fuse the bottleneck stencil with high halos update cost (> 25 % , this percentage can be measured or estimated experimentally for each single stencil) and high registers and shared memory accesses must not be considered in the exploration process. The optimal fission/fusion combination is up to 1.65× faster than the case in which we fully decompose our stencil without tiling and 5.3× faster with respect to the fully fused version on the NVIDIA GPUs.

Original languageEnglish
Pages (from-to)1580-1608
Number of pages29
JournalJournal of Supercomputing
Volume74
Issue number4
DOIs
Publication statusPublished - 1 Apr 2018

Fingerprint

Anisotropic Diffusion
Nonlinear Diffusion
Tuning
Fusion reactions
Pipelines
Tiling
Fusion
Electric fuses
Digital signal processing
Optimization
Shared Memory
Percentage
Signal Processing
Data storage equipment
Programming
Update
Imply
Decompose
Graphics processing unit
Costs

Keywords

  • 3d images
  • 3d stencils
  • Anisotropic Nonlinear Diffusion
  • Fission
  • Fusion
  • GPUs
  • Tiling

Cite this

@article{1f00889fafd34e8284db21870cb13d0a,
title = "A tuning approach for iterative multiple 3d stencil pipeline on GPUs: Anisotropic Nonlinear Diffusion algorithm as case study",
abstract = "This paper focuses on challenging applications that can be expressed as an iterative pipeline of multiple 3d stencil stages and explores their optimization space on GPUs. For this study, we selected a representative example from the field of digital signal processing, the Anisotropic Nonlinear Diffusion algorithm. An open issue to these applications is to determine the optimal fission/fusion level of the involved stages and whether that combination benefits from data tiling. This implies exploring a large space of all the possible fission/fusion combinations with and without tiling, thus making the process non-trivial. This study provides insights to reduce the optimization tuning space and programming effort of iterative multiple 3d stencils. Our results demonstrate that all combinations that fuse the bottleneck stencil with high halos update cost (> 25 {\%} , this percentage can be measured or estimated experimentally for each single stencil) and high registers and shared memory accesses must not be considered in the exploration process. The optimal fission/fusion combination is up to 1.65× faster than the case in which we fully decompose our stencil without tiling and 5.3× faster with respect to the fully fused version on the NVIDIA GPUs.",
keywords = "3d images, 3d stencils, Anisotropic Nonlinear Diffusion, Fission, Fusion, GPUs, Tiling",
author = "S. Tabik and M. Peemen and Romero, {L. F.}",
year = "2018",
month = "4",
day = "1",
doi = "10.1007/s11227-017-2184-6",
language = "English",
volume = "74",
pages = "1580--1608",
journal = "Journal of Supercomputing",
issn = "0920-8542",
publisher = "Springer",
number = "4",

}

A tuning approach for iterative multiple 3d stencil pipeline on GPUs : Anisotropic Nonlinear Diffusion algorithm as case study. / Tabik, S.; Peemen, M.; Romero, L. F.

In: Journal of Supercomputing, Vol. 74, No. 4, 01.04.2018, p. 1580-1608.

Research output: Contribution to journalArticleAcademicpeer-review

TY - JOUR

T1 - A tuning approach for iterative multiple 3d stencil pipeline on GPUs

T2 - Anisotropic Nonlinear Diffusion algorithm as case study

AU - Tabik, S.

AU - Peemen, M.

AU - Romero, L. F.

PY - 2018/4/1

Y1 - 2018/4/1

N2 - This paper focuses on challenging applications that can be expressed as an iterative pipeline of multiple 3d stencil stages and explores their optimization space on GPUs. For this study, we selected a representative example from the field of digital signal processing, the Anisotropic Nonlinear Diffusion algorithm. An open issue to these applications is to determine the optimal fission/fusion level of the involved stages and whether that combination benefits from data tiling. This implies exploring a large space of all the possible fission/fusion combinations with and without tiling, thus making the process non-trivial. This study provides insights to reduce the optimization tuning space and programming effort of iterative multiple 3d stencils. Our results demonstrate that all combinations that fuse the bottleneck stencil with high halos update cost (> 25 % , this percentage can be measured or estimated experimentally for each single stencil) and high registers and shared memory accesses must not be considered in the exploration process. The optimal fission/fusion combination is up to 1.65× faster than the case in which we fully decompose our stencil without tiling and 5.3× faster with respect to the fully fused version on the NVIDIA GPUs.

AB - This paper focuses on challenging applications that can be expressed as an iterative pipeline of multiple 3d stencil stages and explores their optimization space on GPUs. For this study, we selected a representative example from the field of digital signal processing, the Anisotropic Nonlinear Diffusion algorithm. An open issue to these applications is to determine the optimal fission/fusion level of the involved stages and whether that combination benefits from data tiling. This implies exploring a large space of all the possible fission/fusion combinations with and without tiling, thus making the process non-trivial. This study provides insights to reduce the optimization tuning space and programming effort of iterative multiple 3d stencils. Our results demonstrate that all combinations that fuse the bottleneck stencil with high halos update cost (> 25 % , this percentage can be measured or estimated experimentally for each single stencil) and high registers and shared memory accesses must not be considered in the exploration process. The optimal fission/fusion combination is up to 1.65× faster than the case in which we fully decompose our stencil without tiling and 5.3× faster with respect to the fully fused version on the NVIDIA GPUs.

KW - 3d images

KW - 3d stencils

KW - Anisotropic Nonlinear Diffusion

KW - Fission

KW - Fusion

KW - GPUs

KW - Tiling

UR - http://www.scopus.com/inward/record.url?scp=85033433676&partnerID=8YFLogxK

U2 - 10.1007/s11227-017-2184-6

DO - 10.1007/s11227-017-2184-6

M3 - Article

AN - SCOPUS:85033433676

VL - 74

SP - 1580

EP - 1608

JO - Journal of Supercomputing

JF - Journal of Supercomputing

SN - 0920-8542

IS - 4

ER -