Abstract
Computing many small 2D convolutions using FFTs is a basis for a large number of applications in many domains in science and engineering, among them electromagnetic di??raction modeling in physics. The GPU architecture seems to be a suitable architecture to accelerate these convolutions, but reaching high application performance requires substantial development time and non-portable optimizations. In this work, we present the techniques, performance results and considerations to accelerate small 2D convolutions using CUDA, and compare performance to a multi-threaded CPU implementation. To improve programmability and performance of applications that make heavy use of small convolutions, we argue that two improvements to software and hardware are needed: FFT libraries must be extended with a single convolution function and communication bandwidth between CPU and GPU needs to be drastically improved.
Original language | English |
---|---|
Title of host publication | Proceedings of the First Workshop on Applications for Multi and Many Core Processors, A4MMC 2010, held in conjunction with ISCA 2010, 19 June 2010, St. Malo, France |
Place of Publication | Z.pl. |
Publisher | s.n. |
Pages | 52-64 |
Publication status | Published - 2010 |