## Abstract

Abstract
Several techniques used to analyze and parameterize computational models composed of ordinary differential equations involve performing simulations for many combinations of parameters and initial conditions. Graphical processing units (GPUs) contain a large number of processing cores providing a high degree of concurrency at low cost. Using NVIDIA’s Compute Unified Device Architecture (CUDA), we aim to develop tools to be able to perform time consuming analyses on GPUs. To begin with, we have implemented a fourth order Dormand-Prince Runge Kutta scheme [1]. Preliminary results show a significant speedup compared to a CPU implementation and reveal important design considerations.
1. Introduction
Computational models composed of ordinary differential equations are an important tool to aid in developing a more comprehensive understanding of biological systems. Many of the techniques used to analyze and parameterize such models involve performing simulations for thousands of parameter and initial condition values. Graphical processing units (GPUs) contain a large number of processing cores providing a high degree of concurrency at low cost. Using NVIDIA’s Compute Unified Device Architecture (CUDA), a set of tools and a compiler that enables users to develop general purpose GPU applications, we aim to develop tools to enable such analyses on GPUs.
2. Methods and considerations
We have implemented a fourth order Dormand-Prince Runge Kutta scheme [1] to be able to solve a set of ordinary differential equations. To efficiently use the processing power available on modern GPUs, specific aspects of the implementation have to be considered.
CUDA applications revolve around kernel functions, which are routines that can be executed concurrently across multiple threads. Invoking a kernel function introduces overhead. To be able to efficiently use the GPU we solve the system using a fixed length inner loop (which facilitates unrolling) as part of the computation kernel, thereby increasing the amount of work performed in each kernel call.
Memory allocation on the GPU is costly therefore all memory allocation is done in a single initialization step. Since global memory access from a thread is costly, the state variables are copied to a local register before the inner loop commences.
By retrieving only relevant time points required for the meta-analysis from the simulation, the time spent on copying memory from the GPU is also reduced.
3. Results
Results depend strongly on the number of differential equations in the model. For small systems we obtained speedups of 160 times on our test system with a GeForce GTX 260. Considering more complex models we obtained speedups between 12 and 20 times. Simulation time courses were in agreement with time courses obtained using MATLAB R2008b.
4. Discussion
The work presented here is a work in progress. Future work consists in writing an SBML interface, as well as an implicit method to be able to handle stiff problems.

Original language | English |
---|---|

Publication status | Published - 2009 |