Technical note: Parallelization of sparse matrix solver for OpenFoam using openCL

Technical note:  Parallelization of sparse matrix solver for OpenFoam using openCL

Authored by:  Qingfeng XIA

Date: March 09, 2011

Code License: free of charge without any warranty

This is the a short report of my current coding progress, corresponding to my pronunciation on the new year day 2011.



Although there is GPU computation plugin for OF, such as Classic SpeedITTM toolbox 1.1 using CUDATM technology, it is not free for double precision version. Meanwhile, OpenCLTM is an open industrial standard supported my various HPC platforms; it is free to be integrated into the open source OF.


Combination of the MPI and OpenCLTM on OpenFaomTM (OF) makes the fast simulation affordable for personal cluster.  Although large scale problems can not be conducted directly on the limited memory capacity on GPU, partition decomposition techniques using MPI is ready to fix this drawback. Currently, I paralleled the solvers of PCG(Preconditioned Conjugate Gradient) and PBiCG via the individual openCL utility codes. And these solvers are compatible with MPI environment.


OS: Linux 32bit and 64bit


OpenFoam 1.7.1 Single precision build

Currently, the double precision openCL library does not work properly on my GPU.

IDE: Codelite

(1) clPCG and clPBiCG solver class : parallel the serial solver classes clPCG and clPBiCG in OF1.7.1

(2) vclPCG and vclBiCG ; wrapper class based on the CG and BiCG function of the ViennaCL 1.1.  However, this part of work is not completed, due to the compilation errror to compile  viennaCL library using wmake of OpenFoam.

The correctness of the code is checked by the comparison result from the


1 Description of profiling platform

Platform:  Sony Notebook ;  OS:  Ubuntu 10.04 32bit

Flatform CPU:  i5 M450 GPU Radeon HD4650
Frequency(GHz) 2.4 0.53
Cores(PE  GPU) 4 320
Mem(GBytes) 4G  DDR3 1.0 DDR3
Mem freq(GHz) 0.79

local_work_size(0) is set to 64, which is the optimal for ATI cards in most cases

ViennaCL bench test the there is a 80% performace improvement on GPU over CPU on my laptop, i.e. the acceleration ratio is 1.8.

2 The profiling methods

(1) Source code insertion:

C++ profiling code( using clock() functions of the standard C library) has been inserted at the beginning and ending of the main() function of the icoFoam.C, in order to calculate the total duration of the icoFoam progam.   C++ profiling code inserted at the beginning and ending of the solve() function in the following solvers: (a) the original serial solver: PCG and PBiCG solver class; (b) : clPCG and clPBiCG solver class, : vclPCG and vclPBiCG solver class. The accumulation calculation time and the total call number are recorded.

(2)Other Third-party tools, such as Totalview. AMD stream SDK 2.2 contains a  profiler for Windows platforms, but it seems not available for Linux platform stream SDK.

(3)OpenFoam text output

OpenFoam text output is used as approximate profiling tools without any coding effort. Code insertion will be done only if necessary.The profiling is limited to computation time, Memory profiling is not necesseary to be conducted.

Test cases

Case:   Lip-driven cavity

The default geometry scale ratio from 0.1 to 1, but keeps the time steps as 0.005 seconds.  The simulation duration is from t=0.0->0.02s. Mesh: Structural mesh generated by the blockMesh tools. However, more iteration is need for finer mesh for every time steps.

Using the OpenFoam output as the profiling, but it seems not precise with the computer timing.

t=0.0->0.02s icoFoam clIcoFoam
100X100 mesh 1.0 6
200X200 mesh 9 51


Unfortunately,  the openCL solver is 6 times slower than than CPU code.


(1) Using float4 data format to improve the performance

(2) Function profiling should be condect to find out the bottleneck of the parallel code;

(3)Change to another hardware platfrom to test the code performace. Single precision has been tested on ATI ;   double scalar on NV GPU is not tested. Dynamic link the openCL lib, The solver should be compiled and released at shared lib: !

Future work

(1) Develop a  general openCL library in C, including a scheduler  for multiple GPUs and common data structure: clVector, etc.

(2)  Develop a whole set of solvers for OF after test and reconstruction for portable flexibility and improved performance. There are other field and matrix opporation has not been parallelized. e.g. declare a class clField {N, Alignment, MemOnGPU } .


[1] profiling tools

[2] High Performance C++ Profiling clock() is good enough on Linux

This entry was posted in Uncategorized. Bookmark the permalink.