Parallel Computing Toolbox 

This demo looks at how we can benchmark the solving of a linear system on the GPU. The MATLAB® code to solve for x in A*x = b is very simple. Most frequently, we use matrix left division, also known as mldivide or the backslash operator (\), to calculate x (that is, x = A\b).
Related demos:
The code shown in this demo can be found in this function:
function results = paralleldemo_gpu_backslash(maxMemory)
It is important to choose the appropriate matrix size for the computations. We can do this by specifying the amount of system memory in GB available to the CPU and the GPU. The default value is based only on the amount of memory available on the GPU, and you can specify a value that is appropriate for your system.
if nargin == 0 g = gpuDevice; maxMemory = 0.4*g.FreeMemory/1024^3; end
We want to benchmark matrix left division (\), and not the cost of transferring data between the CPU and GPU, the time it takes to create a matrix, or other parameters. We therefore separate the data generation from the solving of the linear system, and measure only the time it takes to do the latter.
function [A, b] = getData(n, clz) fprintf('Creating a matrix of size %dby%d.\n', n, n); A = rand(n, n, clz) + 100*eye(n, n, clz); b = rand(n, 1, clz); end function time = timeSolve(A, b) tic; x = A\b; %#ok<NASGU> We don't need the value of x. time = toc; end
As with a great number of other parallel algorithms, the performance of solving a linear system in parallel depends greatly on the matrix size. As seen in other demos, such as Benchmarking A\b, we compare the performance of the algorithm for different matrix sizes.
% Declare the matrix sizes to be a multiple of 1024. maxSizeSingle = floor(sqrt(maxMemory*1024^3/4)); maxSizeDouble = floor(sqrt(maxMemory*1024^3/8)); step = 1024; if maxSizeDouble/step >= 10 step = step*floor(maxSizeDouble/(5*step)); end sizeSingle = 1024:step:maxSizeSingle; sizeDouble = 1024:step:maxSizeDouble;
We use the number of floating point operations per second as our measure of performance because that allows us to compare the performance of the algorithm for different matrix sizes.
Given a matrix size, the benchmarking function creates the matrix A and the righthand side b once, and then solves A\b a few times to get an accurate measure of the time it takes. We use the floating point operations count of the HPC Challenge, so that for an nbyn matrix, we count the floating point operations as 2/3*n^3 + 3/2*n^2.
function gflops = benchFcn(A, b) numReps = 3; time = inf; % We solve the linear system a few times and calculate the Gigaflops % based on the best time. for itr = 1:numReps tcurr = timeSolve(A, b); time = min(tcurr, time); end n = size(A, 1); flop = 2/3*n^3 + 3/2*n^2; gflops = flop/time/1e9; end
Having done all the setup, it is straightforward to execute the benchmarks. However, the computations can take a long time to complete, so we print some intermediate status information as we complete the benchmarking for each matrix size. We also encapsulate the loop over all the matrix sizes in a function, to benchmark both single and doubleprecision computations.
function [gflopsCPU, gflopsGPU] = executeBenchmarks(clz, sizes) fprintf(['Starting benchmarks with %d different %sprecision ' ... 'matrices of sizes\nranging from %dby%d to %dby%d.\n'], ... length(sizes), clz, sizes(1), sizes(1), sizes(end), ... sizes(end)); gflopsGPU = zeros(size(sizes)); gflopsCPU = zeros(size(sizes)); for i = 1:length(sizes) n = sizes(i); [A, b] = getData(n, clz); gflopsCPU(i) = benchFcn(A, b); fprintf('Gigaflops on CPU: %f\n', gflopsCPU(i)); A = gpuArray(A); b = gpuArray(b); gflopsGPU(i) = benchFcn(A, b); fprintf('Gigaflops on GPU: %f\n', gflopsGPU(i)); end end
We then execute the benchmarks in single and double precision.
[cpu, gpu] = executeBenchmarks('single', sizeSingle); results.sizeSingle = sizeSingle; results.gflopsSingleCPU = cpu; results.gflopsSingleGPU = gpu; [cpu, gpu] = executeBenchmarks('double', sizeDouble); results.sizeDouble = sizeDouble; results.gflopsDoubleCPU = cpu; results.gflopsDoubleGPU = gpu;
Starting benchmarks with 8 different singleprecision matrices of sizes ranging from 1024by1024 to 22528by22528. Creating a matrix of size 1024by1024. Gigaflops on CPU: 38.524366 Gigaflops on GPU: 49.643675 Creating a matrix of size 4096by4096. Gigaflops on CPU: 76.529952 Gigaflops on GPU: 248.271671 Creating a matrix of size 7168by7168. Gigaflops on CPU: 92.038400 Gigaflops on GPU: 386.173617 Creating a matrix of size 10240by10240. Gigaflops on CPU: 98.728781 Gigaflops on GPU: 449.104701 Creating a matrix of size 13312by13312. Gigaflops on CPU: 102.648603 Gigaflops on GPU: 485.947657 Creating a matrix of size 16384by16384. Gigaflops on CPU: 104.272479 Gigaflops on GPU: 509.032897 Creating a matrix of size 19456by19456. Gigaflops on CPU: 107.291224 Gigaflops on GPU: 521.987594 Creating a matrix of size 22528by22528. Gigaflops on CPU: 108.283695 Gigaflops on GPU: 533.918845 Starting benchmarks with 6 different doubleprecision matrices of sizes ranging from 1024by1024 to 16384by16384. Creating a matrix of size 1024by1024. Gigaflops on CPU: 23.569247 Gigaflops on GPU: 40.775307 Creating a matrix of size 4096by4096. Gigaflops on CPU: 41.149421 Gigaflops on GPU: 126.938916 Creating a matrix of size 7168by7168. Gigaflops on CPU: 47.662219 Gigaflops on GPU: 149.164211 Creating a matrix of size 10240by10240. Gigaflops on CPU: 50.786232 Gigaflops on GPU: 157.104799 Creating a matrix of size 13312by13312. Gigaflops on CPU: 52.737269 Gigaflops on GPU: 161.476416 Creating a matrix of size 16384by16384. Gigaflops on CPU: 53.273913 Gigaflops on GPU: 163.386311
We can now plot the results, and compare the performance on the CPU and the GPU, both for single and double precision.
First, we look at the performance of the backslash operator in single precision.
fig = figure; ax = axes('parent', fig); plot(ax, results.sizeSingle, results.gflopsSingleGPU, 'x', ... results.sizeSingle, results.gflopsSingleCPU, 'o') legend('GPU', 'CPU', 'Location', 'NorthWest'); title(ax, 'Singleprecision performance') ylabel(ax, 'Gigaflops'); xlabel(ax, 'Matrix size'); drawnow;
Now, we look at the performance of the backslash operator in double precision.
fig = figure; ax = axes('parent', fig); plot(ax, results.sizeDouble, results.gflopsDoubleGPU, 'x', ... results.sizeDouble, results.gflopsDoubleCPU, 'o') legend('GPU', 'CPU', 'Location', 'NorthWest'); title(ax, 'Doubleprecision performance') ylabel(ax, 'Gigaflops'); xlabel(ax, 'Matrix size'); drawnow;
Finally, we look at the speedup of the backslash operator when comparing the GPU to the CPU.
speedupDouble = results.gflopsDoubleGPU./results.gflopsDoubleCPU; speedupSingle = results.gflopsSingleGPU./results.gflopsSingleCPU; fig = figure; ax = axes('parent', fig); plot(ax, results.sizeSingle, speedupSingle, 'v', ... results.sizeDouble, speedupDouble, '*') legend('Singleprecision', 'Doubleprecision', 'Location', 'NorthWest'); title(ax, 'Speedup of computations on GPU compared to CPU'); ylabel(ax, 'Speedup'); xlabel(ax, 'Matrix size'); drawnow;
end
ans = sizeSingle: [1024 4096 7168 10240 13312 16384 19456 22528] gflopsSingleCPU: [1x8 double] gflopsSingleGPU: [1x8 double] sizeDouble: [1024 4096 7168 10240 13312 16384] gflopsDoubleCPU: [23.5692 41.1494 47.6622 50.7862 52.7373 53.2739] gflopsDoubleGPU: [40.7753 126.9389 149.1642 157.1048 161.4764 163.3863]