Matrix Multiplication Performance
David J.Meppelink ***************.edu
ulo
James Canning **************.edu
December17,2002
University of Massachutts Lowell
Department of Computer Science
1Introduction
This paper prents matrix multiplication performance results for quential and parallel implemen-tations written in C using Message Passing Interface(MPI)[12]and the parallel array language ZPL[22].Although matrix multiplication has been addresd by numerious papers in the past,this treatment is unique in that(1)it prents a detailed and reproducable configuration,(2)it pro-vides independent results of existing implementations and(3)it demonstrates a major performance limitation of the ZPL programming model.
2Test Cas
We measured the performance of six matrix multiplication implementations:
1.Sequential Classic is a triply nested loop written in C bad on the mathematical definition
of matrix multiplication(figure1).
2.Sequential BLAS calls the dgemm function of the Basic Linear Algebra Subprograms(BLAS)
library[3].BLAS provides FORTRAN and C functions for basic vector and matrix operations.
3.Sequential Strasn calls a portable implementation of Strasn’s algorithm[8,p.500]de-
veloped in C by the PRISM project of the Mathematics and Computer Science Division at Argonne National Laboratory[15].
4.MPI SUMMA is the Scalable Universal Matrix Multiplication Algorithm(SUMMA)as pre-
nted,with complete C sources,in[21].东亚杯中国vs澳大利亚
5.ZPL SUMMA is the ZPL implementation of SUMMA as prented in[16,p.86]and repro-
duced infigure2.
6.ZPL PSP is the ZPL Problem Space Promotion Matrix Product described in[5,§3.4].The
ZPL source code prented there us obsolete syntax,so we updated the code to that shown infigure3.
double*A,*B,*C;
for(i=0;i<size;i++){
for(j=0;j<size;j++){
double tmp=0.0;
for(k=0;k<size;k++){
tmp+=A[i*size+k]*B[k*size+j];
C[i*size+j]=tmp;
}
}
}
Figure1:Classic matrix multiplication written in C region M=[1..size, 1..size];
Fc=[1..size,*];
mali
Fr=[*, 1..size];
var A,B,C:[M]double;
Af:[Fc]double;
Bf:[Fr]double;
k:integer;
for k:=1to size do
[Fc]Af:=>>[1..size,k]A;
[Fr]Bf:=>>[k,1..size]B;
[M]C:=C+Af*Bf;
end;
Figure2:SUMMA written in ZPL
region IJ=[1..,1];
IK=[1..n,*, 1..n];
JK=[*, 1..n, 1..n];
IJK=[1..n, 1..n, 1..n];
var A,B,C:[IJ]double;
Af:[IK]double;
Bf:[JK]double;
[IK]Af:=<##[,Index3,1]A;
[JK]Bf:=<##[Index3,,1]B;
[IJ]C:=+<<[IJK](Af*Bf);
Figure3:PSP Matrix Product written in modern ZPL
SUMMA-ZPL PSP-ZPL
Strasn BLAS SUMMA-MPI ZPL runtime
Classic BLAS library MPI library
捐赠支出Operating System生活大爆炸第三季
Figure4:Software dependencies
3Hardware and Software Platform
We ran each of the matrix multiplication programs on two dissimilar platforms:
波士顿法律
1.A Beowulf cluster[2]of12nodes who configurations varied from a350MHz Pentium II
with64MB to a550MHz Pentium III with128MB,each running:
(a)Debian Linux2.2.20[6]
(b)LAM-MPI6.5.6[9]over a local area TCP/IP network
(c)ATLAS3.2.1ln-7[1],an open source BLAS library
(d)ZPL1.16.59with LAM-MPI runtime
mess是什么意思2.A Sun Enterpri10000Server[17]with64400MHz UltraSPARC II processors and64GB of
total memory,running:
(a)Solaris9[20]
(b)MPICH1.2.4[13]over shared memory(using the ch shmem device)
(c)Sun Performance Library,a BLAS library bundled with the Sun ONE Studio7[19]
compiler
(d)ZPL1.16.59with MPICH runtime
Since our goal was to create a reproducible and stable testing environment,we lected binary distributions of the platform libraries whenever possible.On Linux,we downloaded LAM-MPI and ATLAS binaries from the Debian web site.On Solaris,we ud Sun’s binary distribution of BLAS and opted to build MPICH rather than u an“early access”relea of Sun MPI[18]for Solaris9. Figure4shows the dependencies between the test cas and the platform libraries.
4Performance Results
We ran all six implementations on both platforms using1024×1024matrices of double-precision real numbers(C data type double)with varying numbers of processors,processor grid configurations and tuning parameters.We report here the fastest time out offive iterations for each configuration.
2
4
6
8
10
12
14
32
64
96
128
160
192
224
256
288
320
352
384
416
448
480
512
544
576
608
640
672
704
736
768
800
832
864
896
928
toilet是什么意思960
992
1024
T i m e (s e c o n d s )
maxtrix size cutoff
Beowulf cluster Sun Enterpri 10000
Figure 5:Sequential Strasn cutofftuning for 1024×1024matrices
Although most authors calculate speedup relative the fastest rial program to calculate speedup [14,p.250],we u Sequential Classic so that we can make direct comparisons between the quential and parallel implementations 1.
4.1Tuning Parameters
Before prenting comparative results,we examine opportunies to maximize the performance of eac
h implementation.
The critical tuning parameter for Sequential Strasn is “cutoff”,which is defined as the matrix size where the Strasn recursion is replaced by a call to BLAS [7].We ran the program with cutoffvalues from 4to 1024in increments of 4to produce the results shown in figure 5.The stepwi improvement indicates that the implementation us data buffers that are sized in powers of two increments.
MPI SUMMA defines a similar “blocking size”that controlls the size of the submatrices handled by BLAS and the size of each communication transfer.Our experimentation on one system with four processors showed that a block size of 1024/p (where p is the total number of processors)to be asymptotically optimal.郑重其事是什么意思
The binary distributions of the Sun Performance Library and ATLAS did not permit tuning.We ud default optimizations for the ZPL compiler.
1
We certainly anticipate that the other implementations will perform better than it does!
Beowulf cluster Enterpri10000
Implementation time(c)speedup time(c)speedup
Sequential Classic85.930—176.740—
Sequential BLAS 5.41015.9 3.17055.8
Sequential Strasn 5.00017.2 3.21055.1
Table1:Sequential performance for1024×1024matrices
4.2Processor Grid Dimensions
To test the effect of the processor grid dimensions on performance,we ran each of the parallel implementations using from1to12processors and all valid grid dimensions(1×p,2×(p/2)for2|p and3×(p/3)for3|p,where p is the number of processors).Figures6,7and8show the results for the Beowulf cluster andfigures9,10and11show the results for the Enterpri10000.墨尔本大学官网
The results show that performance for a given number of processors is directly proportional to the squareness of the processor grid.Although not to the same degree,this is demonstrated for all implementations and platforms.
It is also apparent that squareness impacts communication more than local processing.For example, MPI SUMMA,which us the relatively fast BLAS,shows dramatic improvement using square processor grids.But ZPL SUMMA,which us the relatively slow classic algorithm,shows less improvement.Notice that a square grid minimizes the data transfered between adjoining nodes becau it minimizes the number of rows and columns that are split across processors.For example, a1×9processor grid has8n splits while a3×3processor grid has6n splits.Furthermore,each processor will handle the same number of values regardless of the grid dimensions becau the total number of elements is divided across the same number of processors regardless of the grid shape.
4.3Sequential Performance
Table1shows the speedup of the quential implementations over Sequential Classic.On both platforms,the largest improvement comes from using the BLAS.The Strasn algorithm gave only a minor incremental improvement2of8%for the Beowulf cluster and no improvement for the Enterpri10000.
4.4Parallel Performance
Figures12and13show the speedup for each of the implementations as the number of processors incr
eas to the platform’s maximum.
MPI SUMMA has,by far,the best performance of the three parallel implementations3.On the Beowulf cluster,it outperforms the fastest quential implementation starting at four processors and on the Enterpri10000,it outperforms starting at two processors.This is not unexpected, 2Recall that Sequential Strasn us BLAS for matrices below the cutoffsize.
3Section4.2explains the periodic low points in the curve.