首页 > 英语园地

MatrixMultiplicationPerformance：矩阵乘法的性能

更新时间:2023-06-30 19:29:50 阅读：评论：0

Matrix Multiplication Performance

David J.Meppelink ***************.edu

ulo

James Canning **************.edu

December17,2002

University of Massachutts Lowell

Department of Computer Science

1Introduction

This paper prents matrix multiplication performance results for quential and parallel implemen-tations written in C using Message Passing Interface(MPI)[12]and the parallel array language ZPL[22].Although matrix multiplication has been addresd by numerious papers in the past,this treatment is unique in that(1)it prents a detailed and reproducable conﬁguration,(2)it pro-vides independent results of existing implementations and(3)it demonstrates a major performance limitation of the ZPL programming model.

2Test Cas

We measured the performance of six matrix multiplication implementations:

1.Sequential Classic is a triply nested loop written in C bad on the mathematical deﬁnition

of matrix multiplication(ﬁgure1).

2.Sequential BLAS calls the dgemm function of the Basic Linear Algebra Subprograms(BLAS)

library[3].BLAS provides FORTRAN and C functions for basic vector and matrix operations.

3.Sequential Strasn calls a portable implementation of Strasn’s algorithm[8,p.500]de-

veloped in C by the PRISM project of the Mathematics and Computer Science Division at Argonne National Laboratory[15].

4.MPI SUMMA is the Scalable Universal Matrix Multiplication Algorithm(SUMMA)as pre-

nted,with complete C sources,in[21].东亚杯中国vs澳大利亚

5.ZPL SUMMA is the ZPL implementation of SUMMA as prented in[16,p.86]and repro-

duced inﬁgure2.

6.ZPL PSP is the ZPL Problem Space Promotion Matrix Product described in[5,§3.4].The

ZPL source code prented there us obsolete syntax,so we updated the code to that shown inﬁgure3.

double*A,*B,*C;

for(i=0;i<size;i++){

for(j=0;j<size;j++){

double tmp=0.0;

for(k=0;k<size;k++){

tmp+=A[i*size+k]*B[k*size+j];

C[i*size+j]=tmp;

}

Figure1:Classic matrix multiplication written in C region M=[1..size, 1..size];

Fc=[1..size,*];

mali

Fr=[*, 1..size];

var A,B,C:[M]double;

Af:[Fc]double;

Bf:[Fr]double;

k:integer;

for k:=1to size do

[Fc]Af:=>>[1..size,k]A;

[Fr]Bf:=>>[k,1..size]B;

[M]C:=C+Af*Bf;

end;

Figure2:SUMMA written in ZPL

region IJ=[1..,1];

IK=[1..n,*, 1..n];

JK=[*, 1..n, 1..n];

IJK=[1..n, 1..n, 1..n];

var A,B,C:[IJ]double;

Af:[IK]double;

Bf:[JK]double;

[IK]Af:=<##[,Index3,1]A;

[JK]Bf:=<##[Index3,,1]B;

[IJ]C:=+<<[IJK](Af*Bf);

Figure3:PSP Matrix Product written in modern ZPL

SUMMA-ZPL PSP-ZPL

Strasn BLAS SUMMA-MPI ZPL runtime

Classic BLAS library MPI library

捐赠支出Operating System生活大爆炸第三季

Figure4:Software dependencies

3Hardware and Software Platform

We ran each of the matrix multiplication programs on two dissimilar platforms:

波士顿法律

1.A Beowulf cluster[2]of12nodes who conﬁgurations varied from a350MHz Pentium II

with64MB to a550MHz Pentium III with128MB,each running:

(a)Debian Linux2.2.20[6]

(b)LAM-MPI6.5.6[9]over a local area TCP/IP network

(c)ATLAS3.2.1ln-7[1],an open source BLAS library

(d)ZPL1.16.59with LAM-MPI runtime

mess是什么意思2.A Sun Enterpri10000Server[17]with64400MHz UltraSPARC II processors and64GB of

total memory,running:

(a)Solaris9[20]

(b)MPICH1.2.4[13]over shared memory(using the ch shmem device)

(c)Sun Performance Library,a BLAS library bundled with the Sun ONE Studio7[19]

compiler

(d)ZPL1.16.59with MPICH runtime

Since our goal was to create a reproducible and stable testing environment,we lected binary distributions of the platform libraries whenever possible.On Linux,we downloaded LAM-MPI and ATLAS binaries from the Debian web site.On Solaris,we ud Sun’s binary distribution of BLAS and opted to build MPICH rather than u an“early access”relea of Sun MPI[18]for Solaris9. Figure4shows the dependencies between the test cas and the platform libraries.

4Performance Results

We ran all six implementations on both platforms using1024×1024matrices of double-precision real numbers(C data type double)with varying numbers of processors,processor grid conﬁgurations and tuning parameters.We report here the fastest time out ofﬁve iterations for each conﬁguration.

128

160

192

224

256

288

320

352

384

416

448

480

512

544

576

608

640

672

704

736

768

800

832

864

896

928

toilet是什么意思960

992

1024

T i m e (s e c o n d s )

maxtrix size cutoff

Beowulf cluster Sun Enterpri 10000

Figure 5:Sequential Strasn cutoﬀtuning for 1024×1024matrices

Although most authors calculate speedup relative the fastest rial program to calculate speedup [14,p.250],we u Sequential Classic so that we can make direct comparisons between the quential and parallel implementations 1.

4.1Tuning Parameters

Before prenting comparative results,we examine opportunies to maximize the performance of eac

h implementation.

The critical tuning parameter for Sequential Strasn is “cutoﬀ”,which is deﬁned as the matrix size where the Strasn recursion is replaced by a call to BLAS [7].We ran the program with cutoﬀvalues from 4to 1024in increments of 4to produce the results shown in ﬁgure 5.The stepwi improvement indicates that the implementation us data buﬀers that are sized in powers of two increments.

MPI SUMMA deﬁnes a similar “blocking size”that controlls the size of the submatrices handled by BLAS and the size of each communication transfer.Our experimentation on one system with four processors showed that a block size of 1024/p (where p is the total number of processors)to be asymptotically optimal.郑重其事是什么意思

The binary distributions of the Sun Performance Library and ATLAS did not permit tuning.We ud default optimizations for the ZPL compiler.

We certainly anticipate that the other implementations will perform better than it does!

Beowulf cluster Enterpri10000

Implementation time(c)speedup time(c)speedup

Sequential Classic85.930—176.740—

Sequential BLAS 5.41015.9 3.17055.8

Sequential Strasn 5.00017.2 3.21055.1

Table1:Sequential performance for1024×1024matrices

4.2Processor Grid Dimensions

To test the eﬀect of the processor grid dimensions on performance,we ran each of the parallel implementations using from1to12processors and all valid grid dimensions(1×p,2×(p/2)for2|p and3×(p/3)for3|p,where p is the number of processors).Figures6,7and8show the results for the Beowulf cluster andﬁgures9,10and11show the results for the Enterpri10000.墨尔本大学官网

The results show that performance for a given number of processors is directly proportional to the squareness of the processor grid.Although not to the same degree,this is demonstrated for all implementations and platforms.

It is also apparent that squareness impacts communication more than local processing.For example, MPI SUMMA,which us the relatively fast BLAS,shows dramatic improvement using square processor grids.But ZPL SUMMA,which us the relatively slow classic algorithm,shows less improvement.Notice that a square grid minimizes the data transfered between adjoining nodes becau it minimizes the number of rows and columns that are split across processors.For example, a1×9processor grid has8n splits while a3×3processor grid has6n splits.Furthermore,each processor will handle the same number of values regardless of the grid dimensions becau the total number of elements is divided across the same number of processors regardless of the grid shape.

4.3Sequential Performance

Table1shows the speedup of the quential implementations over Sequential Classic.On both platforms,the largest improvement comes from using the BLAS.The Strasn algorithm gave only a minor incremental improvement2of8%for the Beowulf cluster and no improvement for the Enterpri10000.

4.4Parallel Performance

Figures12and13show the speedup for each of the implementations as the number of processors incr

eas to the platform’s maximum.

MPI SUMMA has,by far,the best performance of the three parallel implementations3.On the Beowulf cluster,it outperforms the fastest quential implementation starting at four processors and on the Enterpri10000,it outperforms starting at two processors.This is not unexpected, 2Recall that Sequential Strasn us BLAS for matrices below the cutoﬀsize.

3Section4.2explains the periodic low points in the curve.

本文发布于:2023-06-30 19:29:50，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/78/1070761.html

上一篇：Java代码实现Liquiba的基本使用方法

下一篇：基于Spring Boot+Vue的二次设备数字化作业系统研究与实现

标签：澳大利亚支出生活中国捐赠爆炸

留言与评论（共有 0 条评论）