首页 > 美文阅读

Can High Performance Software DSM Systems Designed With InfiniBand Features Benefit from PC

更新时间:2023-07-04 06:22:47 阅读：评论：0

Can High Performance Software DSM Systems Designed With InﬁniBand小学生成语故事100篇

Features Beneﬁt from PCI-Express?

Ranjit Noronha and Dhabaleswar K.Panda

Dept.of Computer Science and Engineering

The Ohio State University

Columbus,OH43210

{noronha,panda}@c.ohio-state.edu

Abstract

The performance of software distributed shared mem-

ory systems has traditionally been lower than other pro-

林冲人物简介

gramming models primarily becau of overhead in the co-

herency protocol,comunication bottlenecks and slow net-

works.Software DSMs have beneﬁted from interconnection

technologies like Myrinet,InﬁniBand and Quadrics which

offer low latency and high bandwidth communication.Ad-

ditionally,some of their features like RDMA and atomic op-

erations have been ud to implement some portions of the

稻草定律

software DSM protocols directly,futher reducing overhead.

Such network aware protocols are dependent on character-

istics of the networks.The performance of the networks

is in turn dependent on the system architecture,specially

the I/O bus.PCI-Express the successor to the PCI-X ar-

chitecture,offers improved latency and bandwidth charac-

teritics.In this paper,we evaluate the impact of using an

improved bus technology like PCI-Express,on the perfor-

mance of software DSM protocols which u the network

features of InﬁniBand.We can e a reduction in applica-

栋号长tion execution time of up to13%at four nodes,when PCI-

Express is ud instead of PCI-X.

Keywords:DSM Systems,Cache coherency protocol,In-

ﬁniBand,System Area Networks

1Introduction

Software Distributed Shared Memory(SDSM)systems

allow developers to easily program parallel applications,

without having to worry about explicit communication of

shared data items.The penalty for this is the higher over-

head from fal sharing,which leads to an increa in com-

munication.The cost of communication was signiﬁcant on

older generations of networks like Gigabit Ethernet.Ad-

bus.Current generation systems employ the PCI-X archi-tecture,which can provide bendwidth up to 1GigaByte/c.InﬁniBand 4X which can deliver an aggregate bandwidth of upto 2GigaBytes/c is thus limited by the PCI-X ar-chitecture.Additionally,the shared architecture of PCI-X further impinges on the performance of InﬁniBand

4X.Figure 1.DSM client rver communication model.PCI-Express the successor to PCI-X employs rial point-to-point links.Each link is made of x1,x2,x4,x8,x12,

x16or x32lanes.

Each lane can

transmit signals

in both directions.

The links can

sustain upto 2.5

GigaBits/c/lane/direction (0.5GigaBytes/s aggregate bandwidth per lane).InﬁniBand 4X can be fully utilized using PCI-Express x4and higher interfaces.

In this paper,we evaluate the impact of improved bus technology on the performance of the network bad proto-col NEWGENDSM.This protocol NEWGENDSM,is po-tentially nsitive to communicati

on latency and bandwidth.Bus technologies like PCI-Express allow the InﬁniBand 4X to achieve higher bandwidth and lower latency.In ction 2,we discuss software distributed shared memory,networking and bus technologies.In ction 3,we provide an overview of the protocol NEWGENDSM.In ction 4,we discuss the impact of PCI-Express on NEWGENDSM both in terms of micro-benchmarks and applications.Finally,in ction 6,we prent our conclusions and discuss future work.

2Background

In this ction,we discuss some of issues relating to SDSM’s,networks and bus-achitectures.In particular,in ction 2.1,we discuss a popular SDSM package HLRC.In ction 2.2,modern networks are discusd.In ction 2.3,the bus architectures PCI-X and PCI-Express are discusd.

2.1Overview of Software DSM支部副书记职责

Software Distributed Shared Memory systems provide an easy programming enviroment to the application devel-oper.However,the penalty for this ea of programmability is lower performance from fal sharing and incread com-munication.SDSM can mainly be categorized into tho which implement the non-home bad lazy relea consis-tency model like TreadMarks [9]and tho that i

mplement the home bad model.

HLRC [20]is a popular software DSM which runs with-out any modiﬁcation to the operating system kernel.HLRC

implements the home bad lazy relea consistency pro-tocol and relies on ur-level memory management tech-niques to provide coherency among participating process.The communication protocol ud is the virtual interface ar-chitecture or VIA [7].The VIA implementation of HLRC has been ported to work with InﬁniBand V API.The basic design of HLRC as well as its successors with InﬁniBand network features is described in Section 3.In the next c-tion,we discuss the networking technology InﬁniBand.

2.2Overview of Network Technologies

In the high performance computing arena,there are three popular technologies,namely Myrinet,InﬁniBand and Quadrics.The networks allow low latency and high-bandwidth communication.They also allow the applica-tion to u features like Remote Direct Memory Access (RDMA),atomic operations and hardware multicast.

InﬁniBand [1]us a switched,channel-bad intercon-nection fabric,which allows for higher bandwidth,more re-liability and better QoS support.The Mellanox implemen-tation of the InﬁniBand Verbs API (V API)supports the ba-sic nd-receive model and Remote Direct Memory Access (RDMA)operations.The communication architecture ud is the work-queue,completion-queue model.In this model,the nder posts work request descriptors to a queue.The HCA,in turn,process the requests and inrts result de-scriptors in the completion queue.The nder may check the status of his request by polling the completion queue.RDMA Write allows the ur to write data elements di-rectly into the memory space of remote nodes.RDMA Read allows an application to directly read data from the memory space of remote nodes.There is also support for atomic op-erations and multicast.The current generation of adapters from Mellanox is 4X.In InﬁniBand 4X,small message la-tency is of the order of a few micro-conds,while uni-directional bandwidth for large messages is up to 1Giga-Byte/c.

2.3Overview of Bus Technologies

PCI [2]has been the dominant standard,interconnect-ing I/O and memory modules of a computer system for al-most a decade.All other computer system components be-sides PCI such as CPU,memory,etc.have been scaling in some form or the other with time.The low bandwidth (132Meg

aBytes/c)of PCI pod a rious bottleneck.PCI-X,at 64bits/133MHz enhanced PCI,provides aggregate band-width up to 1GigaBytes/c.The shared nature of the PCI architecture,where veral I/O devices could share a sin-gle bus,however still has limited scalability.As shown in Table 1,the PCI architecture could scale in terms of band-width,but at the expen at the number of devices it could rve.The introduction of PCI-Express alleviated some of the limitations.PCI-Express [3]employed rial point-to-point links between memory and I/O devices as shown

0 5 10 15 20 25 304

642561K 4K

16K

L a t e n c y (u s e c )

Message Size (Bytes)

RDMA Write

PCI-X PCI-Express

5 10 15 20 25 30 35 40416

642561K 4K 16K

L a t e n c y (u s e c )

Message Size (Bytes)

RDMA Read

100 200 300 400 500 600 700 800 900 1000416

642561K 4K 16K 64K B a n d w i d t h (M e g a B y t e s /s e c )

Message Size (Bytes)

100 200 300 400 500 600 700 800 900 1000416

642561K 4K 16K 64K B a n d w i d t h (M e g a B y t e s /s e c )

Message Size (Bytes)

Figure 2.Performance comparision between PCI-X and PCI-Express using InﬁniBand in terms of latency and bandwidth micro-benchmarks

Table 1.Capabilities of 64-bit PCI bus [16]

Interface Cards/Bus

PCI (66MHz)

寄生虫英文1-2532

PCI-X (133MHz)

1-2

2132

all pages are assigned default home nodes.Let us as-sume that the default home for page two is node two.Now node one ﬁrst requests page two from node two by nding it a message (RDMA Writ

e with immediate data)which is procesd by the asynchronous protocol handler.The handler on node two appoints node one as the home node and updates its data structures.It then ﬁnishes rvicing the request by nding a reply to node one (RDMA Write).Node one on receiving this reply updates its data structures and continues computing.Now when node zero requests this page for the ﬁrst time by nding a request to the default home node one,node two forwards this request to node one,which in turn process the request and replies to node zero with the correct version of the page.We deﬁne page fetch time or page time as the elapd time between nding a page request and actually receiving the page.

3.1.2Difﬁng in

ASYNC

Node 0Diff

Diff Figure 5.The original ASYNC proto-col on a diff (for an example scenario)Now we

go through the protocol steps when

computing and apply-ing diffs.As shown in Figure 5,

node zero arrives at a synchro-nization point such as a barrier

or a lock.At this point,node zero must propagate all updates it has made to all pages to the home node.Assume node zero has modiﬁed pages X and Y .It computes the diff which is a run-length encoded string of the differences between the original page and the modiﬁed page.Follow-ing that it nds the differences to the home node one through RDMA Write followed by a message containing the time-stamp.The home node then applies the diffs.It then nds an ACK back to node zero,which acts as a signal to node zero that the buffer on node one,which were ud to receive the diffs to page X have been freed up and may be reud.Node zero now computes the diffs for page Y and nds them to node one and so on.Multiple buffers at the receiver may be ud to speedup the process.We will now e how parts of this protocol can be enhanced with the features available in InﬁniBand.

3.2NEWGENDSM protocol

In this ction we discuss the protocol NEWGENDSM propod in [13].NEWGENDSM consists of two compo-

nents;ARDMAR (Atomic and RDMA Read)and DRAW (Diff with RDMA Write).ARDMAR us the InﬁniBand atomic peration Compare and Swap and RDMA Read for page fetching and synchronization.DRAW us RDMA Write for difﬁng.First we describe ARDMAR followed by DRAW.

3.2.1ARDMAR

In this ction we describe the design of ARDMAR which is shown in Figure 6for an example scenario.The Atomic operation compare and swap have been combined with the RDMA Read operation to completely eliminate the asynchronous protocol processing.Let us assume the same pattern of requests for a page as shown in Figure 4.Here assume that node one wants to access page two for the ﬁrst time.Let home (x)denote the last known value for the home of page x at node n.Initial values for home (x)are -1at the default home node (indicating that a home has not been assigned)and x mod (number of nodes)at a non-home node.Initially home (2)=-1.Let us denote an issued atomic compare and swap oper-ation as CMPSWAP(node,address,compare with value,swap

with value)where address points to some location in node.Node 1issues an atomic compare and swap CMPSW AP(2,home (2),-1,1).The compare succeeds and now home (2)=1.Now on completion of the atomic op-eration,node 1knows that it is the home node (home (2)=1)and can continue computation after appropriately tting the appropriate memory protections on the

page.

(default home for page 2)Figure 6.The propod protocol ARD-MAR (for the example scenario)Now assume that node zero wants

to access

page two

for the ﬁrst time.It also is-sues an atomic

compare

and swap

广岛之恋简谱operation

CMPSW AP(2,home (2),-1,0)which fails since home (2)=1.Simultaneous with the atomic compare and swap,node zero also issues an RDMA Read to read in home (2).The atomic compare and swap having failed node zero looks at the location read in by the RDMA Read.This location tells node zero that the actual home is now node one.Node zero now issues two simultaneous RDMA Reads.The ﬁrst RDMA Read brings in the version of the page,while

the cond RDMA Read brings in the actual page.If the version does not match,both RDMA’s are reissued until the correct version is obtained.3.2.2DRA W

Let us look at the design of DRAW.Figure 7shows the protocol activity.DRAW us RDMA Write to directly write the diffs to the page on the destination node.Again consider the diff creation and application activity shown

in Node 0Diff Diff

Figure 7.The propod protocol

DRAW (for the example scenario)

Figure 5.

DRAW

moves most of

the difﬁng activity into node zero

as shown in Figure 7.Assume that node zero has now initiated

computing

the diff at

the synchronization point.Let modiﬁed(X,n)refer to the n’th position in page X.Let clean(X,n)refer to the n’th position in the twin of X where twin refers to a clean copy of page X.Let buffer(t,n)refer to the n’th position in communication buffer t.Let RWRITE(source,dest,t,s,len)denote an RDMA Write descriptor initiated at node source bound for node dest,using buffer t,starting at location s and of length len.Let us assume further that modiﬁed(X,i..j)differ from clean(X,i..j).In this ca,DRAW copies modiﬁed(X,i..j)into buffer(t,i..j)and creates RWRITE(0,1,t,i,j-i).Similarly,assume that modiﬁed(Y,b..f)differ from clean(y,b..f).DRAW copies modiﬁed(Y,b..f)into buffer(t+1,b..f)and creates RWRITE(0,1,t+1,b,f-b).Now if pages X and Y are the only modiﬁed pages,at node zero DRAW issues RWRITE(0,1,t,i,j-i)and RWRITE(0,1,t+1,b,f-b).In addition,a message containing the timestamps of pages X and Y is also nt.At node one,on receiving the messages containing the timestamps,DRAW updates the timestamps for pages X and Y .

3.3Potential Beneﬁts With PCI-Express

As discusd in Section 2.3,NEWGENDSM is a more synchronous protocol.Its two components are ARDMAR and DRAW.ARDMAR reads pages using RDMA Read.DRAW propagates DIFF’s through RDMA Write.As a result,NEWGENDSM is potentially nsitive to network characteristics.The characteristics include latency and bandwidth of primitives like RDMA Read and Write.Ad-

ditionally,since every node in the application might be ex-ecuting the protocol,it is important that the network be able to support multiple probes from NEWGENDSM.This might result in a greater load on the network.As discusd in Section 3.2.1;ARDMAR might need to access the net-work multiple times to obtain a page.As a result,quick access to the network is an important requirement.

The requirements of NEWGENDSM match well with the the architecture of PCI-Express,shown in Figure 3and discusd in Section 2.3.PCI-Express is a rial point-to-point link interface,unlike the shared bus medium of PCI-X.As a result,there is less contention to access the net-work.This also translates into improved latency.Addition-ally,PCI-Express x8allows an aggregate bandwidth of up to 4GigaBytes/s,allowing NEWGENDSM to fully exploit the full aggregate bandwidth of InﬁniBand 4X.Thus,mul-tiple nodes executing NEWGENDSM might be able to tap into the network with lower latency and higher bandwidth,for improved performance.

4Performance Evaluation

This ction evaluates the performance of NEW-GENDSM and ASYNC on PCI-X and PCI-Express archi-tectures.Evaluation is in terms of micro-benchmarks and applications.First we describe the hardware tup in c-tion 4.1.Following that NEWGENDSM is evaluated using the Page micro-benchmark in Section 4.2.The application level evaluation is prented in Section 4.3.

4.1Experimental Test Bed

The experiments were run on a four node cluster.Each node has a dual 3.4GHz Xeon processor (64-bit)and 512MB main memory.The nodes also have both 64-bit,133MHz PCI-X and x8PCI-Express interfaces.The InﬁniBand MT23108(PCI-X)and MT25208(PCI-Express)adapters were installed in each node.The adapters were connected through an InﬁniScale MT43132switch.Each node has Linux installed and the kernel version is 2.4.21..

4.2Micro-benchmark level evaluation

The performance of ARDMAR and ASYNC on PCI-X and PCI-Express were evaluated using the page fetch micro-benchmark.This micro-benchmark is modiﬁed from the original version implemented for the TreadMarks SDSM package [9].Page fetch time is the elapd interval between nding a request for a page and actually getting the page.It is measured as follows.The ﬁrst node (

如何提高服务质量

master node)ini-tially touches each of 1024pages so that the home node is assigned to it.Following that each of the remaining nodes reads one word from each of the 1024pages.This results in all the 1024pages being read from the ﬁrst node.As the number of nodes increas,the contention for a page at

the master node increas.The time of the cond pha is measured.

Table2shows the results of running the page micro-benchmark on PCI-X and PCI-Express.We e that ASYNC performs slightly wor than NEWGENDSM at two and four nodes,when comparing the same PCI-X or PCI-Express.This matches with the obrvations in[13].NEWGENDSM is expected to show improved per-formance at large system size.But we can e that PCI-Express performs better than PCI-X for both ASYNC and NEWGENDSM.There is approximately a6-15%improve-ment when replacing PCI-X with PCI-Express at two nodes. This increas to approximately18-21%at four nodes.We would expect the improvement to increa,with increasing system size.ASYNC and NEWGENDSM were not evalu-ated using the diff micro-benchmark discusd in[13],since the diff micro-benchmark does not measure improvement due to communication,the metric we are interested in. 4.3Application level evaluation

In this ction we evaluate our implementation using four different applications;Barnes-HUT(Barnes),three di-mensional FFT(3Dfft),LU Decomposition(LU)and In-teger Sort(IS).Out of the applications Barnes and LU have been taken from the SPLASH-2benchmark suite[19] while3Dfft and IS have been taken from the TreadMarks[9] SDSM package.The application sizes ud are shown in Figure3.All other parameters were kept the same as orig-inally described in[19,9].We will now discuss the perfor-mance numbers for the different applications.

4.3.1Application Characteristics

In this ction,we discuss some of the application charac-teristics relevant to our design.Theﬁrst application Barnes, is an N-Body simulation using the hierarchical Barnes-Hut method.It contains two main arrays,one containing the bodies in the simulation and the other the cells.Sharing patterns are irregular and true.A fairly large amount of diff trafﬁc is exchanged at barriers,which are the synchro-nization points.3DFFT performs a three dimensional Fast Fourier Transform.It exchanges a large volume of mes-sages per unit time.It also us a large number of locks and barriers for synchronization.The LU program factors a den matrix into the product of a lower triangular and an upper triangular matrix.The factorization us blocking to exploit temporal locality on individual submatrix elements. LU nds a large numbers of diffs.It also exchanges the maximum amount of d

ata trafﬁc.IS implements Bucket Sort.There is a global array containing the buckets,and a local array which the local node us to sort its data.After each iteration,each node places its data in the global array and copies the data relevant to it into its local array.A large numbers of diffs are exchanged at intervals.

Table3.Application sizes Application Size Barnes4096

Grid size

LU16

num of keys

本文发布于:2023-07-04 06:22:47，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/82/1077449.html

上一篇：英语必修6各单元词汇专项练习

下一篇：短文改错专题精选卷含答案解析(五)