Can High Performance Software DSM Systems Designed With InfiniBand小学生成语故事100篇
Features Benefit from PCI-Express?
Ranjit Noronha and Dhabaleswar K.Panda
Dept.of Computer Science and Engineering
The Ohio State University
Columbus,OH43210
{noronha,panda}@c.ohio-state.edu
Abstract
The performance of software distributed shared mem-
ory systems has traditionally been lower than other pro-
林冲人物简介
gramming models primarily becau of overhead in the co-
herency protocol,comunication bottlenecks and slow net-
works.Software DSMs have benefited from interconnection
technologies like Myrinet,InfiniBand and Quadrics which
offer low latency and high bandwidth communication.Ad-
ditionally,some of their features like RDMA and atomic op-
erations have been ud to implement some portions of the
稻草定律
software DSM protocols directly,futher reducing overhead.
Such network aware protocols are dependent on character-
istics of the networks.The performance of the networks
is in turn dependent on the system architecture,specially
the I/O bus.PCI-Express the successor to the PCI-X ar-
chitecture,offers improved latency and bandwidth charac-
teritics.In this paper,we evaluate the impact of using an
improved bus technology like PCI-Express,on the perfor-
mance of software DSM protocols which u the network
features of InfiniBand.We can e a reduction in applica-
栋号长tion execution time of up to13%at four nodes,when PCI-
Express is ud instead of PCI-X.
Keywords:DSM Systems,Cache coherency protocol,In-
finiBand,System Area Networks
1Introduction
Software Distributed Shared Memory(SDSM)systems
allow developers to easily program parallel applications,
without having to worry about explicit communication of
shared data items.The penalty for this is the higher over-
head from fal sharing,which leads to an increa in com-
munication.The cost of communication was significant on
older generations of networks like Gigabit Ethernet.Ad-
bus.Current generation systems employ the PCI-X archi-tecture,which can provide bendwidth up to 1GigaByte/c.InfiniBand 4X which can deliver an aggregate bandwidth of upto 2GigaBytes/c is thus limited by the PCI-X ar-chitecture.Additionally,the shared architecture of PCI-X further impinges on the performance of InfiniBand
4X.Figure 1.DSM client rver communication model.PCI-Express the successor to PCI-X employs rial point-to-point links.Each link is made of x1,x2,x4,x8,x12,
x16or x32lanes.
Each lane can
transmit signals
in both directions.
The links can
sustain upto 2.5
GigaBits/c/lane/direction (0.5GigaBytes/s aggregate bandwidth per lane).InfiniBand 4X can be fully utilized using PCI-Express x4and higher interfaces.
In this paper,we evaluate the impact of improved bus technology on the performance of the network bad proto-col NEWGENDSM.This protocol NEWGENDSM,is po-tentially nsitive to communicati
on latency and bandwidth.Bus technologies like PCI-Express allow the InfiniBand 4X to achieve higher bandwidth and lower latency.In ction 2,we discuss software distributed shared memory,networking and bus technologies.In ction 3,we provide an overview of the protocol NEWGENDSM.In ction 4,we discuss the impact of PCI-Express on NEWGENDSM both in terms of micro-benchmarks and applications.Finally,in ction 6,we prent our conclusions and discuss future work.
2Background
In this ction,we discuss some of issues relating to SDSM’s,networks and bus-achitectures.In particular,in ction 2.1,we discuss a popular SDSM package HLRC.In ction 2.2,modern networks are discusd.In ction 2.3,the bus architectures PCI-X and PCI-Express are discusd.
2.1Overview of Software DSM支部副书记职责
Software Distributed Shared Memory systems provide an easy programming enviroment to the application devel-oper.However,the penalty for this ea of programmability is lower performance from fal sharing and incread com-munication.SDSM can mainly be categorized into tho which implement the non-home bad lazy relea consis-tency model like TreadMarks [9]and tho that i
mplement the home bad model.
HLRC [20]is a popular software DSM which runs with-out any modification to the operating system kernel.HLRC
implements the home bad lazy relea consistency pro-tocol and relies on ur-level memory management tech-niques to provide coherency among participating process.The communication protocol ud is the virtual interface ar-chitecture or VIA [7].The VIA implementation of HLRC has been ported to work with InfiniBand V API.The basic design of HLRC as well as its successors with InfiniBand network features is described in Section 3.In the next c-tion,we discuss the networking technology InfiniBand.
2.2Overview of Network Technologies
In the high performance computing arena,there are three popular technologies,namely Myrinet,InfiniBand and Quadrics.The networks allow low latency and high-bandwidth communication.They also allow the applica-tion to u features like Remote Direct Memory Access (RDMA),atomic operations and hardware multicast.
InfiniBand [1]us a switched,channel-bad intercon-nection fabric,which allows for higher bandwidth,more re-liability and better QoS support.The Mellanox implemen-tation of the InfiniBand Verbs API (V API)supports the ba-sic nd-receive model and Remote Direct Memory Access (RDMA)operations.The communication architecture ud is the work-queue,completion-queue model.In this model,the nder posts work request descriptors to a queue.The HCA,in turn,process the requests and inrts result de-scriptors in the completion queue.The nder may check the status of his request by polling the completion queue.RDMA Write allows the ur to write data elements di-rectly into the memory space of remote nodes.RDMA Read allows an application to directly read data from the memory space of remote nodes.There is also support for atomic op-erations and multicast.The current generation of adapters from Mellanox is 4X.In InfiniBand 4X,small message la-tency is of the order of a few micro-conds,while uni-directional bandwidth for large messages is up to 1Giga-Byte/c.
2.3Overview of Bus Technologies
PCI [2]has been the dominant standard,interconnect-ing I/O and memory modules of a computer system for al-most a decade.All other computer system components be-sides PCI such as CPU,memory,etc.have been scaling in some form or the other with time.The low bandwidth (132Meg
aBytes/c)of PCI pod a rious bottleneck.PCI-X,at 64bits/133MHz enhanced PCI,provides aggregate band-width up to 1GigaBytes/c.The shared nature of the PCI architecture,where veral I/O devices could share a sin-gle bus,however still has limited scalability.As shown in Table 1,the PCI architecture could scale in terms of band-width,but at the expen at the number of devices it could rve.The introduction of PCI-Express alleviated some of the limitations.PCI-Express [3]employed rial point-to-point links between memory and I/O devices as shown
0 5 10 15 20 25 304
16
642561K 4K
16K
L a t e n c y (u s e c )
Message Size (Bytes)
RDMA Write
PCI-X PCI-Express
5 10 15 20 25 30 35 40416
642561K 4K 16K
L a t e n c y (u s e c )
Message Size (Bytes)
RDMA Read
100 200 300 400 500 600 700 800 900 1000416
642561K 4K 16K 64K B a n d w i d t h (M e g a B y t e s /s e c )
Message Size (Bytes)
100 200 300 400 500 600 700 800 900 1000416
642561K 4K 16K 64K B a n d w i d t h (M e g a B y t e s /s e c )
Message Size (Bytes)
Figure 2.Performance comparision between PCI-X and PCI-Express using InfiniBand in terms of latency and bandwidth micro-benchmarks
Table 1.Capabilities of 64-bit PCI bus [16]
Interface Cards/Bus
PCI (66MHz)
寄生虫英文1-2532
PCI-X (133MHz)
1-2
2132
all pages are assigned default home nodes.Let us as-sume that the default home for page two is node two.Now node one first requests page two from node two by nding it a message (RDMA Writ
e with immediate data)which is procesd by the asynchronous protocol handler.The handler on node two appoints node one as the home node and updates its data structures.It then finishes rvicing the request by nding a reply to node one (RDMA Write).Node one on receiving this reply updates its data structures and continues computing.Now when node zero requests this page for the first time by nding a request to the default home node one,node two forwards this request to node one,which in turn process the request and replies to node zero with the correct version of the page.We define page fetch time or page time as the elapd time between nding a page request and actually receiving the page.
3.1.2Diffing in
ASYNC
Node 0Diff
Diff Figure 5.The original ASYNC proto-col on a diff (for an example scenario)Now we
go through the protocol steps when
computing and apply-ing diffs.As shown in Figure 5,
node zero arrives at a synchro-nization point such as a barrier
or a lock.At this point,node zero must propagate all updates it has made to all pages to the home node.Assume node zero has modified pages X and Y .It computes the diff which is a run-length encoded string of the differences between the original page and the modified page.Follow-ing that it nds the differences to the home node one through RDMA Write followed by a message containing the time-stamp.The home node then applies the diffs.It then nds an ACK back to node zero,which acts as a signal to node zero that the buffer on node one,which were ud to receive the diffs to page X have been freed up and may be reud.Node zero now computes the diffs for page Y and nds them to node one and so on.Multiple buffers at the receiver may be ud to speedup the process.We will now e how parts of this protocol can be enhanced with the features available in InfiniBand.
3.2NEWGENDSM protocol
In this ction we discuss the protocol NEWGENDSM propod in [13].NEWGENDSM consists of two compo-
nents;ARDMAR (Atomic and RDMA Read)and DRAW (Diff with RDMA Write).ARDMAR us the InfiniBand atomic peration Compare and Swap and RDMA Read for page fetching and synchronization.DRAW us RDMA Write for diffing.First we describe ARDMAR followed by DRAW.
3.2.1ARDMAR
In this ction we describe the design of ARDMAR which is shown in Figure 6for an example scenario.The Atomic operation compare and swap have been combined with the RDMA Read operation to completely eliminate the asynchronous protocol processing.Let us assume the same pattern of requests for a page as shown in Figure 4.Here assume that node one wants to access page two for the first time.Let home (x)denote the last known value for the home of page x at node n.Initial values for home (x)are -1at the default home node (indicating that a home has not been assigned)and x mod (number of nodes)at a non-home node.Initially home (2)=-1.Let us denote an issued atomic compare and swap oper-ation as CMPSWAP(node,address,compare with value,swap
with value)where address points to some location in node.Node 1issues an atomic compare and swap CMPSW AP(2,home (2),-1,1).The compare succeeds and now home (2)=1.Now on completion of the atomic op-eration,node 1knows that it is the home node (home (2)=1)and can continue computation after appropriately tting the appropriate memory protections on the
page.
(default home for page 2)Figure 6.The propod protocol ARD-MAR (for the example scenario)Now assume that node zero wants
to access
page two
for the first time.It also is-sues an atomic
compare
and swap
广岛之恋简谱operation
CMPSW AP(2,home (2),-1,0)which fails since home (2)=1.Simultaneous with the atomic compare and swap,node zero also issues an RDMA Read to read in home (2).The atomic compare and swap having failed node zero looks at the location read in by the RDMA Read.This location tells node zero that the actual home is now node one.Node zero now issues two simultaneous RDMA Reads.The first RDMA Read brings in the version of the page,while
the cond RDMA Read brings in the actual page.If the version does not match,both RDMA’s are reissued until the correct version is obtained.3.2.2DRA W
Let us look at the design of DRAW.Figure 7shows the protocol activity.DRAW us RDMA Write to directly write the diffs to the page on the destination node.Again consider the diff creation and application activity shown
in Node 0Diff Diff
Figure 7.The propod protocol
DRAW (for the example scenario)
Figure 5.
DRAW
moves most of
the diffing activity into node zero
as shown in Figure 7.Assume that node zero has now initiated
computing
the diff at
the synchronization point.Let modified(X,n)refer to the n’th position in page X.Let clean(X,n)refer to the n’th position in the twin of X where twin refers to a clean copy of page X.Let buffer(t,n)refer to the n’th position in communication buffer t.Let RWRITE(source,dest,t,s,len)denote an RDMA Write descriptor initiated at node source bound for node dest,using buffer t,starting at location s and of length len.Let us assume further that modified(X,i..j)differ from clean(X,i..j).In this ca,DRAW copies modified(X,i..j)into buffer(t,i..j)and creates RWRITE(0,1,t,i,j-i).Similarly,assume that modified(Y,b..f)differ from clean(y,b..f).DRAW copies modified(Y,b..f)into buffer(t+1,b..f)and creates RWRITE(0,1,t+1,b,f-b).Now if pages X and Y are the only modified pages,at node zero DRAW issues RWRITE(0,1,t,i,j-i)and RWRITE(0,1,t+1,b,f-b).In addition,a message containing the timestamps of pages X and Y is also nt.At node one,on receiving the messages containing the timestamps,DRAW updates the timestamps for pages X and Y .
3.3Potential Benefits With PCI-Express
As discusd in Section 2.3,NEWGENDSM is a more synchronous protocol.Its two components are ARDMAR and DRAW.ARDMAR reads pages using RDMA Read.DRAW propagates DIFF’s through RDMA Write.As a result,NEWGENDSM is potentially nsitive to network characteristics.The characteristics include latency and bandwidth of primitives like RDMA Read and Write.Ad-
ditionally,since every node in the application might be ex-ecuting the protocol,it is important that the network be able to support multiple probes from NEWGENDSM.This might result in a greater load on the network.As discusd in Section 3.2.1;ARDMAR might need to access the net-work multiple times to obtain a page.As a result,quick access to the network is an important requirement.
The requirements of NEWGENDSM match well with the the architecture of PCI-Express,shown in Figure 3and discusd in Section 2.3.PCI-Express is a rial point-to-point link interface,unlike the shared bus medium of PCI-X.As a result,there is less contention to access the net-work.This also translates into improved latency.Addition-ally,PCI-Express x8allows an aggregate bandwidth of up to 4GigaBytes/s,allowing NEWGENDSM to fully exploit the full aggregate bandwidth of InfiniBand 4X.Thus,mul-tiple nodes executing NEWGENDSM might be able to tap into the network with lower latency and higher bandwidth,for improved performance.
4Performance Evaluation
This ction evaluates the performance of NEW-GENDSM and ASYNC on PCI-X and PCI-Express archi-tectures.Evaluation is in terms of micro-benchmarks and applications.First we describe the hardware tup in c-tion 4.1.Following that NEWGENDSM is evaluated using the Page micro-benchmark in Section 4.2.The application level evaluation is prented in Section 4.3.
4.1Experimental Test Bed
The experiments were run on a four node cluster.Each node has a dual 3.4GHz Xeon processor (64-bit)and 512MB main memory.The nodes also have both 64-bit,133MHz PCI-X and x8PCI-Express interfaces.The InfiniBand MT23108(PCI-X)and MT25208(PCI-Express)adapters were installed in each node.The adapters were connected through an InfiniScale MT43132switch.Each node has Linux installed and the kernel version is 2.4.21..
4.2Micro-benchmark level evaluation
The performance of ARDMAR and ASYNC on PCI-X and PCI-Express were evaluated using the page fetch micro-benchmark.This micro-benchmark is modified from the original version implemented for the TreadMarks SDSM package [9].Page fetch time is the elapd interval between nding a request for a page and actually getting the page.It is measured as follows.The first node (
如何提高服务质量
master node)ini-tially touches each of 1024pages so that the home node is assigned to it.Following that each of the remaining nodes reads one word from each of the 1024pages.This results in all the 1024pages being read from the first node.As the number of nodes increas,the contention for a page at
the master node increas.The time of the cond pha is measured.
Table2shows the results of running the page micro-benchmark on PCI-X and PCI-Express.We e that ASYNC performs slightly wor than NEWGENDSM at two and four nodes,when comparing the same PCI-X or PCI-Express.This matches with the obrvations in[13].NEWGENDSM is expected to show improved per-formance at large system size.But we can e that PCI-Express performs better than PCI-X for both ASYNC and NEWGENDSM.There is approximately a6-15%improve-ment when replacing PCI-X with PCI-Express at two nodes. This increas to approximately18-21%at four nodes.We would expect the improvement to increa,with increasing system size.ASYNC and NEWGENDSM were not evalu-ated using the diff micro-benchmark discusd in[13],since the diff micro-benchmark does not measure improvement due to communication,the metric we are interested in. 4.3Application level evaluation
In this ction we evaluate our implementation using four different applications;Barnes-HUT(Barnes),three di-mensional FFT(3Dfft),LU Decomposition(LU)and In-teger Sort(IS).Out of the applications Barnes and LU have been taken from the SPLASH-2benchmark suite[19] while3Dfft and IS have been taken from the TreadMarks[9] SDSM package.The application sizes ud are shown in Figure3.All other parameters were kept the same as orig-inally described in[19,9].We will now discuss the perfor-mance numbers for the different applications.
4.3.1Application Characteristics
In this ction,we discuss some of the application charac-teristics relevant to our design.Thefirst application Barnes, is an N-Body simulation using the hierarchical Barnes-Hut method.It contains two main arrays,one containing the bodies in the simulation and the other the cells.Sharing patterns are irregular and true.A fairly large amount of diff traffic is exchanged at barriers,which are the synchro-nization points.3DFFT performs a three dimensional Fast Fourier Transform.It exchanges a large volume of mes-sages per unit time.It also us a large number of locks and barriers for synchronization.The LU program factors a den matrix into the product of a lower triangular and an upper triangular matrix.The factorization us blocking to exploit temporal locality on individual submatrix elements. LU nds a large numbers of diffs.It also exchanges the maximum amount of d
ata traffic.IS implements Bucket Sort.There is a global array containing the buckets,and a local array which the local node us to sort its data.After each iteration,each node places its data in the global array and copies the data relevant to it into its local array.A large numbers of diffs are exchanged at intervals.
Table3.Application sizes Application Size Barnes4096
Grid size
LU16
num of keys