A Performance Study to Guide RDMA Programming Decisions
Patrick MacArthur,Robert D.Rusll
Computer Science Department
University of New Hampshire
Durham,New Hampshire03824-3591,USA
pio3@unh.edu,rdr@unh.edu
Abstract—This paper describes a performance study of Remote Direct Memory Access(RDMA)programming tech-niques.Its goal is to u the results as a guide for making “best practice”RDMA programming decisions.
Infiniband RDMA is widely ud in scientific high per-formance computing(HPC)clusters as a low-latency,high-bandwidth,reliable interconnect accesd via MPI.Recently it is gaining adherents outside scientific HPC as high-speed clusters appear in other application areas for which MPI is not suitable.RDMA enables ur applications to move data directly between virtual memory on different no
des without operating system intervention,so there is a need to know how to incorporate RDMA access into high-level programs.But RDMA offers more options to a programmer than traditional sockets programming,and it is not always obvious what the performance tradeoffs of the options might be.This study is intended to provide some answers.
Keywords-RDMA;Infiniband;OFA;OFED;HPC;
I.I NTRODUCTION
As networks grow faster,Remote Direct Memory Access (RDMA)is rapidly gaining popularity outside its traditional application area of scientific HPC.RDMA allows application programmers to directly transfer data between ur-space virtual memories on different machines without kernel in-tervention,thereby bypassing extra copying and processing that reduce performance in conventional networking tech-nologies.RDMA is completely message-oriented,so all application messages are nt and received as units,unlike TCP/IP,which treats network communication as a stream of bytes.
Ur-level RDMA programming offers many more op-tions and is more complex than traditional socket program-ming,as it requires the programmer to directly manipulate functions and data structures defined by the network inter-face in order to directly control all aspects of RDMA mes-sag
e transmission.Therefore,the programmer must make many decisions which may drastically affect performance. The goal of this paper is to evaluate the performance of numerous methods of directly using the application-level RDMA features in practice.
Similar performance evaluations have been done for spe-cific applications and tools that u RDMA,such as MPI[1], [2],FTP[3],GridFTP[4],NFS[5],AMPQ[6],PVFS[7], etc.The importance of our work is that we evaluate RDMA directly,without any particular application or environment in mind,and provide guidance on general design options faced by anyone directly using RDMA.
A.Background
Three RDMA technologies are in u today:Infiniband (IB),Internet Wide-Area RDMA Protocol(iW ARP),and RDMA over Converged Ethernet(RoCE).Infiniband[8],[9] defines a completely lf-contained protocol stack,utilizing its own interface adapters,switches,and cables.iW ARP defines three thin protocol layers[10]–[12]on top of the existing TCP/IP ,standard Internet).RoCE[13] simply replaces the physical and data-link layers of the In-finiband protocol stack with Ethernet.All three technologies are packaged as lf-contained interface adapters and drivers, and there are software-only versions for both iW ARP[14] and RoCE[15].
The OpenFabrics Alliance(OFA)[16]publishes and maintains a common ur-level Application Programming Interface(API)for all three RDMA technologies,It provides direct,efficient ur-level access to all features supported by each RDMA technology.OFA also provides open access to a reference implementation of this API,along with uful util-ities,called the OpenFabrics Enterpri Distribution(OFED) [17].This API is ud throughout this study.
In RDMA,actions are specified by verbs which con-vey requests to the network adapter.Each verb,such as post_nd,is reprented in the OFED API as a library function,ibv_post_nd,with associated parameters and data structures.To initiate a transfer,ibv_post_nd places a work request data structure describing the trans-fer onto a network adapter queue.Data transfers are all asynchronous:once a work request has been posted,control returns to the ur-space application which must later u the ibv_poll_cq function to remove a work completion data structure from a network adapter’s completion queue. This completion contains the status for thefinished transfer and tells the application it can again safely access the virtual memory ud in the transfer.
RDMA provides four ts of work request opcodes to describe a data transfer.The SEND/RECV t superficially rembles a normal socket transfer.A receiverfirst posts a RECV work request that describes a virtual memory
2012 IEEE 14th International Conference on High Performance Computing and Communications
area into which the adapter should place a single message. The nder then posts a SEND work request describing a virtual memory area containing the message to be nt.The network adapters transfer data directly from the nder’s virtual memory area to the receiver’s virtual memory area without any intermediate copies.Since both sides of the transfer are required to post work requests,this is called a“two-sided”transfer.
The cond t is a“one-sided”transfer in which a nder posts a RDMA WRITE request that“pushes”a message directly into a virtual memory area that the receiving side previously described to the nder.The receiving side’s CPU is completely“passive”during the transfer,which is why this is called“one-sided.”
The third t is also a“one-sided”transfer in which the receiver posts a RDMA READ request that“pulls”a message directly from the nding side’s virtual memory, and the nding side’s CPU is completely passive. Becau the passive side in a“one-sided”transfer does not know when that transfer completes,there is another “two-sided”opcode t in which the nder posts a RDMA WRITE WITH IMM request to“push”a message directly into the receiving side’s virtual memory,as for
RDMA WRITE,but the nd work request also includes 4bytes of immediate(out-of-band)data that is delivered to the receiver on completion of the transfer.The receiving side posts a RECV work request to catch the4bytes,and the work completion for the RECV indicates the status and amount of data transferred in the message.
B.Features Evaluated
1)Work Request Opcode Set:Several RDMA features were evaluated for this study.The most obvious feature evaluated was the work request opcode t being ud for the transfer,although in practice this choice is often limited by the requirements of the application regardless of performance.
2)Message Size:The cond item considered is the message size,which was arbitrarily categorized into small messages containing512bytes or less and large messages containing more.This size was chon since512bytes is a standard disk ctor;it is not part of any RDMA standard.
3)Inline Data:The API provides an optional“inline”feature that allows an interface adapter to copy the data from small messages into its own memory as part of a posted work request.This immediately frees the buffer for application reu,and makes the transfer more efficient since the adapter has the data ready to nd and does not need to retrieve it over the memory bus during the transfer.
4)Completion Detection:An asynchronous RDMA transfer starts when an application posts a work request to the interface adapter,and completes when the interface adapter enqueues a work completion in its completion queue.There are two strategies which an application can employ to determine when to pick up a work completion.
Thefirst completion detection strategy,called“busy polling”,is to repeatedly poll the completion queue until a completion becomes available.It allows immediate reaction to completions at the cost of very high CPU utilization,but requires no operating system intervention.
The cond strategy,called“event notification”,is to t up a completion channel that allows an application to wait until the interface adapter signals a notification on this channel,at which time the application obtains the work completion by polling.It requires the application to wait for the notification by transferring to the operating system, but reduces CPU utilization significantly.
5)Simultaneous Operations(Multiple Buffers):We t up the u of simultaneous operations per connection by posting multiple buffers in a round-robin fashion so that the interface adapter queues them.
6)Work Request Submission Lists:The functions that post work requests take a linked list of work req
uests as an argument.We compare the performance of creating a list of work requests and submitting them in a single posting (“multiple work requests per post”)with that of posting individual work requests as single element lists(“single work request per post”).
7)Completion Signaling:For all transfer opcodes except RECV,a work completion is generated only if a“signaled”flag is t in the work request.If thisflag is not t,the “unsignaled”work request still consumes completion queue resources but does not generate a work completion data structure or notification event.To avoid depleting comple-tion queue resources,applications must periodically post a signaled work request and process the generated completion. We compare quences containing only signaled work requests(“full signaling”)against quences containing both signaled and unsignaled work requests(“periodic signal-ing”).SEND or RDMA WRITE WITH IMM with inline are good examples of where unsignaled work requests could be ud becau the data area is no longer needed by the adapter once the request is posted,allowing the application to reu it withoutfirst receiving a work completion.
8)Infiniband Wire Speed:Infiniband hardware supports veral different wire transmission speeds,and we compare the effect of the speeds on various performance measures.
9)RoCE:We compare the performance of RoCE to Infiniband.
II.T EST P ROCEDURES
Our tests are variations of two simple applications,ping and blast.In the ping tool,a client nds data to a rver and the rver nds it back.Ping has variations for SEND/RECV and RDMA WRITE WITH IMM/RECV, but not for RDMA READ or RDMA WRITE becau with the opcodes a rver cannot determine when a transfer
20
40 60
吃苹果减肥吗
80
100
1 4 16 64 256 1024
P e r c e n t
Message size (bytes)
RDMA_WRITE_WITH_IMM with INLINE
SEND/RECV with INLINE RDMA_WRITE_WITH_IMM
SEND/RECV
(a)CPU usage (event notification only).
M i c r o s e c o n d s
Message size (bytes)
(b)Average one-way time.
保鲜技术M e g a b i t s p e r s e c o n d
Message size (bytes)
(c)Average throughput.
Figure 1.
Ping with each completion detection strategy for small messages.
M i c r o s e c o n d s
Message size (bytes)
(a)Average one-way time.M e g a b i t s p e r s e c o n d
Message size (bytes)
(b)Average throughput.
1 2
3 4 5 6 7 8 1 4
16 64 256 1024
M i c r o s e c o n d s
Message size (bytes)
RDMA_WRITE_WITH_IMM busy
RDMA_WRITE busy
RDMA_WRITE_WITH_IMM and INLINE busy
RDMA_WRITE with INLINE busy
(c)Average one-way time with and without inline,busy polling only.
Figure 2.Blast with each opcode t and completion detection strategy for small messages using one buffer.
has completed.In the blast tool,which can run with each of the 4opcode ts,a client nds data to a rver as fast as possible,but the rver does not acknowledge it.
Tests are run between two identical nodes,each consisting of twin 6-core Intel Westmere-EP 2.93GHz processors with 6GB of RAM and PCIe-2.0x8running OFED 1.5.4on Scientific Linux 6.1.Each node has a dual-port Mellanox MT26428adapter with 256byte cache line,one port con-figured for Infiniband 4xQDR ConnectX VPI with 4096-byte MTUs,the other for 10Gbps Ethernet with 9000-byte jumbo frames.With the configurations,each Infiniband or RoCE frame can carry up to 4096bytes of ur data.Nodes are connected back-to-back on both ports,and all transfers u Reliable Connection (RC)transport mode,which fragments and reasmbles large ur messages.
We measure 3performance metrics,all bad on elapd time,which is measured from just before the first message transfer is posted to just after the last transfer completes.Average throughput is the number of messages times the size of each message divided by elapd time.Average one-way time per message for blast is the elapd time divided by the number of messages;for ping it is half this value.
Average CPU utilization is the sum of ur and system CPU time reported by the POSIX getrusage function divided by elapd time.
III.P ERFORMANCE R ESULTS
A.Ping example,small messages
Ping is the application generally ud to measure round trip time.It repeatedly nds a fixed-size message back and forth between client and rver.In our tests,message size varies by powers of 2from 1to 1024bytes.Fig-ure 1shows a total of 8combinations of SEND/RECV and RDMA WRITE WITH IMM/RECV,with and without inline,and with busy polling and event notification.
Figure 1a shows slight differences between the CPU usage by opcode ts,with inline requiring more cycles for messages less than 32bytes due to the extra data copy involved.Only event notification cas are graphed,as busy polling always has 100%CPU usage.Figure 1b shows clearly that busy polling produces lower one-way times than event notification,and transfers with inline perform better than tho without it (6.2microconds for messages smaller than the 256byte cache line).Figure 1c shows that for each
M i c r o s e c o n d s
尤尼克斯是哪国品牌
Buffer count
(a)Average one-way time with sin-gle work request per posting.
M e g a b i t s p e r s e c o n d
Buffer count
(b)Average throughput with single work request per posting.
M i c r o s e c o n d s
Buffer count
(c)Average one-way time with mul-tiple work requests per posting.
M e g a b i t s p e r s e c o n d
Buffer count
(d)Average throughput with multi-ple work requests per posting.
Figure 3.
Blast with RDMA WRITE with busy polling for small messages with inline using multiple buffers.
20
40
60
80
100
1 2 4 8 16 32 64 128P e r c e n t
Buffer count 1KiB message 8KiB message 32KiB message 64KiB message
(a)Active side CPU usage (event notification only).
M i c r o s e c o n d s
Buffer count
(b)Average one-way time.M e g a b i t s p e r s e c o n d
Buffer count
(c)Average throughput.
Figure 4.Blast with RDMA WRITE for each completion detection strategy for large messages using multiple buffers and multiple work requests per
posting.
opcode t throughput increas proportionally with message size and is slightly better for busy polling at any given size.There is little difference in throughput between opcode ts,with low total throughput for them all,since small messages cannot maximize throughput.Using either SEND/RECV or RDMA WRITE WITH IMM/RECV with inline and busy polling gives the best one-way time and marginally better throughput for small messages,but suffers from 100%CPU utilization.
B.Blast example,small messages,single buffer
Next is a blast study using small messages.It compares all transfer opcode ts for both busy polling and event notification.As with ping,busy polling cas have lower one-way time and higher throughput,as shown in Figure 2a and Figure 2b respectively.CPU usage for event notification cas is not significantly different between the opcode ts.As expected,RDMA READ with event notification performs poorly.However,RDMA READ with busy polling gives slightly better performance than other opcodes,which is odd,as RDMA READ is expected to perform wor becau data must flow from the responder back to the requester,which requires a full round-trip in order to deliver the first bit.Figure 2c examines the u of inline in WRITE operations for small message blast.This is only done for busy polling
as it performs much better than event notification for small messages.Figure 2c shows that one-way time is lowest for both opcodes when using inline and messages smaller than the 256byte cache line,although our adapters accepted up to 912bytes of inline data.
C.Blast example,small messages,multiple buffers Next consider the u of multiple outstanding buffers.We initially post an RDMA WRITE for every buffer,then repost each buffer as soon as we get the completion of its previous transfer.This way,the interface adapter process posted work queue entries in parallel with the application code processing completions.
In Figure 3a,we vary the buffer count for veral message sizes using RDMA WRITE with inline and busy polling.One-way time for messages less than or equal to 64bytes is only about 300nanoconds when using 8or 16buffers,and is less than 1microcond when using 2or 4buffers.For larger buffer counts,one-way time for messages smaller than 64bytes increas,but remains around 300nanoconds for 64byte messages.Throughput,shown in Figure 3b,increas proportionally with message size,except that throughput of 64byte messages slightly exceeds that of 256byte messages for 8or more buffers,while throughput for smaller messages drops noticeably for 32or 64buffers.
0 20 40 60 80
100
1KiB
8KiB 64KiB 512KiB 4MiB 32MiB 256MiB
P e r c e n t
Message size (bytes)SEND/RECV RDMA_WRITE
RDMA_WRITE_WITH_IMM
RDMA_READ (a)Active side CPU usage (event notification only). 0 20
40
60
80
100
1KiB
8KiB 64KiB 512KiB 4MiB 32MiB 256MiB
P e r c e n t
Message size (bytes)
SEND/RECV
RDMA_WRITE
RDMA_WRITE_WITH_IMM
RDMA_READ (b)Passive side CPU usage (event notification only).
M i c r o s e c o n d s
Message size
(c)Average one-way time.M e g a b i t s p e r s e c o n d
Message size月球是卫星还是行星
(d)Average throughput.
镜子摆放禁忌剑桥商务英语中级Figure 5.Blast with each opcode t and each completion detection strategy for large messages using four buffers with multiple work requests per posting.
1 2 3 4 5 6
7 8 1 2 4 8 16 32 64 128 256
M i c r o s e c o n d s
Buffer count
multiple WR per post, periodic signaling
multiple WR per post, full signaling single WR per post, periodic signaling
single WR per post, full signaling
(a)Average one-way time for blast with RDMA WRITE and busy polling for 16byte messages,with each work request submission strategy using multiple buffers with multiple work requests per posting. 0.1
1
10
100
1000
8000 16000 32000 64000 1 2 4 8 16 32 64 128 256M e g a b i t s p e r s e c o n d
Buffer count single WR per post, full signaling single WR per post, periodic signaling multiple WR per post, full signaling multiple WR per post, periodic signaling
(b)Average throughput for blast with RDMA WRITE and busy polling for 16byte messages,with each work request submission strategy using multiple buffers with multiple work requests per posting. 0
2 4
6 8 10 12 14 16 1 2 4 8 16 32 64 128 256 512
M i c r o s e c o n d s
Message size (bytes)
signaled READ/signaled WRITE notify unsignaled READ/signaled WRITE notify signaled READ/signaled WRITE busy unsignaled READ/signaled WRITE busy
(c)Average one-way time for ping with client issuing RDMA WRITE and RDMA READ,with each completion detection strategy for small messages.
Figure 6.
Comparing completion signaling strategies.
D.Blast example,small messages,multiple buffers,multiple work requests per posting
The next study is identical to the previous,except instead of posting each work request as we process its previous completion,we place it into a list and post that list after processing all available completions.Comparing one-way times in Figure 3c with tho in Figure 3a shows that times for 256and 912byte messages are unchanged,but for 64byte and smaller messages they increa when using more than 2buffers,and for more than 16buffers they increa to that of 256-byte messages.For messages of 64bytes or less with 4,8or 16buffers Figure 3d does not show the increa in throughput en in Figure 3b.Perhaps the time needed to process a large number of completions before posting a single list of new work requests caus the adapter’s work queue to empty.In all cas,posting multiple work requests produces less dependence on the number of buffers than does single posting of work requests.
E.Blast example,large messages,multiple buffers We next examine the effects of the buffer count and message size on large messages,using RDMA WRITE without inline (since inline can be ud only with small
messages).We vary the buffer count from 1to 128for 1,8,32,and 64kibibyte messages,and post multiple work requests per list.Figure 4a shows that 64kibibyte messages have the lowest CPU utilization when using event notification,and,for all message sizes examined,using more than 4buffers has little or no effect on CPU utilization.The one-way time,shown in Figure 4b,and throughput,shown in Figure 4c,both increa as message size increas.Also,for both one-way time and throughput,busy polling and event notification results converge given enough buffers (ranging from 2or more buffers for 64KiB to 5or more for 1KiB).Next we study the effect of each opcode t for large message transfers.We vary the message size from 1kibibyte to 256mibibytes and u only 4buffers,as it was just shown that using more buffers produces no performance gains.The CPU usage for the active and passive side of each transfer is shown in Figure 5a and Figure 5b,respectively.Active side CPU utilization generally decreas with message size,although there is a bump around 8KiB.Passive side CPU uti-lization is always 0for RDMA WRITE and RDMA READ,but is similar to the active side for SEND/RECV and RDMA WRITE WITH IMM/RECV.One-way time,shown in Figure 5c,and throughput,shown in Figure 5d,both
1000
2000
4000
8000
16000 25600 32000 1
2
4
8
16
M e g a b i t s p e r s e c o n d
Buffer count
(a)Average throughput. 0
50
凤凰于飞歌曲100
150
200
250
1 2
4
8 16
M
i c r o s e c o n d s
Buffer count
(b)Average one-way time.
Figure 7.Blast with each opcode t and each Infiniband speed with busy polling for 64KiB messages using multiple buffers with multiple work requests per posting.
M i c r o s e c o n d s
Message size (bytes)(a)Average one-way time with each opcode t for small messages using one buffer.
M e g a b i t s p e r s e c o n d
Message size (bytes)
(b)Average throughput with each opcode t for small messages using one buffer.
M i c r o s e c o n d s
Buffer count
(c)Average one-way
time with RDMA WRITE for large messages using multiple buffers with multiple work requests per posting.M e g a b i t s p e r s e c o n d
Buffer count
(d)Average throughput with RDMA WRITE for large messages using multiple buffers with multiple work requests per posting.
Figure 8.Comparison of QDR Infiniband and RoCE for blast with busy polling.
perform best for busy polling up to about 16kibibytes,at which point there is no difference between busy polling and event notification.At 32kibibytes and above,the transfer operation also has no effect
on performance.
F .Completion signaling
All tests so far ud full signaling.In this test we u blast with RDMA WRITE and 16byte messages to compare full signaling,where every request is signaled,against periodic signaling,where only one work request out of every nbuffers/2is signaled.Effects are visible only when using more than 2buffers,since otherwi we signal every buffer.Both one-way time in Figure 6a and throughput in Figure 6b show much better performance when using one work request per post than when using multiple requests per post.But there is no performance difference between full and periodic signaling with one work request per post,and with multiple work requests per post the only effect of periodic signaling is to decrea performance when there are 4,8or 16buffers.We believe this is due to the fact that
in the blast example all buffers need processing after they are transferred,so not signaling for a completion just delays that processing until a signaled completion occurs,at which point all buffers transferred up to that time are procesd in a big batch,breaking the flow of new transfer postings.An example without this batching effect is ping with an active client using RDMA WRITE to push messa
ges and RDMA READ to pull them back from a passive rver.Every RDMA WRITE can be unsignaled becau its buffer does not need any processing after the transfer.Figure 6c shows that one-way time is always lower when the RDMA WRITE is unsignaled,more so with event notification,less so with busy polling.This figure also shows remarkably little variation with message size.
G.Infiniband speed comparison
北京夏天All previous tests were done at QDR speed,the maximum supported by our adapter.However,Infiniband adapters can be configured to run at veral speeds,as shown in Gigabits per cond (Gbps)in Table I.Usable Gbps is 20%lower than