IEICE TRANS.INF.&SYST.,VOL.E88–D,NO.1JANUARY2005
109 PAPER
MMLRU Selection Function:A Simple and Efficient Output Selection Function in Adaptive Routing
Michihiro KOIBUCHI†a),Akiya JOURAKU†,Nonmembers,and Hideharu AMANO†,Member
SUMMARY Adaptive routing algorithms,which dynamically lect the route of a packet,have been widely studied for interconnection net-works in massively parallel computers.An output lection function(OSF), which decides the output channel when some legal channels are free,is es-ntial for an adaptive routing.In this paper,we propo a simple and effi-cient OSF called minimal multiplexed and least-recently-ud(MMLRU). The MMLRU lection function has the following simple strategies for dis-tributing the traffic:1)each router locally grasps the congestion informa-tion by the utilization ratio of its own physical channels;2)it is divided into the two lection steps,the choice from available physical channels and the choice from available virtual channels.The MMLRU lection function can be ud on any type of network topology and adaptive routing algorithm.Simulation results show that the MMLRU lection function improves throughput and latency especially when the number of dimension becomes larger or the number of nodes per dimension become larger.
key words:output lection function,adaptive routing,virtual channel, interconnection networks,massively parallel computers
1.Introduction
Adaptive routing algorithms which can lect the route of a packet dynamically have been widely studied to make the best u of bandwidth of interconnection networks in massively parallel computers[1]–[3].When a packet en-counters a faulty or congested router in an adaptive rout-ing,another bypassing route can be lected.Thus,adap-tive routings are preferred over deterministic routings espe-cially when throughput of interconnection networks is cru-cial to system performance.Furthermore,an adaptive rout-ing is usually required to be deadlock-free becau worm-hole[4]or virtual cut-through[5]is ud for low-latency direct-communication in parallel computers.
An adaptive routing can be decompod of routing function,in which an adaptive routing algorithm provides a t of suitable deadlock-free outgoing channels,and lec-tion function,in which an output lection function(OSF)∗decides one of outgoing provided channels[9].A simple ex-ample of the behavior of an adaptive routing algorithm and the OSF is shown in Fig.1.In Fig.1,an adaptive routing
algorithm provides the three legal paths,and then an OSF lects the path by choosing the down direction from two routers in which the path branches.Notice that,since an adaptive routing algorithm performs deadlock-free opera-tion,the OSF is designed to only distribute the traffic.
A large number of adaptive routing algorithms have
Manuscript received February20,2004.
Manuscript revid July7,2004.
†The authors are with the Department of Information and Com-puter Science,Keio University,Yokohama-shi,223–8522Japan.
a)E-mail:koibuchi@am.ics.keio.ac.jp been propod[1]–[3]for relaxing the routing restrictions (increasing routing adaptivity)under the condition that no deadlocks are generated.On the other hand,a few re-arches of the OSFs have been done for only analyzing their impact using simple algorithms[2],[10]–[12].
The OSF lects an output channel depending on the condition of channels.For example,when a channel is be-ing ud(that is,in busy condition),the other free channel has priority over the busy cha
nnel in any OSF.On the other hand,when some legal channels are free,an OSF decides a single channel.And then,if another OSF is ud,a differ-ent output channel may be lected.Thus,performance of such systems is influenced by the OSF,and the fundamen-tal technique of the OSF as well as that of adaptive routing algorithm is needed to improve the performance of an adap-tive routing.
Since modern routers are required to be operational with a high clock frequency,the OSF should be simple but efficient.However,most of the existing OSFs are too sim-ple to balance network channels under non-uniform traffic patterns.
In this paper,we propo an OSF called minimal multiplexed and least-recently-ud(MMLRU)[13].The MMLRU lection function has the following simple strate-gies for distributing the traffic:1)each router locally grasps the congestion information only by the utilization ratio of its own physical channels;2)it is divided into two lec-tion steps:the choice from available physical channels,and the choice from available virtual channels on the physical channel.
The rest of the paper is organized as follows.Section
2∗The OSF is sometimes called a routing policy[6]–[8]in a fully adaptive routing on store-and-forward networks.
Copyright c 2005The Institute of Electronics,Information and Communication Engineers
110
IEICE TRANS.INF.&SYST.,VOL.E88–D,NO.1JANUARY2005
shows traditional OSFs.In Sect.3,we propo the MMLRU lection function,and Sect.4shows its evaluation results through aflit level simulation.Section5shows the evalu-ation results of OSF’s hardware amount.In Sect.6,related rearches on path choice are described and compared with the OSF,and conclusion is shown in Sect.7.
2.Traditional Output Selection Functions
In an adaptive routing,when a packet arrives at an interme-diate router,an output channel is dynamically decided from alternative legal channels provided by an adaptive routing algorithm.Then,an OSF lects an output channel depend-ing on the condition of channels.When there are two le-gal channels and a channel is being ud(that is,in busy condition),the other free channel has priority over the busy channel in any OSF.On the other hand,when some legal channels are free,the OSF lects one bad on its own pol-icy for well-distributed paths.
The simplest OSF is called“random lection func-tion”[2].It choos an output channel from available output channels at random.By using it,traffic will tend to be dis-tributed to various directions randomly.“Dimension order lection function”has been propod for k-ary n-cube and meshes[2].It choos an output channel which belongs to the lowest dimension from available output channels.For example,if there exist free output channels on x,y dimen-sion,this lection function choos the one on x direction.
On the other hand,“zigzag lection function”choos an output channel who direction has the maximum hops to the destination[2].That is,by using this lection function on the mesh topology,pa
ckets tend to be transferred to a diagonal direction toward the center of the network.For example,on a two-dimensional mesh,when a packet is nt from source s(x s,y s)to destination d(x d,y d)and both the x direction channel and y direction channel are free,compare the value|x d−x s|and|y d−y s|,and the packet will be nt
to the larger direction.
The OSFs have a high possibility to nd a packet to a congested direction even if there exist free legal channels to the other directions.This comes from that the OSFs take no thought of the network congestion dynamically.To address this problem,“VC-LRU lection function”which lects the least-recently-ud available virtual channel[11] was propod for networks with virtual channel mechanism. However,this method may cau the traffic to concentrate on a specific physical channel becau it lects an output channel taking care of only the virtual channel congestion. For example,the least-recently-ud virtual channel which belongs to the congested physical channel will be lected, and it will increasingly lead the traffic congestion.“Mini-mal multiplexation(MM)lection function”was propod for Silla’s minimal adaptive routing in system area networks (SANs)[11],[14].SANs,which usually accept irregular topologies,have been ud to connect nodes in PC/WS clus-ters or high-performance storage systems.The MM lec-tion function tries to minimize the num
ber of packets being multiplexed onto a physical channel at the same time.The other method to dynamically avoid the congestion is bad on the counter[15].“Load-dependent lection function”propod by us improves throughput and latency on torus networks,however,it may require large-sized counter for memorizing a large number of packets at a high traffic load.
On the other hand,the most suitable OSF on mesh-bad store-and-forward networks has been theoretically discusd by Badr and Podar under the condition that a traf-fic has no spatial and temporal access locality on a fully adaptive routing[6],and Wu and Schwiebert also discuss on the best one under the similar condition theoretically[7], [8].However,parallel applications usually have strong ac-cess locality and OSFs are required to cope with the burst-local traffic.Thus,such OSFs are not always to be best in parallel systems.
3.The MMLRU Selection Function
3.1Motivation
For modern routers with virtual channels,OSFs should be designed considering with the following requirements.
1.Balance the traffic among physical channels,and re-
duce multiplexed packets(occupied virtual channels) in a physical channel.
2.Reduce blocked packets when some virtual channels
are free in the legal output physical channel.
The former requirement aims to make the best u of all physical channels,and reduce the link transmission-delay of packets especially at low traffic load.On the other hand,the latter one is for efficient u of virtual channels.Since mod-ern adaptive routing algorithms[1],[3]t different rout-ing restrictions to virtual channels in a physical channel, a packet with the most vere routing-restriction may be blocked even if some virtual channels are free.Thus,the lat-ter requirement is needed to efficiently u virtual channels. From the former requirement,the traffic should be analyzed at the viewpoint of its distribution among physical channels. On the other hand,considering the latter requirement,the traffic should be analyzed at the viewpoint of its distribution among virtual channels on a physical channel.Thus,we consider that the OSF should be logically divided into two stages:choosing a physical channel from an available t of physical channels,and choosing a virtual channel from an available t of virtual channels on the physical channel.
However,as mentioned in Sect.2,traditional OSFs have a problem that they consider the traffic distrib
ution only among either physical channels or virtual channels.For example,the traffic distribution among virtual channels is cared in VC-LRU lection function,however,it is regard-less of the traffic distribution among physical channels.In this ction,we propo a novel OSF called minimal mul-tiplexed and least-recently-ud(MMLRU)which is com-pod of the two lection steps.
中国十大将军排名KOIBUCHI et al.:MMLRU SELECTION FUNCTION:A SIMPLE AND EFFICIENT OUTPUT SELECTION FUNCTION IN ADAPTIVE ROUTING
111
3.2The MMLRU Selection Function Algorithm
The MMLRU lection function is compod of the follow-ing two steps.
1.Choo an output physical channel from available ones.
2.Choo an output virtual channel from available ones
on the lected physical channel.
3.2.1Choice of the Output Physical Channel
In the former step,the output physical channel that has the smallest number of multiplexed packets among available physical channels is lected as well as the MM lection function[11].When more than one physical channel has the same number of packets,it lects one with the largest interval time since the last transferred packet header.This is just least-recently-ud(LRU)policy[13]which aims to minimize the times of multiplexing in a physical channel.
Examples of the MMLRU operation on a two-dimensional mesh are shown in Figs.2and3.Figure2 shows the number of packets in available output physical channels is compared,and the down physical channel is -lected.
On the other hand,Fig.3shows that the MMLRU -lection function lects the output physical channel to
the Fig.2The MMLRU lection function on2D mesh.(When two physical channels have a different number of free
Vchs.)
Fig.3The MMLRU lection function on2D mesh.(When both have the same number of free Vchs.)right physical channel.In this ca,packet B has just arrived at the physical channel to the down direction,while packet C is going away from the physical channel to the right direc-tion.We can estimate that the physical channel to the right direction will be free soon.Since the physical channel t
o the right direction would make the lower transmission(vir-tual channel)delay of packet A,the right direction would be better choice,炸酱面的炸酱怎么做
3.2.2Choice of the Output Virtual Channel
In the latter step,the MMLRU lection function choos the output virtual channel that has the most vere restric-tion of packet transfer from available ones on the physical channel.“Restriction of packet transfer”in this paper repre-nts the condition of deadlock-free packet transfer required in the adaptive routing algorithm.The example of the prior-ity of virtual channels in Duato’s protocol[3]is described in Sect.4.
3.3Properties of the MMLRU Selection Function
诈骗信息内容范本
We consider that the traffic congestion should be dynam-ically recognized in the OSF for uniformly distributing a traffic.Although there are some methods to recognize the congested situation,it is hard to collect exact congestion in-formation at each router.The MMLRU lection function simply employs the method in which each router locally judges the congestion only with the state of its own phys-ical channels in the former lection step.
The link transmission delay is also influenced by the number of packets being multiplexed on a physical chan-nel[14].Thus,the MM and LRU policies try to not only distribute the traffic uniformly among physical channels,but also minimize the link transmission delay.
On the other hand,the latter lection step tries to dis-tribute the traffic uniformly among virtual channels.As-suming that a virtual channel that has a relaxed restriction of packet transfer is given a high priority,a large number of packets tend to u the most relaxed virtual channel.In such cas,packets,that accept only the most relaxed virtual channel,may have a high possibility to be blocked,becau packets,that accept any virtual channel,tend to u the most relaxed virtual channel.Converly,the MMLRU lection function gives a virtual channel,that has the most vere re-striction,the highest priority.Thus,packets that accept only the most relaxed virtual channels tend to be transferred with-out being blocked becau they tend to u different virtual channels.
Although the MMLRU lection function is logically divided into the two lection steps,it can statically deter-mines the priority of virtual channels on a physical channel according to an adaptive routing algorithm.Thus,it may be possible that an output virtual channel is decided as soon as an output physical channel is dynamically lected.More-over,the MMLRU lection function can be ud on any type of network topology and adaptive routing algorithm,
112
IEICE TRANS.INF.&SYST.,VOL.E88–D,NO.1JANUARY2005
becau each router locally judges the congestion informa-tion only by the utilization ratio of its own physical and vir-tual channels.
3.4Design and Implementation
娃娃菜的家常做法
We show an implementation method of the MMLRU -lection function.The MMLRU lection function logically requires two phas—MM and LRU—in the lection of physical channel.It might em the MMLRU lection function introduces longer routing time than that of the other dynamic LRU,load-dependent,MM)in router.
However,it can be done in a single pha,that only compares log2(V+1)+log2P bits value of ports as shown in Fig.4,where V and P are the number of virtual channels and the number of ports,respectively.Each port independently ts the number of free virtual channels to log2(V+1)left-side bits of its counter for the MM function.Also,the LRU function controls log2P right-side bits of the counter at each port like the LRU algorithm ud in traditional cache-memory controls of processor.T
hat is,it rets the right-side bits of the counter for lected port as zero,and it incre-ments value of the right-side bits of another counters tho value is smaller than the(previous)counter value for -lected port.Notice that the MM and LRU operations can be parallelly done as background tasks.
When a packet header arrives at the MMLRU router, feasible output ports are checked in parallel and their counter value is got.After the output ports are checked, the MMLRU function compares counters of feasible ports. Then,the port with the highest number is lected,thus re-quiring a single pha to compare.Figure4is the same as the ca of tie of the MM values shown in Fig.3.As shown in thisfigure,the right direction,which has the higher num-ber,is also lected along this implementation.
All dynamic OSFs will require counters,which store information of local traffic,at implementation.When counter size is huge,routing time may be incread in router.However,Table1demonstrates that all dynamic OSFs except load-dependent lection function require few-bit-width counters to lect an output port.Load-dependent lection function employs a simple algorithm[15].How-ever,when the network accepts large message transfer unit (MTU),it requires large counters,which may introduce
a Fig.4Combined counters of the MMLRU lection function.long routing time.
Although hardware amount would be slightly incread in the MMLRU lection function,each module can work in parallel.Thus,we consider that routing time for lect-ing an output port is not incread compared with the other dynamic lection functions.Its evaluation results will be opened in Sect.5.
Notice that the step to lect an output virtual channel us the same mechanism as the OSFs introduced in Sect.2.
4.Evaluation of Throughput and Latency
In this ction,the performance of the MMLRU lection function is evaluated with aflit-level simulator,and com-pared with others(dimension order,random,zigzag,load-dependent,LRU,and MM.)
4.1Simulation Environment
4.1.1Network Model
Aflit-level simulator written in C++is developed.Network size and packet length are lected just by changing param-eters.Each node consists of a processor,requesting queues and a router which provides bidirectional links for connect-ing with neighboring nodes.A simple model consisting of channel buffers,crossbar,link controller and control circuits is ud for the switching fabric in the router.Parameters are t as shown in Table2.
We ignored thefirst5,000clocks for the evaluation be-cau the network is not stable in that period.A header-flit transfer requires at least three clocks in any OSF,that is,one for routing,one for transferring aflit from an input channel to output channel through a crossbar,and the rest for trans-ferring theflit to the next node or processor in any OSF.The model is simple compared with real routers.Nevertheless, it is uful for larger systems and complicated topologies becau more exact modeling of modern router fabrics con-sumes a huge simulation time.
冻的组词Table1Counter size(bit/port).
MMLRU log2(V+1)+log2P
LRU log2P
MM log2(V+1)
集成吊顶品牌load-dependent log2(MTU[flits])
Table2Simulation parameters.
Simulation time50,000clocks
(ignore thefirst5,000clocks)我的童年作文600字
Topology2D torus or3D torus
Network size16×16(256nodes),
32×32(1,024nodes),
8×8×8(512nodes),or
8×8×8×8(4,096nodes)
The number of3
virtual channels
Switching technique wormhole
KOIBUCHI et al.:MMLRU SELECTION FUNCTION:A SIMPLE AND EFFICIENT OUTPUT SELECTION FUNCTION IN ADAPTIVE ROUTING
113
4.1.2Adaptive Routing Algorithm
In this simulation,we u a fully adaptive routing algorithm called Duato’s protocol [3].Here,we will describe the adap-tive routing algorithm details especially on the usage of vir-tual channels in this simulation.
Duato states a general theorem defining a criterion for deadlock-free and then us the theorem to propo a fully adaptive,profitable,and progressive protocol called *-channel.Since the theorem states that by parating virtual channels on a physical channel into escape and adaptive par-titions.
Simple description of Duato’s protocol is as follows.a.Provide that every packet can always find a path toward its destination who virtual channels are not involved in cyclic dependencies (escape path.)
b.Guarantee that every packet can be nt to any destina-tion node using an escape path and the other path on which cyclic dependency is broken by the escape path (fully adaptive path.)By lecting the two routes (escape path and fully adaptive path)dynamically,deadlocks are prevented using minimal paths.
In the simulation,three virtual channels are provided on each physical channel as shown in Fig.5,becau Duato’s protocol requires three virtual channels on torus.In Fig.5,two virtual channels,CA and CH,are ud for dimension-order routing [4],and a packet which needs to u wraparound path us only CA channel and a packet which doesn’t need to u wraparound path is allowed to u the both CH and CA channels.Bad on the above restrictions,the channels provide an escape path,while CF channel is ud for a fully adaptive routing.In Fig.5,CH channel has the most strict restriction of packet transfer,and routing restriction of CA channel is more strict than CF channel.Notice that Duato’s protocol performs deadlock-free operation in any
OSF.
茄子红烧肉Fig.5Virtual channels required by Duato’s protocol on 2D torus.
4.1.3Output Selection Function
We evaluated dimension order,random,zigzag,load-dependent,LRU,MM,and MMLRU lection functions with Duato’s protocol on torus.Here,“LRU”lection func-tion [15]is similar to the VC-LRU lection function except that LRU policy is only applied to the lection step among physical channels.Then,it lects the output virtual channel which has the most vere restriction among available ones on the physical channel.The MM lection function,which was propod for Silla’s minimal routing in SANs,is also implemented with the following extension:it lects an out-put virtual
channel with the most vere restriction of packet transfer after it lects an output physical channel.4.1.4Tra ffic Pattern
The tra ffic pattern establishes the destination distribution of the packets.Tra ffic pattern will be an important perfor-mance factor in OSFs,becau access locality a ffects the performance of them.We u synthetic tra ffic and access traces from NAS Parallel Benchmarks 2.3[16]executed on the RHiNET-2cluster with two-dimensional torus [17].
We u bit reversal and matrix transpo packet desti-nation as the synthetic tra ffic.
•matrix transpo tra ffic
A node (x ,y )nds a packet to the node (k −y −1,k −x −1)(k is the number of nodes in each dimension)or (k −x −1,k −y −1)when x +y =k −1.•bit reversal tra ffic
A node with the identifier (a 0,a 1,···,a n −1)nds a packet to the node who identifier is the bit reversal (a n −1,···,a 1,a 0)of the source node.
Hosts inject a packet independently of each other.Each packet length is randomly lected from 128and 512flits.
In order to collect the traces,we run NAS Parallel Benchmarks on the RHiNET-2cluster.This cluster is a pro-totype in a rearch project called RHiNET (Real World Computing Partnership High Performance Network)[18],and it consists of hosts with specially designed network interfaces (RHiNET-2/NI)and switches (RHiNET-2/SW)connected with 8Gbit /c optical interconnects [19].Host specifications are Intel Pentium III 933MHz ×2(SMP),Serverworks ServerSet III HE-SL,PC133SDRAM 1GByte Memory,PCI 64bit /66MHz,and RedHat Linux 7.2(ker-nel 2.4.18.)We implement dimension-order routing in two-dimensional torus in order to collect the traces.NAS Parallel Benchmarks run under low software-latency via a low-level communication library PMv2and the SCore cluster sys-tem software [20].We collected traces of IS (Integer Sort),and MG (Multi-Grid solver)from NAS Parallel Benchmarks 2.3[16].The overall number of hosts (process)is 64.A small problem,class S,is ud,becau the RHiNET-2clus-ter includes a few unstable optical modules.However,the
114
IEICE TRANS.INF.&SYST.,VOL.E88–D,NO.1JANUARY 2005
communication ratio against computation is relatively in-cread under small-sized parallel benchma
rks,thus short-ening the packet interval.From the viewpoint of the OSF evaluation,this is valuable.
Since access traces are collected using the MPE pro-filing library,we calculate the raw packet-length from them by adding 44-Bytes for MPICH additional information,16-Bytes for PM packet header and tail,and 40-Bytes for the RHiNET packet header and tail.In the RHiNET-2clus-ter,each packet is transferred by splitting into 8-Byte flits,and minimum packet length is 17flits.When data-size is smaller than 17flits,a hardware padding supplies the re-maining flits.On the other hand,since maximum transfer unit (MTU)is 2K-Bytes,large-sized data is divided into v-eral packets.Although network interfaces and switches in this simulation are simplified for achieving enough simula-tion speed,the evaluation using the traces is valuable taking account into testing under various kinds of spatial and tem-poral access locality.4.2Simulation Results 4.2.1Synthetic Tra ffic
Figure 6shows simulation results under matrix transpo tra ffic on 2D torus (16×16,32×32),3D torus (8×8×8),and 4D torus (8×8×8×8)respectively.Each x and y axis reprents accepted tra ffic and network latency
respec-(a)16×162D torus.(b)32×322D
torus.
(c)8×8×83D torus.
(d)8×8×8×84D torus.
Fig.6
Matrix transpo tra ffic.
tively [9].Accepted tra ffic is the flit reception rate in a node in each clock cycle,and throughput is defined as the max-imum amount of accepted tra ffic [9].Figure 6shows that the latency of dimension order lection function is drasti-cally incread under the heavy tra ffic.This shows that the visiting order of directions greatly a ffects the latency under non-uniform tra ffic even with Duato’s protocol which e ffec-tively us physical channels.Random lection function decides an output channel at r
andom and thus,it ems that the tra ffic tends to be uniformly distributed.However,Fig.6shows that random lection function is poor performance compared with four OSFs that dynamically judge the tra ffic congestion.Zigzag lection function,which tries to make a largest number of available packet’s output directions per switch,is also poor performance compared with such four OSFs.Thus,it can be said that the methods to dynamically judge the tra ffic congestion are more e fficient than the sim-ple methods.
Here,we compare OSFs that dynamically consider the congestion.Unlike load-dependent lection function,the MMLRU lection function doesn’t u the information of packet’s length as a pointer.However,even in multi-ple packet lengths (128and 512flits),the MMLRU lec-tion function achieves better performance than LRU,load-dependent,and MM lection functions.The improvement tends to grow as the number of dimensions and the number of nodes per dimension become larger respectively.
Figure 7shows simulation results under the bit reversal