Network and Fault Tolerance*
Deepankar Medhi
Department of Computer Networking
University of Missouri–Kansas City
5100Rockhill Road
Kansas City,MO64110
1.INTRODUCTION
When we make a telephone call,the call is connected through a communication network to the receiving party. Similarly,when we nd an e-mail using the Internet,the message is nt through a communication network to the recipient.Such communication networks are made up of nodes and links that connect the nodes by hardware as well as the software components that allow for the to communicate through such networks.Network reliability
refers to the reliability of the overall provide communication in the event of failure of a component or components in the network.The term fault-tolerant is usually ud to refer to how reliable a particular component (element)of a network ,a switch or a router).The term fault-tolerant network,on the other hand,refers to how resilient the network is against the failure of a component.
Communication network reliability depends on the sustainability of both hardware and software.A variety of network failures,lasting from a few to days depending on the failure,is possible.Traditionally,such failures
were primarily from hardware that result in downtime(or“outage period")of a network element(a node or a link).Thus,the emphasis was on the element-level network availability and,in turn,the determination of overall network availability.However,other types of major outages have received much attention in recent years.Such incidents include accidentalfiber cable cut,natural disasters,and malicious attack(both hardware and software).The major failures need more than what is tradition
ally addresd through network availability.For one,the types of failures cannot be addresd by congestion control schemes alone becau of their drastic impact on the network.Such failures can,for example,drop a significant number of existing network connections;thus,the network is required to have the ability to detect a fault and isolate it,and then either the network must reconnect the affected connections or the ur may try to reconnect it(if the network does not have reconnect capability).At the same time,the network may not have enough capacity and capability to handle such a major simultaneous“reconnect"pha.Likewi,becau of a software and/or protocol error,the network may appear very congested to the ur[4,9,16].Thus,network reliability nowadays encompass more than what was traditionally addresd through network availability.
*to appear in Wiley Encyclopedia of Electrical&Electronics Engineering,1999.
In this article,we will u the term network reliability in a broad n and cover veral subtopics.We will start with network availability and performability,and then discuss survivable network design,followed by fault detection,isolation,and restoration as well as preplanning.We will conclude with a short discussion on recent issues and literature.
2WORK A V AILABILITY AND PERFORMABILITY
冰心是什么家
Network availability refers to some measure of the reliability of a network.Thus,network availability analysis considers the problem of evaluating such a measure.[Note that in the current literature,this is often termed as the network reliability analysis [2]].Moore and Shannon did early work in this area [28].We discuss network availability through an example.Figure 1shows that two telephones are connected by distribution gments (“A")to local switches(“S"),while the switches are connected by the facility (“B").The following allocation of outage/downtime percentage is assumed for the different elements:S,0.01%,A,0.01%,B,0.03%.Then,the availability of this connection is
;this translates to the maximum downtime of 368minutes per year.A A
B S
S Figure 1:Network view for availability example
碟鱼In general,network availability computation address the availability of a network in operational states,and discrete probability models are often ud in analysis.Let
denote the t of elements of a network (for examples all the nodes and links).Each element may be in “up"or “down"state,where “up"refers to fully operational and “down"refers to total loss of the element.Let denote that probability that element
is “up"—this is also referred to as the availability of element .Now consider the subt
of consisting of the “up"elements of state .Then,the probability that the network is in up state is given by
Note that there are possible states (where denotes the cardinality of the t );thus,usually network availability computation needs to deal with the problem of this exponential growth in states.A variety of algorithms for efficient computation have been developed over the years for different availability measures;the interested reader is directed to [2]and the references therein for additional information.
A related issue to availability is the performability.Most availability measures deal only with the connectivity aspect of the network;for example,what is the availability of a path from a source node to a destination node.
牛的英语怎么读
However,when a failure occurs,the network may not be able to perform at the same level as when there were no failure.For example,the average network blocking in voice telephone networks(circuit-switched networks)is typically the measure for grade-of-rvice(GoS).A common value of GoS is1%blocking under the normal operational mode, but under a specific outage,this may increa to more than10%blocking;similarly,in a packet-switched network, the average packet delay may increas
e by an order of magnitude during a major failure compared to under the normal circumstances.Thus,the network failure performability address the performance of the network under various failure states.Consider a network with elements that can each be either in operational or in a completely failed state;then,the total number of states is.The performability measure is given by
where is the probability of state,and is the ,network blocking in circuit-switched networks or average delay in packet-switched networks)in state.Again,we face the issue of the exponential number of states. This can,however,be bounded by considering most probable states as wasfirst shown by Li and Silvester[22]. Often,with the proper choice of,the performability measure can be quite accurately computed.For example,if in a network,multiple simultaneous link failure scenarios are extremely unlikely,then the most probable states are the failure of each link independently.Accordingly,one may limit the computation to the states.
3.SURVIV ABLE NETWORK CAPACITY DESIGN
While network failure availability and performability address important measures for evaluating the reliability of a network,designing the networks for survivability is extremely important for overall netw
ork reliability.In this ction,we address this topic for the capacity design problem using parate examples for circuit-switched networks and packet-switched networks.
3.1Circuit-Switched Traffic Networks Example
Consider the three-node circuit-switched network(Figure2)for which we are given that the availability of each link is99.9%.We assume that the network has symmetric offered traffic(or load)and capacity.Offered load in circuit-switched networks is given in erlangs;this load is the product of the the average call arrival rate and the average call holding time.For example,if the average call arrival rate is200calls/h and the average call holding time is3min, then the offered load is10erlangs(=).
1
2
3 Figure2:Three node network
For the symmetric three-node network,offered load between any pair of nodes is assumed to be10erlangs,and the link capacity on each link is given to be21trunks(or circuits).We assume that the traffic between each pair of nodes is routed on the direct link that connects the end nodes of the
pair,and we would like to know the call-blocking probability.For an offered load of erlangs,and trunks,and under the assumption that call arrival folows a Poisson process,Erlang-B loss formula can be ud for computing the blocking probability,which is given by
育儿文章
Thus,in our example,we have.That is,the network is providing a rvice quality (grade-of-rvice)of0.1%blocking.(In actuality,the blocking for each pair of nodes is slightly different becau any traffic blocked on the direct link can try the alternate route.)
Now,suppo that the link2–3fails;in this ca,the network is still connected becau node1is connected to node3via node2.Assuming that the network still has the same amount of offered load,the load between node2 and node3is now required to be routed through node1;thus,the load offered to each link is20erlangs,whereas the capacity on each link is still21trunks.Thus,the blocking en by traffic on each link is,and the blocking en by pair2–3traffic going through node1is even higher.Under the link independence assumption, the blocking on a path consisting of two links is given by,where is the link blocking probability.Thus, in our example,the blocking for traffic between2–3going through node1is.
皮肤出油什么原因
Thus,we can e that,under no failure,the network provides a grade-of-rvice of0.1%,whereas unde
太阳光的颜色
r a single link failure the worst traffic pair blocking is24.558%although the network connectivity is still maintained.Recall that the link availability was assumed to be99.9%;this means that the link can possibly be down for as long as8 hours in a year.If we assume one event per link per year,then this link could conceivably be down for up to8hours straight!In some networks,this may be unacceptable given that the worst traffic pair blocking jumps to24.558%from 0.01%.If we assume that the network should still provide a0.1%blocking grade even under a single failure for every traffic pair,then to accommodate for the worst path blocking,we need link blocking on each of the remaining links to be such that the path blocking for traffic between node2and node3using links2–1and1–3needs to satisfy ;this translates to for each link.Becau,now we have an offered load of20erlangs on each link,we need tofind the smallest such that.Solving for integral,wefind that needs to be at ,we need to have36units of capacity on links1–2and1–3each).By the same argument,if we consider the failure of a different link independently,then the other two links each need36trunks.Thus,to cover for failure of each link independently,each link needs36trunks to provide the same level of blocking as was originally wanted for the network in the nonfailure mode.In other words,the network needs80%more capacity to cover for a link failure compared to the no-failure ca although network availability requirement was met.司有和
风雨无阻造句3.2Packet-Switched Networks Example
Consider this time a three-node packet-switched network.We will u Figure2again.In packet networks,the offered traffic is usually given by the average packet arrival rate(packets per cond,pps in short).If the average packet arrival rate to a network link is and follows a Poisson process,the average packet size is exponentially distributed with mean kilobits and the link speed is kilobits per cond(Kbps),then the average packet delay(caud by the queueing phenomenon)can be obtained from the M/M/1queueing system and is given by
For the three-node example,we assume unit mean packet ,),in addition to assuming that the average arrival traffic between each pair of nodes is10packets per cond,and that the capacity of each link is30Kbps.If all traffic between each node-pair is routed on the direct link,this provides an average delay of s, or50ms.Now suppo that the link2–3fails,then the traffic between node2and node3is routed through node1; this induces an offered traffic of20pps on each remaining link.Thus,the average delay on each link(1–2and1–3)is 100ms which is obrved by traffic between nodes1and2and between nodes1and3.On the other hand,the traffic between nodes2and3will go over two links and will thus experience a delay of ms;this delay is four times more than under the nofailure situation.
If the network goal is to provide the average delay for any pair to be less than or equal to50ms under
a single link failure,then to meet this condition we need link capacity such that which implies that needs to be60Kbps on each of the remaining links.Similarly,if we consider the independent failure of a different link,then the other two links will require60Kbps to provide the same level of rvice.Thus,in this network,we e that we need to double the capacity to provide the same level of rvice obtained under a single-link failure.
3.3Discussion
We can e from the examples that if the network is not provided with additional capacity,then the traffic blocking can be very high in circuit-switched networks,which can result in excessive retry by urs,or the packet backlog(queue)can build up in packet-switched networks.Thus,a transient effect can take place.From the two examples for two different networks,we can also e that,in some circumstances,the network capacity needs to be 80%to100%more to provide the same level of rvice under a single link failure.This of cour depends on the network objective(in our examples,we have ud the objective that worst-pair traffic blocking or delay is minimized). In some networks,this near doubling of capacity can be cost-prohibitive;thus,the network performance requirement under failure may be relaxed.For example,under a single-element failure,it may be acceptable to have5%blocking under a single link failure for the circuit-switched network ca,or the
average delay is acceptable to be100ms for the packet-switched network ca.It is easy to e that this will reduce the additional capacity requirements in both cas.
Even though additional capacity can meet GoS requirement under a failure,the actual network topology layout and routing are also critical for survivable design[24].Thus,we also need to understand the network connectivity