Structural Analysis of Network Traffic Flows
Anukool Lakhina,Konstantina Papagiannaki,Mark Crovella, Christophe Diot,Eric D.Kolaczyk,and Nina T aft
ABSTRACT
Network traffic aris from the superposition of Origin-Destination (OD)flows.Hence,a thorough understanding of ODflows is esn-tial for modeling network traffic,and for addressing a wide variety of problems including traffic engineering,traffic matrix estimation, capacity planning,forecasting and anomaly detection.However,to date,ODflows have not been cloly studied,and there is very little known about their properties.
We prent thefirst analysis of complete ts of ODflow time-ries,taken from two different backbone networks(Abilene and Sprint-Europe).Using Principal Component Analysis(PCA),we find that the t of ODflows has small intrinsic dimension.In fact, even in a network with over a hundred ODflows,theflows can be accurately modeled in time using a small number(10or less)of independent components or dimensions.
We also show how to u PCA to systematically decompo the structure of ODflow timeries into three main constituents:com-mon periodic trends,short-lived bursts,and noi.We provide in-sight into how the various constitutents contribute to the overall structure of ODflows and explore the extent to which this decom-position varies over time.
A.Lakhina and M.Crovella are with the Depart-ment of Computer Science,Boston University;email: anukool,crovella@cs.bu.edu.K.Papagiannaki and C.Diot are with Intel Rearch,Cambridge,UK;email: dina.papagiannaki,christophe.
E. D.Kolaczyk is with the Department of Mathe-matics and Statistics,Boston University;email:ko-laczyk@math.bu.edu.N.Taft is with Intel Rearch, Berkeley;email:nina.This work was performed while M.Crovella was at Laboratoire d’Informatique de Paris6(LIP6),with support from Centre National de la Recherche Scientifique(CNRS),France and Sprint Labs.Part of this work was also done while A.Lakhina,K.Papagiannaki and N.Taft were at Sprint Labs and A.Lakhina was at Intel Rearch,Cambridge. This work was supported in part by a grant from Sprint Labs, ONR award N000140310043and NSF grants ANI-9986397and CCR-0325701.
Permission to make digital or hard copies of all or part of this work for personal or classroom u is g
ranted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on thefirst page.To copy otherwi,to republish,to post on rvers or to redistribute to lists,requires prior specific permission and/or a fee.
SIGMETRICS/Performance’04,June12–16,2004,New York,NY,USA. Copyright2004ACM1-58113-664-1/$5.00.Categories and Subject Descriptors
C.2.3[Computer-Communication Networks]:Network Opera-tions;C.4.3[Performance of Systems]:Modeling Techniques General Terms
Measurement,Performance
Keywords
Network Traffic Analysis,Traffic Engineering,Principal Compo-nent Analysis
1.INTRODUCTION
Much of the work in network traffic analysis so far has fo-cusd on studying traffic on a single link in isolation.How-ever,a wide range of important problems faced by network re-archers today require
modeling and analysis of traffic on all links simultaneously,including traffic engineering,traffic matrix esti-mation[18,19,27,33,34],anomaly detection[1,6],attack detec-tion[32],traffic forecasting and capacity planning[21]. Unfortunately,whole-network traffic analysis–i.e.,modeling the traffic on all links simultaneously–is a difficult objective,am-plified by the fact that modeling traffic on a single link is itlf a complex task.Whole-network traffic analysis therefore remains an important and unmet challenge.
One way to address the problem of whole-network traffic anal-ysis is to recognize that the traffic obrved on different links of a network is not independent,but is in fact determined by a common t of underlying origin destination(OD)flows and a routing ma-trix.An origin destinationflow is the collection of all traffic that enters the network from a common ingress point and departs from a common egress point.The superposition of the point-to-point flows,as determined by routing,gives ri to all link traffic in a network.Thus,instead of studying traffic on all links,a more di-rect and fundamental focus for whole-network traffic study is the analysis of the network’s t of ODflows.
杉篙However,even though ODflows are conceptually a more funda-mental property of a network’s workload than link traffic,analyz-ing them suffers from similar difficulties.The principal challenge prented by ODflow analysis is that ODflows form a high dimen-sional multivariate structure.For ex
ample,even a moderate-sized network may carry hundreds of ODflows;the resulting t of time-ries has hundreds of dimensions.The high dimensionality of OD flows is in fact a prime source of difficulty in addressing the whole-network analysis problems listed above.Thus the central problem one confronts in ODflow analysis is the so-called“cur of dimen-sionality”[7].
梦见小孩拉屎In general,when prented with the need to analyze a high-dimensional structure,a commonly-employed and powerful ap-proach is to ek an alternate lower-dimensional approximation to the structure that prerves its important properties.It can often be the ca that a structure that appears to be complex becau of its high dimension may be largely governed by a small t of in-dependent variables and so can be well approximated by a lower-dimensional reprentation.Dimension analysis and dimension re-duction techniques attempt tofind the simple variables and can therefore be a uful tool to understand the original structures. The most commonly ud technique to analyze high dimensional structures is the method of Principal Component Analysis[11] (PCA,also known as the Karhunen-Lo`e ve procedure and singu-lar value decompositon[28]).Given a high dimensional object and its associated coordinate space,PCAfinds a new coordinate space which is the best one to u for dimension reduction of the given object.Once the object is placed into this new coordinate space, projecting the object onto a subt of the axes can be done in a way that mini
mizes error.When a high-dimensional object can be well approximated in this way in a smaller number of dimensions,we refer to the smaller number of dimensions as the object’s intrinsic dimensionality.
In this paper,we u PCA to explore the intrinsic dimensionality and structure of ODflows using data collected from two different backbone networks:Abilene and Sprint-Europe.Even though both the networks have over a hundred origin-destination pairs,we show that on long timescales(days to week),their structure can be well captured using remarkably few dimensions.In fact,we find that using between5and10dimensions,one can accurately approximate the enmble of ODflows in each network.
In order to explore the nature of this low dimensionality,we in-troduce the notion of eigenflows.An eigenflow,derived from a PCA of ODflows,is a timeries that captures a particular source of temporal variability(a“feature”)in the ODflows.Each ODflow can be expresd as a weighted sum of eigenflows;the weights cap-ture the extent to which each feature is prent in the given ODflow. We show that eigenflows fall into three natural class:(i)deter-ministic eigenflows,which capture the predictable periodic trends in the ODflow timeries,(ii)spike eigenflows,which capture the occasional short-lived bursts in ODflows,(iii)noi eigenflows, which account for trafficfluctuations ap
pearing to have relatively time-invariant properties across all ODflows.This taxonomy,sys-tematically and quantitatively unearthed by PCA,can be viewed as being parallel to characteristics obrved in various analys of network traffic in the literature:periodic trends[21,25],stochastic bursts[26]and fractional Gaussian(or other)noi[17,22].Thus, the systematic decomposition of a t of ODflows into its consti-tutent eigenflows sheds light on the intrinsic structure of ODflows, and conquently on the behavior of the network as a whole.
In fact,by categorizing eigenflows in this manner,wefind that we can obtain significant insight into the whole-network proper-ties of data traffic.First of all,wefind that each ODflow is well captured by only a small t of eigenflows.Thus,each ODflow has a certain small t of features.Furthermore,the features vary in a predictable manner as a function of the amount of traffic car-ried in the ODflow.In particular,we show quantitatively that the largest ODflows in both networks are primarily deterministic and periodic;ODflows of moderate strength are generally comprid of both bursts and noi comparatively;and the weakest ODflows are primarily bursty(for Sprint-Europe)and primarily noi(for Abilene).This broad characterization of the nature of ODflows provides a uful basis for organizing and interpreting studies of whole-network traffic.
Finally,from a broader perspective,an important methodologi-
cal contribution of our work is the application of a dimension anal-
ysis technique to analyze the structure of network traffic.Although
we concentrate on timeries of traffic counts,analogous problems
ari when studying delay or loss patterns in networks.Examining
intrinsic dimensionality and structure in the manner we outline in
this paper may be fruitful in studying other network properties as
well.
This paper is organized as follows.We begin in Section2with
a discussion of the high dimensionality of ODflows and provide董英杰
the necessary foundations of Principal Component Analysis.We
outline the steps taken to collect and construct ODflows from both
the Sprint-Europe and Abilene networks in Section3.We then
apply PCA to ODflow timeries from both networks and prent
evidence of their low dimensionality in Section4.We elaborate
on the notion of eigenflows and show how they can be interpreted,
understood and harnesd in Section5.In Section6,we examine
the temporal stability of the decomposition of ODflows into their
constitutent eigenflows.The low intrinsic dimensionality of OD
flows at long timescales suggests new approaches to a number of
network engineering problems.A discussion of the,our ongoing
work and related work is in Section7.Concluding remarks are
prented in Section8.
2.BACKGROUND
In order to facilitate discussion in subquent ctions,wefirst
introduce relevant notation.Let denote the number of ODflows
in a network and denote the number of successive time intervals of
interest.In this paper,we study networks which have on the order
of hundreds of OD Flows,over long timescales(days to weeks)and
over time intervals of5and10minutes so that.Let be
the measurement matrix,which denotes the timeries of all
ODflows in a network.Thus,each column denotes the timeries
of the-th ODflow and each row reprents an instance of all
the ODflows at time.We refer to individual columns of a matrix
using a single subscript,so ODflow is denoted.Note that
thus defined has rank at most.Finally,all vectors in this paper are
column vectors,unless otherwi noted.
2.1OD Flows
An ODflow consists of all traffic entering the network at a given
point,and exiting the network at some other point.Each network
ingress and egress point rves a distinct customer population.1
Thus,each ODflow aris from the activity of a distinct ur pop-
ulation.
The traffic actually obrved on a network link aris from the
superposition of ODflows.The relationship between link andflow
traffic can be concily captured in the routing matrix.The ma-
trix has size(#links)(#flows),where ifflow travers link,and is zero otherwi.Then the vector of traffic
counts on links()is related to the vector of traffic counts in OD
flows()by.Traffic engineering is the process of ad-
justing,given some ODflow traffic,so as to influence the link
traffic in some desirable way.Thus accurate traffic engineering
and link capacity planning depends on a good understanding of the
properties of the ODflow vector.
In a typical network with PoPs(points of prence where traf-
fic may enter or exit the network)there are PoP-pairs,and hence
ODflows.Thus even in a moderate sized network with tens of
PoPs,there are hundreds of ODflows,meaning that is a vector 1We assume for purpos of discussion that routing changes do not affect where traffic for a particular population enters or exits.
Figure1:Illustration of PCA on a correlated,2-D datat. residing in a high dimensional space.Successive ODflow traffic measurements over time()then become a high dimensional mul-tivariate timeries.
Becau each ODflow is the result of activity of distinct ur populations,it is not clear to what extent ODflows share common characteristics.That is,it is not clear whether we should expect the columns
of to be related(so that the effective rank of is less than).A particularly powerful approach to answering the questions quantitatively is dimension analysis via PCA.
2.2Principal Component Analysis
PCA is a coordinate transformation method that maps the mea-sured data onto a new t of axes.The axes are called the prin-cipal axes or components.Each principal component has the prop-erty that it points in the direction of maximum variation or energy (with respect to the Euclidean norm)remaining in the data,given the energy already accounted for in the preceding components.2As such,thefirst principal component captures the total energy of the original data to the maximal degree possible on a single axis.The next principal components then capture the maximum residual en-ergy among the remaining orthogonal directions.In this n,the principal axes are ordered by the amount of energy in the data they capture.
The method of PCA can be motivated by a geometric illustra-tion.An application of PCA on a two dimensional datat is shown in Figure1.Thefirst principal axis points in the direction of max-imum energy in the data.Generalization to higher dimensions,as in the ca of,take the rows of as points in Euclidean space, so that we have a datat of points in I R.Mapping the data onto thefirst principal axes places the data into an-dimensional hy-perplane.
Shifting from the geometric interpretation to a linear algebraic formulation,calculating the principal components is equivalent to solving the symmetric eigenvalue problem for the matrix. The matrix is a measure of the covariance betweenflows.
Each principal component is the-th eigenvector computed from the spectral decomposition of:
(1) where is the eigenvalue corresponding to.Furthermore,be-cau is symmetric positive definite,its eigenvectors are or-2We will u the terms variation and energy interchangably in the rest of the paper.thogonal and the corresponding eigenvalues are nonnegative real.
By convention,the eigenvectors have unit norm and the eigenval-教务员
ues are arranged from large to small,so that.
To e that calculating the principal components of is equiva-
lent to computing the eigenvectors of,consider thefirst prin-cipal component.Let denote the vector of size corresponding
to thefirst principal component of.As mentioned earlier,the
first principal axis,,captures the maximum energy of the data:
(2)
where is the energy of the data captured along.The above equation can be rewritten as:
The quantity being maximized in the last equation above is the Rayleigh Quotient of.It can be shown that the eigenvector corresponding to the largest eigenvalue of(or thefirst eigen-vector)maximizes its Rayleigh quotient(e,for instance[28]).In this way,maximizing the energy of along thefirst principal com-ponent is equivalent to computing thefirst eigenvector of. Proceeding recursively,once thefirst principal components have been determined,the-th principal component corresponds to the maximum energy of the residual.The residual is the difference between the original data and the data mapped onto thefirst principal axes.Thus,we can write the-th principal component as:
By a similar argument,computing the-th principal component is equivalent tofinding the-th eigenvector of.Thus,in this manner,computing the t of all principal components,is equivalent to computing the eigenvectors of.
Once the data have been mapped into principal component space, it can be uful to examine the tr
ansformed data one dimension at a time.Considering the data mapped onto the principal components, we e that the contribution of principal axis as a function of time is given by.This vector can be normalized to unit length by dividing by.Thus,we have for each principal axis,
(3)
The are vectors of size and orthogonal by construction.The above equation shows that all the ODflows,when weighted by ,produce one dimension of the transformed data.Thus vector captures the temporal variation common to allflows along principal axis.Since the principal axes are in order of contribution to the overall energy,captures the strongest temporal trend common to all ODflows,captures the next strongest,and so on.Becau the t of capture the time-varying trends common to the ODflows,we refer to them as the eigenflows of.
The t of principal components can be arranged in or-der as columns of a principal matrix,which has size. Likewi,we can form the matrix in which column is.
Figure2:An eigenflow and its corresponding principal compo-nent.
Then taken together,,,and can be arranged to write each ODflow as:
(4) where is the timeries of the-th ODflow and is the -th row of.Equation(4)makes clear that each ODflow
is in turn a linear combination of the eigenflows,with associated weights.
In Figure2we show typical examples of an eigenflow and its corresponding principal axis.The eigenfl
ow captures a pattern of temporal variation common to the t of ODflows,and the extent to which this particular temporal pattern is prent in each ODflow is given by the entries of.In this ca,we can e that this eigenflow’s feature is most strongly prent in ODflow84(the strongest peak in).
The elements of are called the singular values.Note that each singular value is the square root of the corresponding eigenvalue,which in turn is the energy attributable to the respec-tive principal component:
(5) where the cond equality holds from Equation1,and the last equality follows from the fact that has unit norm.Thus,the singular values are uful for gauging the potential for reduced di-mensionality in the data,often simply through their visual exam-ination in a scree plot.Specifically,finding that only singular values are non-negligible,implies that effectively resides on an -dimensional subspace of I R.In that ca,we can approximate the original as:
(6)
where is the effective intrinsic dimension of.
In the next ction,we introduce the complete-ts of ODflow timeries from both networks that we have collected.In the c-tion that follows it(Section4),we analyze theflows using PCA.3.DATA
3.1Networks Studied
This analysis of OD-pairflow properties is bad on measure-ments from two different backbone networks.However,it is not specific to backbone networks and can be applied to different types of networks.
Sprint-Europe(henceforth Sprint)is the European backbone of a US tier-1ISP.This network has13Points of prence(PoPs) and carries commercial traffic for large customers(companies,lo-cal ISPs,etc.).Abilene is the Internet2backbone network.It has 11PoPs and spans the continental USA.The traffic on Abilene is non-commercial,arising mainly from major universities in the US.
3.2Flow Data Collected
Measuringflow data by capturing every packet at high packet rates can overwhelm available processing power.Therefore,we collected sampledflow data from every router in both networks. On the Sprint network,we ud Cisco’s NetFlow[5]to collect ev-ery250th packet.Sampling is periodic,and results are aggregated inflows at the network prefix level,every5minutes.On Abilene, the sampling rate is random,capturing1%of all packets using Ju-niper’s traffic sampling tool[12].The monitoredflow granularity is at the5-tuple level(IP address and port number for both source and destination,along wi
天堂里的爸爸
th protocol type)and sampled measurements are reported every minute.We aggregated the Sprint and Abilene flow traffic counts into bins of size10minutes and5minutes re-spectively to avoid possible collection synchronization issues. Using sampledflow data has two major drawbacks.First,when a link is lightly utilized,sampling every-th packet undersamples someflows.However,we found excellent agreement(within1%-5%accuracy)between sampledflow bytecounts,adjusted for sam-pling rate,and the corresponding SNMP bytecounts on links with utlization more than1Mbps.Most of the links from both networks fall in this category,and so our sampledflow bytecounts are likely to be accurate.Another problem with measuringflows by sampling packets on any link is that someflows are not sampled altogether. As[8,10]show,the unsampledflows have a small number of packets,carry very few bytes and so will have negligible impact on our aggregatedflow bytecounts.
3.3From Raw Flows to OD Flows
To obtain Origin-Destinationflows from the rawflows collected, we have to identify the ingress and egress points of eachflow.The ingress points can be identified becau we collect data from each ingress link in both networks.For egress point resolution,we u BGP and ISIS routing tables as detailed in[2,9].3Using this pro-cedure,we obtained the datats summarized in Table1.
#Pairs Type Time Bin Period Sprint-1169Net.Prefix10min Jul07-Jul13 Sprint-2169Net.Prefix10min Aug04-Aug10 Sprint-3169Net.Prefix10min Aug11-Aug17 Abilene121IP5-Tuple5min Apr07-Apr13 Table1:Summary of datats studied.
3For Sprint,we supplemented routing tables with router configu-rationfiles to resolve customer IP address spaces.Also,Abilene anonymizes the last11bits of the destination IP.This is not a sig-nificant concern becau there are few prefixes less than11bits in the Abilene routing tables,and we found very little traffic destined to the prefixes.
4.ANALYZING OD FLOWS
As described in Section2,the foundation of our approach is to u PCA to decompo an enmble of ODflows into its con-stituent t of eigenflows.In this ction,we prent the results of that process.Wefirst show that only a small t of eigenflows is necessary for reasonably accurate construction of OD traffic–meaning that ODflows in fact form a multivariate timeries of low effective dimension.Then we examine the structure of OD flows,that is,how each ODflow is decompod into constituent eigenflows.
4.1Low Dimensionality of OD Flows
As described in Section2.2,the energy contributed by each eigenflow to aggregate network traffic is summarized in the scree plot.We form scree plots by applying PCA to the Sprint and Abi-lene datats.In Figure4we show the scree plots for each datat. Thefigure shows the surprising result that the vast majority of traffic variability is contributed by thefirst few eigenflows;further-more,this effect is consistent in both networks.Both curves have a very sharp knee,showing that a handful of eigenflows,between 5and10,contribute to most of the traffic variability.In different terms,this result shows that the ODflow timeries together form a structure with effective dimension between5and10–much lower than the number of OD pairs(over100in each ca).
As an illustration of this low dimensionality of ODflows,we plot a sample of ODflows using a low-dimensional reconstruction. We do so by reprenting each ODflow using only thefirstfive eigenflows.This construction is given by Equation6,with. The results are shown in Figure3.Thefigure shows that even if we omit over100dimensions from the original data,we can capture the temporal characteristics of the ODflows remarkably well. What is the reason for this low dimensionality in ODflow
data? There are at least two ways in which this sort of low-dimensionality can ari.First,if the magnit
ude of variation among dimensions in the original data differs greatly,then the data may have low effec-tive dimension for that reason alone.This is the ca if variation along a small t of dimensions in the original data is dominant. Second,a multivariate timeries may exhibit low dimensionality if there are common underlying patterns or trends across dimensions –in
other words,if dimensions show non-negligible correlation. We can distinguish the cas in ODflow analysis by normaliz-ing the ODflows before performing PCA.The standard approach is to normalize each dimension to zero mean and unit variance.For ODflow data we have:
辣椒肉丝
Figure5:Scree plot for Normalized ODflows. where is the sample mean of.If wefind that ODflows still exhibit low dimensionality after normalization,we can infer that the remaining effect is due to the common temporal patterns amongflows.
The results of applying PCA to normalized versions of all datats is shown in Figure5.The most striking feature of this figure is that the sharp knee from Figure4remains,in nearly the same location.It is also clear that the relative significance of the first few eigenflows has diminished somewhat.
Taken together,the obrvations suggest that while differences
Inspecting the rows of for a number of our datats yields
some surprising obrvations about how ODflows are structured in
terms of eigenflows.4Ourfirst obrvation is that each ODflow is
comprid of only a handful of significant eigenflows.We demon-
state this as follows.
Considering any given row of,we are interested in how many
北京流动人口entries are significantly different from zero.We can make this pre-
ci by tting a threshold and counting how many entries in the
row exceed this threshold in absolute value.A resonable threshold
is,since a perfectly equal mixture of all eigenflows would result in a row of with all entries equal and,applying this rea-
soning across all rows simultaneously,the constraints that columns
of have unit norm must be enforced.
In Figure6,we plot the CDF of the number of entries per row of that exceed this threshold for our Sprint-1and Abilene datats. Thefigure shows that,regardless of datat,most rows of have less than20significant entries,and no row has more than35sig-nificant entries.In terms of ODflows,this means that any given ODflow is compod of no more than35significant eigenflows, and generally many fewer.This surprising result means that we 4Exhaustive prentation of such voluminous data i
s impractical in the current context,but the reader is invited to inspect[16]which displays all rows of for both Sprint-1and Abilene datats.
Figure7:Indices of the eigenflows constituting each ODflow. can think of each ODflow as having only a small t of“features.”
Thus,we should expect different ODflows to differ considerably
in the nature of the temporal variation that they exhibit.
Our cond obrvation concerns how ODflows differ.We note
that,in general,there is a relationship between the size of an OD
flow(its mean rate)and the eigenflows that compri it.To examine
this relationship,we can inspect where the above-threshold entries
of the matrix occur.Figure7shows the above-threshold entries
of the matrix for the Sprint-1and Abilene datats.In thefigure,
there is a dot for each entry in the matrix that exceeds in absolute value.Note that the columns of the matrix are organized
by convention in decreasing singular value order,and we have or-
dered the rows in order of decreasing ODflow rate as well.Thus
the top row in each plot indicates the eigenflows that are significant
in forming the strongest ODflow,and the bottom row indicates the
significant eigenflows for the weakest ODflow.
Thefigure shows two things:first,in general,the significant en-
tries in most rows of are clustered in a restricted range(this ef-
fect is more pronounced in the Sprint data than in the Abilene data).
消防月