Discovering Spatio-Temporal Causal Interactions
in Traffic Data Streams
Wei Liu
University of Sydney,
Sydney,Australia wei.liu@sydney.edu.au
Yu Zheng
Microsoft Rearch Asia,
Beijing,China
Sanjay Chawla
University of Sydney,
Sydney,Australia
sanjay.chawla@sydney.edu.au
Jing Yuan
University of Science and
Technology of China yuanjing@mail.ustc.edu
Xing Xie Microsoft Rearch Asia, Beijing,China
ABSTRACT
The detection of outliers in spatio-temporal traffic data is an important rearch problem in the data mining and knowl-edge discovery community.However to the best of our knowl-edge,the discovery of relationships,especially causal inter-actions,among detected traffic outliers has not been inves-tigated before.In this paper we propo algorithms which construct outlier causality trees bad on temporal and spa-tial properties of detected outliers.Frequent substructures of the causality trees reveal not only recurring interac-tions among spatio-temporal outliers,but potentialflaws in the design
of existing traffic networks.The effectiveness and strength of our algorithms are validated by experiments on a very large volume of real taxi trajectories in an urban road network.
Categories and Subject Descriptors
H.2.8[Databa Applications]:Data mining
General Terms
Algorithms
Keywords
Spatio-temporal outliers;outlier causalities;frequent sub-structures;urban computing and planning;
1.INTRODUCTION
The increasing availability of location-acquisition tech-nologies including GPS and WIFI have resulted in huge volumes of spatio-temporal data,especially in the form of Permission to make digital or hard copies of all or part of this work for personal or classroom u is granted without fee provide
d that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on thefirst page.To copy otherwi,to republish,to post on rvers or to redistribute to lists,requires prior specific permission and/or a fee.
KDD’11,August21–24,2011,San Diego,California,USA.
Copyright2011ACM978-1-4503-0813-7/$ajectories[2,5,7,15,14,19,17].Unusual patterns of mov-ing objects’trajectories generally reflect abnormal traffic streams on road networks,which could be caud by aperi-odic events including celebrations,parades,large-scale busi-ness promotions,protests,traffic control and traffic jams. Therefore,the detection of outliers/anomalies from trajec-tory data can help in nsing abnormal events and plan for their impact on the smoothflow of traffic.In this study, we treat both known(planned)and unknown(unplanned) events that behave differently from daily network traffics as anomalies.
Challenges and Contributions
In order to successfully detect outliers and causal interac-tions among them,the following challenges need to be ad-dresd:(i)Heterogeneous traffic patterns:the traffic pat-terns on roads vary across days of a week and hours of a day. Different road gments have often distinct time-variant traf-fic p
atterns.It is difficult to u one model to detect outliers across the road network at different time periods.(ii)Data sparness and distribution skewness:even though we could have a large number of nsors(taxis)probing the traffic on roads,there are many roads that have only a small number of samples given a large size of road networks in a major city. Moreover,a few road gments are traveled by thousands of vehicles in a few hours,while some gments may be only driven on a few times in a day.The two properties to-gether result in unique challenges in detecting outliers from traffic data.(iii)Causality among outliers:we not only need to discover outliers from the traffic,but also infer causal re-lationships and interactions among them,especially given the large number of outliers that could be identified.So a challenge is how to detect the appearance,growth,disap-pearance and transformation of outliers by ,prop-agation of a traffic jam).
In this paper we design veral steps to address the above challenges and propo solutions to the problem of detect-ing spatio-temporal outliers and causal relationships among them from traffic data streams.We u contexts of road networks in this study,however,algorithms propod in this paper can be generally applied to domains of networking[12,
Figure 1:An example using traffic networks in the city of Beijing.Bad on major roads in the traffic network,the entire city (subfigure (a))is partitioned into regions (subfigure (b)).Trajectories of moving objects (such as a moving taxi shown by a blue trajectory in subfigure (c))connect neighboring regions,bad on which we create the notion of links (subfigure (d)).13]and climate change [18,23,8]etc.More specifically,the contributions we make in this paper are:
1.City-wide traffic modeling :we partition the urban area of a city into regions using the framework of the city’s road network.Then we build a region graph where a node is a region and a link captures the traffic flow among two regions.We formulate the outlier detec-tion problem as identifying the most outlying “links”from the region graph in terms of the spatio-temporal properties of a link.
2.Outlier tree construction :we propo the STOTree al-gorithm bad on both spatial and temporal properties of detected outliers (which are certain “links”in a time frame)to construct outlier trees,which uncover causal relationships among the outliers.
3.Frequent outlier subtree discovery :we propo the fre-qentSubtree algorithm,inspired by association rule mining,which generates the most frequent sub-structure (subtree)from all discovered outlier trees.The fre-quent subtrees reveal recurrent abnormalities in the data and suggest inherent problems in existing road networks.The rest of the paper is structured as follows.In Section 3we introduce the overall framework of our model,includ-ing preliminary concepts and notations that we u in this paper.In Section 4we propo algorithms for discovering spatio-temporal outliers and causal relationships.Exper-iments and their analysis are reported in Section 5.We conclude in Section 6with directions for future work.
2.RELATED WORK
To the best of our knowledge this is the first paper that propos the problem of discovering casual relationships among spatio-temporal outliers.However,there have been a num-ber of efforts on detecting only outliers from spatially and temporally distributed data.For example,principle compo-ne
nt analysis (PCA)has been ud for network-wide anomaly detection [13,26,20,1].However,PCA results (as we will show)cannot capture volume heterogeneity and are also very nsitive to parameter ttings which are highly data depen-dent.Under certain circumstances,large anomalies in turn can effect the PCA computation leading to both fal posi-tives and fal negatives [20].
The problem of outlier monitoring has also been studied in [2]which builds local clusters on trajectory streams and monitors outliers that are defined by a “trajectory”(instead of a spatial link as ours).Another method is Sun et al.[22]where a measure,spatial local outlier measure (SLOM),is propod to capture the local behavior of datum in their spa-tial neighborhood.This measure takes into account the local stability around a data point and suppress the reporting of outliers in highly unstable areas.A generalized local sta-tistical (GLS)framework is propod in [6]which studies the performance of local bad methods on detecting outliers in geo-statistical data with either linear or nonlinear trends,and compares them against global bad methods.Wu et al.[23]design algorithms detecting the most abnormal dis-crepancy regions in precipitation data,where they u four sweep lines to form grids which are treated as regions.How-ever,none of the approaches model and capture tempo-ral relations (causalities)among detected outlying regions.Lee et al.[15,14]have designed a “partition-and-group”framework for clustering and detecting trajectory outliers.In their app
roach,they first partition the trajectories into small gments and then u both distance and density to detect abnormal sub-trajectories.This is different from our work as we detect abnormal regions and links of the entire traffic network (instead of objects moving in the network).Moving objects are usually associated with periodic be-havioral patterns,and there have been veral methods pro-pod to address the problem of detecting such periodic movements [3,4,10,9,16].Cao et al.[3,4]propod abbre-viated list tables (ALT)to find subquences that appear periodically and frequently in data quences,but the pe-riodic patterns they detect are very nsitive to parameter ttings.Similarly,Elfeky et.al.[9]have propod specific definitions of periodicities and algorithms for identifying the periodic patterns.
名下房产查询3.OVERVIEW
In this ction,we introduce our notations,definitions and the main structure of the propod model.
农业供给侧改革3.1Preliminary Concepts
The overall traffic map is partitioned into regions (Rgn )bounded by high level (i.e.major)roads,each of which may consist of a number of road gments.Figure 1(a)and 1(b)demonstrate an example of region formations.
Definition 1.Trajectory :A trajectory T r is a trace cre-
Figure2:The overall structure of our model for detecting spatio-temporal outliers and their inter-causalities.
ated by a moving object in geographical space.A T r is rep-rented by of a t of time-ordered T r:p1→p2→...→p n,where each point consists of a geospatial co-ordinate t and a p=(longitude,latitude, timestamp).
Definition2.Transition:Given a trajectory T r:p1→p2→...→p n,there exists a transition S between Rgn1and Rgn2,if there exists adjacent points p i and p i+1(1≤i≤n+1)such that p i is in Rgn1and p i+1is in Rgn2,and Rgn1is not the same to Rgn2.A transition is associated with a leaving time(timestamp of p i)and an arriving time (timestamp of p i+1).
班主任系统Definition3.Link:A link(Lnk)is comprid of a pair of regions(<Rgn o,Rgn d>)indicating a virtual spatial con-nection between the origin region and the destination re-gion.There exists a link from one region(Rgn o)to another (Rgn d)if and only if there exists at least one object moving from Rgn o at timestamp ts i to Rgn d at ts i+1.Figure1(c) and1(d)give examples of links.
Definition4.Time frame:A time frame(tf)is a t of concutive time intervals1.Figure3(a)shows an example of a time frame.
Definition5.Spatio-temporal outlier:A spatio-tem-poral outlier(ST O)is a link who non-spatial and non-temporal attribute values are very different from the values of its spatio-temporal neighbors.Spatio-temporal neighbors of a link Lnk i are the links who locations and timestamps are clo to tho of Lnk i.
Definition6.Outlier causality:ST O2(with a region pair<Rgn o2,Rgn d2>and a time frame tf2)is cau
d by ST O1(containing a region pair<Rgn o1,Rgn d1>and a time frame tf1)if and only if the following conditions hold true: (i)The destination of ST O1is the same as the origin of ST Rgn d1=Rgn o2);(ii)Time frames tf1and tf2 are concutive to each other and tf1is ahead of tf2. During the construction of an outlier tree,an outlier ST O i is a a child of another outlier ST O j if ST O i is caud by ST O j.
3.2Framework
The main structure of our model is illustrated in Figure2. The three main steps are preprocessing traffic data to build a region graph,detecting outliers andfinally discover causal 1We call a“time interval”a“timebin”in the rest of lationships between the discovered outliers.The cond and the third step have three and two sub-steps respectively. Details of the(sub-)steps are described in the following ction.
黄未
4.METHODOLOGY
In this ction,we provide details of our model as shown in Figure2.Specifically,we focus on the detection of spatio-temporal outliers bad on each link’s“distort”,construct outlier causality trees bad on the outliers,and discover the most frequent causal trees which are indicative of recur-ren
t traffic abnormalities.
4.1Building the Graph of Regions
有你好幸福
In our study,we assume the map of traffic network,the t of major roads,and the trajectories of objects are all known. Although it is more straightforward to apply a simple“n×m”grid on maps to define regions,cells of a grid are equal-sized and do not reflect natural differences of regions in a traffic network.So instead of using equal-sized grids,we define regions of a traffic network by road gments as illustrated in Figure1.In detail,we build a graph of regions according to the following three steps.
1.Region Partitioning:As shown in Figure1(a)and Fig-
ure1(b),we partition a city into dis-jointed regions using the major roads of the city.Here we employ Connected Components Labeling(an image gment method)[21]to partition a map into regions effectively and efficiently,since the problem of subdivisions in a polygonal region is known to be NP-complete[11].
2.Formulating transitions:By scanning the trajectory
额手称庆什么意思
data t,we transfer each trajectory into a quence of transitions between pairs of regions in terms of defi-nition2.As demonstrated in Figure1(c),a trajectory passing three regions a,b,and c results in two transi-tions:a⇒b,and b⇒c.
3.Generating links:If there is a transition generated be-
tween two regions,we connect the two regions with a link(refer to Definition3).In timebin j,a link i(Lnk i
=<Rgn o,Rgn d>)is associated with a feature vector of three properties f i,j≡<#Obj,Pct o,Pct d>:
(a)#Obj:Total number of objects on the
objects moving from Rgn o to Rgn d in this time-
bin);
(b)Pct o:The proportion of#Obj among all objects
moving out of Rgn o in this timebin;
(c)Pct d:The proportion of#Obj among all objects
moving into Rgn d in this timebin;
Then,using Figure1(d)as an example(where the number shown on each link is the number of transitions pertaining to the link),the property of link a⇒b is
f
i,j=<#Obj=2,Pct o=
2
2+3
=0.4,Pct d=2
2+6
=0.25>.
4.2Detecting Outliers from Graph Links Assume each time frame is comprid of afixed number of q timebins.Given a time frame tf j,we denote the quence of feature values of a link Lnk i in this time frame by:
F i,j=< f i,j−q+1, f i,j−q+2,..., f i,j>.(1)
(a)An example of time frame defined by the gment between two vertical(green)lines.
(b)Time frames to be compared with the one in(a)
Figure3:An example of the number of taxis on a certain link in four adjacent days.Gaps between gre
en vertical lines reprent time frames,each of which is comprid of v-eral timebins.One unit of timebin in x-axis reprents30 48timebins compri a day).The value of minDistort of the time frame in subfigure(a)is obtained by calculating the smallest difference between the time frame
in(a)and the ones at the same time in adjacent days as shown in subfigure(b).
For each link(Lnk i)in each time frame tf j,we calculate the distortion between two time frames(denoted by minDistort i,j) by arching for the minimum difference between tf j and the same time frames of the same days on concutive weeks. With this approach minDistort is capable of capturing the special pattern of traffic data that similar behaviors are ob-rved during the same time of different days or the same day of different weeks etc.
Algorithm1shows the procedure of calculating distorts.
In line7of the algorithm,we obtain the difference between two time frames of a link by computing their Euclidean dis-
tance:
Distance(tf j,tf t,Lnk i)=
q−1
k=0
f i,j−k− f i,t−k 2(2)
We u minDistort i,j obtained from Algorithm1as the “non-spatial and non-temporal attributes”(e Definition5) of each link in each time frame.Extreme values among minDistort of all links are identified as temporal outliers. By subtracting the min and dividing by the max the feature values of the links are in the range of[0,1].The normal-ization removes the effect of different regions and volume sizes.Another advantage of using minDistort is that it prevents the examination of many repeating patterns(where minDistort≈0).
Then for each time frame,there is a corresponding three dimensional vector(formed by features<#Obj,Pct o,Pct d>) shown in Figure4.As each point reprents a link,we iden-tify the most extreme points as outliers in that time frame. To normalize the effect of variances along different direc-tions,we u the Mahalanobis distance(instead of Euclidean distance)to measure the the extremeness of data points.We Algorithm1minDistort:calculating minimum distort of time quences
Input:Lnk i:a link;tf j:a time frame;t:number of adjacent weeks to check
Output:minDistort i,j:the degree of distort for link Lnk i in time frame tf j菖蒲怎么读
——————————————–
1:minDist←+Infinity;
2:T←tf j±u weeks,u∈{−t,...,t}
3:for All time frames tf t in T do
4:if tf t overlap with tf j then
5:Continue;
6:end if
7:currentDist←Distance(tf j,tf t,Lnk i);
8:if currentDist<minDist then
9:minDist←currentDist;
10:end if
11:end for
12:Return minDist;
手拉着手
Figure4:An illustration of the three-dimensional unit cube formed by three features<#Obj,Pct o,Pct d>before nor-malization for computing the Mahalanobis distance.We la-bel the most“extreme”points among all points as outliers. For example,outlier points are the ones who distance are the farthest to the center of the data cluster.
u the Mahalanobis distance here in order to normalize the distance by the variance in different directions.
In this way,the outliers detected are links who features have the largest difference from both their temporal neighbors(for using“minDistort”)and spa-tial neighbors(for being detected“among all links”)–so they are spatio-temporal outliers(STOs).Another ad-vantage of identifying“extreme points”as outliers is that it can detect abnormal links with either too low volumes or too high volumes since extremeness of points are bad on their Mahalanobis distances.
Now each STO is a spatial link associated with a time frame.We reprent a STO by its link Lnk i(containing an original region and a destination region)and its time frame tf ,STO i,j=<Rgn i,o,Rgn i,d,tf j>.
4.3Constructing Outlier Trees
We propo an algorithm named STOTree thatfinds out-lier dependencies by looking at the relationship of outliers from the earliest time frame through the last.The main insight of STOTree is that an outlier ST O1is a par-ent of another outlier ST O2if ST O1occurred before ST O2in time and they are spatially correlated.Al-
Algorithm2STOTree:constructing all outlier trees Input:ST Outlier:a t of spatial-temporal outliers of size t×k
where t is the number of time frames,and k is the number of outliers to examine in a time frame.
Output:ST OT rees:a list of roots of spatial-temporal trees.——————————————–
1:ST OT rees←an empty t{};
2:for Each time frame i(i∈(1,...,t))do
3:for Each outlier j(j∈(1,...,k))in time frame i do
4:ST ORoot i,j←FindAllChildren(ST Outlier i,j,i);
5:ST OT rees←ST OT rees∪ST ORoot i,j;
6:end for
7:end for
8:Return ST OT rees;
——————————————–
Subroutine:FindAllChildren(ST Outlier i,j,i)
9:if Time frame i is the last time frame then
10:Return ST Outlier i,j;
11:end if
12:ST Outlier i,j.subnodes←an empty t{};
13:for Each outlier u(u∈(1,...,k))in time frame i+1do 14:if ST OT rees contains ST Outlier i+1,u then
15:continue;
16:end if
17:if ST Outlier i,j.Rgn d==ST Outlier i+1,u.Rgn o then 18:ST Outlier i,j.subnodes←ST Outlier i,j.subnodes∪FindAllChildren(ST Outlier i+1,u,i+1);
19:end if
20:end for
21:Return ST Outlier i,j;
gorithm2demonstrates the process of constructing outlier trees from discovered outliers.Note the algorithm results in a collection of trees(a forest).The subroutine(Line9 to21)is a recursive function ud to retrieve all possible descendants of a node.For each time frame,this recursive function is called on each outlier of the current time frame to compare with each outlier of next time frame,unless the “current”outlier tree already contains outliers of next time frame(Line14to16).So the overall time complexity of the outlier tree construction process on each time frame is upper bounded by O(k2),where k is the number of outliers in a time frame.
We do not place a restriction on the maximum size of outlier trees in the STOTree algorithm,under the assump-tion that abnormal events caud by one single accident are not expected to last for a long time and the size outlier trees should not grow infinitely.In Section5.3we provide empirical evidence that confirms the maximum size of trees is usually small.
Now we give an example by using Figure5to demonstrate the process of Algorithm2for building outlier trees.Fig-ure5us top3outliers in three concutive time frames,so the input parameters in Algorithm2in this ca are k=3 and t=3.The algorithm starts from time frame1(Line2 of Algorithm2),a
nd for each of the top three outlying links (Line3to6),i.e.,A⇒B,C⇒D and E⇒F,the algo-rithm arches in time frame2(Line13to20)and checks whether there is any following link that can be a child of a previous link(Line17to19).This allows the algorithm to find outlying links B⇒G and B⇒E as children of A⇒B; and similarly it identifies link H⇒K in time frame3as a child of J⇒H in time frame2.Therefore two outlier trees are built up as shown in the right side of Figure5.In this way,Algorithm2scans through all time frames of traffic data we have,and builds a forest of various outlier
trees.Figure5:A synthetic example for demonstrating the process of building a forest of two outlier trees.The subfigure on the left illustrates top3outlying links in three concutive time frames,and the one on the right shows two outlier trees obtained from the time frames.
4.4Causal Outlier Detection
Denote by T the forest containing all outlier trees.The most significant and recurring causal relationships corre-spond to the most frequent subtrees of T.The mechanism of discovering frequent subtrees from all outlier trees is in-spired by the process of mining frequent item ts,except that we design our own strategy to generate frequent subtree candidates(through node inrtion on trees).
The process of discovering frequent substructures from constructed outlier trees is shown in Algorithm3.Given a predefined support threshold ,wefirstfind all single nodes who supports exceed (Line3of the algorithm),then we u this t of frequent single nodes to form candidates of frequent subtrees.The“while”iteration(Line6to30)first generates candidates of subtrees(Line9to15),and then checks the support of each candidate and performsfiltering (Line18to29)according to .
When generating subtree candidates,new subtrees(who sizes are incread by one)are created by inrting a frequent single node into previous frequent subtrees.This node in-rtion process is gi
ven in Algorithm4.The single node to be inrted isfirst compared with the root of the tree,and is inrted as a subnode of the root(Line1to3of Algorithm4) if the root can be a parent of the single node and its existing children do not contain the single node.Otherwi,the sin-gle node is compared and checked whether it can be inrted into branches below the a recursive process shown in Line8to12).When counting the support of a candidate subtree,we increa the frequency of the candidate by one if all nodes(with their immediate subnodes)of the candidate have an exact match with a discovered outlier tree(Line21 to23).
The effectiveness and strengths of our algorithms,STOTree and frequentSubtree,are evaluated in the next ction. 5.EXPERIMENTS AND ANALYSIS
In this ction we report on the experiments carried out on taxi trajectory data on the road network of Beijing city. Our experiments are conducted on a64bit rver with3.2 GHz CPU and8GB memory.We note that although we are using road traffic data,our methods and algorithms can be