The VLDB Journal(2005)14:211–221/Digital Object Identifier(DOI)10.1007/s00778-004-0125-5
An effective and efficient algorithm for high-dimensional outlier detection Charu C.Aggarwal,Philip S.Yu
IBM T.J.Watson Rearch Center,19Skyline Drive,Hawthorne,NY10532,USA
Edited by R.Ng.Received:November19,2002/Accepted:February6,2004
Published online:August19,2004–c Springer-Verlag2004
Abstract.The outlier detection problem has important ap-plications in thefield of fraud detection,network robustness analysis,and intrusion detection.Most such applications are most important for high-dimensional domains in which the data can contain hundreds of dimensions.Many recent algo-rithms have been propod for outlier detection that u v-eral concepts of proximity in order tofind the outliers bad on their relationship to the other points in the data.However, in high-dimensional space,the data are spar and concepts using the notion of proximity fail to retain their effectiveness. In fact,the sparsity of high-dimensional data can be under-stood in a different way so as to imply that every point is an equally good outlier from the perspective of distance-bad definitions.Conquently,for high-dimensional data,the no-tion offinding meaningful outliers becomes s
ubstantially more complex and nonobvious.In this paper,we discuss new tech-niques for outlier detection thatfind the outliers by studying the behavior of projections from the data t.
Keywords:Data mining–High-dimensional spaces–Outlier detection
1Introduction
An outlier is defined as a data point that is very different from the rest of the data bad on some measure.Such a data point often contains uful information on abnormal behavior in the system that is characterized by the data.The outlier de-tection techniquefinds applications in credit card fraud,net-work intrusion detection,financial applications,and market-ing.This problem typically aris in the context of very high-dimensional data ts.Much of the recent work onfinding outliers u methods that make implicit assumptions of rel-atively low dimensionality of the data.The methods work quite poorly when the dimensionality is high and the data be-come spar,
Many data-mining algorithms in the literaturefind out-liers as a by-product of clustering algorithms[2,3,6,16,24]. Correspondence to:C.C.Aggarwal(e-mail:charu@)However,the techniques define outliers as points that do not lie in clusters.Thus,the techniques implicitly define outliers as the background noi in which the clusters
are embedded. Starting with the work in[8],recent literature[10,20,21,23] defines outliers as points that are neither a part of a cluster nor a part of the background noi;rather they are specifically points that behave very differently from the norm.Outliers are more uful bad on their value for determining behav-ior that deviates significantly from average behavior.In many ,network intrusion detection),such records may provide guidance in discovering important anamolies in the data.Such points are also referred to as strong outliers in the work discusd in[21].In this paper,we will develop algorithms that generate only outliers that are bad on their deviation value.
Many algorithms have been propod in recent years for outlier detection[10,20,21,23],but they are mostly either dis-tance bad or density bad;the are generally not methods specifically designed to deal with the cur of high dimen-sionality.Two interesting distance-bad algorithms are dis-cusd in[20,23],which define outliers by using the distri-bution of(full-dimensional)distances of the other points to a given point.This kind of measure is naturally susceptible to the dimensionality cur.For example,consider the definition by Knorr and Ng[20]:A point p in a data t is an outlier with respect to the parameters k andλ,if no more than k points in the data t are at a distanceλor less from p.
砚台
As pointed out in[23],this method is nsitive to the u of the parameterλthat is hard tofigure out a priori.In addition, when the dimensionality increas,it becomes increasingly difficult to pickλsince most of the points are likely to lie in a thin shell about any point[9].Thus,if we pick too small a λ,then all points are outliers;whereas if we pick too large a λ,then no point is an outlier.This means that a ur would need to pickλto a very high degree of accuracy in order to find a modest number of points that can then be defined as outliers.Aside from this,the data in real applications are very noisy,and the abnormal deviations may be embedded in some lower-dimensional subspace that cannot be determined by the spreading out behavior[9]of the data in full dimensionality. The algorithm also does not scale well for high dimensions. Conquently,the work in[23]discuss the following defi-
nition for an outlier:Given a k and n,a point p is an outlier if
the distance to its k-th nearest neighbor of the point is smaller than the corresponding value for no more than n−1other points.
Although the definition in[23]has some advantages over that provided in[20],it is again not specifically designed to work for high-dimensional problems.In fact,it has been in-dicated in[23]that b
y using fewer features for a given run, more interesting outliers on the NBA98basketball statistics databa were obtained.This was again becau the data often got spread out uniformly with increasing dimensionality.An-other interesting recent techniquefinds outliers bad on their local density[10],particularly with respect to the densities of local neighborhoods.This technique has some advantages in accounting for local levels of skews and abnormalities in data collections.To compute the outlier factor of a point,the method in[10]computes the local reachability density of a point o by using the average smoothed distances to a certain number of points in the locality of o.Unfortunately,this is again a problem in high dimensionality,where the concept of locality becomes difficult to define becau of the sparsity of the data.In order to u the concept of local density,we need a meaningful concept of distance for spar high-dimensional data;if this does not exist,then the outliers found are unlikely to be very uful.
Thus the techniques propod in[10,20,23]try to define outliers bad on the distances in full-dimensional space in one way or another.The sparsity of the data in high dimensionality [9]can be interpreted slightly differently to infer that each point is as good as an outlier in high-dimensional space.This is becau if all pairs of points are almost equidistant[9],then meaningful clusters cannot be found in the data[2,3,6,11]; similarly,it is difficult to detect abnormal deviations.
For problems such as clustering and similarity arch,it has been shown[1–4,6,11,17]that by examining the behavior of the data in subspaces,it is possible to design more mean-ingful clusters that are specific to the particular subspace in question.This is becau different localities of the data are den with respect to different subts of attributes.By defin-ing clusters that are specific to particular projections of the data,it is possible to design more effective techniques for finding clusters.The same insight is true for outliers,becau in typical applications such as credit card fraud,only the sub-t of the attributes actually affected by the abnormality of the activity is likely to be uful in detecting the behavior.
In order to explain our point a little bit better,let us con-sider the example illustrated in Fig.1.In the above example, we have shown veral two-dimensional cross ctions of a very high-dimensional data t.It is quite likely that for high-dimensional data,many of the cross ctions may be struc-tured,whereas others may be more noisy.For example,the points A and B show abnormal behavior in views1and4of the data.In other views,the points show average behavior.In the context of a credit card fraud application,both the points A and B may correspond to different kinds of fraudulent behavior yet may show average behavior when distances are measured in all the dimensions.Thus,by using full-dimensional distance measures,it would be more difficult to detect poi
nts that are outliers becau of the averaging behavior of the noisy and irrelevant dimensions.Furthermore,it is impossible to prune off specific features a priori since different points(such as A
x
x x
x
x xx
x x x
x
x
x
*
o
B
A
View 1
x领导形象
x x
x x
x
x
x
x
x
A
*
oB
View 3
2 and
3 do not. Full dimensional measures
x
x x
x
x
x
x
x
x
x
x
oB
*A
View 4
x
x
x
x
x x
x
x
x
x
*
A
o B
View 2
The 2-dimensional views 1 and 4
become increasingly susceptible to the
dimensionality金融助力乡村振兴
sparsity and noi effects in high
expo outliers A and B, The views
Fig.1.Illustrations of outliers in various views of the data
and B)may show different kinds of abnormal patterns,each of which us different features or views.
Thus the problem of outlier detection becomes increas-ingly difficult for very high-dimensional data ts,just as any of the other problems in the literature such as clustering,in-dexing,classification,or similarity arch.Previous work on outlier detection has not focud on the high-dimensionality aspect of outlier detection and has ud methods that are more applicable for low-dimensional problems by using relatively straightforward proximity measures[10,20,23].On the other hand,we note that most practical data-mining applications are likely to ari in the context of a very large num
ber of features.In this paper,we focus on the effects of high dimen-sionality on the problem of outlier detection.Recent work has discusd some of the concepts of defining the intensional knowledge that characterizes distance-bad outliers in terms of subts of attributes.Unfortunately,this technique was not intended for high-dimensional data,and the complexity in-creas exponentially with dimensionality.As the empirical results in[21]show,even for the relatively small dimension-ality of4,the technique is highly computation intensive.For even slightly higher dimensionalities,the technique is infea-sible from a computational standpoint.
In this paper,we discuss a new technique for outlier de-tection thatfinds outliers by obrving the density distribu-tions of projections from the data.Intuitively speaking,this new definition considers a point to be an outlier if in some lower-dimensional projection it is prent in a local region of abnormally low density.
1.1Defining outliers in lower-dimensional projections
The idea is to define outliers for data by looking at tho pro-jections of the data that have abnormally low density.Thus ourfirst step is to identify and mine tho patterns that have abnormally low prence that cannot be justified by random-ness.This is important since we value outlier pattern
s not for their noi value but for their deviation value.Once such pat-terns have been identified,then the outliers are defined as tho records that have such patterns prent in them.An interesting obrvation is that such lower-dimensional projections can be mined even in data ts that have missing attribute values[22]. This is quite uful for many real applications in which feature extraction is a difficult process and full feature descriptions of-ten do not exist.In such cas,the only difference is that for
a given projection,only tho points need to be ud that are fully specified in that projection.
1.2Defining abnormal lower-dimensional projections In order to find such abnormal lower-dimensional projec-tions,we need to define and characterize what we mean by an abnormal lower-dimensional projection.An abnormal lower-dimensional projection is one in which the density of the data is exceptionally lower than average.
In order to find such projections,we first perform a grid discretization of the data.Each attribute of the data is divided into φranges.The ranges are created on an equidepth basis;thus each range contains a fraction f =1/φof the records.The reason for using equidepth ranges as oppod to equiwidth ranges is that different localities of the data have different densities;we would like to find o
utliers while normalizing for this factor.The ranges form the units of locality that we will u to define low-dimensional projections that have unreasonably spar regions.
Let us consider a k -dimensional cube that is created by picking grid ranges from k different dimensions.The expected fraction of the records in that region if the attributes were sta-tistically independent would be equal to f k .Of cour,the data are far from statistically independent;therefore,the ac-tual distribution of points in a cube would differ significantly from average behavior;it is precily tho deviations that are abnormally below the average that are uful for the purpo of outlier detection.
Let us assume that there are a total of N points in the databa,and the dimensionality is d .If the data were uni-formly distributed,then the prence or abnce of any point in a k -dimensional cube is a bernoulli random variable with probability f k .Then the expected fraction and standard devi-ation of the points in a a k -dimensional cube is given by N ·f k
and
N ·f k ·(1−f k ).Also,under the assumption of uni-formly distributed data,the number of points in a cube can be approximated by a normal distribution.Let n (D )be the num-ber of points in a k -dimensi
onal cube D .Then we calculate the sparsity coefficient S (D )of the cube D as follows:S (D )=n (D )−N ·f k
k ·(1−f k )
.
(1)
Only sparsity coefficients that are negative indicate cubes in which the prence of the points is significantly lower than expected.Note that if n (D )is assumed to fit a normal distri-bution,then the normal distribution tables can be ud to quan-tify the probabilistic level of significance for a point to deviate significantly from average behavior for an a priori assump-tion of uniformly distributed data.In general,the uniformly distributed assumption is not true;however,the sparsity co-efficient provides a reasonable approximation to the level of significance for a given projection.We also note that we are only arching for cubes that are nonempty in order to find outliers.Therefore,cubes that are empty are considered infea-sible.From an implementation perspective,this is achieved by tting such sparsity coefficients to the very high value of 106.
1.3A note on the nature of the problem
At this stage we would like to make a comment on the na-ture of the problem of finding the most spar k -dimensional cubes in the data.The nature of this problem is such that there are no upward-or downward-clod properties in the t of dimensions (along with associated ranges)that are unusually spar.1This is not unexpected:unlike problems such as large itemt detection [7]where one is looking for large aggregate patterns,the problem of finding subts of dimensions that are sparly populated is akin to finding a needle in a haystack since one is looking for patterns that rarely exist.Furthermore,it may often be the ca that even though particular regions may be well populated on certain ts of dimensions,they may be very sparly populated when such dimensions are com-bined together.(For example,there may be a large number of people below the age of 20,and a large number of people with diabetes,but very few with both.)From the perspective of an outlier detection technique a person below the age of 20with diabetes is a very interesting record;however,it is very difficult to find such a pattern using structured arch meth-ods.Therefore,the best projections are often created by an a priori unknown combination of dimensions,which cannot be determined by looking at any lower-dimensional projec-tion.One solution is to change the measure in order to force better closure or pruning properties;however,this can worn the quality of the solution substantially by forcing the choice of the measure to be driven by algorithmic considerations.In general,it is not possible to predict the behavior of the data when tw
o ts of dimensions are combined;therefore,the best qualitative option is to develop arch methods that can identify such hidden combinations of dimensions.In order to arch the exponentially increasing space of possible pro-jections,we borrow ideas from a class of evolutionary arch methods in order to create an efficient and effective algorithm for the outlier detection problem.
2Evolutionary algorithms for outlier detection
In this ction,we will discuss the algorithms that are uful for outlier detection in high-dimensional problems.A natural class of methods for outlier detection are the naive brute-force techniques in which all subts of dimensions are examined for possible patterns that are spar.The patterns are then ud to determine the points that are possibly outliers.We dis-cuss two algorithms for outlier detection:a naive brute-force algorithm that is very slow at finding the best patterns be-cau of its exhaustive arch of the entire space and a much faster evolutionary algorithm that is able to quickly find hid-den combinations of dimensions in which the data are spar.We assume that one of the inputs to the algorithm is the di-mensionality k of the projections that are ud to determine the outliers.Aside from this,the algorithm us the number m of projections to be determined as an input parameter.
The brute-force algorithm is illustrated in Fig.2.The algorithm works by examining all possible ts of k -1
An upward-clod pattern is one in which all superts of the pattern are also valid patterns.A downward-clod t of patterns is one in which all subts of the pattern are also members of the t.
美人鱼动漫图片Algorithm BruteForce(Number:m,Dimensionality:k)
begin
R1=Q1=Set of all d·φranges;
for i=2to k+1do
begin
R i=R i−1⊕Q1;
end;
Determine sparsity coefficients of all
elements in R k;
醉不成欢惨将别
R k=Set of m elements in Q k
with most negative sparsity coefficients;
O=Set of points covered by R k;
return(R k,O);
end
Fig.2.The brute-force technique
dimensional candidate projections(together with correspond-ing grid ranges)and retaining the m projections that have the most negative sparsity coefficients.In order to actually deter-mine the candidate projections,the method us a bottom-up recursive algorithm in which(k+1)candidate cubes are de-termined by concatenating the candidate k projections with all d·φpossible ts of one-
dimensional projections and their grid ranges(denoted by Q1).The concatenation operation is illus-trated in Fig.2by⊕.Note that,for a given cube,it only makes n to concatenate with grid ranges from dimensions not included in the current projection in order to create a higher-dimensional projection.The candidate t of dimensionality i is denoted by R i.At termination,the t of projections with most negative sparsity coefficients in R k are retained.The t of points in the data that contain the corresponding ranges for the projections is reported as thefinal t of outliers.
As we shall e in later ctions,the algorithm discusd in Fig.2is computationally untenable for problems of even mod-est complexity.This is becau of the exponentially increas-ing arch space of the outlier detection problem.To overcome this,we will illustrate an innovative u of evolutionary arch techniques for the outlier detection problem.
2.1An overview of evolutionary arch techniques Evolutionary algorithms[18]are methods that imitate the pro-cess of organic evolution[12]in order to solve parameter opti-mization problems.The fundamental idea underlying the Dar-winian view of evolution is that,in nature resources are scarce, and this leads to a competition among the various species.As a result,all the species undergo a lection mechanism in which only thefittest survive.Conquently,thefitter individuals tend to mate each other more often,resulting in still better individuals.At the same time,once in a while,nature also
throws in a variant by the process of mutation,so as to en-sure a sufficient amount of diversity among the species,and hence also a greater scope for improvement.The basic idea behind an evolutionary arch technique is similar;every so-lution to an optimization problem can be“disguid”as an individual in an evolutionary system.The measure offitness of this“individual”is equal to the objective function value of the corresponding solution,and the other species that this individual has to compete with are a group of other solutions to the problems;thus,unlike other optimization methods such as hill climbing or simulated annealing[19],they work with an entire population of current solutions rather than a single solution.This is one of the reasons why evolutionary algo-rithms are more effective as arch methods than either hill-climbing,random arch,or simulated annealing techniques; they u the esnce of the techniques of all the methods in conjunction with recombination of multiple solutions in a population.Appropriate operations are defined in order to im-itate the recombination and mutation process as well,and the simulation is complete.
Each feasible solution to the problem being solved by an evolutionary technique is defined as an individual.This fea-sible solution is in the form of a string and is the genetic reprentation of the individual.The process of conversion of feasible solutions of the problem into strings that the algorith
m can u is called coding.For example,a possible coding for a feasible solution to the traveling salesman problem could be a string containing a quence of numbers reprenting the order in which he visits the cities.The genetic material at each locus on the string is referred to as a gene and the possible val-ues that it could possibly take on are the alleles.The measure offitness of an individual is evaluated by thefitness function, which has as its argument the string reprentation of the in-dividual and returns a value indicating itsfitness.Thefitness value of an individual is analogous to the objective function value of the solution;the better the objective function value, the better thefitness value.
As the process of evolution progress,all the individu-als in the population tend to become more and more similar to each other genetically.This phenomenon is referred to as convergence.Dejong[13]defined convergence of a gene as the stage at which95%of the population has the same value for that gene.The population is said to have converged when all genes have converged.
The application of such evolutionary arch procedures should be bad on a good understanding of the problem at hand.Typically black-box GA software on straightforward string encodings does not work very well[5],and it is often a nontrivial task to design the proper arch methods such as recom
binations,lections,and mutations that work well for a given problem.In the next ction,we will discuss the details of the evolutionary arch procedures that work effectively for the outlier detection problem.
2.2A description of the outlier detection technique
In this ction,we will discuss the application of the arch technique to the outlier detection problem.Let us assume that the grid range for the i-th dimension is denoted by m i.Then the value of m i can take on any of the values1throughφ, or it can take on the value∗,which denotes a“don’t care”. Thus,there are a total ofφ+1values that the dimension m i can take on.Thus,consider a four-dimensional problem with φ=10.Then,one possible example of a solution to the prob-lem is given by*3*9.In this ca,the ranges for the cond and fourth dimensions are identified,whereas thefirst and third are left as“don’t cares”.Thefitness for the corresponding solu-tion may be computed using the sparsity coefficient discusd
Algorithm EvolutionaryOutlierSearch(Number:m,Dimensionality:k) begin
S=Initial ed population of p strings;
BestSet=null;
while not(termination criterion)do begin
S=Selection(S);
S=CrossOver(S);
S=Mutation(S,p1,p2);
Update BestSet to be the m solutions in
BestSet∪S with most negative sparsity coefficients;
end;
O=Set of data points covered by BestSet;
return(BestSet,O);
end
Fig.3.Evolutionary framework for outlier detection
earlier.The evolutionary arch technique starts with a popu-lation of p random solutions and iteratively ud the process
of lection,crossover,and mutation to perform a combination
of hill climbing,solution recombination,and random arch over the space of possible projections.The process was con-tinued until the population converged to a global optimum.We ud the De Jong[13]convergence criterion to determine the termination condition.At each stage of the algorithm,the m best projection solutions(most negative sparsity coefficients) were kept track of.At the end of the algorithm,the solutions were reported as the best projections in the data.The over-all procedure for the genetic algorithm is illustrated in Fig.3. The population of solutions in any given iteration is denoted by S.This t S is refined in subquent iterations of the al-gorithm,and the best t of projections found so far is always maintained by the evolutionary algorithm.
•Selection:Several alternatives are possible[15]for -
lection in an evolutionary algorithm;the most popularly
known ones are rank lection andfitness proportional -
lection.The idea is to replicate copies of a solution by
ordering them by rank and biasing the population in favor
of the higher-ranked solutions.This is called rank lec-
tion and is often more stable than straightforwardfitness
正月十五祝福语
proportional methods that sample the t of solutions in
proportion to the actual value of the objective function.
This strategy of biasing the population in favor offitter
strings in conjunction with effective solution recombina-
tion creates newer ts of children strings that are more
likely to befit.This results in a global hill climbing of
an entire population of solutions.For the particular ca
of our implementation,we ud a roulette wheel mecha-
nism,where the probability of sampling a string from the
population was proportional to p−r(i),where p is the to-
tal number of strings and r(i)is the rank of the i-th string.
Note that the strings are ordered in such a way that the
strings with the most negative sparsity coefficients occur
first.Thus,the lection mechanism ensures that the new
population is biad in such a way that the the most abnor-
mally spar solutions are likely to have a greater number
of copies.The overall lection algorithm is illustrated in
Fig.4.
•Crossover:Since the crossover technique is a key method
in evolutionary algorithms forfinding optimum combina-Algorithm Selection(S)
begin
Compute the sparsity coefficient of each solution
in the population S;
杜昌伟
Let r(i)be the rank of solution i in order
of sparsity coefficient(most negative occursfirst);
S =null;
for i=1to p do
begin
Roll a die with the i-th side proportional to p−r(i);
Add the solution corresponding to the i-th side to S ;
end;
Replace S by S ;
return(S);
end
Fig.4.Selection criterion for genetic algorithm
tions of solutions,it is important to implement this opera-tion effectively for making the overall method work effec-tively.We willfirst discuss the natural two-point crossover mechanism ud in evolutionary algorithms and show how to suitably modify it for the outlier detection problem.
Unbiad two-point crossover:The standard procedure in evolutionary algorithms is to u uniform two-point crossover in order to create the recombinant children strings.The two-point crossover mechanism works by determining a point in the string at random called the crossover point and exchanging the gments to the right of this point.For example,consider the strings3*2*1 and1*33*.If
the crossover is performed after the third position,then the two resulting strings are3*23*and 1*3*1.Note that in this ca,both the parent and chil-dren strings correspond to three-dimensional projections infive-dimensional data.However,if the crossover oc-curred after the fourth position,then the two resulting children strings would be3*231and1*3**.The cor-respond to two-dimensional and two-dimensional projec-tions.In general,since the evolutionary algorithm only finds projections of a given dimensionality in a run,this kind of crossover mechanism often creates infeasible so-lutions after the crossover process.Such solutions are dis-carded in subquent iterations since they are assigned very lowfitness values.In general,evolutionary algorithms work very poorly when the recombination process cannot create ts of solutions of high quality or that are viable in terms of feasibility.To address this,we create an opti-mized crossover process that takes both the factors into account.
Since it is clear that the dimensionality of the projection needs to be kept in mind while performing a crossover operation,it is desirable that the two children obtained after solution recombination also correspond to a k-dimensional projection.In order to achieve this goal,we need to classify the different positions in the string into three types.This classification is specific to a given pair of strings s1and s2.
武汉重启Type I:Both strings have a“don’t care”.
Type II:Neither has a“don’t care”.Let us assume that there are k ≤k positions of this type.