An effective and efficient algorithm for high-dimensional outlier detection

更新时间:2023-06-13 08:00:02 阅读：评论：0

The VLDB Journal(2005)14:211–221/Digital Object Identiﬁer(DOI)10.1007/s00778-004-0125-5

An effective and efﬁcient algorithm for high-dimensional outlier detection Charu C.Aggarwal,Philip S.Yu

IBM T.J.Watson Rearch Center,19Skyline Drive,Hawthorne,NY10532,USA

Edited by R.Ng.Received:November19,2002/Accepted:February6,2004

Published online:August19,2004–c Springer-Verlag2004

Abstract.The outlier detection problem has important ap-plications in theﬁeld of fraud detection,network robustness analysis,and intrusion detection.Most such applications are most important for high-dimensional domains in which the data can contain hundreds of dimensions.Many recent algo-rithms have been propod for outlier detection that u v-eral concepts of proximity in order toﬁnd the outliers bad on their relationship to the other points in the data.However, in high-dimensional space,the data are spar and concepts using the notion of proximity fail to retain their effectiveness. In fact,the sparsity of high-dimensional data can be under-stood in a different way so as to imply that every point is an equally good outlier from the perspective of distance-bad deﬁnitions.Conquently,for high-dimensional data,the no-tion ofﬁnding meaningful outliers becomes s

ubstantially more complex and nonobvious.In this paper,we discuss new tech-niques for outlier detection thatﬁnd the outliers by studying the behavior of projections from the data t.

Keywords:Data mining–High-dimensional spaces–Outlier detection

1Introduction

An outlier is deﬁned as a data point that is very different from the rest of the data bad on some measure.Such a data point often contains uful information on abnormal behavior in the system that is characterized by the data.The outlier de-tection techniqueﬁnds applications in credit card fraud,net-work intrusion detection,ﬁnancial applications,and market-ing.This problem typically aris in the context of very high-dimensional data ts.Much of the recent work onﬁnding outliers u methods that make implicit assumptions of rel-atively low dimensionality of the data.The methods work quite poorly when the dimensionality is high and the data be-come spar,

Many data-mining algorithms in the literatureﬁnd out-liers as a by-product of clustering algorithms[2,3,6,16,24]. Correspondence to:C.C.Aggarwal(e-mail:charu@)However,the techniques deﬁne outliers as points that do not lie in clusters.Thus,the techniques implicitly deﬁne outliers as the background noi in which the clusters

are embedded. Starting with the work in[8],recent literature[10,20,21,23] deﬁnes outliers as points that are neither a part of a cluster nor a part of the background noi;rather they are speciﬁcally points that behave very differently from the norm.Outliers are more uful bad on their value for determining behav-ior that deviates signiﬁcantly from average behavior.In many ,network intrusion detection),such records may provide guidance in discovering important anamolies in the data.Such points are also referred to as strong outliers in the work discusd in[21].In this paper,we will develop algorithms that generate only outliers that are bad on their deviation value.

Many algorithms have been propod in recent years for outlier detection[10,20,21,23],but they are mostly either dis-tance bad or density bad;the are generally not methods speciﬁcally designed to deal with the cur of high dimen-sionality.Two interesting distance-bad algorithms are dis-cusd in[20,23],which deﬁne outliers by using the distri-bution of(full-dimensional)distances of the other points to a given point.This kind of measure is naturally susceptible to the dimensionality cur.For example,consider the deﬁnition by Knorr and Ng[20]:A point p in a data t is an outlier with respect to the parameters k andλ,if no more than k points in the data t are at a distanceλor less from p.

砚台

As pointed out in[23],this method is nsitive to the u of the parameterλthat is hard toﬁgure out a priori.In addition, when the dimensionality increas,it becomes increasingly difﬁcult to pickλsince most of the points are likely to lie in a thin shell about any point[9].Thus,if we pick too small a λ,then all points are outliers;whereas if we pick too large a λ,then no point is an outlier.This means that a ur would need to pickλto a very high degree of accuracy in order to ﬁnd a modest number of points that can then be deﬁned as outliers.Aside from this,the data in real applications are very noisy,and the abnormal deviations may be embedded in some lower-dimensional subspace that cannot be determined by the spreading out behavior[9]of the data in full dimensionality. The algorithm also does not scale well for high dimensions. Conquently,the work in[23]discuss the following deﬁ-

nition for an outlier:Given a k and n,a point p is an outlier if

the distance to its k-th nearest neighbor of the point is smaller than the corresponding value for no more than n−1other points.

Although the deﬁnition in[23]has some advantages over that provided in[20],it is again not speciﬁcally designed to work for high-dimensional problems.In fact,it has been in-dicated in[23]that b

y using fewer features for a given run, more interesting outliers on the NBA98basketball statistics databa were obtained.This was again becau the data often got spread out uniformly with increasing dimensionality.An-other interesting recent techniqueﬁnds outliers bad on their local density[10],particularly with respect to the densities of local neighborhoods.This technique has some advantages in accounting for local levels of skews and abnormalities in data collections.To compute the outlier factor of a point,the method in[10]computes the local reachability density of a point o by using the average smoothed distances to a certain number of points in the locality of o.Unfortunately,this is again a problem in high dimensionality,where the concept of locality becomes difﬁcult to deﬁne becau of the sparsity of the data.In order to u the concept of local density,we need a meaningful concept of distance for spar high-dimensional data;if this does not exist,then the outliers found are unlikely to be very uful.

Thus the techniques propod in[10,20,23]try to deﬁne outliers bad on the distances in full-dimensional space in one way or another.The sparsity of the data in high dimensionality [9]can be interpreted slightly differently to infer that each point is as good as an outlier in high-dimensional space.This is becau if all pairs of points are almost equidistant[9],then meaningful clusters cannot be found in the data[2,3,6,11]; similarly,it is difﬁcult to detect abnormal deviations.

For problems such as clustering and similarity arch,it has been shown[1–4,6,11,17]that by examining the behavior of the data in subspaces,it is possible to design more mean-ingful clusters that are speciﬁc to the particular subspace in question.This is becau different localities of the data are den with respect to different subts of attributes.By deﬁn-ing clusters that are speciﬁc to particular projections of the data,it is possible to design more effective techniques for ﬁnding clusters.The same insight is true for outliers,becau in typical applications such as credit card fraud,only the sub-t of the attributes actually affected by the abnormality of the activity is likely to be uful in detecting the behavior.

In order to explain our point a little bit better,let us con-sider the example illustrated in Fig.1.In the above example, we have shown veral two-dimensional cross ctions of a very high-dimensional data t.It is quite likely that for high-dimensional data,many of the cross ctions may be struc-tured,whereas others may be more noisy.For example,the points A and B show abnormal behavior in views1and4of the data.In other views,the points show average behavior.In the context of a credit card fraud application,both the points A and B may correspond to different kinds of fraudulent behavior yet may show average behavior when distances are measured in all the dimensions.Thus,by using full-dimensional distance measures,it would be more difﬁcult to detect poi

nts that are outliers becau of the averaging behavior of the noisy and irrelevant dimensions.Furthermore,it is impossible to prune off speciﬁc features a priori since different points(such as A

x x

x xx

x x x

x领导形象

x x

2 and

3 do not. Full dimensional measures

x x

x x

o B

The 2-dimensional views 1 and 4

become increasingly susceptible to the

dimensionality金融助力乡村振兴

sparsity and noi effects in high

expo outliers A and B, The views

Fig.1.Illustrations of outliers in various views of the data

and B)may show different kinds of abnormal patterns,each of which us different features or views.

Thus the problem of outlier detection becomes increas-ingly difﬁcult for very high-dimensional data ts,just as any of the other problems in the literature such as clustering,in-dexing,classiﬁcation,or similarity arch.Previous work on outlier detection has not focud on the high-dimensionality aspect of outlier detection and has ud methods that are more applicable for low-dimensional problems by using relatively straightforward proximity measures[10,20,23].On the other hand,we note that most practical data-mining applications are likely to ari in the context of a very large num

ber of features.In this paper,we focus on the effects of high dimen-sionality on the problem of outlier detection.Recent work has discusd some of the concepts of deﬁning the intensional knowledge that characterizes distance-bad outliers in terms of subts of attributes.Unfortunately,this technique was not intended for high-dimensional data,and the complexity in-creas exponentially with dimensionality.As the empirical results in[21]show,even for the relatively small dimension-ality of4,the technique is highly computation intensive.For even slightly higher dimensionalities,the technique is infea-sible from a computational standpoint.

In this paper,we discuss a new technique for outlier de-tection thatﬁnds outliers by obrving the density distribu-tions of projections from the data.Intuitively speaking,this new deﬁnition considers a point to be an outlier if in some lower-dimensional projection it is prent in a local region of abnormally low density.

1.1Deﬁning outliers in lower-dimensional projections

The idea is to deﬁne outliers for data by looking at tho pro-jections of the data that have abnormally low density.Thus ourﬁrst step is to identify and mine tho patterns that have abnormally low prence that cannot be justiﬁed by random-ness.This is important since we value outlier pattern

s not for their noi value but for their deviation value.Once such pat-terns have been identiﬁed,then the outliers are deﬁned as tho records that have such patterns prent in them.An interesting obrvation is that such lower-dimensional projections can be mined even in data ts that have missing attribute values[22]. This is quite uful for many real applications in which feature extraction is a difﬁcult process and full feature descriptions of-ten do not exist.In such cas,the only difference is that for

a given projection,only tho points need to be ud that are fully speciﬁed in that projection.

1.2Deﬁning abnormal lower-dimensional projections In order to ﬁnd such abnormal lower-dimensional projec-tions,we need to deﬁne and characterize what we mean by an abnormal lower-dimensional projection.An abnormal lower-dimensional projection is one in which the density of the data is exceptionally lower than average.

In order to ﬁnd such projections,we ﬁrst perform a grid discretization of the data.Each attribute of the data is divided into φranges.The ranges are created on an equidepth basis;thus each range contains a fraction f =1/φof the records.The reason for using equidepth ranges as oppod to equiwidth ranges is that different localities of the data have different densities;we would like to ﬁnd o

utliers while normalizing for this factor.The ranges form the units of locality that we will u to deﬁne low-dimensional projections that have unreasonably spar regions.

Let us consider a k -dimensional cube that is created by picking grid ranges from k different dimensions.The expected fraction of the records in that region if the attributes were sta-tistically independent would be equal to f k .Of cour,the data are far from statistically independent;therefore,the ac-tual distribution of points in a cube would differ signiﬁcantly from average behavior;it is precily tho deviations that are abnormally below the average that are uful for the purpo of outlier detection.

Let us assume that there are a total of N points in the databa,and the dimensionality is d .If the data were uni-formly distributed,then the prence or abnce of any point in a k -dimensional cube is a bernoulli random variable with probability f k .Then the expected fraction and standard devi-ation of the points in a a k -dimensional cube is given by N ·f k

and

N ·f k ·(1−f k ).Also,under the assumption of uni-formly distributed data,the number of points in a cube can be approximated by a normal distribution.Let n (D )be the num-ber of points in a k -dimensi

onal cube D .Then we calculate the sparsity coefﬁcient S (D )of the cube D as follows:S (D )=n (D )−N ·f k

k ·(1−f k )

(1)

Only sparsity coefﬁcients that are negative indicate cubes in which the prence of the points is signiﬁcantly lower than expected.Note that if n (D )is assumed to ﬁt a normal distri-bution,then the normal distribution tables can be ud to quan-tify the probabilistic level of signiﬁcance for a point to deviate signiﬁcantly from average behavior for an a priori assump-tion of uniformly distributed data.In general,the uniformly distributed assumption is not true;however,the sparsity co-efﬁcient provides a reasonable approximation to the level of signiﬁcance for a given projection.We also note that we are only arching for cubes that are nonempty in order to ﬁnd outliers.Therefore,cubes that are empty are considered infea-sible.From an implementation perspective,this is achieved by tting such sparsity coefﬁcients to the very high value of 106.

1.3A note on the nature of the problem

At this stage we would like to make a comment on the na-ture of the problem of ﬁnding the most spar k -dimensional cubes in the data.The nature of this problem is such that there are no upward-or downward-clod properties in the t of dimensions (along with associated ranges)that are unusually spar.1This is not unexpected:unlike problems such as large itemt detection [7]where one is looking for large aggregate patterns,the problem of ﬁnding subts of dimensions that are sparly populated is akin to ﬁnding a needle in a haystack since one is looking for patterns that rarely exist.Furthermore,it may often be the ca that even though particular regions may be well populated on certain ts of dimensions,they may be very sparly populated when such dimensions are com-bined together.(For example,there may be a large number of people below the age of 20,and a large number of people with diabetes,but very few with both.)From the perspective of an outlier detection technique a person below the age of 20with diabetes is a very interesting record;however,it is very difﬁcult to ﬁnd such a pattern using structured arch meth-ods.Therefore,the best projections are often created by an a priori unknown combination of dimensions,which cannot be determined by looking at any lower-dimensional projec-tion.One solution is to change the measure in order to force better closure or pruning properties;however,this can worn the quality of the solution substantially by forcing the choice of the measure to be driven by algorithmic considerations.In general,it is not possible to predict the behavior of the data when tw

o ts of dimensions are combined;therefore,the best qualitative option is to develop arch methods that can identify such hidden combinations of dimensions.In order to arch the exponentially increasing space of possible pro-jections,we borrow ideas from a class of evolutionary arch methods in order to create an efﬁcient and effective algorithm for the outlier detection problem.

2Evolutionary algorithms for outlier detection

In this ction,we will discuss the algorithms that are uful for outlier detection in high-dimensional problems.A natural class of methods for outlier detection are the naive brute-force techniques in which all subts of dimensions are examined for possible patterns that are spar.The patterns are then ud to determine the points that are possibly outliers.We dis-cuss two algorithms for outlier detection:a naive brute-force algorithm that is very slow at ﬁnding the best patterns be-cau of its exhaustive arch of the entire space and a much faster evolutionary algorithm that is able to quickly ﬁnd hid-den combinations of dimensions in which the data are spar.We assume that one of the inputs to the algorithm is the di-mensionality k of the projections that are ud to determine the outliers.Aside from this,the algorithm us the number m of projections to be determined as an input parameter.

The brute-force algorithm is illustrated in Fig.2.The algorithm works by examining all possible ts of k -1

An upward-clod pattern is one in which all superts of the pattern are also valid patterns.A downward-clod t of patterns is one in which all subts of the pattern are also members of the t.

美人鱼动漫图片Algorithm BruteForce(Number:m,Dimensionality:k)

begin

R1=Q1=Set of all d·φranges;

for i=2to k+1do

begin

R i=R i−1⊕Q1;

end;

Determine sparsity coefﬁcients of all

elements in R k;

醉不成欢惨将别

R k=Set of m elements in Q k

with most negative sparsity coefﬁcients;

O=Set of points covered by R k;

return(R k,O);

end

Fig.2.The brute-force technique

dimensional candidate projections(together with correspond-ing grid ranges)and retaining the m projections that have the most negative sparsity coefﬁcients.In order to actually deter-mine the candidate projections,the method us a bottom-up recursive algorithm in which(k+1)candidate cubes are de-termined by concatenating the candidate k projections with all d·φpossible ts of one-

dimensional projections and their grid ranges(denoted by Q1).The concatenation operation is illus-trated in Fig.2by⊕.Note that,for a given cube,it only makes n to concatenate with grid ranges from dimensions not included in the current projection in order to create a higher-dimensional projection.The candidate t of dimensionality i is denoted by R i.At termination,the t of projections with most negative sparsity coefﬁcients in R k are retained.The t of points in the data that contain the corresponding ranges for the projections is reported as theﬁnal t of outliers.

As we shall e in later ctions,the algorithm discusd in Fig.2is computationally untenable for problems of even mod-est complexity.This is becau of the exponentially increas-ing arch space of the outlier detection problem.To overcome this,we will illustrate an innovative u of evolutionary arch techniques for the outlier detection problem.

2.1An overview of evolutionary arch techniques Evolutionary algorithms[18]are methods that imitate the pro-cess of organic evolution[12]in order to solve parameter opti-mization problems.The fundamental idea underlying the Dar-winian view of evolution is that,in nature resources are scarce, and this leads to a competition among the various species.As a result,all the species undergo a lection mechanism in which only theﬁttest survive.Conquently,theﬁtter individuals tend to mate each other more often,resulting in still better individuals.At the same time,once in a while,nature also

throws in a variant by the process of mutation,so as to en-sure a sufﬁcient amount of diversity among the species,and hence also a greater scope for improvement.The basic idea behind an evolutionary arch technique is similar;every so-lution to an optimization problem can be“disguid”as an individual in an evolutionary system.The measure ofﬁtness of this“individual”is equal to the objective function value of the corresponding solution,and the other species that this individual has to compete with are a group of other solutions to the problems;thus,unlike other optimization methods such as hill climbing or simulated annealing[19],they work with an entire population of current solutions rather than a single solution.This is one of the reasons why evolutionary algo-rithms are more effective as arch methods than either hill-climbing,random arch,or simulated annealing techniques; they u the esnce of the techniques of all the methods in conjunction with recombination of multiple solutions in a population.Appropriate operations are deﬁned in order to im-itate the recombination and mutation process as well,and the simulation is complete.

Each feasible solution to the problem being solved by an evolutionary technique is deﬁned as an individual.This fea-sible solution is in the form of a string and is the genetic reprentation of the individual.The process of conversion of feasible solutions of the problem into strings that the algorith

m can u is called coding.For example,a possible coding for a feasible solution to the traveling salesman problem could be a string containing a quence of numbers reprenting the order in which he visits the cities.The genetic material at each locus on the string is referred to as a gene and the possible val-ues that it could possibly take on are the alleles.The measure ofﬁtness of an individual is evaluated by theﬁtness function, which has as its argument the string reprentation of the in-dividual and returns a value indicating itsﬁtness.Theﬁtness value of an individual is analogous to the objective function value of the solution;the better the objective function value, the better theﬁtness value.

As the process of evolution progress,all the individu-als in the population tend to become more and more similar to each other genetically.This phenomenon is referred to as convergence.Dejong[13]deﬁned convergence of a gene as the stage at which95%of the population has the same value for that gene.The population is said to have converged when all genes have converged.

The application of such evolutionary arch procedures should be bad on a good understanding of the problem at hand.Typically black-box GA software on straightforward string encodings does not work very well[5],and it is often a nontrivial task to design the proper arch methods such as recom

binations,lections,and mutations that work well for a given problem.In the next ction,we will discuss the details of the evolutionary arch procedures that work effectively for the outlier detection problem.

2.2A description of the outlier detection technique

In this ction,we will discuss the application of the arch technique to the outlier detection problem.Let us assume that the grid range for the i-th dimension is denoted by m i.Then the value of m i can take on any of the values1throughφ, or it can take on the value∗,which denotes a“don’t care”. Thus,there are a total ofφ+1values that the dimension m i can take on.Thus,consider a four-dimensional problem with φ=10.Then,one possible example of a solution to the prob-lem is given by*3*9.In this ca,the ranges for the cond and fourth dimensions are identiﬁed,whereas theﬁrst and third are left as“don’t cares”.Theﬁtness for the corresponding solu-tion may be computed using the sparsity coefﬁcient discusd

Algorithm EvolutionaryOutlierSearch(Number:m,Dimensionality:k) begin

S=Initial ed population of p strings;

BestSet=null;

while not(termination criterion)do begin

S=Selection(S);

S=CrossOver(S);

S=Mutation(S,p1,p2);

Update BestSet to be the m solutions in

BestSet∪S with most negative sparsity coefﬁcients;

end;

O=Set of data points covered by BestSet;

return(BestSet,O);

end

Fig.3.Evolutionary framework for outlier detection

earlier.The evolutionary arch technique starts with a popu-lation of p random solutions and iteratively ud the process

of lection,crossover,and mutation to perform a combination

of hill climbing,solution recombination,and random arch over the space of possible projections.The process was con-tinued until the population converged to a global optimum.We ud the De Jong[13]convergence criterion to determine the termination condition.At each stage of the algorithm,the m best projection solutions(most negative sparsity coefﬁcients) were kept track of.At the end of the algorithm,the solutions were reported as the best projections in the data.The over-all procedure for the genetic algorithm is illustrated in Fig.3. The population of solutions in any given iteration is denoted by S.This t S is reﬁned in subquent iterations of the al-gorithm,and the best t of projections found so far is always maintained by the evolutionary algorithm.

•Selection:Several alternatives are possible[15]for -

lection in an evolutionary algorithm;the most popularly

known ones are rank lection andﬁtness proportional -

lection.The idea is to replicate copies of a solution by

ordering them by rank and biasing the population in favor

of the higher-ranked solutions.This is called rank lec-

tion and is often more stable than straightforwardﬁtness

正月十五祝福语

proportional methods that sample the t of solutions in

proportion to the actual value of the objective function.

This strategy of biasing the population in favor ofﬁtter

strings in conjunction with effective solution recombina-

tion creates newer ts of children strings that are more

likely to beﬁt.This results in a global hill climbing of

an entire population of solutions.For the particular ca

of our implementation,we ud a roulette wheel mecha-

nism,where the probability of sampling a string from the

population was proportional to p−r(i),where p is the to-

tal number of strings and r(i)is the rank of the i-th string.

Note that the strings are ordered in such a way that the

strings with the most negative sparsity coefﬁcients occur

ﬁrst.Thus,the lection mechanism ensures that the new

population is biad in such a way that the the most abnor-

mally spar solutions are likely to have a greater number

of copies.The overall lection algorithm is illustrated in

Fig.4.

•Crossover:Since the crossover technique is a key method

in evolutionary algorithms forﬁnding optimum combina-Algorithm Selection(S)

begin

Compute the sparsity coefﬁcient of each solution

in the population S;

杜昌伟

Let r(i)be the rank of solution i in order

of sparsity coefﬁcient(most negative occursﬁrst);

S =null;

for i=1to p do

begin

Roll a die with the i-th side proportional to p−r(i);

Add the solution corresponding to the i-th side to S ;

end;

Replace S by S ;

return(S);

end

Fig.4.Selection criterion for genetic algorithm

tions of solutions,it is important to implement this opera-tion effectively for making the overall method work effec-tively.We willﬁrst discuss the natural two-point crossover mechanism ud in evolutionary algorithms and show how to suitably modify it for the outlier detection problem.

Unbiad two-point crossover:The standard procedure in evolutionary algorithms is to u uniform two-point crossover in order to create the recombinant children strings.The two-point crossover mechanism works by determining a point in the string at random called the crossover point and exchanging the gments to the right of this point.For example,consider the strings3*2*1 and1*33*.If

the crossover is performed after the third position,then the two resulting strings are3*23*and 1*3*1.Note that in this ca,both the parent and chil-dren strings correspond to three-dimensional projections inﬁve-dimensional data.However,if the crossover oc-curred after the fourth position,then the two resulting children strings would be3*231and1*3**.The cor-respond to two-dimensional and two-dimensional projec-tions.In general,since the evolutionary algorithm only ﬁnds projections of a given dimensionality in a run,this kind of crossover mechanism often creates infeasible so-lutions after the crossover process.Such solutions are dis-carded in subquent iterations since they are assigned very lowﬁtness values.In general,evolutionary algorithms work very poorly when the recombination process cannot create ts of solutions of high quality or that are viable in terms of feasibility.To address this,we create an opti-mized crossover process that takes both the factors into account.

Since it is clear that the dimensionality of the projection needs to be kept in mind while performing a crossover operation,it is desirable that the two children obtained after solution recombination also correspond to a k-dimensional projection.In order to achieve this goal,we need to classify the different positions in the string into three types.This classiﬁcation is speciﬁc to a given pair of strings s1and s2.

武汉重启Type I:Both strings have a“don’t care”.

Type II:Neither has a“don’t care”.Let us assume that there are k ≤k positions of this type.

本文发布于:2023-06-13 08:00:02，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/89/1036259.html

上一篇：航海及海运专业英语词汇（D4）

下一篇：Three dimensional modeling apparatus