Mining Data Streams:A Review
Mohamed Medhat Gaber,Arkady Zaslavsky and Shonali Krishnaswamy
Centre for Distributed Systems and Software Engineering,Monash University
900Dandenong Rd,Caulfield East,VIC3145,Australia
{Mohamed.Medhat.Gaber,Arkady.Zaslavsky,Shonali.Krishnaswamy}@ash.edu.au抗皱护肤品排行
好玩的课堂游戏
Abstract
The recent advances in hardware and software have enabled the capture of different measurements of data in a wide range of fields.The measurements are generated continuously and in a very high fluctuating data rates.Examples include nsor networks,web logs, and computer network traffic.The storage,querying and mining of such data ts are highly computationally challenging tasks.Mining data streams is concerned with extracting knowledge structures reprented in models and patterns in non stopping streams of information.The rearch in data stream mining has gained a high attraction due to the importance of its applications and the increasing generation of streaming information.Applications of data stream analysis can vary from critical scientific and astronomical applic
ations to important business and financial ones. Algorithms,systems and frameworks that address streaming challenges have been developed over the past three years.In this review paper,we prent the state-of-the-art in this growing vital field.
1-Introduction
The intelligent data analysis has pasd through a number of stages.Each stage address novel rearch issues that have arin.Statistical exploratory data analysis reprents the first stage.The goal was to explore the available data in order to test a specific hypothesis.With the advances in computing power, machine learning field has arin.The objective was to find computationally efficient solutions to data analysis problems.Along with the progress in machine learning rearch,new data analysis problems have been addresd.Due to the increa in databa sizes,new algorithms have been propod to deal with the scalability issue.Moreover machine learning and statistical analysis techniques have been adopted and modified in order to address the problem of very large databas.Data mining is that interdisciplinary field of study that can extract models and patterns from large amounts of information stored in data repositories[30, 31,34].
Advances in networking and parallel computation have lead to the introduction of distributed and par
allel data mining.The goal was how to extract knowledge from different subts of a datat and integrate the generated knowledge structures in order to gain a global model of the whole datat. Client/rver,mobile agent bad and hybrid models have been propod to address the communication overhead issue.Different variations of algorithms have been developed in order to increa the accuracy of the generated global model.More details about distributed data mining could be found in[47].
Recently,the data generation rates in some data sources become faster than ever before.This rapid generation of continuous streams of information has challenged our storage,computation and communication capabilities in computing systems. Systems,models and techniques have been propod and developed over the past few years to address the challenges[5,44].
是非得失
In this paper,we review the theoretical foundations of data stream analysis.Mining data stream systems,techniques are critically reviewed.Finally,we outline and discuss rearch problems in streaming mining field of study.The rearch issues should be addresd in order to realize robust systems that are capable of fulfilling the needs of data stream mining applications.吃苹果减肥
The paper is organized as follows.Section2 prents the theoretical background of data stream anal
ysis.Mining data stream techniques and systems are reviewed in ctions3and4respectively.Open and addresd rearch issues in this growing field are discusd in ction5.Finally ction6summarizes this review paper.
2-Theoretical Foundations
Rearch problems and challenges that have been arin in mining data streams have its solutions using well-established statistical and computational approaches. We can categorize the solutions to data-bad and task-bad ones.In data-bad solutions,the idea is to examine only a subt of the whole datat or to transform the data vertically or horizontally to an approximate smaller size data reprentation.At the other hand,in task-bad solutions,techniques from computational theory have been adopted to achieve time
and space efficient solutions.In this ction we review the theoretical foundations.
爱你无条件2.1Data-bad Techniques
Data-bad techniques refer to summarizing the whole datat or choosing a subt of the incoming stream to be analyzed.Sampling,load shedding and sketching techniques reprent the former one.
学校推广Synopsis data structures and aggregation reprent the later one.Here is an outline of the basics of the techniques with pointers to its applications in the context of data stream analysis.
2.1.1Sampling
Sampling refers to the process of probabilistic choice of a data item to be procesd or not.Sampling is an old statistical technique that has been ud for a long time. Boundaries of the error rate of the computation are given as a function of the sampling rate.Very Fast Machine Learning techniques[16]have ud Hoeffding bound to measure the sample size according to some derived loss functions.
The problem with using sampling in the context of data stream analysis is the unknown datat size.Thus the treatment of data stream should follow a special analysis to find the error bounds.Another problem with sampling is that it would be important to check for anomalies for surveillance analysis as an application in mining data streams.Sampling may not be the right choice for such an application.Sampling also does not address the problem of fluctuating data rates.It would be worth investigating the relationship among the three parameters:data rate,sampling rate and error bounds.
2.1.2Load Shedding
Load shedding refers[6,52]to the process of dropping a quence of data streams.Load shedding has been ud successfully in querying data streams.It has the same problems of sampling.Load shedding is difficult to be ud with mining algorithms becau it drops chunks of data streams that could be ud in the structuring of the generated models or it might reprent
a pattern of interest in time ries analysis.
2.1.3Sketching
Sketching[5,44]is the process of randomly project a subt of the features.It is the process of vertically sample the incoming stream.Sketching has been applied in comparing different data streams and in aggregate queries.The major drawback of sketching is that of accuracy.It is hard to u it in the context of data stream mining.Principal Component Analysis(PCA)would be a better solution that has been applied in streaming applications[38].
2.1.4Synopsis Data Structures
Creating synopsis of data refers to the process of applying summarization techniques that are capab
le of summarizing the incoming stream for further analysis. Wavelet analysis[25],histograms,quantiles and frequency moments[5]have been propod as synopsis data structures.Since synopsis of data does not reprent all the characteristics of the datat, approximate answers are produced when using such data structures.
2.1.5Aggregation
Aggregation is the process of computing statistical measures such as means and variance that summarize the incoming stream.Using this aggregated data could be ud by the mining algorithm.The problem with aggregation is that it does not perform well with highly fluctuating data distributions.Merging online aggregation with offline mining has been studies in[1,2, 3].
2.2Task-bad Techniques
Task-bad techniques are tho methods that modify existing techniques or invent new ones in order to address the computational challenges of data stream processing.Approximation algorithms,sliding window and algorithm output granularity reprent this category. In the following subctions,we examine each of the techniques and its application in the context of data stream analysis.
2.2.1Approximation algorithms Approximation algorithms[44]have their roots in algorithm design.It is concerned with design algorithms for computationally hard problems.The algorithms can result in an approximate solution with error bounds. The idea is that mining algorithms are considered hard computational problems given its features of continuality and speed and the generating environment that is featured by being resource constrained. Approximation algorithms have attracted rearchers as a direct solution to data stream mining problems. However,the problem of data rates with regard with the available resources could not be solved using approximation algorithms.Other tools should be ud along with the algorithms in order to adapt to the
available resources.Approximation algorithms have been ud in[13]
2.2.2Sliding Window
The inspiration behind sliding window is that the ur is more concerned with the analysis of most recent data streams.Thus the detailed analysis is done over the most recent data items and summarized versions of the old ones.This idea has been adopted in many techniques in the undergoing comprehensive data stream mining system MAIDS[17].
2.2.3Algorithm Output Granularity
The algorithm output granularity(AOG)[21,22,23] introduces the first resource-aware data analysis approach that can cope with fluctuating very high data rates according to the available memory and the processing speed reprented in time constraints.The AOG performs the local data analysis on a resource constrained device that generates or receive streams of information.AOG has three main stages.Mining followed by adaptation to resources and data stream rates reprent the first two stages.Merging the generated knowledge structures when running out of memory reprents the last stage.AOG has been ud in clustering,classification and frequency counting[21].
Having discusd the different theoretical approaches to data stream analysis problems,the following ction is devoted to stream mining techniques that u the above theoretical approaches in different ways.
3-Mining Techniques
Mining data streams has attracted the attention of data mining community for the last three years.A number of algorithms have been propod for extracting knowledge from streaming information.In this ction, we review clustering,classification,frequency counting and time ries analysis techniques.
3.1Clustering
Guha et al.[27,28]have studied analytically clustering data streams using K-median technique.The propod algorithm makes a single pass over the data stream and us small space.It requires O(nk)time and O(nİ)space where“k”is the number of centers,“n”is the number of points andİ<1.They have proved that any k-median algorithm that achieves a constant factor approximation can not achieve a better run time than O(nk).The algorithm starts by clustering a calculated size sample according to the available memory into2k,and then at a cond level,the algorithm clusters the above points for a number of samples into2k and this process is repeated to a number of levels,and finally it clusters the2k clusters into k clusters.
Babcock et al.[7]have ud exponential histogram(EH)data structure to improve Guha et al. algorithm[27].They u the same method described above,however they address the problem of merging clusters when the two ts of cluster centers to be merged are far apart by maintaining the EH data structure.They have studied their propod algorithm analytically.
Charikar et al[11]have propod another k-median algorithm that overcomes the problem of increasing approximation factors in the Guha et al[27] algorithm with the increa in the number of le
vels ud to result in the final solution of the divide and conquer algorithm.The algorithm has also been studied analytically
池州人口Domingos et al.[15,16,35]have propod a general method for scaling up machine learning algorithms.They have termed this approach Very Fast Machine Learning VFML.This method depends on determining an upper bound for the learner’s loss as a function in number of data items to be examined in each step of the algorithm.They have applied this method to K-means clustering VFKM and decision tree classification VFDT techniques.The algorithms have been implemented and evaluated using synthetic data ts as well as real web data streams.VFKM us the Hoeffding bound to determine the number of examples needed in each step of K-means algorithm.The VFKM runs as a quence of K-means executions with each run us more examples than the previous one until a calculated statistical bound(Hoeffding bound)is satisfied.
Ordonez[46]has propod veral improvements to k-means algorithm to cluster binary data streams.He has developed an incremental k-means algorithm.The experiments were conducted on real data ts as well as synthetic ones.He has demonstrated experimentally that the propod algorithm outperforms the scalable k-means in the majority of cas.The propod algorithm is a one pass algorithm in O(Tkn) complexity,where T is the average transaction size,n is number of tran
sactions and k is number of centers.The u of binary data simplifies the manipulation of categorical data and eliminates the need for data normalization.The main idea behind the propod algorithm is that it updates the cluster centers and weights after examining a batch of transactions which equalizes square root of the number of transactions rather than updating them one by one.
O’Challaghan et al.[45]have propod STREAM and LOCALSEARCH algorithms for high quality data stream clustering.The STREAM algorithm
starts by determining the size of the sample and then applies the LOCALSEARCH algorithm if the sample size is larger than a pre-specified equation result.This process is repeated for each data chunk.Finally,the LOCALSEARCH algorithm is applied to the cluster centers generated in the previous iterations.
Aggarwal et al.[1]have propod a framework for clustering data steams called CluStream algorithm. The propod technique divides the clustering process into two components.The online component stores summarized statistics about the data streams and the offline one performs clustering on the summarized data according to a number of ur preferences such as the time frame and the number of clusters.A number of experiments on real datats have been conducted to prove the accuracy and efficiency of the propod algorithm.They[2]have recently propod HPStream;
a projected clustering for high dimensional data streams. HPStream has outperformed CluStream in recent results.
Keogh et al[39]have proved empirically that most highly cited clustering of time ries data streams algorithms propod so far in the literature come out with meaningless results in subquence clustering. They have propod a solution approach using k-motif to choo the subquences that the algorithm can work on to produce meaningful results.
Gaber et al.[21]have developed Lightweight Clustering LWC.It is an AOG-bad algorithm.AOG has been discusd in ction2.The algorithm adjusts a threshold that reprents the minimum distance measure between data items in different clusters.This adjustment is done regularly according to a pre-specified time frame.It is done according to the available resources by monitoring the input-output rate.This process is followed by merging clusters when the memory is full.
3.2Classification
Wang et al.[53]have propod a general framework for mining concept drifting data streams.They have obrved that data stream mining algorithms propod so far have not addresd the concept of drifting in the evolving data.The propod technique us weighted classifier enmbles to mine d
ata streams.The expiration of old data in their model depends on the data distribution.They u synthetic and real life data streams to test their algorithm and compare between the single classifier and classifier enmbles.The propod algorithm combines multiple classifiers weighted by their expected prediction accuracy.Also the lection of number of classifiers instead of using all is an option in the propod framework without loosing accuracy in the classification process.
Ganti et al.[18]have developed analytically an algorithm for model maintenance under inrtion and deletion of blocks of data records.This algorithm can be applied to any incremental data mining model.They have also described a generic framework for change detection between two data ts in terms of the data mining results they induce.They formalize the above two techniques into two general algorithms:GEMM and FOCUS.The algorithms have been applied to decision tree models and the frequent itemt model.GEMM algorithm accepts a class of models and an incremental model maintenance algorithm for the unrestricted window option,and outputs a model maintenance algorithm for both window-independent and window-dependent block lection quence.FOCUS framework us the difference between data mining models as the deviation in data ts.
Domingos et al.[15]have developed VFDT.It is a decision tree learning systems bad on Hoeffding trees.It splits the tree using the current best attribute taking into consideration that the number of exa
mined data items ud satisfies a statistical measure which is Hoeffding bound.The algorithm also deactivates the least promising leaves and drops the non-potential attributes.
Papadimitriou et al.[48]have propod AWSOM(Arbitrary Window Stream mOdeling Method)for interesting pattern discovery from nsors. They developed a one-pass algorithm to incrementally update the patterns.Their method requires only O(log N)memory where N is the length of the quence.They conducted experiments with real and synthetic data ts. They u wavelet coefficients as compact information reprentation and correlation structure detection,and then apply a linear regression model in the wavelet domain.
Aggarwal et al.have adopted the idea of micro-clusters introduced in CluStream in On-Demand classification[3]and it shows a high accuracy.The technique us clustering results to classify data using statistics of class distribution in each cluster.
Last[41]has propod an online classification system that can adapt to concept drift.The system re-builds the classification model with the most recent examples.Using the error rate as a guide to concept drift,the frequency of model building and the window size are adjusted.The system us info-fuzzy techniques for model building and information theory to calculate the window size.
Ding et al.[14]have developed a decision tree bad on Peano count tree data structure.It has been shown experimentally that it is a fast building algorithm that is suitable for streaming applications.
Gaber et al.[21]have developed Lightweight Classification LWClass.It is a variation of LWC.It is also an AOG-bad technique.The idea is to u K-nearest neighbors with updating the frequency of class occurrence given the data stream features.In ca of contradiction between the incoming stream and the
stored summary of the cas,the frequency is reduced. In ca of the frequency is equalized to zero,all the cas reprented by this class is relead from the memory.
3.3Frequency Counting
Giannella et al.[20]have developed a frequent itemts mining algorithm over data stream.They have propod the u of tilted windows to calculate the frequent patterns for the most recent transactions bad on the fact that urs are more interested in the most recent transactions.They u an incremental algorithm to maintain the FP-stream which is a tree data structure to reprent the frequent itemts.They conducted a number of experiments to prove the algorithm efficiency.
Manku and Motwani[43]have propod and implemented an approximate frequency counts in data streams.The implemented algorithm us all the previous historical data to calculate the frequent patterns incrementally.
Cormode and Muthukrishnan[13]have developed an algorithm for counting frequent items.The algorithm us group testing to find the hottest k items. The algorithm is ud with the turnstile data stream model which allows addition as well as deletion of data items.An approximation randomized algorithm has been ud to approximately count the most frequent items.It is worth mentioning that this data stream model is the hardest to analyze.Time ries and cash register models are computationally easier.The former does not allow increments and decrements and the later one allows only increments.
Gaber et al.[21]have developed one more AOG-bad algorithm:Lightweight frequency counting LWF.It has the ability to find an approximate solution to the most frequent items in the incoming stream using adaptation and releasing the least frequent items regularly in order to count the more frequent ones.
3.4Time Series Analysis
Indyk et al.[36]have propod approximate solutions with probabilistic error bounding to two problems in time ries analysis:relaxed periods and average trends. The algorithms u dimensionality reduction sketching techniques.The process starts with computing the sketches over an arbitrarily chon time window and creating what so called sketch pool.Using this pool of sketches,relaxed periods and average trends are computed.The algorithms have shown experimentally efficiency in running time and accuracy.
Perlman and Java[49]have propod a two pha approach to mine astronomical time ries streams.The first pha clusters sliding window patterns of each time ries.Using the created clusters,an association rule discovery technique is ud to create affinity analysis results among the created clusters of time ries.
Zhu and Shasha[54]have propod techniques to compute some statistical measures over time ries data streams.The propod techniques u discrete Fourier transform.The system is called StatStream and is able to compute approximate error bounded correlations and inner products.The system works over an arbitrarily chon sliding window.
Lin et al.[42]have propod the u of symbolic reprentation of time ries data streams. This repr
entation allows dimensionality/numerosity reduction.They have demonstrated the applicability of the propod reprentation by applying it to clustering, classification,indexing and anomaly detection.The approach has two main stages.The first one is the transformation of time ries data to Piecewi Aggregate Approximation followed by transforming the output to discrete string symbols in the cond stage.
Chen et al.[12]have propod the application of what so called regression cubes for data streams.Due to the success of OLAP technology in the application of static stored data,it has been propod to u multidimensional regression analysis to create a compact cube that could be ud for answering aggregate queries over the incoming streams.This rearch has been extended to be adopted in an undergoing project Mining Alarming Incidents in Data Streams MAIDS.
Himberg et al.[33]have prented and analyzed randomized variations of gmenting time ries data streams generated onboard mobile phone nsors.One of the applications of clustering time ries discusd:Changing the ur interface of mobile phone screen according to the ur context.It has been proven in this study that Global Iterative Replacement provides approximately an optimal solution with high efficiency in running time.
Guralnik and Srivastava[29]have developed a generic event detection approach of time ries streams. They have developed techniques for batch and online incremental processing of time ries data.The techniques have proven efficiency with real and synthetic data ts.
4-Systems
Several applications have stimulated the development of robust streaming analysis systems.The following reprents a list of such applications.寒菊宋郑思肖
•Burl et al.[9]have developed Diamond Eye for NASA and JPL.The aim of this project to enable remote computing systems as well as obrving