Data Min Knowl Disc
DOI10.1007/s10618-014-0372-z
Leveraging the power of local spatial autocorrelation
in geophysical interpolative clustering
Annalisa Appice·Donato Malerba
Received:16December2012/Accepted:22June2014
the用法©The Author(s)2014
Abstract Nowadays ubiquitous nsor stations are deployed worldwide,in order to measure veral geophysical perature,humidity,light)for a grow-ing number of ecological and industrial process.Although the variables are,in general,measured over large zones and long(potentially unbounded)periods of time, stations cannot cover any space location.On the other hand,due to their huge vol-ume,data produced cannot be entirely recorded for future analysis.In this scenario, he computation of aggregates of data,can be ud to reduce the amount of produce
d data stored on the disk,while he estimation of unknown data in each location of interest,can be ud to supplement station records. We illustrate a novel data mining solution,named interpolative clustering,that has the merit of addressing both the tasks in time-evolving,multivariate geophysical appli-cations.It yields a time-evolving clustering model,in order to summarize geophysical data and computes a weighted linear combination of cluster prototypes,in order to predict data.Clustering is done by accounting for the local prence of the spatial autocorrelation property in the geophysical data.Weights of the linear combination are defined,in order to reflect the inver distance of the unen data to each clus-ter geometry.The cluster geometry is reprented through shape-dependent sampling of geographic coordinates of clustered stations.Experiments performed with veral data collections investigate the trade-off between the summarization capability and predictive accuracy of the prented interpolative clustering algorithm.
Responsible editors:Hendrik Blockeel,Kristian Kersting,Siegfried Nijsn and FilipŽelezný.
两唇之间A.Appice(B)·D.Malerba
Dipartimento di Informatica,Universitàdegli Studi di Bari“Aldo Moro”,
Via Orabona4,70125Bari,Italy
e-mail:annalisa.appice@uniba.it
D.Malerba
e-mail:donato.malerba@uniba.it
A.Appice,D.Malerba Keywords Spatial autocorrelation·Clustering·Inver distance weighting·Geophysical data stream
1Introduction
The widespread u of nsor networks has paved the way for the explosive living ubiquity of geophysical data streams of data that are measured repeatedly over a t of locations).Procedurally,remote nsors are installed worldwide.They gather information along a number of variables over large zones and long(potentially unbounded)periods of time.In this scenario,spatial distribution of data sources,as well as temporal distribution of measures po new challenges in the collection and query of data.Much scientific and industrial interest has recently been focud on the deployment of data management systems that gather continuous multivariate data from veral data sources,recognize and possibly adapt a behavioral model,deal with queries t
hat concern prent and past data,as well as en and unen data.This pos specific issues that include storing entirely(unbounded)data on disks withfinite memory(Chiky and Hébrail2008),as well as looking for predictions(estimations) where no measured data are available(Li and Heap2008).
Summarization is one solution for addressing storage limits,while interpolation is one solution for supplementing unen data.So far,both the tasks,namely summa-rization and interpolation,have been extensively investigated in the literature.How-ever,most of the studies consider only one task at a time.Several summarization Rodrigues et al.(2008),Chen et al.(2010),Appice et al.(2013a),have been defined in data mining,in order to compute fast,compact summaries of geophysical data as they arrive.Data storage systems store computed summaries,while discarding actual data.Various interpolation Shepard(1968b),Krige(1951),have been defined in geostatistics,in order to predict unen measures of a geophysical vari-able.They u predictive inferences bad upon actual measures sampled at specific locations of space.In this paper,we investigate a holistic approach that links predictive inferences to data summaries.Therefore,we introduce a summarization pattern of geo-physical data,which can be computed to save memory space and make predictive infer-ences easier.We u predictive inferences that exploit knowledge in data summaries, in order to yield accurate predictions covering any(en and unen)space location.
We begin by obrving that the common factor of veral summarization and inter-polation algorithms is that they accommodate the spatial autocorrelation analysis in the learned model.Spatial autocorrelation is the cross-correlation of values of a variable strictly due to their relatively clo locations on a two-dimensional surface.Spatial autocorrelation exists when there is systematic spatial variation in the values of a given property.This variation can exist in two forms,called positive and negative spatial autocorrelation(Legendre1993).In the positive ca,the value of a variable at a given location tends to be similar to the values of that variable in nearby locations. This means that if the value of some variable is low at a given location,the prence of spatial autocorrelation indicates that nearby values are also low.Converly,neg-ative spatial autocorrelation is characterized by dissimilar values at nearby locations. Goodchild(1986)remarks that positive autocorrelation is en much more frequently
Local spatial autocorrelation and interpolative clustering
in practice than negative autocorrelation in geophysical variables.This is justified by Tobler’sfirst law of geography,according to which“everything is related to everything el,but near things are more related than distant things”(Tobler1979).
As obrved by LeSage and Pace(2001),the analysis of spatial autocorrelation is crucial and can be fundamental for building a reliable spatial component into any (statistical)model for geophysical data.With the same viewpoint,we propo to:(i) model the property of spatial autocorrelation when collecting the data records of a number of geophysical variables,(ii)u this model to compute compact summaries of actual data that are discarded and(iii)inject computed summaries into predictive inferences,in order to yield accurate estimations of geophysical data at any space location.
The paper is organized as follows.The next ction clarifies the motivation and the actual contribution of this paper.In Sect.3,related works regarding spatial autocorre-lation,spatial interpolators and clustering are reported.In Sect.4,we report the basics of the prented algorithm,while in Sect.5,we describe the propod algorithm.An experimental study with veral data collections is prented in Sect.6and conclusions are drawn.
清胎毒2Motivation and contributions
回春操The analysis of the property of spatial autocorrelation in geophysical data pos spe-cific issues.
张飞的武器>西京赋One issue is that most of the models that reprent and learn data with spatial autocorrelation are ba
d on the assumption of spatial stationarity.Thus,they assume a constant mean and a constant variance(no outlier)across space(Stojanova et al. 2012).This means that possible significant variabilities in autocorrelation dependen-cies throughout the space are overlooked.The variability could be caud by a different underlying latent structure of the space,which varies among its portions in terms of scale of values or density of measures.As pointed out by Angin and Neville(2008), when autocorrelation varies significantly throughout space,it may be more accurate to model the dependencies locally rather than globally.
Another issue is that the spatial autocorrelation analysis is frequently decoupled from the multivariate analysis.In this ca,the learning process accounts for the spa-tial autocorrelation of univariate data,while dealing with distinct variables parately (Appice et al.2013b).Bailey and Krzanowski(2012)obrve that ignoring complex interactions among multiple variables may overlook interesting insights into the cor-relation of potentially related variables at any site.Bad upon this idea,Dray and Jombart(2011)formulate a multivariate definition of the concept of spatial autocorre-lation,which centers on the extent to which values for a number of variables obrved at a given location show a systematic(more than likely under spatial randomness), homogeneous association with values obrved at the“neighboring”locations.
In this paper,we develop an approach to modeling non stationary spatial auto-correlation of multivariate geophysical data by using interpolative clustering.As in clustering,clusters of records that are similar to each other at nearby locations are identified,but a cluster description and a predictive model is associated to each clus-雄鹰简笔画
A.Appice,D.Malerba ter.Data records are aggregated through clusters bad on the cluster descriptions. The associated predictive models,that provide predictions for the variables,are stored as summaries of the clustered data.On any future demand,predictive models queried to databas are procesd according to the requests,in order to yield accurate esti-mates for the variables.Interpolative clustering is a form of conceptual clustering (Michalski and Stepp1983)since,besides the clusters themlves,it also provides symbolic descriptions(in the form of conjunctions of conditions)of the constructed clusters.Thus,we can also plan to consider this description,in order to obtain clusters in different contexts of the same domain.However,in contrast to conceptual clus-tering,interpolative clustering is a form of supervid learning.On the other hand, interpolative clustering is similar to predictive clustering(Blockeel et al.1998),since it is a form of supervid learning.However,unlike predictive clustering,where the predictive space(target variables)is typically distinguished from the descriptive one (explanatory variables),1variables of interpolative clustering play,in principle,both target and explanatory roles.
Interpolative clustering trees(ICTs)are a class of tree structured models where a split node is associated with a cluster and a leaf node with a single predictive model for the target variables of interest.The top node of the ICT contains the entire sample of training records.This cluster is recursively partitioned along the target variables into smaller sub-clusters.A predictive model(the mean)is computed for each target variable and then associated with each leaf.All the variables are predicted indepen-dently.In the context of this paper,an ICT is built by integrating the consideration of a local indicator of spatial autocorrelation,in order to account for significant vari-abilities of autocorrelation dependencies in training data.Spatial autocorrelation is coupled with a multivariate analysis by accounting for the spatial dependence of data and their multivariate variance,simultaneously(Dray and Jombart2011).This is done by maximizing the variance reduction of local indicators of spatial autocorrelation computed for multivariate data when evaluating the candidates for adding a new node to the tree.This solution has the merit of improving both the summarization,as well as the predictive performance of the computed models.Memory space is saved by storing a single summary for data of multiple variables.Predictive accuracy is incread by exploiting the autocorrelation of data clustered in space.
From the summarization perspective,an ICT is ud to summarize geophysical data according to a h
ierarchical view of the spatial autocorrelation.We can brow gen-erated clusters at different levels of the hierarchy.Predictive models on leaf clusters model spatial autocorrelation dependencies as stationary over the local geometry of the clusters.From the interpolation perspective,an ICT is ud to compute knowledge
1The predictive clustering framework is originally defined in Blockeel et al.(1998),in order to com-bine clustering problems and classification/regression problems.The predictive inference is performed by distinguishing between target variables and explanatory variables.Target variables are considered when evaluating similarity between training data such that training examples with similar target values are grouped in the same cluster,while training examples with dissimilar target values are grouped in parate clusters. Explanatory variables are ud to generate a symbolic description of the clusters.Although the algorithm prented in Blockeel et al.(1998)can be,in principle,run by considering the same t of variables for both explanatory and target roles,this ca is not investigated in the original study.
Local spatial autocorrelation and interpolative clustering
to make accurate predictive inferences easier.We can u Inver Distance Weighting2 (Shepard19
68b)to predict variables at a specific location by a weighted linear com-bination of the predictive models on the leaf clusters.Weights are inver functions of the distance of the query point from the clusters.
Finally,we can obrve that an ICT provides a static model of a geophysical phe-nomenon.Nevertheless,inferences bad on static models of spatial autocorrelation require temporal stationarity of statistical properties of variables.In the geophysical context,data are frequently subject to the temporal variation of such properties.This requires dynamic models that can be updated continuously as new fresh data arrive (Gama2010).In this paper,we propo an incremental algorithm for the construction of a time-adaptive ICT.When a new sample of records is acquired through stations of a nsor network,a past ICT is modified,in order to model new data of the process, which may change their properties over time.In theory,a distinct tree can be learned for each time point and veral trees can be subquently combined using some gen-eral Spiliopoulou et al.2006)for tracking cluster evolution over time. However,this solution is prohibitively time-consuming when data arrive at a high rate. By taking into account that(1)geophysical variables are often slowly time-varying, and(2)a change of the properties of the data distribution of variables is often restricted to a delimited group of stations,more efficient learning
algorithms can be derived.In this paper,we propo an algorithm that retains past clusters as long as they discrimi-nate between surfaces of spatial autocorrelated data,while it mines novel clusters only if the latest become inaccurate.In this way,we can save computation time and track the evolution of the cluster model by detecting changes in the data properties at no additional cost.
The specific contributions in this paper are:(1)the investigation of the property of spatial autocorrelation in interpolative clustering;(2)the development of an approach that us a local indicator of the spatial autocorrelation property,in order to build an ICT by taking into account non-stationarity in autocorrelation and multivariate analysis;(3)the development of an incremental algorithm to yield a time-evolving ICT that accounts for the fact that the statistical properties of the geophysical data may change over time;(4)an extensive evaluation of the effectiveness of the propod (incremental)approach on veral real geophysical data.
3Related works
拥抱爱This work has been motivated by the rearch literature for the property of spatial autocorrelation and its influence on the interpolation theory and(predictive)clustering. In the following subctions,we report related works from the rearch lines.
2Inver distance weighting is a common interpolation algorithm.It has veral advantages that endor its widespread u in geostatistics(Li and Revesz2002;Karydas et al.2009;Li et al.2011):simplicity of implementation;lack of tunable parameters;ability to interpolate scattered data and work on any grid without suffering from multicollinearity.