An Incremental Clustering Algorithm Bad on Grid

更新时间:2023-06-17 09:01:08 阅读: 评论:0

An Incremental Clustering Algorithm
Bad on Grid
Guohua Lei, Xiang Yu Xianfei Yang Shuang Chen
Department of Computer Science and Technology Department of Computer
Science and Technology
Department of Economics and
Management
Heilongjiang Institute of
Technology University of Harbin
Engineering
Heilongjiang Nongken
Vocational College
Harbin, Heilongjiang Province,
China Harbin, Heilongjiang Province,
China
Harbin, Heilongjiang Province,
China
Abstract: The existing clustering algorithms bad on grid are analyzed, and the clustering algorithms bad on grid have the advantages of dealing with high dimensional data and high efficiency. However, traditional algorithms bad on grid are influenced greatly by the granularity of grid partition. An incremental clustering algorithm bad on grid, which is called IGrid, is propod. IGrid has the advantage of high efficiency of traditional clustering algorithms bad on grid, and it also partition the grid space by dimensional radius in a dynamic and incremental manner to improve the quality of clustering. The experiments on real datats and synthetic datats show that IGrid has better performance than traditional clustering algorithms bad on grid in both speed and accura
cy.
Key words: incremental; clustering; grid; data mining趋之若鹜什么意思
I.I NTRODUCTION
Clustering analysis is an important rearch topic in data mining. It can also be ud as an independent tool to acquire the situation of data distribution. Furthermore, some other algorithms u clustering analysis as their preprocessing step. Most traditional clustering algorithms bad on grid run fast. Their processing time is independent to the number of data objects, and only depends on the number of units in each dimension of data space[1,2].
Some reprentative clustering algorithms bad on grid are STING, which process the statistics information stored in grid units, WaveCluster, which clusters data objects by a kind of wavelet transformation method, CLIQUE, which is a clustering algorithm bad on density and grid in high dimensional space.
STING is a multi-resolution clustering technology bad on grid, and it divides data space into rectangular units. To different grades of resolution, there generally exist rectangular units of different
grades which form a hierarchical structure. And the attribute statistics information of each grid unit is computed and stored in advance. STING has a high efficiency. For n data objects, when hierarchical structure forms, suppo there exist m grid units of the lowest level, then the clustering time complexity of STING is O(n), and the query processing time is O(m). The clustering quality of STING mainly depends on the lowest granularity of the grid structure. High granularity will increa the costs of processing which ems more obvious when the dimensionality is high enough. However, if the granularity of the lowest level of grid structure is too low, the clustering quality will decrea.
WaveCluster is a kind of multi-resolution clustering algorithm. It first adds a multi-dimensional grid structure to data space so as to summarize data, then, transforms the original feature space by wavelet transformation to find den regions in the transformed space. WaveCluster can process large data ts efficiently and process outliers successfully, it is not nsitive to the input order and needs not to specify parameters such as the radius of
neighborhood.
CLIQUE integrates the algorithms bad on density and tho bad on grid, and it can cluster the high-dimensional data in large databa. Suppo there is a multi-dimensional data t, then, disting
大禹治水教案uish the den units and the spar units of the space in order to find the global distribution pattern of the data t. If the number of data points in a grid unit exceeds a certain input parameter, then the unit is den. CLIQUE is not nsitive to the input order of tuples, it has good scalability when the dimensionality increas. However, when clustering in all dimensions, too many units will appear, so the performance of CLIQUE may decrea, and the precision of clustering result will be influenced by the too much simplification of the method.
However, traditional clustering algorithms bad on grid are influenced greatly by the granularity of grid. If the granularity of grid is low, the quality of clustering will decrea, on the contrary, high granularity of grid will decrea clustering efficiency. An incremental clustering algorithm bad on grid, IGrid, is propod in this paper. It has the high efficiency of traditional clustering algorithm bad on grid, and partitions data space dynamically and incrementally with the vector of dimensional radius in order to improve the quality of clustering.
In this paper, we first prent the related definition, then the idea of IGrid and all steps of IGrid. The experimental results show that the efficiency and clustering accuracy of IGrid are better than traditional clustering algorithms bad on grid. Finally, we summarize this paper and point out the direction of future rearch.
II. D EFINITION
A. Definition
Suppo  is a data t of d -dimensional data space S , and the subscript n  reprents the number of data points in the data t. For each data point 12{,,...,}n D x x x =12{,,...,,...,}i i i ij id x x x x x =,
, 1,2,...,j d =ij x  denotes the j th dimensional value
of the i th data point.
Definition 1 (δneighborhood) Suppo ij x is the j th dimensional value of data point i x in data t D , the j th dimensional δneighborhood of i x is [ij x δ−,ij x δ+], a partition on dimension j  who center is data point i x , and δis the dimensional radius of this dimension.
Definition 2 (δ
neighborhood vector)
12{,,...,}d δδδδ= is the δneighborhood vector
which consists of all i δ in each dimension. i δ is the dimensional radius of dimension i.
北京冬残奥会奖牌榜Definition 3 (grid volume) 1
d j j V span ==
is
the product of all j span which are the extents of the grid in each dimension.
Definition 4 (relative density of grid) The ratio of the number of data points involved in grid and the volume of the grid V , that is, .
/n V Definition 5 (grid common face) Suppo two grids g1 and g2, when g1 and g2 are the same in
1d − dimensions. In the d th dimension, they are
different and adjacent to each other, then, there is a grid common face between g1 and g2.
Definition 6 (grid adjacency) Suppo there is a grid common face between g1 and g2, or there exists a grid g3 where there exists a grid common face between g1 and g3 and there exists a grid common face between g2 and g3, then, g1 and g2 are grid adjacent. According to the definition of cluster in CLIQUE algorithm[3], a micro-cluster is the largest t which consists of all the adjacent grids who relative density exceed some certain threshold 0ρ.
B. Dynamically partition and incrementally adjustment of grid
Suppo 12{,,...,}i i i id x x x x = is the i th data point in data t 12{,,...,,...}n D x x x =, partition data space into grid units on each dimension. To the j th
dimensional value ij x of data point i x , if ij x  is not involved in any δ neighborhood partition on dimension j, then we get partition [ij j x δ−,ij j x δ+] according to the j th dimensional radius j δ, ij j x δ− is the front part of the partition, and ij j x δ+ is the rear part of the partition. If the front part ij j x δ− has interction ts with some existing partition on this dimension (assume that [kj j x δ−,kj j x δ+]), then adjust the partition [ij j x δ−,ij j x δ+] to (kj j x δ+,ij j x δ+]. Similarly, if the rear part
ij j x δ+has interction ts with some existing
partition on this dimension (assume that [lj j x δ−,lj j x δ+]), then, adjust the partition [ij j x δ−,ij j x δ+] to [ij j x δ−,lj j x δ−).
III. IG RID
There are mainly two steps of IGrid. The first step is to visit data points in the data t, and gradually form grid structure by partitioning data space dynamically and incrementally with dimensional radius. The cond step is to execute grid clustering. IGrid can also incrementally adjust the data t. To the new coming data t, visit the data points in the data t orderly and dynamically refresh the grid structure, then execute grid clustering.
In IGrid, each grid unit us a six-tuple
(
) as grid unit characteristics to denote the statistics of grid unit, and the grid unit
characteristics can be adjusted according to
requirements. In six-tuple, and are both
d -dimensional vector. For each dimension, 122,,,,,CF CF T T n V uuuu r uuuu r查询对方手机位置
1
CF uuuu r 2CF uuuu r
1CF uuuu r
and denote the vector of sum and square sum of data
points respectively. And T  and  denote the sum and square sum of data point arriving time, and n  denotes the number of data points in grid unit, V  denotes the volume of grid unit. The statistics information of each grid unit can be denoted by this six-tuple. For example, the mean value of a grid unit
is a d -dimensional vector, which is 2CF uuuu r
2T 1
CF n uuuu r , and the
relative density of grid unit is .
/n V A. Formation of grid structure
Each data point 12{,,...,}i i i id x x x x =of data t
12{,,...,,...}n D x x x = is visited orderly one by one,
according to the given δ neighborhood radius, the grid structure is formed. The δ neighborhood radius can be determined not only by experience but also by sampling method. The detail is as following:  Algorithm 1
Input: Data t 12{,,...,,...}n D x x x =, δ
neighborhood radius 12{,,...,}d δδδ
澳大利亚美食Output: Grid structure 12{,,...,}d G G G G =
G φ=;
M D =;
WHILE M φ≠
高中名人名言{ read data point i x from M  in quence; FOR each j δ in δ
IF ij x  is not involved in some existing partition on dimension j
{ [,ij j ij j P x x ]δδ=−+;
IF partition P  has interction with some existing partition
IF partition P  has interction with the rear part of partition Q  //Suppo partition Q  is //[kj j x δ−,kj j x δ+]
Adjust P  to be [,]kj j ij j P x x δδ=++;    ELSE IF partition P  has interction with the front part of partition Q
Adjust P  to be [,ij j kj j P x x ]δδ=−−;    ELSE  //That is, partition P  has //interctions with both partition Q  and partition L , //suppo partition L  is [lj j x δ−,lj j x δ+]
Adjust P  to be [,kj j lj j P x x ]δδ=+− or [,lj j kj j P x x ]δδ=+−;
j j G G P =U ;
Refresh the value of grid characteristics; }
i M M x =−;
ENDWHILE END
Algorithm 1 partitions data space S dynamically with δ
neighborhood vector, and it limits the number of grid units to a certain extent. However, due to the difference of the grid unit volume, the relative density of grid is ud to judge whether a grid unit should be pruning or not by comparing the grid unit relative density with threshold 0ρ.
B. Grid clustering
When clustering grid units in the grid structure which is formed in algorithm 1, a cluster is the largest t of connected grid units who grid relative density exceeds the threshold 0ρ. The clustering result is denoted by the t  which consists of k clusters generated by clustering process. The detail is as follows: 12{,,...,}k C C C C =Algorithm 2
Input :Grid structure , relative density threshold 12{,,...,}d G G G G =0ρ
Output :Clustering result
12{,,...,}k C C C C ='G G =;
0cno =;  //cno denotes the number of clusters
WHILE 'G φ≠
沟壑纵横的意思{_Temp G φ=;
Get a g  from ; //g  denotes a grid unit 'G IF 0(./.)g n g V ρ> {cno ;
++__Temp G Temp G g =U ;
''G G g =−;
FOR each 'g  inregrets
'G      IF
('./'.)g n g V ρ> AND
'g is
connected with some certain grid unit of _Temp G
{__'Temp G Temp G g =U ;        ;
''G G g =−'Refresh the statistics information of
_Temp G ;}
}
_cno C Temp G =; //is the th cluster
cno C cno }
ENDWHILE END
When clustering process begins, grid units are sorted according to the relative density of grid. When the relative density of a grid unit is less than threshold 0ρ, then, need not compute the grid units sorted after this one.
IV. A NALYSIS OF EXPERIMENTATION
The code of the algorithm in experiments is written in C++, and we conduct the experiments on a
Pentium IV machine with 2.0GHz CPU, 512MB of memory, 80G of hard disk storage, and Windows XP operating system. In this paper, the
real-world data t of wine recognition collected by UCI, has 13 continuous attributes without missing value. And we identify the category of wine by analyzing its chemical attributes. We also run the algorithm on a synthetic data t, and each data point in this data t has 8 continuous attributes without missing value. In the experiments of IGrid, the dimensional radius is determined by sampling method. In STING , the equal partitioning interval of each dimension is determined by the mean valu
e of distance between pairs of neighbor data points.
A. Quality evaluation
In this experiment, we cluster wine recognition data t with IGrid algorithm. As shown in figure 1, the 178 instances are classified into 3 categories, C1:59 C2:71 C3:48. To wine recognition data t, the
clustering result of IGrid is C1:52 C2:66 C3:42.
Figure 1 Clustering accuracy (wine recognition)
Table 1 Clustering accuracy
Cluster Wine  IGrid  Accuracy  C1 59
52
88.14%
C2716692.96%
C3484287.50%
The clustering accuracy of IGrid is shown in table 1. With the execution of IGrid algorithm, the grid structure forms gradually. IGrid becomes stable, and keeps high accuracy. Furthermore, IGrid is not nsitive to outliers.
B.Runtime evaluation
In this experiment, we cluster the synthetic data t with both IGrid and STING. As shown in figure 2, with the increasing of data point number in synthetic data t, both IGrid and STING need more time to cluster. When the number of data point in data t is equal, the runtime of IGrid is less than STING..
C.Scalability evaluation
In this experiment, we cluster the synthetic data t with different dimensionality. With the increasing dimensionality of data points, the runtime of IGrid increas linearly. As is shown in figure 3, IGrid has
good scalability of dimensionality in data space.
Figure 2 Runtime comparison (synthetic data t)
Figure 3 Changes of dimensionality(synthetic data t) V.C ONCLUSION
In this paper, we propo a incremental clustering algorithm IGrid which is bad on grid. IGrid keeps the advantages of traditional clustering algorithms bad on grid, and it partitions data space dynamically. Furthermore, IGrid has linear time complexity. The experiments results on both real data t and synthetic data t show that IGrid has superiority over traditional clustering algorithm on both clustering quality and efficiency.
IGrid can incrementally adjust the grid structure in order to implement incremental clustering according to the increasing number of data points in data t. In IGrid algorithm, if the dimensional radius is too small, the algorithm efficiency will be influenced greatly; On the contrary, if the dimensional radius is too big, there will be some small partitions, which have few data points in and can not exceed the threshold that will be ignored in clustering process. Then, how to determine the dimensional radius so as to form grid structure and improve clustering quality is the topic that we need to rearch in future.
R EFERENCES:
[1] Wang W, Yang J, Muntz R. Sting: a statistical
information grid approach to spatial data mining. In: Proceedings of the 23rd conference on VLDB[C]. Athens: [s.n.] , 1997, 186-195.
[2] Sheikholeslami G, Chatterjee S, Zhang A. Wavecluster: a
multi-resolution clustering approach for very large spatial databas. In: Proceedings of the 24th Conference on VLDB[C]. New York: [s.n.], 1998, 428-439.
[3] Agrawal R, Gehrke J, Gunopulos D, et al. Automatic
subspace clustering of high dimensional data for data mining applications. In: Proc of ACM SIGMOD Conf Seattle[C]. WA: [s.n.], 1998: 94-105.
[4] Chen Ning, Chen An, Zhou Longxiang. An incremental
grid density-bad clustering algorithm[J]. Journal of Software, 2002, 13(1): 1-7.
[5] A. H. Pilevar, M. Sukumar. GCHL: A grid-clustering
algorithm for high-dimensional very large spatial data bas [J]. Pattern Recognition Letters 2005, 26(2005): 999-1010.
[6] V. Fiolet, B. Tourl. Distributed data mining[J].
Scalable Computing: Practice and Experiences, 2005, 6(1): 99-109.
[7] M. S. Perez, A. Sanchez, V. Robles, et al. Design and
implementation of a data mining grid-aware architecture [J]. Future Generation Computing Systems, 2007, 23(1): 42-47.

本文发布于:2023-06-17 09:01:08,感谢您对本站的认可!

本文链接:https://www.wtabcd.cn/fanwen/fan/82/974358.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:查询   奖牌榜   美食   沟壑纵横   手机
相关文章
留言与评论(共有 0 条评论)
   
验证码:
推荐文章
排行榜
Copyright ©2019-2022 Comsenz Inc.Powered by © 专利检索| 网站地图