首页 > 美文阅读

An Incremental Clustering Algorithm Bad on Grid

更新时间:2023-06-17 09:01:08 阅读：评论：0

An Incremental Clustering Algorithm

Bad on Grid

Guohua Lei, Xiang Yu Xianfei Yang Shuang Chen

Department of Computer Science and Technology Department of Computer

Science and Technology

Department of Economics and

Management

Heilongjiang Institute of

Technology University of Harbin

Engineering

Heilongjiang Nongken

Vocational College

Harbin, Heilongjiang Province,

China Harbin, Heilongjiang Province,

China

Harbin, Heilongjiang Province,

China

Abstract: The existing clustering algorithms bad on grid are analyzed, and the clustering algorithms bad on grid have the advantages of dealing with high dimensional data and high efficiency. However, traditional algorithms bad on grid are influenced greatly by the granularity of grid partition. An incremental clustering algorithm bad on grid, which is called IGrid, is propod. IGrid has the advantage of high efficiency of traditional clustering algorithms bad on grid, and it also partition the grid space by dimensional radius in a dynamic and incremental manner to improve the quality of clustering. The experiments on real datats and synthetic datats show that IGrid has better performance than traditional clustering algorithms bad on grid in both speed and accura

cy.

Key words: incremental; clustering; grid; data mining趋之若鹜什么意思

I.I NTRODUCTION

Clustering analysis is an important rearch topic in data mining. It can also be ud as an independent tool to acquire the situation of data distribution. Furthermore, some other algorithms u clustering analysis as their preprocessing step. Most traditional clustering algorithms bad on grid run fast. Their processing time is independent to the number of data objects, and only depends on the number of units in each dimension of data space[1,2].

Some reprentative clustering algorithms bad on grid are STING, which process the statistics information stored in grid units, WaveCluster, which clusters data objects by a kind of wavelet transformation method, CLIQUE, which is a clustering algorithm bad on density and grid in high dimensional space.

STING is a multi-resolution clustering technology bad on grid, and it divides data space into rectangular units. To different grades of resolution, there generally exist rectangular units of different

grades which form a hierarchical structure. And the attribute statistics information of each grid unit is computed and stored in advance. STING has a high efficiency. For n data objects, when hierarchical structure forms, suppo there exist m grid units of the lowest level, then the clustering time complexity of STING is O(n), and the query processing time is O(m). The clustering quality of STING mainly depends on the lowest granularity of the grid structure. High granularity will increa the costs of processing which ems more obvious when the dimensionality is high enough. However, if the granularity of the lowest level of grid structure is too low, the clustering quality will decrea.

WaveCluster is a kind of multi-resolution clustering algorithm. It first adds a multi-dimensional grid structure to data space so as to summarize data, then, transforms the original feature space by wavelet transformation to find den regions in the transformed space. WaveCluster can process large data ts efficiently and process outliers successfully, it is not nsitive to the input order and needs not to specify parameters such as the radius of

neighborhood.

CLIQUE integrates the algorithms bad on density and tho bad on grid, and it can cluster the high-dimensional data in large databa. Suppo there is a multi-dimensional data t, then, disting

大禹治水教案uish the den units and the spar units of the space in order to find the global distribution pattern of the data t. If the number of data points in a grid unit exceeds a certain input parameter, then the unit is den. CLIQUE is not nsitive to the input order of tuples, it has good scalability when the dimensionality increas. However, when clustering in all dimensions, too many units will appear, so the performance of CLIQUE may decrea, and the precision of clustering result will be influenced by the too much simplification of the method.

However, traditional clustering algorithms bad on grid are influenced greatly by the granularity of grid. If the granularity of grid is low, the quality of clustering will decrea, on the contrary, high granularity of grid will decrea clustering efficiency. An incremental clustering algorithm bad on grid, IGrid, is propod in this paper. It has the high efficiency of traditional clustering algorithm bad on grid, and partitions data space dynamically and incrementally with the vector of dimensional radius in order to improve the quality of clustering.

In this paper, we first prent the related definition, then the idea of IGrid and all steps of IGrid. The experimental results show that the efficiency and clustering accuracy of IGrid are better than traditional clustering algorithms bad on grid. Finally, we summarize this paper and point out the direction of future rearch.

II. D EFINITION

A. Definition

Suppo is a data t of d -dimensional data space S , and the subscript n reprents the number of data points in the data t. For each data point 12{,,...,}n D x x x =12{,,...,,...,}i i i ij id x x x x x =,

, 1,2,...,j d =ij x denotes the j th dimensional value

of the i th data point.

Definition 1 (δneighborhood) Suppo ij x is the j th dimensional value of data point i x in data t D , the j th dimensional δneighborhood of i x is [ij x δ−,ij x δ+], a partition on dimension j who center is data point i x , and δis the dimensional radius of this dimension.

Definition 2 (δ

neighborhood vector)

12{,,...,}d δδδδ= is the δneighborhood vector

which consists of all i δ in each dimension. i δ is the dimensional radius of dimension i.

北京冬残奥会奖牌榜Definition 3 (grid volume) 1

d j j V span ==

∏

the product of all j span which are the extents of the grid in each dimension.

Definition 4 (relative density of grid) The ratio of the number of data points involved in grid and the volume of the grid V , that is, .

/n V Definition 5 (grid common face) Suppo two grids g1 and g2, when g1 and g2 are the same in

1d − dimensions. In the d th dimension, they are

different and adjacent to each other, then, there is a grid common face between g1 and g2.

Definition 6 (grid adjacency) Suppo there is a grid common face between g1 and g2, or there exists a grid g3 where there exists a grid common face between g1 and g3 and there exists a grid common face between g2 and g3, then, g1 and g2 are grid adjacent. According to the definition of cluster in CLIQUE algorithm[3], a micro-cluster is the largest t which consists of all the adjacent grids who relative density exceed some certain threshold 0ρ.

B. Dynamically partition and incrementally adjustment of grid

Suppo 12{,,...,}i i i id x x x x = is the i th data point in data t 12{,,...,,...}n D x x x =, partition data space into grid units on each dimension. To the j th

dimensional value ij x of data point i x , if ij x is not involved in any δ neighborhood partition on dimension j, then we get partition [ij j x δ−,ij j x δ+] according to the j th dimensional radius j δ, ij j x δ− is the front part of the partition, and ij j x δ+ is the rear part of the partition. If the front part ij j x δ− has interction ts with some existing partition on this dimension (assume that [kj j x δ−,kj j x δ+]), then adjust the partition [ij j x δ−,ij j x δ+] to (kj j x δ+,ij j x δ+]. Similarly, if the rear part

ij j x δ+has interction ts with some existing

partition on this dimension (assume that [lj j x δ−,lj j x δ+]), then, adjust the partition [ij j x δ−,ij j x δ+] to [ij j x δ−,lj j x δ−).

III. IG RID

There are mainly two steps of IGrid. The first step is to visit data points in the data t, and gradually form grid structure by partitioning data space dynamically and incrementally with dimensional radius. The cond step is to execute grid clustering. IGrid can also incrementally adjust the data t. To the new coming data t, visit the data points in the data t orderly and dynamically refresh the grid structure, then execute grid clustering.

In IGrid, each grid unit us a six-tuple

(

) as grid unit characteristics to denote the statistics of grid unit, and the grid unit

characteristics can be adjusted according to

requirements. In six-tuple, and are both

d -dimensional vector. For each dimension, 122,,,,,CF CF T T n V uuuu r uuuu r查询对方手机位置

CF uuuu r 2CF uuuu r

1CF uuuu r

and denote the vector of sum and square sum of data

points respectively. And T and denote the sum and square sum of data point arriving time, and n denotes the number of data points in grid unit, V denotes the volume of grid unit. The statistics information of each grid unit can be denoted by this six-tuple. For example, the mean value of a grid unit

is a d -dimensional vector, which is 2CF uuuu r

2T 1

CF n uuuu r , and the

relative density of grid unit is .

/n V A. Formation of grid structure

Each data point 12{,,...,}i i i id x x x x =of data t

12{,,...,,...}n D x x x = is visited orderly one by one,

according to the given δ neighborhood radius, the grid structure is formed. The δ neighborhood radius can be determined not only by experience but also by sampling method. The detail is as following: Algorithm 1

Input: Data t 12{,,...,,...}n D x x x =, δ

neighborhood radius 12{,,...,}d δδδ

澳大利亚美食Output: Grid structure 12{,,...,}d G G G G =

G φ=;

M D =;

WHILE M φ≠

高中名人名言{ read data point i x from M in quence; FOR each j δ in δ

IF ij x is not involved in some existing partition on dimension j

{ [,ij j ij j P x x ]δδ=−+;

IF partition P has interction with some existing partition

IF partition P has interction with the rear part of partition Q //Suppo partition Q is //[kj j x δ−,kj j x δ+]

Adjust P to be [,]kj j ij j P x x δδ=++; ELSE IF partition P has interction with the front part of partition Q

Adjust P to be [,ij j kj j P x x ]δδ=−−; ELSE //That is, partition P has //interctions with both partition Q and partition L , //suppo partition L is [lj j x δ−,lj j x δ+]

Adjust P to be [,kj j lj j P x x ]δδ=+− or [,lj j kj j P x x ]δδ=+−;

j j G G P =U ;

Refresh the value of grid characteristics; }

i M M x =−;

ENDWHILE END

Algorithm 1 partitions data space S dynamically with δ

neighborhood vector, and it limits the number of grid units to a certain extent. However, due to the difference of the grid unit volume, the relative density of grid is ud to judge whether a grid unit should be pruning or not by comparing the grid unit relative density with threshold 0ρ.

B. Grid clustering

When clustering grid units in the grid structure which is formed in algorithm 1, a cluster is the largest t of connected grid units who grid relative density exceeds the threshold 0ρ. The clustering result is denoted by the t which consists of k clusters generated by clustering process. The detail is as follows: 12{,,...,}k C C C C =Algorithm 2

Input ：Grid structure , relative density threshold 12{,,...,}d G G G G =0ρ

Output ：Clustering result

12{,,...,}k C C C C ='G G =;

0cno =; //cno denotes the number of clusters

WHILE 'G φ≠

沟壑纵横的意思{_Temp G φ=;

Get a g from ; //g denotes a grid unit 'G IF 0(./.)g n g V ρ> {cno ;

++__Temp G Temp G g =U ;

''G G g =−;

FOR each 'g inregrets

'G IF

('./'.)g n g V ρ> AND

'g is

connected with some certain grid unit of _Temp G

{__'Temp G Temp G g =U ; ;

''G G g =−'Refresh the statistics information of

_Temp G ;}

}

_cno C Temp G =; //is the th cluster

cno C cno }

ENDWHILE END

When clustering process begins, grid units are sorted according to the relative density of grid. When the relative density of a grid unit is less than threshold 0ρ, then, need not compute the grid units sorted after this one.

IV. A NALYSIS OF EXPERIMENTATION

The code of the algorithm in experiments is written in C++, and we conduct the experiments on a

Pentium IV machine with 2.0GHz CPU, 512MB of memory, 80G of hard disk storage, and Windows XP operating system. In this paper, the

real-world data t of wine recognition collected by UCI, has 13 continuous attributes without missing value. And we identify the category of wine by analyzing its chemical attributes. We also run the algorithm on a synthetic data t, and each data point in this data t has 8 continuous attributes without missing value. In the experiments of IGrid, the dimensional radius is determined by sampling method. In STING , the equal partitioning interval of each dimension is determined by the mean valu

e of distance between pairs of neighbor data points.

A. Quality evaluation

In this experiment, we cluster wine recognition data t with IGrid algorithm. As shown in figure 1, the 178 instances are classified into 3 categories, C1:59 C2:71 C3:48. To wine recognition data t, the

clustering result of IGrid is C1:52 C2:66 C3:42.

Figure 1 Clustering accuracy (wine recognition)

Table 1 Clustering accuracy

Cluster Wine IGrid Accuracy C1 59

88.14%

C2716692.96%

C3484287.50%

The clustering accuracy of IGrid is shown in table 1. With the execution of IGrid algorithm, the grid structure forms gradually. IGrid becomes stable, and keeps high accuracy. Furthermore, IGrid is not nsitive to outliers.

B.Runtime evaluation

In this experiment, we cluster the synthetic data t with both IGrid and STING. As shown in figure 2, with the increasing of data point number in synthetic data t, both IGrid and STING need more time to cluster. When the number of data point in data t is equal, the runtime of IGrid is less than STING..

C.Scalability evaluation

In this experiment, we cluster the synthetic data t with different dimensionality. With the increasing dimensionality of data points, the runtime of IGrid increas linearly. As is shown in figure 3, IGrid has

good scalability of dimensionality in data space.

Figure 2 Runtime comparison (synthetic data t)

Figure 3 Changes of dimensionality(synthetic data t) V.C ONCLUSION

In this paper, we propo a incremental clustering algorithm IGrid which is bad on grid. IGrid keeps the advantages of traditional clustering algorithms bad on grid, and it partitions data space dynamically. Furthermore, IGrid has linear time complexity. The experiments results on both real data t and synthetic data t show that IGrid has superiority over traditional clustering algorithm on both clustering quality and efficiency.

IGrid can incrementally adjust the grid structure in order to implement incremental clustering according to the increasing number of data points in data t. In IGrid algorithm, if the dimensional radius is too small, the algorithm efficiency will be influenced greatly; On the contrary, if the dimensional radius is too big, there will be some small partitions, which have few data points in and can not exceed the threshold that will be ignored in clustering process. Then, how to determine the dimensional radius so as to form grid structure and improve clustering quality is the topic that we need to rearch in future.

R EFERENCES:

[1] Wang W, Yang J, Muntz R. Sting: a statistical

information grid approach to spatial data mining. In: Proceedings of the 23rd conference on VLDB[C]. Athens: [s.n.] , 1997, 186-195.

[2] Sheikholeslami G, Chatterjee S, Zhang A. Wavecluster: a

multi-resolution clustering approach for very large spatial databas. In: Proceedings of the 24th Conference on VLDB[C]. New York: [s.n.], 1998, 428-439.

[3] Agrawal R, Gehrke J, Gunopulos D, et al. Automatic

subspace clustering of high dimensional data for data mining applications. In: Proc of ACM SIGMOD Conf Seattle[C]. WA: [s.n.], 1998: 94-105.

[4] Chen Ning, Chen An, Zhou Longxiang. An incremental

grid density-bad clustering algorithm[J]. Journal of Software, 2002, 13(1): 1-7.

[5] A. H. Pilevar, M. Sukumar. GCHL: A grid-clustering

algorithm for high-dimensional very large spatial data bas [J]. Pattern Recognition Letters 2005, 26(2005): 999-1010.

[6] V. Fiolet, B. Tourl. Distributed data mining[J].

Scalable Computing: Practice and Experiences, 2005, 6(1): 99-109.

[7] M. S. Perez, A. Sanchez, V. Robles, et al. Design and

implementation of a data mining grid-aware architecture [J]. Future Generation Computing Systems, 2007, 23(1): 42-47.

本文发布于:2023-06-17 09:01:08，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/82/974358.html