ESOM-Maps:tools for clustering,visualization, and classification with Emergent SOM
Alfred Ultsch,Fabian M¨o rchen
Data Bionics Rearch Group,University of Marburg
D-35032Marburg,Germany
March17,2005
Abstract
An overview on the usage of emergent lf organizing maps is given.
U-Maps visualize the distance structures of high dimensional data ts.
P-Maps show their density structures and U*-Maps combine the advan-
tages of the mentioned maps to a visualization suitable to detect non-
trivial cluster structures.A conci summary on the usage of Emergent
Self-organizing Maps(ESOM)for data mining is given.The tasks of vi-
sualization,clustering,and classification as they can be performed with
the Databionics ESOM Tools are described.
1Introduction
The power of lf-organization that allows the emergence of structure in data is often neglected.We think this is in part due to a misusage of Self-organizing Maps that is widely spread in the scientific literature.The maps ud by most authors are usually very small,consisting of somewhere between a few and some tens of [1,2,3]).Also,the concept of boundless id maps[4])to avoid border effects is rarely ud.Using small SOM is almost identical to a k-Means clustering with k equal to the number of nodes in the map.The topology prervation of the SOM projection is of little u when using small maps.Emergent phenomena involve by definition a large number of individuums,where large means at least a few thousands.This is why we u large SOMs and called them Emergent Self-Organizing Maps(ESOM)to emphasize the distinction.It has been demonstrated,that using ESOM is a significantly different process from using k-Means[5].
The purpo of this article is to give a conci summary on the usage of Emergent Self-organizing Maps(ESOM)for data mining.We describe the tasks We thank Mario N¨o cker,Christian Stamm,Michael Thies,and Martin K¨u mmerer for programming most of the Databionics ESOM Tools.
1
of visualization,clustering,and classification as they can be performed with our new software-the Databionics ESOM Tools.
2Map Architecture
If M⊂R2the map has planar topology.In a quadgrid arrangement the number of immediate neighbors of a neuron is4,for hexgrids there are6immediate neighbors.
Note,that the number of immediate neighbors is less at the borders of the grid.In the map spaces border effects occur,enlargening the probability of topology errors[6].To avoid such border effects,grids can be embedded in a finite but boundless a sphere or the toroid PAC-man space(PMS)(e [4]).In PMS the top row is connected to the bottom row and the left column to the right column within the lattice.
If the number of rows and columns in a2D grid is equal,the map is called square,otherwi rectangular.The ratio of rows to columns is propod to be chon corresponding to the ratio of thefirst and cond eigenvalues of the covariance matrix in[7].An experimental analysis of topology errors showed, however,that the ratio of rows and columns should be chon different from unity even when no dominant direction of variance exists[6].
We recommend the following ESOM architecture:boundless toroid grids with at least4000neurons and a ratio of rows and columns different from unity. This avoids border effects,topology errors,and enables an intuitive undistorted visualization.
3Visualization
wolf
The result of ESOM training[7]is a low dimensional grid of high dimensional prototype vectors.The positions of the bestmatches for the data points alone does often not offer an intuitive visualization of the structures prent in the high dimensional space.Additional methods are needed to visualize the structures, the most common being height values forming a3D landscape on top of the grid.
The example to demonstrate the visualizations is taken from[8].It consists of a2dimensional datat typical for sonar applications.The dimensions are ENGY and TIME,the data t is a mixture of two G
aussians with2048points each(e Figure1(a)).
Distance-bad Visualization:The U-Matrix[9]is the canonical display of ESOM.The local distance structure is displayed at each neuron as a height value creating a3D landscape of the high dimensional data space.The height is calculated as the sum of the distances to all immediate neighbors normalized by the largest occurring height.This value will be large in areas where no or few data points reside,creating mountain ranges for cluster boundaries.The sum will be small in areas of high densities,thus clusters are depicted as val-
2
leys.Figure1(b)shows the U-Matrix of the EngyTime datat,darker colors correspond to large distances.
Density-bad Visualization:While distance-bad methods usually work well for clearly parated clusters,problems can occur with slowly changing den-sities and overlapping clusters.Density-bad methods more directly measure the density in the data space sampled at the prototype vectors.The P-Matrix[4] displays the local density measures with the Pareto Density Estimation(PDE) [10],an information optimal kernel density estimation.Figure1(c)shows the P-Matrix of the EngyTime datat,
darker colors correspond to larger densities.
Distance-and Density-bad Visualization:In den regions of the data space the local distances depicted in an U-Matrix are presumably distances measured inside a cluster.Such distances may be disregarded for the purpo of clustering.In thin populated regions of the data space,however,the distances do matter.In this ca the U-Matrix heights correspond to cluster boundaries. This leads to the definition of an U*-Matrix[11]which combines the distance bad U-Matrix and the density bad P-Matrix.The values of the U-Matrix are dampened in highly den regions,unchanged in regions of average density, and emphasized in spar regions.
pvp是什么意思
The advantage of the U*-Matrix over the U-Matrix in datats with clusters that are not clearly parated in the high dimensional space can be e from the example.The U*-Matrix in Figure1(d)shows clearly the two Gaussians of the EngyTime datat.The U-Matrix in Figure1(b),however,may mislead a clustering.
Topology-bad Visualizations:There are two kinds of topographical errors that have to be considered.The forward projection error(FPE)occurs when a pair of similar data points is assigned to a distant pair of positions on the map.The backward projection error(BPE)corresponds to a pair of cl
o grid positions being the image of two distant data points.The latter can be visualized using Zrehen’s Measure[12],for the former the Minimal-U-Path[6] can be ud.
Bestmatch Visualizations:A display of the bestmatches can be placed on top of the images created by the above mentioned methods.Each neuron that is a bestmatch for at least one data point can be marked with a point.The point can be colored indicating a known or chon cluster membership.
4From Matrix to Map大耳朵英语网
For boundless grids the visualizations should be viewed in tiled mode[4],dis-playing four adjacent copies of the grid.Clusters stretching over the edge of the grid image can then be en connected.Unfortunately every data point and cluster is shown in4different places.This can be compensated by extracting a map from the tiled view removing the redundancies[4].The obvious rem-blance with geographical landscapes led us to call the displays maps.For an U-Matrix we get an U-Map and similar P-Maps and U*-Maps.The t of all the maps is called ESOM-Maps.The remblance of ESOM-Maps to geo-
gas3
(a)Data (b)U-Matrix
(c)P-Matrix (d)U*-Matrix
Figure 1:ESOM visualizations for 2clusters in the EngyTime datat
graphical maps or landscapes may be enhanced by computer graphical means such as texturing,coloring and lightening.The Figure 2(a)shows a tiled view of an U*-Matrix with 3clusters.Figure 2(b)shows the non-redundant U*-Map display extracted from the center of the tiled view.
5Clustering
The clustering of the ESOM can be performed at two different levels.First,the bestmatches and thus the corresponding data points can be manually grouped into veral clusters.Not all points need to be labeled,outliers are usually eas-ily detected and can be removed.The cluster membership can be visualized by a coloring of the bestmatches.Secondly,the neurons can be clustered.This way regions on the map reprenting a cluster can be identified and ud for classification of new data (e Section 6).The regions can be visualized by mi-transparent colored regions in order not to overlay the background visu-alizations completely.The visualization can show whether there actually is a cluster structure.Outlying data points will also be easily detectable.In con-trast,the popular k -Means algorithm will always converge to the given number of clusters no matter whether the data supports this structure.Outliers will be handled like any other data point.
4
(a)Tiled U*-Matrix(b)Extracted U*-Map
Figure2:ESOM visualizations for3clusters in the Skating data
6Classification
Labeling all(or most)neurons instead of only the bestmatches creates a sub-symbolic classificator similar to a k-nearest neighbor(KNN)classifier with k=1 that can be applied to new data automatically.The main difference to KNN is,that the ur can u the visualization of the ESOM to create the labeling whereas KNN does not offer this convenience.Further KNN classification al-ways classifies a point,no matter how near(or rather far)the neighbors are. In contrast,ESOM classification offers a don’t know class by leaving neurons for sparly populated regions parating clusters.
keep upWe have successfully ud ESOM classification to detect movement phas in a multivariate time ries from Inline-Speed Skating[13,14].Bad on the results of a single skating speed we tried to extend the analysis to six different speeds.Each datat contained30K6-dimensional vectors.We ud a pooled sample from all speeds for training.Three clusters were detected,the corre-sponding regions on the map were labeled.The U*-Matrix and the U*-Map are shown in Figure2(a)and Figure2(b),respectively.The six full datats were then classified into the three movement phas by
bestmatch projection on the map.The resulting gmentation of the time ries was labeled by an expert into the movement phas glide,push,and swing.
7Databionics ESOM Tools
Most of the described methods for visualization,clustering,and classification have been implemented in a software tool called Databionics ESOM Tools1.The main part is a graphical front-end to perform tasks like visualization,outlier removal,clustering of bestmatches and neurons,creation of non-redundant map views.The visualization of the maps is designed in a modular way.The different background displays of the map can be applied to subts of the data features and can be combined with foreground displays of the bestmatches.Several color 1databionic-esom.sf
5
bca
Figure 3:U-Matrix during ESOM training for 7clusters (Epochs 0+k ∗5)gradients with an adjustable number of colors are available as well as contour rendering.Even the training process can be visualized by creating a slide show of the map unfolding in the data space (e Figure 3).The training,rendering,projection,and classification tasks can also be run from the command line to allow an automatic analysis of large data ts.
The programs are written in Java and the source code is available under the GPL[15].We welcome other scientists to get involved in the further development of the tool.
8Summary
Emergent SOMs are a powerful tool for clustering and classification.To exploit nontrivial emergent phenomena,however,large maps must be ud.Only large maps can adequately project complicated structures onto the two dimensional map space.Distance-bad and density-bad visualization of ESOM can be ud to analyze the data for possible clusters.A (partly)labeled ESOM rves as a sub-symbolic classificator in the spirit of KNN classification.Many ESOM related tasks can be performed with the freely available Databionics ESOM Tools .
rad
References
[1] C.Li,P.S.Yu,and V.Castelli.MALM:a framework for mining quence databa at multiple abstraction levels.In Proc.of the 7th ACM CIKM ,pages 267–272,Bethesda,MD,1998.country road take me home
[2]T.C.Fu,F.L.Chung,V.Ng,and R.Luk.Pattern Discovery from Stock Time Series Using Self-Organizing Maps.Workshop Notes of KDD2001Workshop on Temporal Data Mining,San Francisco ,pages 27–37,2001.
秘密花园2
[3] E.Pampalk,A.Rauber,and D.Merkl.Using Smoothed Data Histograms for Cluster Visualization in Self-Organizing Maps.In Proc.ICANN’02,Madrid,Spain,August 27-302002.Springer.
[4] A.Ultsch.Maps for the Visualization of high dimensional Data Spaces.In Proc.
WSOM’03,Japan ,2003.[5] A.Ultsch.Self Organizing Neural Networks perform different from statistical k-means clustering.In Proc.GfKl 1995,Bal,Swiss .
[6] A.Ultsch and L.Herrmann.Architecture of emergent lf-organizing maps to reduce projection errors.In Proc.ESANN,Bruges,Belgium ,2005.
[7]T.Kohonen.Self-Organizing Maps .Springer,1995.
[8]P.M.Baggenstoss.Statistical Modeling Using Gaussian Mixtures and HMMs with Mat-lab,www.npt.nuwc.navy.mil/Csf/htmldoc/pdf/pdf.html ,2002.
神探夏洛克第3季6