Fast algorithm for detecting community structure in networks

更新时间:2023-07-18 08:50:14 阅读: 评论:0

Fast algorithm for detecting community structure in networks
M.E.J.Newman
Department of Physics and Center for the Study of Complex Systems,University of Michigan,Ann Arbor,Michigan48109-1120,USA (Received22September2003;revid manuscript received22March2004;published18June2004)
Many networks display community structure—groups of vertices within which connections are den but
between which they are sparr—and nsitive computer algorithms have in recent years been developed for
detecting this structure.The algorithms,however,are computationally demanding,which limits their appli-
cation to small networks.Here we describe an algorithm which gives excellent results when tested on both
computer-generated and real-world networks and is much faster,typically thousands of times faster,than
previous algorithms.We give veral example applications,including one to a collaboration network of more
than50000physicists.
DOI:10.1103/PhysRevE.69.066133PACS number(s):89.75.Hc,05.10.Ϫa,87.23.Ge,89.20.Hh
I.INTRODUCTION
There has in recent years been a surge of interest within
the physics community in the properties of networks of many
kinds,including the Internet,the world wide web,citation
少先队员敬礼networks,transportation networks,software call graphs,
email networks,food webs,and social and biochemical net-
works[1–4].One property that has attracted particular atten-
tion is that of“community structure”:the vertices in net-
works are often found to cluster into tightly knit groups with
a high density of within-group edges and a lower density of
between-group edges.Girvan and Newman[5,6]propod a
computer algorithm bad on the iterative removal of edges
with high“betweenness”scores that appears to identify such
structure with some nsitivity,and this algorithm has been
employed by a number of authors in the study of such di-
ver systems as networks of email messages,social net-
works of animals,collaborations of jazz musicians,meta-
bolic networks,and gene networks[5–11].As pointed out by
Newman and Girvan[6],the principal disadvantage of their
algorithm is the high computational demands it makes.In its
simplest and fastest form,it runs in worst-ca time O͑m2n͒on a network with m edges and n vertices,or O͑n3͒on a spar network(one for which m scales with n in the limit of
large n,which covers esntially all networks of current sci-
entific interest,with the possible exception of food webs).
With typical computer resources available at the time of writ-
ing,this limits the algorithm’s u to networks of a few thou-
sand vertices at most,and substantially less than this for
interactive applications.Increasingly,however,there is inter-
est in the study of much larger networks;citation and col-
laboration networks can contain millions of vertices[12,13],
for example,while the world wide web numbers in the bil-
lions[14].
In this paper,therefore,we propo another algorithm for
detecting community structure.The algorithm operates on
different principles from that of Girvan and Newman(GN),
but,as we will show,gives qualitatively similar results.The
worst-ca running time of the algorithm is O(͑m+n͒n),or O͑n2͒on a spar graph.In practice,it runs to completion on current computers in reasonable times for networks of up to a million or so vertices,bringing within reach the study of communities in many systems that would previously have been considered intractable.
II.THE ALGORITHM
Our algorithm is bad on the idea of modularity.Given any network,the GN community structure algorithm always produces some division of the vertices into communities,re-gardless of whether the network has any natural such divi-sion.To test whether a particular division is meaningful,we define a quality function or“modularity”Q as follows[6].
Let e ij be one-half of the fraction of edges in the network that connect vertices in group i to tho in group j,so that the total fraction of such edges is e ij+e ji.The only exception will be the diagonal elements e ii,which are equal to the fraction of edges that fall within group i(with no factor of a half).Then͚i e ii is the total fraction of edges that fall within groups.All other edges fall between groups.The maximum value of this sum is1,and a division of the network into communities is good if this quantity is large,meaning it is of order1.On its own,however,the sum is not a good measure of community structure,since it takes its maximal value of1 if we put all vertices in a single group together,which is a trivial and not particularly uful form of community struc-ture.
A more uful measure of community structure is to cal-culate the sum͚i e ii and then subtract from it the value that it would take if edges were placed at random.Such a measure gives a score of zero to the trivial grouping with only a single community,but nonzero scores to nontrivial group-ings.
Let a i be the fraction of all ends of edges that are attached to vertices in group i.We can calculate a i straightforwardly by noting that a i=͚j e ij.If the ends of edges are connected together at random,the fraction of the resulting edges that connect vertices within group i is a i2.We define the modu-larity to be
Q=͚i͑e ii−a i2͒.͑1͒If a particular division gives no more within-community edges than would be expected by random chance,this modu-
PHYSICAL REVIEW E69,066133(2004)
larity is Q =0.Values other than 0indicate deviations from randomness,and in practice values greater than about 0.3appear to indicate significant community structure.A number of examples are given in Ref.[6].
But this now suggests an alternative approach to finding community structure.If a high value of Q reprents a good community division,why not simply optimize Q over all possible divisions to find the best one?By doing this,we can avoid the iterative removal of edges and cut straight to the cha.The problem is that true optimization of Q is very costly.The number of ways to divide n vertices into g non-empty groups is given by the Stirling number of the cond
kind S n ͑g ͒
,and hence the number of distinct community divi-sions is ͚g =1n S n ͑g ͒.This sum is not known in clod form,but we obrve that S n ͑1͒+S n ͑2͒=2
n −1for all n Ͼ1,so that the sum must increa at least exponentially in n .To carry out an exhaustive arch of all possible divisions for the optimal value of Q would therefore take at least an exponential amount of time,and is in practice infeasible for systems larger than 20or 30vertices.Various approximate optimiza-tion methods are available:simulated annealing,genetic al-gorithms,and so forth.Here we consider a scheme bad on a standard “greedy”optimization algorithm,which appears to perform well.
Our algorithm falls in the general category of agglomera-tive hierarchical clustering methods [15,16].Starting with a state in which each vertex is the sole member of one of n communities,we repeatedly join communities together in pairs,choosing at each step the join that results in the great-est increa (or smallest decrea )in Q .The progress of the algorithm can be reprented as a “dendrogram,”a tree that shows the order of the joins (e Fig.2for an example ).Cuts through this dendrogram at different levels give divisions of the network into larger or smaller numbers of communities and we can lect the best cut by looking for the maximal value of Q .
Since the joining of a pair of communities between which there are no edges at all can never result in an increa in Q ,we need only consider tho pairs between which there are edges,of which there will at any time be at most m ,where m is again the number of edges in the graph.The change in Q upon joining two communities is given by
渲组词⌬Q =e ij +e ji −2a i a j =2͑e ij −a i a j ͒,
͑2͒
which can clearly be calculated in constant time.The quan-tities e ij are initially equal to one-half of the corresponding
elements of the adjacency matrix of the ,to 1
2for vertex pairs that are joined by an edge and 0for tho that are not.Following a join,some of the matrix elements e ij must be updated by adding together the rows and columns corresponding to the joined communities,which takes worst-ca time O ͑n ͒.Thus each step of the algorithm takes worst-ca time O ͑m +n ͒.There are a maximum of n −1join op-erations necessary to construct the complete dendrogram and hence the entire algorithm runs in time O (͑m +n ͒n ),or O ͑n 2͒on a spar g
空调孔raph.The algorithm has the added advantage of calculating the value of Q as it goes along,making it espe-cially simple to find the optimal community structure.
It is worth noting that our algorithm can be generalized trivially to weighted networks in which each edge has a nu-meric strength associated with it,by making the initial values of the matrix elements e ij equal to (a half of )tho strengths;otherwi the algorithm is as above and has the same running time.The networks studied in this paper,however,are all unweighted.行政招聘
III.APPLICATIONS
As a first example of the working of our algorithm,we have generated using a computer a large number of random graphs with known community structure,which we then run through the algorithm to quantify its performance.Each graph consists of n =128vertices divided into four groups of 32.Each vertex has on average z in edges connecting it to members of the same group and z out edges to members of other groups,with z in and z out chon such that the total expected degree z in +z out =16,in this ca.As z out is incread from small values,the resulting community structure be-comes progressively weaker and the graphs po greater and greater challenges to the community-finding algorithm.In Fig.1we show the fraction of vertices correctly assigned to the four co
mmunities by the algorithm as a function of z out [19].As the figure shows,the algorithm performs well,cor-rectly identifying more than 90%of vertices for values of z out Շ6.Only when z out approaches the value 8at which the number of within-and between-community edges per vertex is the same does the algorithm begin to fail.On the same plot we also show the performance of the GN algorithm and,as we can e,that algorithm performs slightly but measurably better for smaller values of z out .For example,for z out =5our algorithm correctly identifies an average of 97.4(2)%of ver-tices,while the older algorithm correctly identifies 98.9(1)%.Both,however,clearly perform well.
Interestingly,for higher values of z out our algorithm per-forms better than the older one,and we have come across a few real-world networks in which this is the ca also.
苹果手机怎么强制开机
Nor-
FIG.1.The fraction of vertices correctly identified by our algo-rithms in the computer-generated graphs described in the text.The two curves show results for our algorithm (circles )and for the al-gorithm of Girvan and Newman [5](squares ).Each point is an average over 100graphs.
M.E.J.NEWMAN PHYSICAL REVIEW E 69,066133(2004)
mally,however,the GN algorithm ems to have the edge,and this should come as no great surpri.Our algorithm bas its decisions on purely local information about indi-vidual communities,while the GN algorithm us nonlocal information about the entire network—information derived from betweenness scores.Since community structure is itlf fundamentally a nonlocal quantity,it ems reasonable that one can do a better job of finding that structure if one has nonlocal information at one’s disposal.
For systems small enough that the GN algorithm is com-putationally tractable,therefore,we e no reason not to con-tinue using it—it appears to give the best results.For systems too large to make u of this approach,however,our algo-rithm gives uful community structure information with comparatively little effort.
We have applied our algorithm to a variety of real-world networks also.We have looked,for example,at the “karate club”network studied in [5],which reprents friendships between 34members of a club at a U.S.university,as re-corded over a two-year period by Zachary [17].During the cour of the study,the club split into two groups as a result of a dispute within the organization,and the members of one group left to start their own club.In Fig.2we show the dendrogram derived by feeding the friendship network into our algorithm.The peak modularity is Q =0.381and corre-sponds to a split into two groups of 17,as shown in the figure.The shapes of the vertices reprent the alignments of the club members following the dispute and,as we can e,the division found by the algorithm corresponds almost per-fectly to the alignments;only one vertex,number 10,is classified wrongly.The GN algorithm performs similarly on this task,but not better—it also finds the split but classifies one vertex wrongly (although a different one,vertex 3).In other tests,we find that our algorithm also successfully de-tects the main two-way division of the dolphin social net-work of Lusau [6,18],and the division between black and white musicians in the jazz network of Gleir and Danon [11].
As a demonstration of how our algorithm can sometimes miss some of the structure in a network,we take another example from Ref.[5],a network reprenting the schedule of games between American
用字母表示数优秀教案
college football teams in a single ason.Becau the teams are divided into groups or “conferences,”with intraconference games being more fre-quent than interconference games,we have a reasonable idea ahead of time about what communities our algorithm should find.The dendrogram generated by the algorithm is shown in Fig.3,and has an optimal modularity of Q =0.546,which is a little shy of the value 0.601for the best split reported in [5].As the dendrogram reveals,the algorithm finds six com-munities.Some of them correspond to single conferences,but most correspond to two or more.The GN algorithm,by contrast,finds all 11conferences,as well as accurately iden-tifying independent teams that belong to no conference.Nonetheless,it is clear that our algorithm is quite capable of picking out uful community structure from the network,and of cour it is much the faster algorithm.On the author’s desktop computer the algorithm ran to completion in an im-measurably small time—less than a hundredth of a cond.The algorithm of Girvan and Newman took a little over a cond.
A time difference of this magnitude will not prent a big problem in most practical situations,but performance rapidly becomes an issue when we look at larger networks;we ex-pect the ratio of running times to increa with the number of vertices.Thus,for example,in applying our algorithm to the 1275-node network of jazz musician collaborations men-tioned above,we found that it runs to co
mpletion in about one cond of CPU time.The GN algorithm by contrast takes more than three hours to reach very similar results.As an example of an analysis made possible by the speed of our algorithm,we have looked at a network of collabora-tions between physicists as documented by papers posted on the widely ud Physics E-print Archive at arxiv.The network is an updated version of the one described in Ref.[13],in which scientists are considered connected if they have coauthored one or more papers posted on the archive.We analyze only the largest component of the network,which contains n =56276scientists in all branches of phys-ics covered by the archive.Since two vertices that are un-connected by any path are never put in the same community by our algorithm,the small fraction of vertices that are not part of the largest component can safely be assumed to be in parate communities in the n of our algorithm.Our al-gorithm takes 42min to find the full community structure.Our best estimates indicate that the GN algorithm would
take
FIG.  2.Dendrogram of the communities found by our algo-rithm in the “karate club”network of Zachary [5,17].The shapes of the vertices reprent the two groups into which the club split as the result of an internal dispute.
FAST ALGORITHM FOR DETECTING COMMUNITY …PHYSICAL REVIEW E 69,066133(2004)
somewhere between three and five years to complete its ver-sion of the same calculation.
The analysis reveals that the network in question consists of about 600communities,with a high peak modularity of Q =0.713,indicating strong community structure in the phys-ics world.Four of the communities found are large,contain-ing between them 77%of all the vertices,while the others are small—e Fig.4,left panel.The four large communities correspond cloly to subject subareas:one to astrophysics,one to high-energy physics,and two to condend-matter physics.Thus there appears to be a strong correlation be-tween the structure found by our algorithm and the commu-nity divisions perceived by human obrvers.It is precily correlation of this kind that makes community structure analysis a uful tool in understanding the behavior of net-worked systems.
We can repeat the analysis with any of the subcommuni-ties to obrve how they break up.For example,feeding the smaller of the two condend-matter groups through the al-gorithm again,we find an even stronger peak modularity of Q =0.807—the strongest we have yet obrved in any network—corresponding to a split into about a 100commu-nities (Fig.4,center panel ).The communities have a broad distribution of sizes from 3to nearly 2000.The distribution is shown in cumulative form in Fig.5,and we obrve that it is approximately power law in form with exponent
about
FIG.3.Dendrogram of the communities found in the college football network descibed in the text.The real-world communities—conferences—are denoted by the different shapes as indicated in the
legend.
FIG.4.Left panel:Community structure in the collaboration network of physicists.The graph breaks down into four large groups,each compod primarily of physicists of one specialty,as shown.Specialties are determined by the subction (s )of the e-print archive in which individuals post papers:“C.M.”indicates condend matter;“H.E.P.”indicates high-energy physics including theory,phenomenology,and nuclear physics;“astro”indicates astrophysics.Middle panel:one of the condend matter communities is further broken down by the algorithm,revealing an approximate power-law distribution of community sizes.Right panel:one of the smaller communities is further analyzed to reveal individual rearch groups (different shades ),one of which (in the dashed box )is the author’s own.ほしの景子
M.E.J.NEWMAN PHYSICAL REVIEW E 69,066133(2004)
浅水洼里的小鱼−1.6,although this conclusion should be treated with caution as there is significant deviation from a perfect power law [20].
Narrowing our focus still further to the particular one of the communities that contains the prent author,we find the structure shown in the right panel of Fig.4.Feeding this one last time through the algorithm,it breaks apart into com-munities that correspond cloly to individual institutional rearch groups,the author’s group appearing in the corner of the figure,highlighted by the dashed box.One could pur-
sue this line of analysis further,identifying individual groups,iteratively breaking them down,and looking,for ex-ample,at the patterns of collaboration between them,but we leave this for later studies.
IV .CONCLUSIONS
In this paper we have described an algorithm for extract-ing community structure from networks,which has a consid-erable speed advantage over previous algorithms,running to completion in a time that scales as the square of the network size.This allows us to study much large
r systems than has previously been possible.Among other examples,we have applied the algorithm to a network of collaborations between more than 50000physicists,and found that the resulting community structure corresponds cloly to the traditional divisions between specialties and rearch groups in the field.
We believe that our method will not only allow for the extension of community structure analysis to some of the very large networks that are now being studied for the first time,but will also provide a uful tool for visualizing and understanding the structure of the networks,who daunt-ing size has hitherto made many of their structural properties obscure.
ACKNOWLEDGMENTS
The author thanks Leon Danon,Pablo Gleir,David Lus-au,and Douglas White for providing network data ud in the examples.This work was supported in part by the Na-tional Science Foundation under Grant No.DMS-0234188.
[1]S.H.Strogatz,Nature (London )410,268(2001).
[2]R.Albert and A.-L.Barabási,Rev.Mod.Phys.74,47(2002).[3]S.N.Dorogovtv and J.F.F.Mendes,Evo
lution of Networks:From Biological Nets to the Internet and WWW (Oxford Uni-versity Press,Oxford,2003).
[4]M.E.J.Newman,SIAM Rev.45,167(2003).
[5]M.Girvan and M.E.J.Newman,Proc.Natl.Acad.Sci.U.S.A.
99,7821(2002).
[6]M.E.J.Newman and M.Girvan,Phys.Rev.E 69,026113(2004).
[7]D.Wilkinson and B.A.Huberman,Proc.Natl.Acad.Sci.U.S.A.101,5241(2004).
[8]P.Holme,M.Huss,and H.Jeong,Bioinformatics 19,532(2003).
[9]R.Guimerà,L.Danon,A.Díaz-Guilera,F.Giralt,and A.Are-nas,Phys.Rev.E 68,065103(2003).
[10]J.R.Tyler,D.M.Wilkinson,and B.A.Huberman,in Pro-ceedings of the First International Conference on Communities and Technologies ,edited by M.Huysman,E.Wenger,and V .Wulf (Kluwer,Dordrecht,2003).
[11]P.Gleir and L.Danon,Adv.Complex Syst.6,565(2003).[12]S.Redner,Eur.Phys.J.B 4,131(1998).
[13]M.E.J.Newman,Proc.Natl.Acad.Sci.U.S.A.98,404
(2001).[14]J.Kleinberg and S.Lawrence,Science 294,1849(2001).[15]B.Everitt,Cluster Analysis (John Wiley,New York,1974).[16]J.Scott,Social Network Analysis:A Handbook ,2nd ed.(Sage,
London,2000).
[17]W.W.Zachary,J.Anthropol.Res.33,452(1977).
[18]D.Lusau,Proc.R.Soc.London,Ser.B 270,S186(2003).[19]The criterion for deciding correct classification is as follows.
We find the largest t of vertices that are grouped together by the algorithm in each of the four known communities.If the algorithm puts two or more of the ts in the same group,then all vertices in tho ts are considered incorrectly classi-fied.Otherwi,they are considered correctly classified.All other vertices not in the largest ts are considered incorrectly classified.This criterion is quite harsh—there are cas in which one might consider some of the vertices to have been identified correctly,where this method would not.Even with this harsh definition,however,our algorith
m performs well,and a laxer definition would only make its performance more impressive.
[20]This power law is different from the one obrved in an email
network by Guimeràet al.[9].They studied the histogram of community sizes over all levels of the dendrogram;we are looking only at the single level corresponding to the maximum value of Q
.
FIG.5.Cumulative distribution function of the sizes of commu-nities found in one of the subnetworks of the physics collaboration graph,as described in the text.The dotted line reprents the slope the plot would have if the distribution followed a power law with exponent −1.6.
FAST ALGORITHM FOR DETECTING COMMUNITY …PHYSICAL REVIEW E 69,066133(2004)

本文发布于:2023-07-18 08:50:14,感谢您对本站的认可!

本文链接:https://www.wtabcd.cn/fanwen/fan/82/1102791.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:苹果   表示   浅水
相关文章
留言与评论(共有 0 条评论)
   
验证码:
推荐文章
排行榜
Copyright ©2019-2022 Comsenz Inc.Powered by © 专利检索| 网站地图