Fast algorithm for detecting community structure in networks

更新时间:2023-07-18 08:50:14 阅读：评论：0

M.E.J.Newman

Department of Physics and Center for the Study of Complex Systems,University of Michigan,Ann Arbor,Michigan48109-1120,USA (Received22September2003;revid manuscript received22March2004;published18June2004)

Many networks display community structure—groups of vertices within which connections are den but

between which they are sparr—and nsitive computer algorithms have in recent years been developed for

detecting this structure.The algorithms,however,are computationally demanding,which limits their appli-

cation to small networks.Here we describe an algorithm which gives excellent results when tested on both

computer-generated and real-world networks and is much faster,typically thousands of times faster,than

previous algorithms.We give veral example applications,including one to a collaboration network of more

than50000physicists.

DOI:10.1103/PhysRevE.69.066133PACS number(s):89.75.Hc,05.10.Ϫa,87.23.Ge,89.20.Hh

I.INTRODUCTION

There has in recent years been a surge of interest within

the physics community in the properties of networks of many

kinds,including the Internet,the world wide web,citation

少先队员敬礼networks,transportation networks,software call graphs,

email networks,food webs,and social and biochemical net-

works[1–4].One property that has attracted particular atten-

tion is that of“community structure”:the vertices in net-

works are often found to cluster into tightly knit groups with

a high density of within-group edges and a lower density of

between-group edges.Girvan and Newman[5,6]propod a

computer algorithm bad on the iterative removal of edges

with high“betweenness”scores that appears to identify such

structure with some nsitivity,and this algorithm has been

employed by a number of authors in the study of such di-

ver systems as networks of email messages,social net-

works of animals,collaborations of jazz musicians,meta-

bolic networks,and gene networks[5–11].As pointed out by

Newman and Girvan[6],the principal disadvantage of their

algorithm is the high computational demands it makes.In its

simplest and fastest form,it runs in worst-ca time O͑m2n͒on a network with m edges and n vertices,or O͑n3͒on a spar network(one for which m scales with n in the limit of

large n,which covers esntially all networks of current sci-

entiﬁc interest,with the possible exception of food webs).

With typical computer resources available at the time of writ-

ing,this limits the algorithm’s u to networks of a few thou-

sand vertices at most,and substantially less than this for

interactive applications.Increasingly,however,there is inter-

est in the study of much larger networks;citation and col-

laboration networks can contain millions of vertices[12,13],

for example,while the world wide web numbers in the bil-

lions[14].

In this paper,therefore,we propo another algorithm for

detecting community structure.The algorithm operates on

different principles from that of Girvan and Newman(GN),

but,as we will show,gives qualitatively similar results.The

worst-ca running time of the algorithm is O(͑m+n͒n),or O͑n2͒on a spar graph.In practice,it runs to completion on current computers in reasonable times for networks of up to a million or so vertices,bringing within reach the study of communities in many systems that would previously have been considered intractable.

II.THE ALGORITHM

Our algorithm is bad on the idea of modularity.Given any network,the GN community structure algorithm always produces some division of the vertices into communities,re-gardless of whether the network has any natural such divi-sion.To test whether a particular division is meaningful,we deﬁne a quality function or“modularity”Q as follows[6].

Let e ij be one-half of the fraction of edges in the network that connect vertices in group i to tho in group j,so that the total fraction of such edges is e ij+e ji.The only exception will be the diagonal elements e ii,which are equal to the fraction of edges that fall within group i(with no factor of a half).Then͚i e ii is the total fraction of edges that fall within groups.All other edges fall between groups.The maximum value of this sum is1,and a division of the network into communities is good if this quantity is large,meaning it is of order1.On its own,however,the sum is not a good measure of community structure,since it takes its maximal value of1 if we put all vertices in a single group together,which is a trivial and not particularly uful form of community struc-ture.

A more uful measure of community structure is to cal-culate the sum͚i e ii and then subtract from it the value that it would take if edges were placed at random.Such a measure gives a score of zero to the trivial grouping with only a single community,but nonzero scores to nontrivial group-ings.

Let a i be the fraction of all ends of edges that are attached to vertices in group i.We can calculate a i straightforwardly by noting that a i=͚j e ij.If the ends of edges are connected together at random,the fraction of the resulting edges that connect vertices within group i is a i2.We deﬁne the modu-larity to be

Q=͚i͑e ii−a i2͒.͑1͒If a particular division gives no more within-community edges than would be expected by random chance,this modu-

PHYSICAL REVIEW E69,066133(2004)

larity is Q =0.Values other than 0indicate deviations from randomness,and in practice values greater than about 0.3appear to indicate signiﬁcant community structure.A number of examples are given in Ref.[6].

But this now suggests an alternative approach to ﬁnding community structure.If a high value of Q reprents a good community division,why not simply optimize Q over all possible divisions to ﬁnd the best one?By doing this,we can avoid the iterative removal of edges and cut straight to the cha.The problem is that true optimization of Q is very costly.The number of ways to divide n vertices into g non-empty groups is given by the Stirling number of the cond

kind S n ͑g ͒

,and hence the number of distinct community divi-sions is ͚g =1n S n ͑g ͒.This sum is not known in clod form,but we obrve that S n ͑1͒+S n ͑2͒=2

n −1for all n Ͼ1,so that the sum must increa at least exponentially in n .To carry out an exhaustive arch of all possible divisions for the optimal value of Q would therefore take at least an exponential amount of time,and is in practice infeasible for systems larger than 20or 30vertices.Various approximate optimiza-tion methods are available:simulated annealing,genetic al-gorithms,and so forth.Here we consider a scheme bad on a standard “greedy”optimization algorithm,which appears to perform well.

Our algorithm falls in the general category of agglomera-tive hierarchical clustering methods [15,16].Starting with a state in which each vertex is the sole member of one of n communities,we repeatedly join communities together in pairs,choosing at each step the join that results in the great-est increa (or smallest decrea )in Q .The progress of the algorithm can be reprented as a “dendrogram,”a tree that shows the order of the joins (e Fig.2for an example ).Cuts through this dendrogram at different levels give divisions of the network into larger or smaller numbers of communities and we can lect the best cut by looking for the maximal value of Q .

Since the joining of a pair of communities between which there are no edges at all can never result in an increa in Q ,we need only consider tho pairs between which there are edges,of which there will at any time be at most m ,where m is again the number of edges in the graph.The change in Q upon joining two communities is given by

渲组词⌬Q =e ij +e ji −2a i a j =2͑e ij −a i a j ͒,

͑2͒

which can clearly be calculated in constant time.The quan-tities e ij are initially equal to one-half of the corresponding

elements of the adjacency matrix of the ,to 1

2for vertex pairs that are joined by an edge and 0for tho that are not.Following a join,some of the matrix elements e ij must be updated by adding together the rows and columns corresponding to the joined communities,which takes worst-ca time O ͑n ͒.Thus each step of the algorithm takes worst-ca time O ͑m +n ͒.There are a maximum of n −1join op-erations necessary to construct the complete dendrogram and hence the entire algorithm runs in time O (͑m +n ͒n ),or O ͑n 2͒on a spar g

空调孔raph.The algorithm has the added advantage of calculating the value of Q as it goes along,making it espe-cially simple to ﬁnd the optimal community structure.

It is worth noting that our algorithm can be generalized trivially to weighted networks in which each edge has a nu-meric strength associated with it,by making the initial values of the matrix elements e ij equal to (a half of )tho strengths;otherwi the algorithm is as above and has the same running time.The networks studied in this paper,however,are all unweighted.行政招聘

III.APPLICATIONS

As a ﬁrst example of the working of our algorithm,we have generated using a computer a large number of random graphs with known community structure,which we then run through the algorithm to quantify its performance.Each graph consists of n =128vertices divided into four groups of 32.Each vertex has on average z in edges connecting it to members of the same group and z out edges to members of other groups,with z in and z out chon such that the total expected degree z in +z out =16,in this ca.As z out is incread from small values,the resulting community structure be-comes progressively weaker and the graphs po greater and greater challenges to the community-ﬁnding algorithm.In Fig.1we show the fraction of vertices correctly assigned to the four co

mmunities by the algorithm as a function of z out [19].As the ﬁgure shows,the algorithm performs well,cor-rectly identifying more than 90%of vertices for values of z out Շ6.Only when z out approaches the value 8at which the number of within-and between-community edges per vertex is the same does the algorithm begin to fail.On the same plot we also show the performance of the GN algorithm and,as we can e,that algorithm performs slightly but measurably better for smaller values of z out .For example,for z out =5our algorithm correctly identiﬁes an average of 97.4(2)%of ver-tices,while the older algorithm correctly identiﬁes 98.9(1)%.Both,however,clearly perform well.

Interestingly,for higher values of z out our algorithm per-forms better than the older one,and we have come across a few real-world networks in which this is the ca also.

苹果手机怎么强制开机

Nor-

FIG.1.The fraction of vertices correctly identiﬁed by our algo-rithms in the computer-generated graphs described in the text.The two curves show results for our algorithm (circles )and for the al-gorithm of Girvan and Newman [5](squares ).Each point is an average over 100graphs.

M.E.J.NEWMAN PHYSICAL REVIEW E 69,066133(2004)

mally,however,the GN algorithm ems to have the edge,and this should come as no great surpri.Our algorithm bas its decisions on purely local information about indi-vidual communities,while the GN algorithm us nonlocal information about the entire network—information derived from betweenness scores.Since community structure is itlf fundamentally a nonlocal quantity,it ems reasonable that one can do a better job of ﬁnding that structure if one has nonlocal information at one’s disposal.

For systems small enough that the GN algorithm is com-putationally tractable,therefore,we e no reason not to con-tinue using it—it appears to give the best results.For systems too large to make u of this approach,however,our algo-rithm gives uful community structure information with comparatively little effort.

We have applied our algorithm to a variety of real-world networks also.We have looked,for example,at the “karate club”network studied in [5],which reprents friendships between 34members of a club at a U.S.university,as re-corded over a two-year period by Zachary [17].During the cour of the study,the club split into two groups as a result of a dispute within the organization,and the members of one group left to start their own club.In Fig.2we show the dendrogram derived by feeding the friendship network into our algorithm.The peak modularity is Q =0.381and corre-sponds to a split into two groups of 17,as shown in the ﬁgure.The shapes of the vertices reprent the alignments of the club members following the dispute and,as we can e,the division found by the algorithm corresponds almost per-fectly to the alignments;only one vertex,number 10,is classiﬁed wrongly.The GN algorithm performs similarly on this task,but not better—it also ﬁnds the split but classiﬁes one vertex wrongly (although a different one,vertex 3).In other tests,we ﬁnd that our algorithm also successfully de-tects the main two-way division of the dolphin social net-work of Lusau [6,18],and the division between black and white musicians in the jazz network of Gleir and Danon [11].

As a demonstration of how our algorithm can sometimes miss some of the structure in a network,we take another example from Ref.[5],a network reprenting the schedule of games between American

用字母表示数优秀教案

college football teams in a single ason.Becau the teams are divided into groups or “conferences,”with intraconference games being more fre-quent than interconference games,we have a reasonable idea ahead of time about what communities our algorithm should ﬁnd.The dendrogram generated by the algorithm is shown in Fig.3,and has an optimal modularity of Q =0.546,which is a little shy of the value 0.601for the best split reported in [5].As the dendrogram reveals,the algorithm ﬁnds six com-munities.Some of them correspond to single conferences,but most correspond to two or more.The GN algorithm,by contrast,ﬁnds all 11conferences,as well as accurately iden-tifying independent teams that belong to no conference.Nonetheless,it is clear that our algorithm is quite capable of picking out uful community structure from the network,and of cour it is much the faster algorithm.On the author’s desktop computer the algorithm ran to completion in an im-measurably small time—less than a hundredth of a cond.The algorithm of Girvan and Newman took a little over a cond.

A time difference of this magnitude will not prent a big problem in most practical situations,but performance rapidly becomes an issue when we look at larger networks;we ex-pect the ratio of running times to increa with the number of vertices.Thus,for example,in applying our algorithm to the 1275-node network of jazz musician collaborations men-tioned above,we found that it runs to co

mpletion in about one cond of CPU time.The GN algorithm by contrast takes more than three hours to reach very similar results.As an example of an analysis made possible by the speed of our algorithm,we have looked at a network of collabora-tions between physicists as documented by papers posted on the widely ud Physics E-print Archive at arxiv.The network is an updated version of the one described in Ref.[13],in which scientists are considered connected if they have coauthored one or more papers posted on the archive.We analyze only the largest component of the network,which contains n =56276scientists in all branches of phys-ics covered by the archive.Since two vertices that are un-connected by any path are never put in the same community by our algorithm,the small fraction of vertices that are not part of the largest component can safely be assumed to be in parate communities in the n of our algorithm.Our al-gorithm takes 42min to ﬁnd the full community structure.Our best estimates indicate that the GN algorithm would

take

FIG. 2.Dendrogram of the communities found by our algo-rithm in the “karate club”network of Zachary [5,17].The shapes of the vertices reprent the two groups into which the club split as the result of an internal dispute.

FAST ALGORITHM FOR DETECTING COMMUNITY …PHYSICAL REVIEW E 69,066133(2004)

somewhere between three and ﬁve years to complete its ver-sion of the same calculation.

The analysis reveals that the network in question consists of about 600communities,with a high peak modularity of Q =0.713,indicating strong community structure in the phys-ics world.Four of the communities found are large,contain-ing between them 77%of all the vertices,while the others are small—e Fig.4,left panel.The four large communities correspond cloly to subject subareas:one to astrophysics,one to high-energy physics,and two to condend-matter physics.Thus there appears to be a strong correlation be-tween the structure found by our algorithm and the commu-nity divisions perceived by human obrvers.It is precily correlation of this kind that makes community structure analysis a uful tool in understanding the behavior of net-worked systems.

We can repeat the analysis with any of the subcommuni-ties to obrve how they break up.For example,feeding the smaller of the two condend-matter groups through the al-gorithm again,we ﬁnd an even stronger peak modularity of Q =0.807—the strongest we have yet obrved in any network—corresponding to a split into about a 100commu-nities (Fig.4,center panel ).The communities have a broad distribution of sizes from 3to nearly 2000.The distribution is shown in cumulative form in Fig.5,and we obrve that it is approximately power law in form with exponent

about

FIG.3.Dendrogram of the communities found in the college football network descibed in the text.The real-world communities—conferences—are denoted by the different shapes as indicated in the

legend.

FIG.4.Left panel:Community structure in the collaboration network of physicists.The graph breaks down into four large groups,each compod primarily of physicists of one specialty,as shown.Specialties are determined by the subction (s )of the e-print archive in which individuals post papers:“C.M.”indicates condend matter;“H.E.P.”indicates high-energy physics including theory,phenomenology,and nuclear physics;“astro”indicates astrophysics.Middle panel:one of the condend matter communities is further broken down by the algorithm,revealing an approximate power-law distribution of community sizes.Right panel:one of the smaller communities is further analyzed to reveal individual rearch groups (different shades ),one of which (in the dashed box )is the author’s own.ほしの景子

M.E.J.NEWMAN PHYSICAL REVIEW E 69,066133(2004)

浅水洼里的小鱼−1.6,although this conclusion should be treated with caution as there is signiﬁcant deviation from a perfect power law [20].

Narrowing our focus still further to the particular one of the communities that contains the prent author,we ﬁnd the structure shown in the right panel of Fig.4.Feeding this one last time through the algorithm,it breaks apart into com-munities that correspond cloly to individual institutional rearch groups,the author’s group appearing in the corner of the ﬁgure,highlighted by the dashed box.One could pur-

sue this line of analysis further,identifying individual groups,iteratively breaking them down,and looking,for ex-ample,at the patterns of collaboration between them,but we leave this for later studies.

IV .CONCLUSIONS

In this paper we have described an algorithm for extract-ing community structure from networks,which has a consid-erable speed advantage over previous algorithms,running to completion in a time that scales as the square of the network size.This allows us to study much large

r systems than has previously been possible.Among other examples,we have applied the algorithm to a network of collaborations between more than 50000physicists,and found that the resulting community structure corresponds cloly to the traditional divisions between specialties and rearch groups in the ﬁeld.

We believe that our method will not only allow for the extension of community structure analysis to some of the very large networks that are now being studied for the ﬁrst time,but will also provide a uful tool for visualizing and understanding the structure of the networks,who daunt-ing size has hitherto made many of their structural properties obscure.

ACKNOWLEDGMENTS

The author thanks Leon Danon,Pablo Gleir,David Lus-au,and Douglas White for providing network data ud in the examples.This work was supported in part by the Na-tional Science Foundation under Grant No.DMS-0234188.

[1]S.H.Strogatz,Nature (London )410,268(2001).

[2]R.Albert and A.-L.Barabási,Rev.Mod.Phys.74,47(2002).[3]S.N.Dorogovtv and J.F.F.Mendes,Evo

lution of Networks:From Biological Nets to the Internet and WWW (Oxford Uni-versity Press,Oxford,2003).

[4]M.E.J.Newman,SIAM Rev.45,167(2003).

[5]M.Girvan and M.E.J.Newman,Proc.Natl.Acad.Sci.U.S.A.

99,7821(2002).

[6]M.E.J.Newman and M.Girvan,Phys.Rev.E 69,026113(2004).

[7]D.Wilkinson and B.A.Huberman,Proc.Natl.Acad.Sci.U.S.A.101,5241(2004).

[8]P.Holme,M.Huss,and H.Jeong,Bioinformatics 19,532(2003).

[9]R.Guimerà,L.Danon,A.Díaz-Guilera,F.Giralt,and A.Are-nas,Phys.Rev.E 68,065103(2003).

[10]J.R.Tyler,D.M.Wilkinson,and B.A.Huberman,in Pro-ceedings of the First International Conference on Communities and Technologies ,edited by M.Huysman,E.Wenger,and V .Wulf (Kluwer,Dordrecht,2003).

[11]P.Gleir and L.Danon,Adv.Complex Syst.6,565(2003).[12]S.Redner,Eur.Phys.J.B 4,131(1998).

[13]M.E.J.Newman,Proc.Natl.Acad.Sci.U.S.A.98,404

(2001).[14]J.Kleinberg and S.Lawrence,Science 294,1849(2001).[15]B.Everitt,Cluster Analysis (John Wiley,New York,1974).[16]J.Scott,Social Network Analysis:A Handbook ,2nd ed.(Sage,

London,2000).

[17]W.W.Zachary,J.Anthropol.Res.33,452(1977).

[18]D.Lusau,Proc.R.Soc.London,Ser.B 270,S186(2003).[19]The criterion for deciding correct classiﬁcation is as follows.

We ﬁnd the largest t of vertices that are grouped together by the algorithm in each of the four known communities.If the algorithm puts two or more of the ts in the same group,then all vertices in tho ts are considered incorrectly classi-ﬁed.Otherwi,they are considered correctly classiﬁed.All other vertices not in the largest ts are considered incorrectly classiﬁed.This criterion is quite harsh—there are cas in which one might consider some of the vertices to have been identiﬁed correctly,where this method would not.Even with this harsh deﬁnition,however,our algorith

m performs well,and a laxer deﬁnition would only make its performance more impressive.

[20]This power law is different from the one obrved in an email

network by Guimeràet al.[9].They studied the histogram of community sizes over all levels of the dendrogram;we are looking only at the single level corresponding to the maximum value of Q

FIG.5.Cumulative distribution function of the sizes of commu-nities found in one of the subnetworks of the physics collaboration graph,as described in the text.The dotted line reprents the slope the plot would have if the distribution followed a power law with exponent −1.6.

FAST ALGORITHM FOR DETECTING COMMUNITY …PHYSICAL REVIEW E 69,066133(2004)

本文发布于:2023-07-18 08:50:14，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/82/1102791.html

上一篇：realflow发射器详解

下一篇：基于贝叶斯最优化的Xgboost算法的改进及应用

标签：苹果表示浅水

留言与评论（共有 0 条评论）