remarks

更新时间:2022-11-24 21:43:32 阅读: 评论:0


2022年11月24日发(作者:西安日语培训)

6thInternational

WorkshoponData

MininginBioinformatics

(BIOKDD06)

WorkshopChairs

GeorgeKarypis

JiongYang

HeldinconjunctionwiththeKDDconference,August20,2006

BIOKDD06:WorkshoponDataMininginBioinformatics

August20th,2006

Philadelphia,PA,USA

inconjunctionwith

12thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining

ComputerScience

Department

RenslaerPolytechnic

Institute

Troy,NY12180,USA

zaki@

GeorgeKarypis

DepartmentofComputer

Science

UniversityofMinnesota

Minneapolis,MN55455,USA

karypis@

JiongYang

ElectricalEngineeringand

ComputerScience

Department

CaWesternRerve

University

Cleveland,OH44106USA

jiong@

OpeningRemarks

DataMiningistheprocessofautomaticdiscoveryofnovel

andunderstandablemodelsandpatternsfromlargeamounts

ormaticsisthescienceofstoring,analyzing,

andutilizinginformationfrombiologicaldatasuchas-

quences,molecules,geneexpressions,-

opmentofnoveldataminingmethodswillplayafundamen-

talroleinunderstandingtherapidlyexpandingsourcesof

biologicaldata.

ThegoalofthisworkshopistoencourageKDDrearchers

totakeonthenumerouschallengesthatBioinformaticsof-

kshopwillfeatureinvitedtalksfromnoted

expertsinthefield,andthelatestdataminingrearchin

uragepapersthatproponovel

dataminingtechniquesfortaskslike:

•Geneexpressionanalysis

•Protein/RNAstructureprediction

•Phylogenetics

•Sequenceandstructuralmotifs

•GenomicsandProteomics

•Genefinding

•Drugdesign

•RNAiandmicroRNAAnalysis

•Textmininginbioinformatics

•Modelingofbiochemicalpathways

Theproceedingscontain6papers,outof18submissions,

paperwasreviewedbythreemembersoftheprogramcom-

ithtwokeynotetalks,byDavidRoosand

SridharHannenhalli,wewereabletoasmbleaveryexcit-

ingprogram.

Wewouldliketothankalltheauthors,invitedspeaker,and

attendeesforcontributingtothesuccessoftheworkshop.

Specialthanksareduetotheprogramcommitteeforhelp

inreviewingthesubmissions.

WorkshopCo-Chairs

•,RenslaerPolytechnicInstitute

•GeorgeKarypis,UniversityofMinnesota

•JiongYang,CaWesternRerveUniversity

ProgramCommittee

•TatsuyaAkutsu,KyotoUniversity•ChrisBailey-

Kellogg,DartmouthCollege•MehmetDalkilic,Indiana

University•YuanGao,HarvardUniversity•ShinIshii,

NaraInstituteofScienceandTechnology•SunKim,In-

dianaUniversity•LeiLiu,UniversityofIllinois,Urbana-

Champaign•ZoranObradovic,TempleUniversity•Gul-

tekinOzsoyoglu,CaWesternRerveUniversity•Srini-

vasanParthasarathy,OhioStateUniversity•JianPei,

SimonFrarUniversity•NarenRamakrishnan,Vir-

giniaTech•HannuToivonen,UniversityofHelsinki•

JasonWang,NewJeryInstituteofTechnology•Wei

Wang,UniversityofNorthCarolina,ChapelHill•Aidong

Zhang,SUNY,Buffalo•YingZhao,UniversityofMis-

souri,Rolla

ExternalReviewers

•Young-RaeCho,SUNY,Buffalo•WoochangHwang,

SUNY,Buffalo•MugdhaKhaladkar,NewJeryIn-

stituteofTechnology•TakeshiKawabata,NaraInsti-

tuteofScienceandTechnology•MustafaKirac,Ca

WesternRerveUniversity•ChuanLin,SUNY,Buffalo

•UrosMidic,TempleUniversity•ChandanReddy,

CornellUniversity•JuhoRousu,UniversityofHelsinki

•YangSong,NewJeryInstituteofTechnology•Ju-

nildaSpirollari,NewJeryInstituteofTechnology•

DongrongWen,NewJeryInstituteofTechnology•

HongboxXie,TempleUniversity•WangZhuo,Uni-

versityofIllinois,Urbana-Champaign

BIOKDD06:6thWorkshoponDataMininginBioinformatics(withSIGKDDConference)page1

WorkshopProgram&TableofContents

8:50-9:00am:1

9:00-10:00am:ProteinInteractionsandFolding

•SignalTransductionModelBadFunctionalModuleDetectionAlgorithmforProtein-ProteinInteractionNetworks,

WoochangHwang,Young-raeCho,AidongZhangandMuraliRamanathan(StateUniversityofNewYorkatBuf-

falo).page3

•ProteinFoldingTrajectories:Summarization,EventDetectionandConnsusPartialFoldingPathwayIdentification,

HuiYang,SrinivasanParthasarathyandDuyguUcar(OhioStateUniversity).page13

10:00-10:30am:CoffeeBreak

10:30-11:30am:InvitedTalk

•DavidRoos,MerriamProfessorofBiology&Director,PennGenomicsInstitute,22

11:30-12:00pm:ClusterVisualization

•AutomaticLayoutandVisualizationofBiclusters,us,(Virginia

PolytechnicInstituteandStateUniversity).page23

12:00-1:30pm:Lunch

1:30-2:30pm:InvitedTalk

•DecipheringGeneRegulatoryNetworksbyinsilicoapproaches,SridharHannenhalli,PennCenterforBioinformatics,

DepartmentofGenetics,31

2:30-3:30pm:MotifDiscovery

•ExMotif:EfficientStructuredMotifExtraction,YongqiangZhangandMohammedZaki(RenslaerPolytechnicInsti-

tute).page32

•EfficientSearchusingHybridEMforMotifRefinement,ChandanReddy,Yao-ChungWengandHsiao-DongChiang

(CornellUniversity).page42

3:30-4:00pm:CoffeeBreak

2:30-3:30pm:Classification

•ProteinClassificationusingSummariesofProfile-BadFrequencyMatrices,KeithMarsoloandSrinivasanParthasarathy

(OhioStateUniversity).page51

BIOKDD06:6thWorkshoponDataMininginBioinformatics(withSIGKDDConference)page2

SignalTransductionModelBadFunctionalModule

DetectionAlgorithmForProtein-ProteinInteraction

Network

WoochangHwang†Young-RaeCho†AidongZhang†MuraliRamanathan††

†DepartmentofComputerScienceand

Engineering,StateUniversityofNew

YorkatBuffalo,USA

††DepartmentofPharmaceutical

Sciences,StateUniversityofNewYork

atBuffalo,USA

Email:{whwang2,ycho8,azhang}@,murali@

ABSTRACT

Cellularfunctionsarecoordinatelycarriedoutbygroupsofgenes

ionofsuchfunc-

tionalmodulesfromprotein-proteininteraction(PPI)networksis

oneofthemostchallengingandimportantprobleminpostgenomic

er,thesparconnectivityofprotein-proteininterac-

tiondatatsmakesidentificationoffunctionalmodulesmorechal-

arefulobrvationsofthepropertiesoffunctional

modulesinthePPInetwork,wehavefoundthattheactualtopologi-

calshapesandproperties,includingthegraphdensityanddiameter,

ofthefunctionalmodulesinthePPInetworkhaveexpodunex-

pectedphenomena,e.g.,lowintraconnectivityandlongishshapes.

Manydifferentclusteringapproacheshavebeenpropodtoextract

themconcentrated

onlyondenlyconnectedregionstopologicallyandignoredbio-

logicalcharacteristicsofthenetworktobedealtwith,eventhough

ore,theycould

findonlytheclusterswithcertaindensity,andfailedtofindef-

fectivefunctionalmoduleswhicharebiologicallysignifi-

thermore,theyproducedmanysmallsizeclusters,whichhaveless

than5membersorevensingletons,anditresultedindiscarding

-

quertheproblemffectively,wedevelopanalgorithm,termed

STM,whichutilizesthedegreeofinfluencebetweenproteinstode-

rscanthenbeformulated

omparedtosixcompeting

approachesincludingthemaximumclique,quasi-clique,minimum

cut,betweenesscutandMarkovClustering(MCL)

clustersobtainedbyeachtechniquearecomparedforenrichmentof

fiedclustersbySTMareshowntobeen-

richedforbiologicalfunctionbetterthantheclustersidentifiedby

gicalevaluationoftheidentified

clustersbyourmethoddemonstratedthatourmethodcansuccess-

fullyidentifyarbitraryshapeclusterswithlargesizethattheother

tiontotheabove,animportantstrengthof

ourapproachisthatthepercentageofproteinsthatarediscardedto

createclustersismuchlowerthantheotherapproacheswhichhave

anaveragediscardpercentageof59%ontheyeastprotein-protein

interactionnetwork.

Keywords

SignalTransduction,Protein-proteininteractionnetwork,Functional

moduledetection

UCTION

Sincethefirstcompletegenomewasquencedin1995,morethan

300prokaryoticgenomesandmorethan20eukaryoticgenomes

havebeenquenced[17].Discoveringthefunctionalrolesofgene

productsafterthecompletionofquencingtheSaccharomyces

Cerevisiaegenomehasbeeninthespotlightofpost-genomicera.

High-throughputtechniques[5;11;12;25]forprotein-proteinin-

teractions(PPI)detectionhaveattractedrearchers’attentionsince

interactingproteinsarelikelytorvetogetherasagroupincel-

lularfunctions[10].Inrecentyears,high-throughputtechniques

inagenomicscalesuchasyeast-two-hybrid,massspectrometry,

andproteinchiptechnologieshavemultipliedthevolumeofpro-

teininteractiondatatxponentiallyandalsohaveprovidedusa

ulativePPI

datatof,forexample,siaeinDIP(DatabaofInter-

actingProteins)[2]nowlistsover4900proteinsand18,000inter-

actionsfromover22000experiments;however,nearlyhalfofthe

ivecomputationalsystemsfor

storage,management,visualizationandanalysisarenecessaryto

copewiththefastgrowingcomplexdatats.

PPIdataprovideusthegoodopportunitytosystematicallyanalyze

thestructureofalargelivingsystemandalsoallowustouit

tounderstandesntialprincipleslikeesntiality,geneticinterac-

tions,functions,functionalmodules,proteincomplexesandcellu-

arfunctionsandbiochemicaleventsarecoor-

dinatelycarriedoutbygroupsofproteinsinteractingeachotherin

functionalmodules,andthemodularstructureofcomplexnetworks

iscriticaltofunction[7;10;20].Identifyingsuchfunctionalmod-

ulesinPPInetworksisveryimportantforunderstandingthestruc-

-

fore,developinganeffectivecomputationalapproachtoidentify

functionalmodulesshouldbehighlychallengingbutindispensable.

Clusteringanalysishelpsusunderstandthetopologicalstructureof

thePPInetworksandrelationshipsamongitscomponentsbetter.

And,theprincipalfunctionofeachclustercanbeinferredfromthe

onsforunannotatedmembersofa

clustercanbepredictedbythefunctionsofotherannotatedmem-

bers[18].

PPIadjacencymatricescanbereprentedasgraphswhonodes

ster-

ingofaPPIdatatcanbetherebyreducedtographtheoretical

,thebinarynatureofthecurrentPPIdatatsim-

BIOKDD06:6thWorkshoponDataMininginBioinformatics(withSIGKDDConference)page3

themaximalcliqueapproach,clusteringisreducedtoidentifying

fullyconnectedsubgraphsinthegraph[24].Toovercometherel-

ativelyhighstringencyimpodbythemaximalcliquemethod,

theQuasiClique[2],MolecularComplexDetection(MCODE)[1],

SpirinandMirny[24]algorithmsidentifydenlyconnectedsub-

graphsratherthanfullyconnectedonesbyeitheroptimizingan

-

strictedNeighborhoodSearchClusteringAlgorithm(RNSC)[16]

andHighlyConnectedSubgraphs(HCS)algorithms[9]harness

minimumcostedgecutsforclusteridentifikov

ClusterAlgorithm(MCL)algorithmfindsclustersusingiterative

roundsofexpansionandinflationthatpromotethestronglycon-

nectedregionsandweakenthesparlyconnectedregions,respec-

tively[26].Thelinegraphgenerationapproach[20]transforms

thenetworkofproteinsconnectedbyinteractionsintoanetworkof

connectedinteractionsandthenustheMCLalgorithmtocluster

haandLiang[22]employedasta-

tisticalapproachtoclusteringofproteinsbadonthepremithat

apairofproteinssharingasignificantlylargernumberofcommon

neighborswillhavehighfunctionalsimilarity.

However,currentlyudapproachencounterchallengesbecau

therelationshipbetweenproteinfunctionandPPIischaracterized

byweakconnectivityandunexpectedtopologicalphenomena,such

aslowintraconnectivityandlongishshapesofactualtopological

shapesofMIPSfunctionalcategories[19].Inourexperimental

analysis,subgraphsofeachfunctionalcategoriesinMIPSdataba

[19]areextractedfromtheYeastPPInetwork,andthedensityof

sityoftho

subgraphsisaveragedabout0.0023whichisfairlylowerthanwe

nctionalcategorieshavelowconnectivitywithin

theminthePPInetworkandthemajorityofmembersinfunctional

categoriesdonothavedirectphysicalinteractionwithothermem-

rmore,itisnot

difficulttofindsingletonsintheextractedsubgraphsoffunctional

categorieswhichmeansthatsomeproteinsdonothaveanyinter-

actionwithotherproteinsinthefunctionalcategorytheybelong

averagediameterofagraphbethelengthofthelongest

ragediam-

eterofthesubgraphsofallMIPSfunctionalcategoriesisapproxi-

mately4whichisclototheaverageshortestpathslength,5.47,

rwords,thesubgraphsofactual

MIPSfunctionalcategoriesinthePPInetworkgenerallyarenot

clolycongregatedasweexpected,

tosuchlowdensitywithinthemodules,theexistingapproaches

haveproducedanumberofclusterswithsmallsizeandsingletons

andmercilesslydiscardedmanyweaklyconnectednodessincethey

complete-

nessofclusteringisariousdrawbackoftheexistingalgorithms.

Discardingthesparlyconnectednodescouldbeahazardousloss

ofimportantinformationofthePPInetwork.

Biologicalnetworks,includingPPInetworks,illustratethebio-

chemicalrelationshipsofcomponentsinbiochemicalprocess.

Clusteringofbiologicalnetworksshouldbeabletoidentifyclusters

ofanyarbitraryshapesandanydensityifthemembersofacluster

shareimportantbiochemicalpropertiesfromthepointofviewof

withthisnecessityandovercome

thodrawbacksofexistingapproaches,wepropoanovelstrat-

egytoanalyzethedegreeofbiologicalandtopologicalinfluence

lPPI

networksasadynamicsignaltransductionsystemanddemonstrate

thesignaltransductionbehavioroftheperturbationbyeachprotein

haviorshouldalsoreflectthetopological

rallsignaltransductionbehavior

functionbetweenanytwoproteinswillbeformulatedtoevaluate

theperturbationcaudbyaproteinonotherproteinsbiologically

cessfullyidentifiedthe

clusterswithbiggersize,arbitraryshape,lowdensity,andbiologi-

callymoreenrichedthanotherexistingapproachesdid.

2.1TheSignalTransductionModel

Theproteinsandtheprotein-proteininteractionsinaPPIdatat

were,respectively,

graphreprentationwasmodeledusingapharmacodynamicsignal

fically,thesignaltransduction

behaviorofthenetworkwasmodeledusingtheErlangdistribution,

aspecialcaoftheGammadistribution.

Graphdefinitions:AnundirectedgraphG=(V,E)consistsof

atVofnodesandatEofedges,E⊆V×

e=(i,j)connectstwonodesiandj,e∈ghborsN(i)

ofnodeiaredefinedtobethetofdirectlyconnectednodesof

reed(i)ofanodeiisthenumberoftheedges

sdefinedasaquenceofnodes

(i

1

,...,i

k

)suchthatfromeachofitsnodesthereisanedgeto

gthofapathisthenumberofnodes

estpathbetweentwonodes,iandj,

tancebetweentwo

nodes,iandj,isthelengthofitsshortestpath.

Signaltransductionmodel:Wepropotomodelthedynamicre-

lationshipsbetweenproteinsinaPPInetworkusingasignaltrans-

fically,thesignaltransductionbe-

haviorofthenetworkismodeledusingtheErlangdistribution,a

angdistribution

functionis:

F(x)=1−e

x

c−1X

k=0

(x

b

)k

k!

(1)

wherec>0istheshapeparameter,b>0isthescaleparam-

eter,x≥0istheindependentvariable,-

langdistributionhasveralcharacteristics,whichareappropriate

fordescribingtheprotein-proteininteractionnetwork,includingits

positiverangeanditsimportantreproductiveproperty[14].The

Erlangdistributionwithx/b=1isudandthevalueofcist

tothenumberofedgesbetweensourceproteinnodeandthetarget

gthevalueofx/btounityasssthepertur-

bationatthetargetproteinwhentheperturbationreaches1/eofits

initialvalueatthenearestneighborofthesourceproteinnode.

Erlangdistributionmodelshavebeenudinpharmacodynamics

tomodelsignaltransductionandtransferdelaysinavarietyofsys-

temsincludingtheproductionofdruginducedmRNAandprotein

dynamics[21]andcalciumion-mediatedsignalinginneutrophils

[8].Figure1showsthecorrespondingpharmacodynamicsignal

transductionmodelwhobolusresponisanErlangdistribution

withparametersbandc,tically,theErlangdis-

tributionreprentsthetimerequiredtocarryoutaquenceof

ctaskswhodurationsareidentical,exponentialprobabilitydis-

madistributionisalsoexcellentforapproxi-

matingpopulationabundancesfluctuatingaroundequilibrium[4].

ThusinthecontextofthePPInetwork,theErlangdistributioncan

beviewedasamechanisticallymotivatedmeasureoftheperturba-

tioncaudbythesuddenintroductionofthesourceproteintoa

targetproteininthenetwork.

TheErlangdistributionneedstobefurthermodifiedtoreflectnet-

turbationinducedbythesourceprotein

BIOKDD06:6thWorkshoponDataMininginBioinformatics(withSIGKDDConference)page4

bbbb

Bolus

Input

Erlang

Output

Node

1

Node

2

Node

c

Figure1:Thepharmacodynamicthetimeconstantfor

signaltransferandcisthenumberofcompartments.

A

B

C

D

E

L

H

K

I

M

J

F

G

N

NodeA

A:5.0

F:4.0

G:0.4741

H:0.0881

NodeB

A:5.0

F:0.5057

G:0.0396

H:0.0054

NodeC

A:5.0

F:0.5057

G:0.0396

H:0.0054

NodeD

A:5.0

F:0.5057

G:0.0396

H:0.0054

NodeE

A:5.0

F:0.5057

G:0.0396

H:0.0054

NodeF

A:5.0

F:4.0

G:3.0

H:0.8428

NodeL

A:0.7902

F:4.0

G:0.4741

H:0.0881

NodeG

A:0.7902

F:4.0

G:3.0

H:4.0

NodeN

A:0.7902

F:4.0

G:3.0

H:0.8428

NodeK

A:0.0084

F:0.0881

G:0.4741

H:4.0

NodeH

A:0.1101

F:0.8428

G:3.0

H:4.0

NodeJ

A:0.0084

F:0.0881

G:0.4741

H:4.0

NodeI

A:0.0084

F:0.0881

G:0.4741

H:4.0

NodeM

A:0.0009

F:0.0133

G:0.0991

H:1.2642

Figure2:xcontainsthenumericalvaluesobtainedfromEquation2fromnodesA,F,G,andHtoothertarget

sforothernodesarenotshown.

nodeshouldbeproportionaltoitsdegreeandtofollowtheshortest

transductiontothetarget

proteinnode,theperturbationshoulddissipateateachintermediate

rallsignaltransduction

behaviorfunctionfromnodevtonodew(v=w)isthus:

S(v→w)=

d(v)

Q

i∈P(v,w)

d(i)

×F(x)(2)

whered(i)isthedegreeofnodei,P(v,w)isthetoftheallnodes

visitedenroutefromnodevtonodew,excludingthesourcenode

vandthetargetdestinationnodew,andF(x)isthesignaltrans-

=wanddistance(v,w)=0,

wedefineS(v→w)=d(v).Thenumeratorofthefirstterm

intherighthandsideofEquation2reprentsthedegreeofthe

sourcenodev,andthedenominatorreprentsthedissipationon

eachvisitingnodeontheshortestpathfromsourcenodevtotarget

iceoftheshortestpathismotivatedbythefinding

thatthemajorityoffluxprefersthepathofleastresistanceinmany

,thefirstterminthe

righthandsideofEquation2reprentsthetopologicaleffectof

ondtermintherighthand

sideofEquation2reprentsthebiologicaleffectofsourcenodev

ore,

thenodesthatscorethehighestvalueontargetnodewwillbethe

mostinfluentialnodesonnodewbiologicallyandtopologically.

Figure2demonstratesthesignaltransductionbehaviorofasmall

eaofunder-

standing,onlythesignalsfromnodeA,F,G,andHareprented,

althoughsignalsshouldbepropagatedfromeachnodeinthenet-

xinFigure2containsthesignalassdbythe

Equation2fromnodesA,F,G,andHtoothertargetnodes,e.g.,

5.0,0.5057,0.0396,0.0054arethesignalsassdfromnodes

A,F,G,andH,respectively,umericalvalues

illustrateoveralleffectsofcombiningthenetworktopologywith

thesignaltransductionmodelfromsourcenodesA,F,G,andH

uently,nodeA,whichhasscoredthehighest

value,willbethemostinfluentialnodeonnodeEbiologicallyand

topologically.

2.2ClusteringModel

STMalgorithmsimulatestheperturbationfromeachnodetothe

othernodesinanetworkusingEquation2,whichreflectsthebi-

repren-

tativesarethenodesthatrecordthehighestscoresbyEquation2

oneverynodeinamodule,i.e.,theyarethemostinfluentialnodes

hesignaltrans-

ductionsimulation,eachnodelectsthemostinfluentialnodesas

ereprentatives,prelim-

inarymodulescanbeformedbyaggregatingeachnodeintoeach

y,the

preliminarymodulesaremergediftherearesubstantialintercon-

nectionsbetweenthem.

ThepudocodefortheSTMalgorithm,whichemploysthesignal

transductionfunctionofEquation2andademocraticreprenta-

tiveslectionalgorithm,orithm

involvesfourquentialprocess:

Process1:Computesignalstransducedbetweenallnodepairs.

Process2:Selectclusterreprentativesforeachnode.

Process3:Formationofpreliminaryclusters.

Process4:Mergepreliminaryclusters.

Process1propagatessignalsfromeachsourcenodeandrecordsthe

signalquantitiesoneachtargetnodeforallnodepairsaccordingto

lementationofProcess1isshownonlines

6-10oftheSTMalgorithminAlgorithm1.

InProcess2,eachnodeelectsthenodesfromwhichitreceives

thehighestsignalvalueasthereprentativesoftheclustersthat

mple,inFigure2,nodesA,B,

BIOKDD06:6thWorkshoponDataMininginBioinformatics(withSIGKDDConference)page5

Algorithm1STM(G)

1:V:tofnodesingraphG

2:F(x):Transductionbehaviorfunction

3:S(v,w):arrivedsignalfromnodevtonodew

4:C:thelistoffinalclusters

5:PreClusters:thelistofpreliminaryclusters

6:foreachnodepair(v,w)v,w∈V,v=wdo

7:distance(v,w)←theshortestpathlengthfromnodevto

nodew

8:tparametercinfunctionasdistance(v,w)

9:signal(v,w)←S(v→w)

10:endfor

11:foreachnodev∈Vdo

12:entative←lectthebestscorednodewfornode

v

13:ifcluster

w==nullthen

14:makeclusterw

15:(v)

16:(clusterw)

17:el

18:cluster

(v)

19:endif

20:endfor

21:C←Merge(PreClusters)

Procedure1Merge(C)

1:C:theclusterlist

2:MaxPair:theclusterpair(c,k)withmaxinterconnections

amongallpairs

3::interconnectionsbetweenclusterpaircandk

4:MaxPair←findMaxPair(C,null)

5:≥thresholddo

6:newCluster←mergeMaxPaircandk

7:ReplaceclustercwithnewCluster

8:Removeclusterk

9:MaxPair←findMaxPair(C,newCluster)

10:endwhile

11:returnC

C,D,E,andFwillchoonodeAandnodesL,G,andNwill

choonodeF,whicharethebestscorednodesonthonodes,as

thereprentatives.

Eachpreliminaryclusterisinitializedbytakingitsreprentative

inaryclustersarethenaugmentedby

accumulatingeachnodetowardeachclusterthateachoftherep-

rentatives,whicharechonbythenode,rom

11-20inAlgorithm1containthereprentativelectionprocess

thatSTM

allowsoverlapsamongclustersbyopeningthepossibilityofmul-

tiplereprentativeswhichhavethetiescoreonanode,

example,nodeGpicksnodesFandH,whichhavethetiescoreon

nodeG,,Gwillbelongto

-

fore,overlapsoccurbetweentheclusterformedbynodeF,{F,G,

L,N},andtheclusterformedbynodeH,{G,H,I,J,K,M}.STM

identifiedthreepreliminaryclusters,{A,B,C,D,E,F},{F,G,L,

N},and{G,H,I,J,K,M},badonthechoiceofreprentatives

inFigure2.

Somepreliminaryclustersmaybemergediftheyhavesubstantial

otomeasurethedegreeofintercon-

nectivitybetweenclustersbythesimilarityoftwoclustersiandj

definedbelow:

Similarity(i,j)=

interconnectivity(i,j)

minsize(i,j)

(3)

whereinterconnectivity(i,j)isthenumberofconnectionsbe-

tweenclustersiandj,andminsize(i,j)isthesizeofthesmaller

ilarity(i,j)betweentwo

clustersiandjistheratioofthenumberoftheconnectionsbe-

intercon-

nectedclustersareiterativelymergedbadonthesimilarityofthe

rofclustersthathavethehighestsimilarityare

mergedineachiterationandthemergeprocessiteratesuntilthe

highestsimilarityofallclusterpairsislessthanagiventhreshold.

Thelectionofthethresholdformergingclustersisacriticalfac-

torforthefitically,wecanewhen

interconnectivity(i,j)≥minsize(i,j),clustersiandjhave

lusters,{A,B,C,D,E,F},

{F,G,L,N},{G,H,I,J,K,M},areobtainedaftertheProcess4

sters,{A,B,C,

D,E,F,G,L,N},{G,H,I,J,K,M},areobtainedaftertheMerge

processwhen1.0isudasthemergethreshold.

2.3ClusterAsssment

ThestructuresoftheclustersidentifiedbySTMandothercompet-

ingalternativeapproachesareassdusingveralmetrics.

Theclusteringcoefficient,C(v),ofanodevmeasurestheconnec-

tivityamongitsdirectneighbors:

C(v)=

2|

S

i,j∈N(v)

(i,j)|

d(v)(d(v)−1)

(4)

InEquation4,N(v)isthetofthedirectneighborsofnodev

andd(v)

connectednodeshavehighvaluesofclusteringcoefficient.

Degreecentralityordersnodesbythenumberoftheirdirectneigh-

bors,andbetweennesscentralitymeasuresthenodes’importance

fromtheinformationfland

betweennesscentralitycommonlyudtomeasuretheimportance

weenessCentrality,C

B

(v),isa

measureoftheglobalimportanceofanodethatasssthepro-

portionofshortestpathsbetweenallnodepairsthatpassthrough

weenessCentrality,C

B

(v)foranode

ofinterest,v,isdefinedby:

C

B

(v)=

X

s=v=t∈V

ρ

st

(v)

ρ

st

(5)

IntheEquation5,ρ

st

isthenumberofshortestpathsfromnodes

totandρ

st

(v)thenumberofshortestpathsfromstotthatpass

throughthenodev.

Theextenttowhichtheclustersareassociatedwithaspecificbi-

ologicalfunctionivaluatedusingap-valuebadonthehyper-

geometricdistribution[2].Thep-valueistheprobabilitythata

clusterwouldbeenrichedwithproteinswithaparticularfunction

-valueisgivenby:

p=1−

k−1X

i=0

C

i

«„

G−C

n−i

«

G

n

«

(6)

InEquation6,Cisthesizeoftheclustercontainingkproteins

withagivenfunction;Gisthesizeoftheuniversaltofproteins

-

cauthep-valuesarefrequentlysmallnumberswithpositiveval-

BIOKDD06:6thWorkshoponDataMininginBioinformatics(withSIGKDDConference)page6

Figure3:Accumulationoflethalproteinsforvariouspercentilesof

degree(grayline),betweenesscentrality(dashedline)ortheSTM

signaltransductionmetric(solidline).Theresultsareshownfor

thetop555proteinsobtainedfromtheyeastPPInetworkandare

ordered;thehighestvaluesofthemetricsareclosttotheorigin.

uesbetween0and1,thenegativelogarithms(toba10,denoted

-logp)areud.A-logpvalueof2orgreaterindicatesstatistical

significanceatα=0.01.

ThedensityofasubgraphsinaPPInetworkismeasuredby:

D

s

=

2e

n(n−1)

(7)

InEquation7,nisthenumberofproteinsandeisthenumberof

interactionsinasubgraphsofaPPInetwork.

MENTALRESULTS

3.1ProteinInteractionData

siaewasobtainedfromtheDIPdataba

[3].Thisdatatinclude2526proteinsand5949filteredreliable

saeprovideim-

portanttestbedsforthestudyofthePPInetworkssinceitisa

well-studiedorganismforwhichmostproteomicsdataisavailable

fortheorganism,byvirtueoftheavailabilityofadefinedandrel-

ativelystableproteome,fullgenomeclonelibraries,established

molecularbiologyexperimentaltechniquesandanassortmentof

welldesignedgenomicsdatabas[3;10].

3.2BiologicalSignificanceofthePutativeMod-

uleReprentatives

OursignaltransductionmodelofEquation2providesavehicle

toquantitativelymeasurethedegreeofbiologicalandtopologi-

calinfluenceofeachproteinonotherproteinsinthePPInetwork.

Themostinfluentialproteins,thatis,thehighestscorednodes,are

uatethebiologicalsignificance

ofthemostinfluentialproteins,weannotatedthelethalityofeach

proteinintheyeastPPInetworkaccordingtotheMIPSlethality

ityisacrucialfactortocharacterizethebiologicales-

terminedbyexaminingwhethera

moduleisfunctionallydisruptedwhentheproteinisknockedout.

WeobtainedtheproteinlethalityinformationfromMIPSdataba

[19],d

that233proteinsoutofthetopscored555proteinsarelethal.

-

berofproteinnodesincludedforincreasingpercentilesofthede-

gree,a

areshownfor555genes,obtainedfromtheyeastPPInetwork,with

ca,theresults

-

ure3showsthattheperformanceoftheSTMmetricinpredicting

lethalityiscomparabletothatofdegreeandbetweenessapproaches

forupto150nodes.

3.3ClusteringPerformanceAnalysis

Experimentally,weperformedSTMalgorithmontheyeastPPI

datatusingvariousmergethresholdvaluestofindthebestthresh-

mentsusing0.5,1.0,1.5,2.0,

2.5,and3.0asthemergethresholdwereperformedoneachdata

ultsshowthatwhenthemergethresholdislessthan

1.0,clustersthatdonothavesubstantialsimilarityaremerged;and

whenthemergethresholdisgreaterthat1.5,mergingldomoc-

snomuchperformancedifferencewhenthevalues

erimentwhen1.0isudas

themergethresholdshowedthebestperformance.

3.3.1ClusterAnalysis

555preliminaryclustersareobtainedfromtheyeastPPInetwork

e1,all60

clustersthathavemorethan4proteinsarelisted,anditalsoshows

theirtopologicalcharacteristicsandtheirassignedmolecularfunc-

litatecriticalasss-

ments,thepercentageofproteinsthatareinconcordancewiththe

majorassignedfunction(hits),thediscordantproteins(miss)and

he60clusters,thelargest

onecontains210proteinsandthesmallestonecontains5inthem.

Onaverage,wehave40.1proteinsinacluster,andtheaverageden-

sityofthesubgraphsoftheclusterxtractedfromthePPInetwork

-logpvaluesofthemajorfunctionidentifiedineach

clusterisalsoshownandthevaluesprovideameasureoftherela-

tiveenrichmentofaclusterforagivenfunctionalcategory:higher

ultsdemon-

stratethattheSTMmethodcandetectlargebutsparlyconnected

hval-

uesof-logp(valuesgreaterthan2.0indicatestatisticalsignificance

atα<0.01)indicatethatclustersaresignificantlyenrichedforbi-

ologicalfunctionandcanbeconsideredtobefunctionalmodules.

Asaresult,ourmethodcanclearlyidentifylargermodulesthat

havelowdensitybutstillbiologicallyenrichedaswecanefrom

thesize,thedensity,andtheP-valueoftheclustersinTable1.

Figure4exhibitsthedistributionofthehit,miss,andunknown

percentageofmemberproteinswiththeassignedfunctionforeach

dthat

mostoftheproteinsinaclusterhavethesamefunctionsthatare

assignedasamainfunctionfortheclusterasshowninFigure4.

3.3.2ComparativeAnalysis

TheresultsinTable2and3fortheyeastPPIdatatshowthat

STMgenerateslargerclusters;theclustersidentifiedhadp-values

thatare2.2ordersofmagnitudeorapproximately125-foldlower

thanQuasiclique,thebestperformingalternativeclusteringmethod,

-valuesforthecellularlocalizationare

earthatthe

clustersidentifiedbySTMdespitebeinglargerhavelowp-values.

Althoughp-valuesgenerallydecreawithincreasingclustersize,

BIOKDD06:6thWorkshoponDataMininginBioinformatics(withSIGKDDConference)page7

Distribution

ClusterSizeDensityHDU-LogpFunction

12140.01924.769.65.643.9Nucleartransport

21880.01569.125.05.836.4CellcycleandDNAprocessing

31810.02222.072.35.517.2Cytoplasmicandnuclearproteindegradation

4

1700.02846.442.910.531.6Transportedcompounds(substrates)

5

1310.02837.455.76.828.6Vesiculartransport(Golginetwork,etc.)

61250.03060.833.65.632.2tRNAsynthesis

71130.02719.471.68.811.8Actincytoskeleton

8790.04517.773.48.812.3Homeostasisofprotons

9780.03326.962.810.212.5Ribosomebiogenesis

10760.04138.159.22.620.2rRNAprocessing

11720.0305.684.79.76.2Calciumbinding

12

680.06466.125.08.844.5mRNAprocessing

13

610.04140.952.46.511.5Cytoskeleton

14580.06472.427.60.037.4Generaltranscriptionactivities

15530.04815.071.613.27.9MAPKKKcascade

16500.06466.032.02.033.5rRNAprocessing

17450.05524.473.32.211.1Metabolismofenergyrerves

18

440.05859.036.34.55.1Metabolism

19

390.07210.289.70.07.3Cell-celladhesion

20

360.12558.336.15.516.9Vesiculartransport

21

290.09155.144.80.08.3Phosphatemetabolism

22280.07414.278.57.14.5Lysosomalandvacuolarproteindegradation

23270.11929.666.63.77.3Cytokinesis(celldivision)/ptumformation

24260.15353.846.10.028.6Peroxisomaltransport

25250.09028.068.04.04.6RegulationofC-compoundandcarbohydrateutilization

26

250.11668.0284.012.9Cellfate

27

220.15159.036.34.511.4DNAconformationmodification

28

210.14776.119.04.723.9Mitochondrialtransport

29200.20075.020.05.024.0rRNAsynthesis

30190.22878.915.75.217.9Splicing

31170.22070.529.40.019.7Microtubulecytoskeleton

32170.18323.576.40.08.2Regulationofnitrogenutilization

33

150.30486.613.30.031.3Energygeneration

34

140.14250.042.87.19.0SmallGTPamediatedsignaltransduction

35

130.56476.923.00.015.9Mitosis

36

130.35884.615.40.012.4DNAconformationmodification

37130.41069.223.07.617.63’-endprocessing

38130.17961.530.77.66.7DNArecombinationandDNArepair

39120.19616.675.08.33.9Unspecifiedsignaltransduction

40

120.36358.341.60.014.7Posttranslationalmodificationofaminoacids

41

120.16616.675.08.32.4Autoproteolyticprocessing

42

110.21854.545.40.02.9Transcriptionalcontrol

43

110.20072.727.20.08.2Enzymaticactivityregulation/enzymeregulator

44100.46680.020.00.014.8Translationinitiation

4590.36177.722.20.012.8Translationinitiation

4680.32150.037.512.55.6Metabolismofenergyrerves

4780.32175.025.00.09.0Modificationbyubiquitination,deubiquitination

48

80.32137.562.50.03.7Mitosis

49

70.33342.857.10.03.5DNAdamagerespon

5070.33357.128.514.24.1Vacuolartransport

51

70.28528.571.40.04.4Biosynthesisofrine

5260.33350.033.316.62.38Modificationbyphosphorylation,dephosphorylation,etc.

5350.4001000.00.07.0Meiosis

5450.6001000.00.07.0Vacuolartransport

5550.4001000.00.08.5ERtoGolgitransport

56

50.40020.040.040.01.8cAMPmediatedsignaltransduction

5750.50040.040.020.03.1Oxidativestressrespon

58

50.50080.020.00.04.4Intracellularsignalling

5950.60040.060.00.04.2Tetracyclicandpentacyclictriterpenes

6050.40060.040.00.04.1Mitochondrialtransport

Table1:firstcolumnisaclusteridentifier;theSizecolumnindicatesthenumberofproteins

ineachcluster;theDensityindicatesthepercentageofpossibleproteininteractionsthatareprent;theHcolumnindicatesthepercentageofproteins

concordantwiththemajorfunctionindicatedinthelastcolumn;theDcolumnindicatesthepercentageofproteinsdiscordantwiththemajorfunctionandU

-logpvaluesforbiologicalfunctionareshown.

BIOKDD06:6thWorkshoponDataMininginBioinformatics(withSIGKDDConference)page8

Figure4:Distributionofthethreeclassof60clusters:thehitpercentagewiththeassignedfunction,discordantpercentagefromthe

assignedfunction,andunknownpercentage

thedecreasinp-valuescanoccuronlywhenthenullhypothe-

-valuesreflecttheconfidencethatthedifferences,

ifprent,fidenceinanygiven

resultincreaswhentheareobtainedinalargersampleandin

,thedependenceofp-valuesonsamplesizeisintu-

-valuexpressthestrengthofevidenceagainstthenull

hypothesistoaccountforboththesamplesize,theamountofnoi

ore,theSTMclustershavelowp-values

becautheyareenrichedforfunctionandnotsimplybecauthey

arelarger.

Tables2and3demonstratethatSTMoutperformstheotherexist-

acomparisonwith6otherexistingap-

proaches,Maximalcliques[24],Quasicliques[2],Samantha[23],

Minimumcut[14],Betweennesscut[6],andMCL[26].Thecom-

parisonontheclustersizemorethan4isinTable2andonthe

blesshowthatoursig-

naltransductionmodelbadmethodgeneratesconsiderablylarger

clusters,andtheidentifiedclustersbyourmethodhaveatleast2or-

dersofmagnitudehigherP-valuethantheothersonbothfunction

andlocalizationcategories.

QuasicliqueandMaximalcliquediscarded80.8%and98.4%nodes

duringclusteringprocess,eventhoughtheyidentifiedtheclusters

liqueandSaman-

thadiscarded86.7%and93.3%nodes,eventhoughtheyidentified

theclusterswithrelativelyhighp-valuesintheclusterswithsize

rimportantstrengthofSTMis

thatthepercentageofproteinsthatarediscardedtocreateclusters

is7.8%,whichismuchlowerthantheotherapproaches,which

haveanaveragediscardpercentageof59%.TheyeastPPIdatat

isrelativelymodularandthebottom-upapproaches(e.g.,maximal

cliqueandquasicliquemethods)generallyoutperformedthetop-

downapproaches(exemplifiedbytheminimumcutandbetwee-

nesscutmethods)onfunctionalenrichmentasassdby-log

rbecaubottom-upapproachesarebadonconnec-

tivityofdenregions,thepercentagesofdiscardednodesforthe

bottom-upmethodsarealsohigherthanSTMandthetop-downap-

,wealreadyhaveshownthatthefunctionalmodules

havefairlylowdensityandarbitraryshapeswithlongdiameter.

So,discardingthosparlyconnectedproteinscouldbeafatal

decisionwhichmightresultedintheimportantbiologicalinforma-

uently,STMisversatileanditsperformanceon

biologicalfunctionandlocalizationenrichment,clustersize,and

discardrateissuperiortothebestoftheothersixmethodsonboth

datats.

3.3.3TopologicalAnalysis

Toanalyzetheclusteringresultsvisually,weobrvedthetopo-

logicalshapesofthedetectedclustersbySTMandtheirassociated

functionalcategoriesthatwereassignedfromMIPSdataba.4ex-

ampleclusterswhichhavevisuallyrecognizablesizearelected

graphsofthechon

clusterswereextractedfromtheyeastPPInetworkanddisplayed

Sfunction

categoriesthatwereassignedtoeachoftheclusterswerealso

extractedfromtheyeastPPInetworkanddisplayedontheright

thedetectedclustershave

-

ticethatwecaneasilyfindoutthatthediameterandthedensityof

thoexampleclustersandtheirassignedfunctionsarefairlylong

ticethatitisnotdifficulttofindsingletons

ineachfunctionalcategorysubgraphs,whichmeansthatthereare

memberswhichdonothaveadirectphysicalinteractionwithinthe

functionalcategorythattheybelongto.

3.4ComputationalComplexityAnalysis

STMisfundamentallyestablishedonallpairsshortestpatharch-

problemcanbesolvedinO(V2logV+VE)timeifitisimple-

mentedusingJohnson’salgorithm[13],whereVisthenumberof

easuringthe

distancebetweenallnodepairs,formationofpreliminaryclusters

takesO(V)untoftimerequiredtofindthebestclus-

terpairthathasthemostinterconnectionsisO(k2logk)byusing

heap-badpriorityqueue,wherekisthenumberofpreliminary

clusters[15].TheMergingprocessneedstofindtheclusterpair

whichhasthemostinterconnections,andittakesO(k2logk)time

econditeration,findingthe

bestclusterpairtakesO(klogk)timesincetheclusterpaircom-

parisonsareneededonlybetweenthenewlymergedclusterand

maximumk,thenumberofprelimi-

naryclusters,isatmostO(V)inthecaofthefullyconnected

graph,thereforetheMergingprocesstakesO(V2logV)

kismuchsmallerthanVinsparnetworksliketheYeastPPI

otaltimecomplexityofouralgorithmisbounded

BIOKDD06:6thWorkshoponDataMininginBioinformatics(withSIGKDDConference)page9

MethodNumberSizeDiscard(%)FunctionLocation

STM6040.17.813.77.42

Maximalclique1205.6598.410.67.93

Quasiclique10311.280.811.56.58

Samantha647.979.99.164.89

Minimumcut11413.535.08.364.75

Bwtweennesscut18010.2621.08.194.18

MCL1639.7936.78.183.97

Table2:ComparisonofSTMtocompetingclusteringmethodsfortheyeastprotein-proteininteractiondatatforclusterswith5ormore

bercolumnindicatesthenumberofclustersidentifiedbyeachmethod,theSizecolumnindicatestheaveragenumber

ofproteinsineachcluster;theDiscard%-logpvaluesforbiological

functionandcellularlocationareshown.

MethodNumberSizeDiscard(%)FunctionLocation

STM4552.411.516.89.01

MaximalcliqueN/AN/AN/AN/AN/A

Quasiclique4616.786.715.39.34

Samantha1712.393.315.97.65

Minimumcut4424.355.014.88.78

Bwtweennesscut7814.450.511.36.05

MCL5516.769.411.55.42

Table3:ComparisonofSTMtocompetingclusteringmethodsfortheyeastprotein-proteininteractiondatatforclusterswith9ormore

tnoteisthesametoTable2.

bythetimeconsumedincomputingthedistancebetweenallnode

pairs,whichisO(V2logV+VE).

SION

Wehavestudiedthatthetopologicalshapesofthesubgraphsof

MIPSfunctionalcategoriextractedfromthePPInetworkare

wounex-

pectedpropertiesoffunctionalcategoriesprohibitedotherexist-

ingapproachesfromdetectingfunctionalmodulesfromPPInet-

iveexcessofemphasisondensityand

interconnectivityintheexistingmethodscanbepreferentialfor

detectingclusterswithrelativelybalancedroundshapesandlimit

ompletenessofclusteringisanotherdistinct

drawbackofexistingalgorithms,whichproducemanyclusterswith

ferenceforstronglyconnected

nodesresultsinmanyweaklyconnectednodesbeingdiscarded.

Moreover,consideringonlythetopologicalpropertiesandignoring

thebiologicalcharacteristicsofthenetworkalsocandamagethe

effectivenessofclustering.

Inthispaperwehavepropodanovelclusteringmethodbad

head-to-headcomparisons,theSTMoutperformedcompetingap-

proachesandiscapableofeffectivelydetectingbothdenand

sparlyconnected,biologicallyrelevantfunctionalmoduleswith

nowledge,thisisthefirstdescriptionofthe

uofsignaltransductionbadapproachforthisapplication.

Overwhelmingperformanceofourapproachhasbeendemonstrated

eratedbig-

gersizeclusterswitharbitraryshape,andthoidentifiedclusters

aremorebiologicallyenriched,i.e.,higherP-value,eventhough

remorethan5%ofunannotated

proteinsintheidentifictionofthounanno-

tatedproteinscanbepredictedaccordingtotheirassignedmain

tenessofourclusteringmethod

methoddiscardedonlyabout7.8%ofproteinswhichistremen-

douslylowerthantheotherapproachesdid,59%

conclusion,STMhasstrongpharmacodynamics-badunderpin-

ningsandisaneffective,versatileapproachforanalyzingprotein-

approachcontainsaframeworkfor

rationallyincorporatingreactionrates,proteinconcentrationsand

d

thereforehavepotentialapplicationsinthedrugdiscoveryandde-

velopment.

NCES

[1]Bader,matedmethodfor

findingmolecularcomplexesinlargeproteininteractionnet-

informatics,4:2,2003.

[2]Bu,gicalstructureanalysisoftheprotein-

cAcids

Res.,31:2443–2450,2003.

[3]Deane,C.M.,Salwinski,L.,Xenarios,enberg,D..

Proteininteractions:twomethodsforasssmentoftherelia-

oteomics,

1:349–356,2002.

[4]Dennis,il,G.P..Thegammadistributionand

weightedmultimodalgammadistributionsasmodelsofpop-

ences,68:187–212,1984.

[5]Gavin,..Functionalorganizationoftheyeastpro-

,

415:141–147,2002.

[6]Girvan,man,itystructurein

,99:7821–7826,2002.

[7]onalcartographyof

,433:895–900,2005.

BIOKDD06:6thWorkshoponDataMininginBioinformatics(withSIGKDDConference)page10

YBL099W

YNL315C

YJL180C

YJR121W

YDR322C-A

YLR295C

YPL271W

YDL004W

YDR377W

YML081C-A

YPR020W

YPL078C

YDR298C

YKL016C

YDL181W

(a)

YBL099W

YBR039W

YDL004W

YDL181W

YDR298C

YDR322C-A

YDR377W

YJR121W

YKL016C

YLR295C

YML081C-A

YPL078C

YPL271W

YPR020W

(b)

YNR017W

YIL022W

YHR005C-A

YJL143W

YEL020W-A

YML054C

YGR181W

YJR135W-A

YPL063W

YBR091C

YDL217C

YJL054W

YOL034W

YOR297C

YJR045C

YFL016C

YGR254W

YJL064W

YMR203W

YNL131W

YNL121C

(c)

YBL030C

YBL099W

YBR039W

YBR085W

YBR091C

YBR291C

YDL004W

YDL198C

YDL217C

YDR298C

YDR322C-A

YEL020W-A

YEL030W

YER017C

YER154W

YGR082W

YGR101W

YGR181W

YHR002W

YHR005C-A

YIL022W

YIL114C

YJL054W

YJL133W

YJL143W

YJR045C

YJR077C

YJR095W

YJR121W

YJR135W-A

YKL016C

YKL120W

YKR065C

YLL024C

YLR008C

YLR259C

YLR295C

YML042W

YML062C

YMR089C

YMR203W

YNL055C

YNL064C

YNL121C

YNL131W

YNR017W

YOR232W

YOR297C

YPL063W

YPL078C

YPL271W

YPR021C

(d)

YHR166C

YKL022C

YFR036W

YBL084C

YLR127C

YOR249C

YGL240W

YLR102C

YDL008W

YDR118W

YER161C

YER047C

YGR225W

(e)

YAL047C

YBL084C

YBR109C

YCL029C

YDL008W

YDL064W

YDR118W

YDR159W

YDR457W

YEL061C

YFL008W

YFL037W

YFR031C

YFR036W

YGL003C

YGL075C

YGL116W

YGL240W

YHR129C

YHR166C

YKL022C

YKL049C

YKR048C

YKR054C

YLR086W

YLR102C

YLR127C

YLR226W

YML064C

YML085C

YML124C

YMR190C

YMR198W

YMR294W

YNL172W

YNL307C

YOR058C

YOR249C

YPL155C

YPL174C

YPL204W

YPR141C

(f)

YKR002W

YGR048W

YDR390C

YNL317W

YJR093C

YPR107C

YAL043C

YLR277C

YKL059C

YGR156W

YLR115W

YNL222W

YDL094C

(g)

YAL043C

YDR228C

YDR301W

YER032W

YGL044C

YGL122C

YGR156W

YGR178C

YJR093C

YKL059C

YKR002W

YLR115W

YLR277C

YMR061W

YNL317W

YOL123W

YOR250C

YPR107C

(h)

Figure5:Thesubgraphsof4exampleclustersidentifigraphsof4example

clustersdetectedbySTMareegraphsoftheir

associatedfunctionalcategories,whichareassignedfromMIPSdatabaforeachoftheclusters,arealsoextractedfromyeastPPI

networkanddisplayedontherightcolumnoftheirassociatedclusters.

BIOKDD06:6thWorkshoponDataMininginBioinformatics(withSIGKDDConference)page11

[8]Hallett,tit,E.J..StochasticeventsunderlieCa2+

.,186:1–6,1997.

[9]Hartuv,E.,Shamir,eringAlgorithmbadGraph

ationProcessingLetters,76:175–181,

2000.

[10]Hishigaki,H.,Nakai,K.,Ono,T.,Tanigami,agi,

mentofpredictionaccuracyofproteinfunctionfrom

protein–,18:523–531,2001.

[11]Ho,aticidentificationofproteincomplexes

,

415:180–183,2002.

[12]Ito,ehensivetwo-hybridanalysistoexplore

,98:4569–4574,2001.

[13]Johnson,ficientalgorithmsforshortestpathsinspar

CM,24:1–13,1977.

[14]Johnson,N.L.,Kotz,akrishnan,uousuni-

ley&sons,NewYork,NY,

1994.

[15]Karypis,G.,Han,E.-ar,eon:Hier-

Com-

puter:Specialissueondataanalysisandmining,volume32,

pages68–75,1999.

[16]King,A.D.,Przulj,isica,ncomplexpredic-

ormatics,20:3013–3020,

2004.

[17]Letovsky,if,tingproteinfunctionfrom

protein/proteininteractiondata:aprobabilisticapproach.

Bioinformatics,19:i197–i204,2003.

[18]Lin,dgeDiscoveryinBioinformatics:Tech-

niques,Methods,andApplications,chapterClusteringmeth-

odsinprotein-proteininteractionnetwork,ley

&SonsInc.,2006.

[19]Mewes,H.W..MIPS:analysisandannotationofpro-

cAcidRearch,

34:D169–D172,2006.

[20]Pereira-Leal,J.B.,Enright,ounis,-

tionoffunctionalmodulesfromproteininteractionnetworks.

Proteins,54:49–57,2004.

[21]Ramanathan,rsionmodelforcellularsignaltrans-

.,19:1544–1548,2002.

[22]Samanta,ng,S..Predictingproteinfunctions

fromredundanciesinlarge-scaleproteininteractionnetworks.

PNAS,100:12579–12583,2003.

[23]Samanta,ng,anciesinlarge-scalepro-

.,100:12579–

12583,2003.

[24]Spirin,ny,L.A..Proteincomplexesandfunctional

,100:12123–12128,

2003.

[25]Uetz,ehensiveanalysisofprotein-protein

,403:623–

627,2000.

[26]vanDongen,S..TechnicalReportINS-R0010:Aclusteral-

alRearchInstituteforMathe-

maticsandComputerScience,2000.

BIOKDD06:6thWorkshoponDataMininginBioinformatics(withSIGKDDConference)page12

ProteinFoldingTrajectoriesAnalysis:Summarization,

EventDetectionandConnsusPartialFoldingPathway

Identification

HuiYang

.&Eng.

TheOhioStateUniversity

2015NeilAvenue

Columbus,OH43210,USA

yanghu@-

Srinivasan

Parthasarathy

.&Eng.

TheOhioStateUniversity

2015NeilAvenue

Columbus,OH43210,USA

srini@

DuyguUcar

.&Eng.

TheOhioStateUniversity

2015NeilAvenue

Columbus,OH43210,USA

ucar@

ABSTRACT

Moleculardynamicssimulationsareemployedtoexplainthefor-

mationof3Dproteinstructures,afundamentalyetunsolvedprob-

ivecomparisonand

reprentationofthesimulationdataisamajorstepinunderstand-

paper,weex-

ploretheuofspatio-temporalassociationpatternstodiscover

-

poanapproachthatemploysthesimplicityofcontactmapsto

effectivelyanalyzeandutilizeinformationin3Dfoldingsimula-

hodalsoallowsonetoperformcrosscomparison

ofmultipletrajectoriestoidentifycriticaleventsorpartialfolding

iricalresultsonthe

foldingtrajectoriesoftheproteinBBA5demonstratetheefficacy

ofthepropodapproach.

UCTION

Thethreedimensional(3D)nativestructuresofproteinshaveim-

tandingthestructureof

aproteinenablesustoexplorethefunctionoftheprotein,explain

substrateandligandbinding,performrealisticdrugdesignandpo-

teinfolding

problemisthereforeoneofthemostfundamentalyetunsolved

orchal-

lengeinsimulatingtheproteinfoldingprocessisitscomplexity.

Snowetal.[15]statethatperformingaMolecularDynamics(MD)

simulationonamini-proteinforjust10μswouldrequiredecades

ding@homedis-

tributedcomputingproject[14]recentlypropodusingworldwide

distributedcomputingtotackleproteinfoldingsimulations.

Withtheincreasingnumberoftrajectoriesproducedbydistributed

computing,thereisaneedtoanalyze,understand,andmanagethe

usly,rearchershaveexaminedveralsum-

marystatistics(ofgyration,rootmeansquaredeviation

(RMSD))ghsummarystatisticsarecom-

monlyudforcomparison,theycanonlycaptureabiadand

ly,Rusletal.

[13]suggestedusinggeometricspannersformappingasimulation

nsiderus-

Contactauthor.

inggeometricspannerstodiscovertheproximitybetweendifferent

gmentsofaproteinacrossarangeofscales,andtrackthechanges

ofsuchproximityovertime.

Toovercomethedifficultiesinmanagingandanalyzingthelarge

amountofsimulationdata,Berraretal.[2]propoddesigning

bedtheirwarehouinagrid

environmenttoenablethesharingoftheactualsimulationdata.

Theyalsopropoimplementingatofdataminingalgorithmsto

facilitatecommonlyneededdataanalysistasks.

Inthispaper,wepropoamethodtoanalyzefoldingtrajecto-

riesoftheminiproteinBBA5producedbytheFolding@home

izethespatio-temporaldataminingframeworkthat

wehavedevelopedanddescribedearlierforthepurpoofman-

agingandanalyzingsuchdata[21].Asmentionedin[21],this

frameworkisdesignedtoanalyzespatio-temporaldataproduced

inveralscientifireviouswork,wehaveap-

pliedthisframeworkto8732proteinstakenfromtheProteinData

Banktoidentifystructuralfingerprintsfordifferentclassofpro-

teins[19].Eachproteinisassociatedwithatofobjectsthatare

ctivelycapturespatial

relationshipsamongobjects,wedefineSpatialObjectAssociation

Patterns(SOAP).Furthermore,byassociatingSOAPswithproteins

indifferentproteinclass,weestablishtheconnectionsbetween

differenttypesofSOAPsandproteinclass.

Itisapparentthatproteinfoldingtrajectorydatahavebothspa-

oteininaMDsimulation

consistsofanumberofresiduesspatiallylocatedinthe3Dspace

ameofthetrajectorycanberepre-

ntedasacontactmapin2D,capturingthepair-wi3Ddistance

rtoourpreviouswork[19],weextractnon-local

uanentropy-bad

bit-patternsarefurtherassociatedtoformspatialobjectassocia-

tionpatterns(SOAPs).BytheuofSOAPs,weeffectivelyrepre-

ntandanalyzefoldingtrajectoriesproducedbyMDsimulations.

Amajoradvantageofthisreprentationisitsappropriatenessfor

efitsofour

frameworkinclude:

•Effective,informativeandscalablereprentationoffold-

ingsimulations:WereprenteachframebyatofSOAPs,

whereeachSOAPinturncharacterizesthespatialrelation-

ship(orinteractionsinthefoldingca)amongmultiplebit-

renotonlyeasilyobtainablebutalso,as

BIOKDD06:6thWorkshoponDataMininginBioinformatics(withSIGKDDConference)page13

wewillshow,areabletocapturelandmarksalongafolding

trajectory.

•Cross-analysisoftrajectoriestorevealaconnsuspar-

tialpathway:ByreprentingeachframeasatofSOAPs,

analysisincludesdetectingcriticaleventsandidentifyingcon-

nsuspartialfoldingpathwaysacrosstrajectories.

ISOFPROTEINFOLDINGTRA-

JECTORIES

2.1ProteinFoldingTrajectories

Advancesinhigh-performancecomputingtechnologiesandmolec-

ulardynamicshaveledtosuccessfulsimulationsoffoldingdynam-

icsfor(small)proteinsatatomisticlevel[12].Suchsimulations

resultinalargenumberoffoldingtrajectories,eachofwhichcon-

sistsofariesof3Dconformationsoftheproteinundersimu-

onformationsareusuallysampledregularly(e.g.,

every200fs)paper,wealsoreferto

r-

more,toreprentaproteinconformation,weadoptoneofthe

commonlyadoptedreprentationschemes,whereaconformation

isreprentedasaquenceofα-carbons(C

α

)locatedin3Dspace.

Asmentionedearlier,weobtainedtwofoldingtrajectoriesofthe

designedmini-proteinBBA5(ProteinDataBankID)fromthe

Folding@5is

ive

structure(orfold)ofBBA5showsaβ-hairpininvolvingresidues1-

includesanα-helixin-

ention,residuesare

numberedincreasinglyfromtheN-terminaltoC-terminalofapro-

1(a)

twofoldingtrajectories,referredtoasT

23

andT

24

respectively,

areofdifferentlength.T

23

consistsofariesof192conforma-

tions(orframes),whileT

24

nformationis

describedatatomisticlevelinPDBformatadoptedbytheProtein

DataBankprograms.

2.2ComparingConformationsofBBA5Across

Trajectories

Althoughbothtrajectoriesstartfromthesameextendedconfor-

mationasshowninFigure1(b),whenweexaminethevisualized

frames,theyemtoidentifytwoverydifferentfoldingprocess.

Figures1(c)and(d)illustratethelastframeinT

23

andT

24

respec-

emingdifferencemightbeattributedtothestochas-

ticnatureofthefoldingsimulationprocess[12;16].However,it

isalsodesirabletocharacterizethesimilarities(ordissimilarities)

acrossmultipletrajectories.

Tocomparetwotrajectories,akeyissuethatmustbeaddresd

is:howcanwecomparetwoproteinconformations?Severalmea-

sureshavebeencommonlyudtodosuchcomparison,includ-

ingRMSD(rootmeansquareddistance)[22],contactorder[9],

andnativecontacts[4].However,allthemeasuresaredesigned

rmore,

badonourempiricalanalysisofthemeasures,wenoticethat

theyaregenerallytoocoarandthuscanoftenbemisleading.

Evenmoreimportant,suchmeasuresfailtoidentifysimilarlocal

structures(ormotifs)especially

1/

2Pleareferto[12;16]fordetailsonthesimulationmodelem-

ployedtoproducesuchtrajectories.

nstratedinbothex-

perimentalandtheoreticalstudies,smallproteinsoftenfoldhierar-

chicallyandbeginlocally[1].Forinstance,ithasbeenshownthat

BBA5tendstofirstformcondarystructuressuchasβ-turnsand

α-helix,thenconformtoitsglobaltopology[16].Finally,assug-

gestedbyPande[12],bothsterics(localmotifs)andglobaltopol-

ore,

tocompareconformationsof(small)proteins,amorereasonable

-

over,itshouldalsotakethenativetopologyoftheproteinunder

studyintoaccount.

Tomeettherequirements,wepropothefollowingapproachto

,weloolypartitionthe

23residuesofBBA5intofourfragments:(i)F

1

:N-terminal1-10

β-hairpin;(ii)F

2

:C-terminal11-23α-helixfragment;(iii)F

3

:the

firsthalfofF

1

andthecondhalfofF

2

;and(iv)F

4

:thec-

ondhalfofF

1

andthefirsthalfofF

2

,i.e.,themiddlectionin

,werecognizethecondarystruc-

formationsaresaidtobe

similariftheydemonstratethesamecondarystructurepropen-

tance,theconformationpairin

Figure6(a)aresimilarasresiduesinF

1

,F

2

andF

4

frombothcon-

formationsindicateaβ-notethatthe

-

stance,inFigure6(d),wesaythetwoconformationshaveasimilar

structureinF

1

fragment,eventhoughtheβ-turnmotifshavedif-

ferentorientations.

Torealizethecomparisonofconformations,twomoreissuesmust

,howcanweeffectivelycaptureandrepre-

ntlocalmotifs?Second,howcanwereprenttheglobaltopol-

ogyofaconformationintermsoflocalmotifs?Toaddressthefirst

issue,weleveragethenon-localpatternsinproteincontactmaps.

Forthecond,wecharacterizethespatialarrangementamong

eSection3fordetails.

2.3FoldingTrajectoryAnalysis:Objectives

Therearetwogoalswewouldliketoachieveinanalyzingthe

,wewouldliketoaddressthefollow-

ingfoldingissuesforagiventrajectory:(1)todetect(oreven

predict)significantfoldingevents,includingtheformationofβ-

turns,α-helices,andnative-likeconformations;and(2)torecog-

nizethetemporalorderingofimportantfoldingeventsinthetra-

tance,betweenthetwocondarystructuresα-helix

andβ-hairpininBBA5,whichformarlier?Whatisorderingof

thetwoeventsprecedingaβ-hairpinformation:formationoftwo

extendedstrandsorformationoftheturn?

Ifthefirstgoalconcernsindividualtrajectories,thecondgoal

fically,wewouldliketoiden-

tifyasub-quenceofsimilarconformationsinbothtrajectories.

Thissub-quenceofconformationsisreferredtoastheconnsus

analogoustotheLongestCom-

monSub-quence(LCS)problem[5],butmuchmorechalleng-

,wearedealingwithtimeries

,wearenotlookingforanex-

-

stead,wearelookingforsimilarconformationsacrosstrajectories.

Wewouldliketopointoutthatthisworkisclolyrelatedtoour

previousworkonproteinstructuralanalysis[18]andourworkon

miningspatio-temporaldata[21].

BIOKDD06:6thWorkshoponDataMininginBioinformatics(withSIGKDDConference)page14

40

45

50

55

60

65

70

-48

-46

-44

-42

-40

-38

-36

-34

-8

-6

-4

-2

0

2

4

6

8

10

12

N-terminal

C-terminal

0

5

10

15

20

25

30

35

40

15

20

25

30

35

40

45

10

15

20

25

30

35

40

45

N-terminal

C-terminal

14

16

18

20

22

24

26

28

30

32

34

20

22

24

26

28

30

32

34

16

18

20

22

24

26

28

30

32

34

36

N-terminal

C-terminal

16

18

20

22

24

26

28

30

32

34

18

20

22

24

26

28

30

32

34

15

20

25

30

35

40

N-terminal

C-terminal

(a)(b)(c)(d)

Figure1:DifferentconformationsofBBA5,whereeachpointcorrespondstoanα-carbon.(a)ThenativeNMRstructureofBBA5badon

datafromtheSCOPwebsite.(b)Theinitialconformationofbothfoldingtrajectories.(c)Thelastconformationinthefirsttrajectory.(d)The

lastconformationinthecondtrajectory.

THM

Inthisction,wedescribeindetailourapproachforanalyzing

ninFigure2,thisanalysis

consistsofthreemainphas:(I)Datapreprocessing,(II)Spatio-

temporalobjectassociationpatternmining,and(III)Trajectoryanal-

discusachphainfurtherdetails.

3.1DataPreprocessing

Asinourpreviousstudiesonproteinstructuralanalysis[18;19],

,in

thiswork,weusimilarmeasurestocreateandprocesssuchmaps.

Inorderforthisalgorithmtobelf-contained,webrieflygoover

rtoreadersto[18]formore

tion,wewillalsoexplaintherationalebehindveral

keystepsinthecontextofproteinfolding.

ContactMapGeneration

Whengeneratingcontactmaps,weconsidertheEuclideandis-

tancesbetweenα-carbons(C

α

)α-carbons

areconsideredtobeincontactiftheirdistanceiswithin8.5

˚

A.

Thus,foraproteinofNresidues,itscontactmapisanN×N

binarymatrix,wherethecellat(i,j)is1iftheithandjthα-

carbonsareincontact,ontactmapsaresym-

metricacrossthediagonal,weonlyconsiderthebitsbelowthedi-

rmore,wealsoignorethepairsofC

α

atomswho

distanceintheprimaryquenceis≤2,astheyaresuretobein

epesntiallytransformsthetwoBBA5trajectories

intotworiesofcontactmaps,witheachmapofsize23×23.

IdentifyingMaximallyConnectedBit-patterns

dge

position,

contactmap,aconnectedbit-patternisacollectionofbit-1posi-

tions,whereforeach1,-

respondingly,wedefineamaximally-connectedbit-pattern(also

referredtoasabit-patterninthisarticle)tobeaconnectedpat-

yasim-

pleregiongrowthalgorithmtoidentifyallmaximally-connected

of

352maximally-connectedbit-patternswereextractedfromthetwo

foldingtrajectoriesofBBA5.

Asdiscusdinourpreviousworkandelwhere[7;8;10;17],

suchbit-patternscaneffectivelycapturethecondarystructures

ontextofproteinfolding,wehaveobrved

thattheyarepowerfulenoughtoreprentawiderangeoflocal

venmeasureapproximatelythestrength

ofcondarystructurepropensityinaconformationbadonthe

tance,wehaveidentifiedbit-patternsthatcor-

respondto“premature”α-helicesandnative-likeα-helicesrespec-

orth,werefertothe3Dstructureformedbyallthe

participatingresiduesofabit-patternasthe3Dmotifofthebit-

pattern.

ClusteringBit-patternsintoApproximatelyEquivalent

Groups

Weapplyanentropy-badclusteringalgorithmtogroupthebit-

patternsintolclusters,wherethebit-patternsinaclustershow

similargeometricproperties(e.g.,shapeandsize).Thevalueof

lisdeterminedusinganentropy-badmeasurethatquantitatively

indicatesthequalityofaclusteringresult.(Pleareferto[18]for

details.)FortheBBA5foldingdata,theclusteringstepgroupsthe

352bit-patternsinto10clusters(ortypes).

Intuitively,the3Dmotifsofthebit-patternsinaclusterwillalso

verifiedbadonour

3illustratestherepre-

type0,asbit-patternsofthistype,unliketheothers,correspondto

awidevarietyof3Dmotifs.

Thisdemonstrates,toacertainextent,theadvantageofusing2D

tedly,

usingcontactmapsgreatlyreducesthecomputationalcomplexity

ofouralgorithm,thoughatthecostoflossinstructuralinformation.

Moreimportantly,byexploitingdifferentfeaturesincontactmaps

(bit-patternsinthiswork),weareabletoconnect2Dfeatureswith

a,byidentifying10typesofbit-

patternsincontactmaps,weindirectlyrecognize10different3D

structuralmotifsinthefoldingconformations.

Re-labelingBit-patternswithTheCorrespondingClus-

terLabel

Inthisstep,were-labelallthepreviouslyidentifiedbit-patterns

alabeledbit-pattern.

Itcanbereprentedasfollows:p=(trajID,frameID,listC

α

,

label).Here,trajIDidentifiesafoldingtrajectory,andframeID

indicatestheframewherepoccurs,listC

α

consistsofallpar-

ticipatingα-carbonsofp,identifiedbytheirpositioninthepri-

y,5,

label∈{g

0

,g

1

,···,g

9},correspondingtothe10approximately

equivalentgroups(ortypes).

BIOKDD06:6thWorkshoponDataMininginBioinformatics(withSIGKDDConference)page15

I:Datapreprocessing

1.1Generatecontactmapsforeveryconformationinthetwofoldingtrajectories

1.2Identifymaximallyconnectedbit-patternsinallcontactmaps

1.3Clusterbit-patternsintoapproximatelyequivalentgroupsbadongeometricproperties

1.4Re-labeleachbit-patternwithitscorrespondingclusterlabel

II:Discoveringfrequentspatio-temporalobjectassociationpatterns(SOAPs)

2.1Discoverfrequent(minLink=1)-SOAPsofbit-patternsineitherfoldingtrajectory

III:Foldingtrajectoryanalysis

3.1SummarizeeachfoldingtrajectorybadonfrequentSOAPs

3.2Detectfoldingeventsandrecognizetheorderingoffoldingeventsinatrajectory

3.3Identifytheconnsuspartialfoldingpathwayacrosstrajectories

Figure2:Mainstepstoanalyzeproteinfoldingtrajectories

3DMotifNotationbit-patternvs.3Dmotif

α-helixlikeα

#4#7#9

-111——

11111—–

—-11—-

—–11—

——11–

——-11-

——–11

——–1-

——–1-

16

18

20

22

24

26

28

30

32

20

22

24

26

28

30

32

34

24

26

28

30

32

34

36

38

-1——-

-11——

–11—–

—11—-

—111—

—1111–

—-1111-

—–1–1

12

14

16

18

20

22

24

26

22

24

26

28

30

32

34

26

27

28

29

30

31

32

-11—–

-111—-

—11—

—111–

—-111-

—-11–

12

14

16

18

20

22

24

26

28

30

32

22

24

26

28

30

32

34

36

28

28.5

29

29.5

30

30.5

31

31.5

32

32.5

β-turnlikeβ

#1#6#8

–1—

–11–

-1111-

-11—

-1—-

24

26

28

30

32

34

36

38

23

24

25

26

27

28

29

20

21

22

23

24

25

26

27

28

29

—-1–

-11111-

-1111–

26

27

28

29

30

31

32

33

34

35

36

21

22

23

24

25

26

27

28

29

30

17

18

19

20

21

22

23

24

25

26

—-1—

—-11–

-111111-

—111–

—11—

—11—

—11—

29

30

31

32

33

34

35

36

37

38

19

20

21

22

23

24

25

14

16

18

20

22

24

26

28

smallα-helixlikea

#3(α)#2(β)

-11—

-111–

—11-

—–1

14

16

18

20

22

24

26

28

20

21

22

23

24

25

26

27

28

29

25

25.5

26

26.5

27

27.5

28

28.5

29

29.5

30

30.5

-1—

-11–

-111-

-1—

26

27

28

29

30

31

32

33

34

24

25

26

27

28

29

30

31

32

33

19

20

21

22

23

24

25

26

27

smallβ-turnlikeb

parallelstrands

本文发布于:2022-11-24 21:43:32,感谢您对本站的认可!

本文链接:http://www.wtabcd.cn/fanwen/fan/90/14426.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

下一篇:烟台培训
标签:remarks
相关文章
留言与评论(共有 0 条评论)
   
验证码:
Copyright ©2019-2022 Comsenz Inc.Powered by © 专利检索| 网站地图