6thInternational
WorkshoponData
MininginBioinformatics
(BIOKDD06)
WorkshopChairs
GeorgeKarypis
JiongYang
HeldinconjunctionwiththeKDDconference,August20,2006
BIOKDD06:WorkshoponDataMininginBioinformatics
August20th,2006
Philadelphia,PA,USA
inconjunctionwith
12thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining
ComputerScience
Department
RenslaerPolytechnic
Institute
Troy,NY12180,USA
zaki@
GeorgeKarypis
DepartmentofComputer
Science
UniversityofMinnesota
Minneapolis,MN55455,USA
karypis@
JiongYang
ElectricalEngineeringand
ComputerScience
Department
CaWesternRerve
University
Cleveland,OH44106USA
jiong@
OpeningRemarks
DataMiningistheprocessofautomaticdiscoveryofnovel
andunderstandablemodelsandpatternsfromlargeamounts
ormaticsisthescienceofstoring,analyzing,
andutilizinginformationfrombiologicaldatasuchas-
quences,molecules,geneexpressions,-
opmentofnoveldataminingmethodswillplayafundamen-
talroleinunderstandingtherapidlyexpandingsourcesof
biologicaldata.
ThegoalofthisworkshopistoencourageKDDrearchers
totakeonthenumerouschallengesthatBioinformaticsof-
kshopwillfeatureinvitedtalksfromnoted
expertsinthefield,andthelatestdataminingrearchin
uragepapersthatproponovel
dataminingtechniquesfortaskslike:
•Geneexpressionanalysis
•Protein/RNAstructureprediction
•Phylogenetics
•Sequenceandstructuralmotifs
•GenomicsandProteomics
•Genefinding
•Drugdesign
•RNAiandmicroRNAAnalysis
•Textmininginbioinformatics
•Modelingofbiochemicalpathways
Theproceedingscontain6papers,outof18submissions,
paperwasreviewedbythreemembersoftheprogramcom-
ithtwokeynotetalks,byDavidRoosand
SridharHannenhalli,wewereabletoasmbleaveryexcit-
ingprogram.
Wewouldliketothankalltheauthors,invitedspeaker,and
attendeesforcontributingtothesuccessoftheworkshop.
Specialthanksareduetotheprogramcommitteeforhelp
inreviewingthesubmissions.
WorkshopCo-Chairs
•,RenslaerPolytechnicInstitute
•GeorgeKarypis,UniversityofMinnesota
•JiongYang,CaWesternRerveUniversity
ProgramCommittee
•TatsuyaAkutsu,KyotoUniversity•ChrisBailey-
Kellogg,DartmouthCollege•MehmetDalkilic,Indiana
University•YuanGao,HarvardUniversity•ShinIshii,
NaraInstituteofScienceandTechnology•SunKim,In-
dianaUniversity•LeiLiu,UniversityofIllinois,Urbana-
Champaign•ZoranObradovic,TempleUniversity•Gul-
tekinOzsoyoglu,CaWesternRerveUniversity•Srini-
vasanParthasarathy,OhioStateUniversity•JianPei,
SimonFrarUniversity•NarenRamakrishnan,Vir-
giniaTech•HannuToivonen,UniversityofHelsinki•
JasonWang,NewJeryInstituteofTechnology•Wei
Wang,UniversityofNorthCarolina,ChapelHill•Aidong
Zhang,SUNY,Buffalo•YingZhao,UniversityofMis-
souri,Rolla
ExternalReviewers
•Young-RaeCho,SUNY,Buffalo•WoochangHwang,
SUNY,Buffalo•MugdhaKhaladkar,NewJeryIn-
stituteofTechnology•TakeshiKawabata,NaraInsti-
tuteofScienceandTechnology•MustafaKirac,Ca
WesternRerveUniversity•ChuanLin,SUNY,Buffalo
•UrosMidic,TempleUniversity•ChandanReddy,
CornellUniversity•JuhoRousu,UniversityofHelsinki
•YangSong,NewJeryInstituteofTechnology•Ju-
nildaSpirollari,NewJeryInstituteofTechnology•
DongrongWen,NewJeryInstituteofTechnology•
HongboxXie,TempleUniversity•WangZhuo,Uni-
versityofIllinois,Urbana-Champaign
BIOKDD06:6thWorkshoponDataMininginBioinformatics(withSIGKDDConference)page1
WorkshopProgram&TableofContents
8:50-9:00am:1
9:00-10:00am:ProteinInteractionsandFolding
•SignalTransductionModelBadFunctionalModuleDetectionAlgorithmforProtein-ProteinInteractionNetworks,
WoochangHwang,Young-raeCho,AidongZhangandMuraliRamanathan(StateUniversityofNewYorkatBuf-
falo).page3
•ProteinFoldingTrajectories:Summarization,EventDetectionandConnsusPartialFoldingPathwayIdentification,
HuiYang,SrinivasanParthasarathyandDuyguUcar(OhioStateUniversity).page13
10:00-10:30am:CoffeeBreak
10:30-11:30am:InvitedTalk
•DavidRoos,MerriamProfessorofBiology&Director,PennGenomicsInstitute,22
11:30-12:00pm:ClusterVisualization
•AutomaticLayoutandVisualizationofBiclusters,us,(Virginia
PolytechnicInstituteandStateUniversity).page23
12:00-1:30pm:Lunch
1:30-2:30pm:InvitedTalk
•DecipheringGeneRegulatoryNetworksbyinsilicoapproaches,SridharHannenhalli,PennCenterforBioinformatics,
DepartmentofGenetics,31
2:30-3:30pm:MotifDiscovery
•ExMotif:EfficientStructuredMotifExtraction,YongqiangZhangandMohammedZaki(RenslaerPolytechnicInsti-
tute).page32
•EfficientSearchusingHybridEMforMotifRefinement,ChandanReddy,Yao-ChungWengandHsiao-DongChiang
(CornellUniversity).page42
3:30-4:00pm:CoffeeBreak
2:30-3:30pm:Classification
•ProteinClassificationusingSummariesofProfile-BadFrequencyMatrices,KeithMarsoloandSrinivasanParthasarathy
(OhioStateUniversity).page51
BIOKDD06:6thWorkshoponDataMininginBioinformatics(withSIGKDDConference)page2
SignalTransductionModelBadFunctionalModule
DetectionAlgorithmForProtein-ProteinInteraction
Network
WoochangHwang†Young-RaeCho†AidongZhang†MuraliRamanathan††
†DepartmentofComputerScienceand
Engineering,StateUniversityofNew
YorkatBuffalo,USA
††DepartmentofPharmaceutical
Sciences,StateUniversityofNewYork
atBuffalo,USA
Email:{whwang2,ycho8,azhang}@,murali@
ABSTRACT
Cellularfunctionsarecoordinatelycarriedoutbygroupsofgenes
ionofsuchfunc-
tionalmodulesfromprotein-proteininteraction(PPI)networksis
oneofthemostchallengingandimportantprobleminpostgenomic
er,thesparconnectivityofprotein-proteininterac-
tiondatatsmakesidentificationoffunctionalmodulesmorechal-
arefulobrvationsofthepropertiesoffunctional
modulesinthePPInetwork,wehavefoundthattheactualtopologi-
calshapesandproperties,includingthegraphdensityanddiameter,
ofthefunctionalmodulesinthePPInetworkhaveexpodunex-
pectedphenomena,e.g.,lowintraconnectivityandlongishshapes.
Manydifferentclusteringapproacheshavebeenpropodtoextract
themconcentrated
onlyondenlyconnectedregionstopologicallyandignoredbio-
logicalcharacteristicsofthenetworktobedealtwith,eventhough
ore,theycould
findonlytheclusterswithcertaindensity,andfailedtofindef-
fectivefunctionalmoduleswhicharebiologicallysignifi-
thermore,theyproducedmanysmallsizeclusters,whichhaveless
than5membersorevensingletons,anditresultedindiscarding
-
quertheproblemffectively,wedevelopanalgorithm,termed
STM,whichutilizesthedegreeofinfluencebetweenproteinstode-
rscanthenbeformulated
omparedtosixcompeting
approachesincludingthemaximumclique,quasi-clique,minimum
cut,betweenesscutandMarkovClustering(MCL)
clustersobtainedbyeachtechniquearecomparedforenrichmentof
fiedclustersbySTMareshowntobeen-
richedforbiologicalfunctionbetterthantheclustersidentifiedby
gicalevaluationoftheidentified
clustersbyourmethoddemonstratedthatourmethodcansuccess-
fullyidentifyarbitraryshapeclusterswithlargesizethattheother
tiontotheabove,animportantstrengthof
ourapproachisthatthepercentageofproteinsthatarediscardedto
createclustersismuchlowerthantheotherapproacheswhichhave
anaveragediscardpercentageof59%ontheyeastprotein-protein
interactionnetwork.
Keywords
SignalTransduction,Protein-proteininteractionnetwork,Functional
moduledetection
UCTION
Sincethefirstcompletegenomewasquencedin1995,morethan
300prokaryoticgenomesandmorethan20eukaryoticgenomes
havebeenquenced[17].Discoveringthefunctionalrolesofgene
productsafterthecompletionofquencingtheSaccharomyces
Cerevisiaegenomehasbeeninthespotlightofpost-genomicera.
High-throughputtechniques[5;11;12;25]forprotein-proteinin-
teractions(PPI)detectionhaveattractedrearchers’attentionsince
interactingproteinsarelikelytorvetogetherasagroupincel-
lularfunctions[10].Inrecentyears,high-throughputtechniques
inagenomicscalesuchasyeast-two-hybrid,massspectrometry,
andproteinchiptechnologieshavemultipliedthevolumeofpro-
teininteractiondatatxponentiallyandalsohaveprovidedusa
ulativePPI
datatof,forexample,siaeinDIP(DatabaofInter-
actingProteins)[2]nowlistsover4900proteinsand18,000inter-
actionsfromover22000experiments;however,nearlyhalfofthe
ivecomputationalsystemsfor
storage,management,visualizationandanalysisarenecessaryto
copewiththefastgrowingcomplexdatats.
PPIdataprovideusthegoodopportunitytosystematicallyanalyze
thestructureofalargelivingsystemandalsoallowustouit
tounderstandesntialprincipleslikeesntiality,geneticinterac-
tions,functions,functionalmodules,proteincomplexesandcellu-
arfunctionsandbiochemicaleventsarecoor-
dinatelycarriedoutbygroupsofproteinsinteractingeachotherin
functionalmodules,andthemodularstructureofcomplexnetworks
iscriticaltofunction[7;10;20].Identifyingsuchfunctionalmod-
ulesinPPInetworksisveryimportantforunderstandingthestruc-
-
fore,developinganeffectivecomputationalapproachtoidentify
functionalmodulesshouldbehighlychallengingbutindispensable.
Clusteringanalysishelpsusunderstandthetopologicalstructureof
thePPInetworksandrelationshipsamongitscomponentsbetter.
And,theprincipalfunctionofeachclustercanbeinferredfromthe
onsforunannotatedmembersofa
clustercanbepredictedbythefunctionsofotherannotatedmem-
bers[18].
PPIadjacencymatricescanbereprentedasgraphswhonodes
ster-
ingofaPPIdatatcanbetherebyreducedtographtheoretical
,thebinarynatureofthecurrentPPIdatatsim-
BIOKDD06:6thWorkshoponDataMininginBioinformatics(withSIGKDDConference)page3
themaximalcliqueapproach,clusteringisreducedtoidentifying
fullyconnectedsubgraphsinthegraph[24].Toovercometherel-
ativelyhighstringencyimpodbythemaximalcliquemethod,
theQuasiClique[2],MolecularComplexDetection(MCODE)[1],
SpirinandMirny[24]algorithmsidentifydenlyconnectedsub-
graphsratherthanfullyconnectedonesbyeitheroptimizingan
-
strictedNeighborhoodSearchClusteringAlgorithm(RNSC)[16]
andHighlyConnectedSubgraphs(HCS)algorithms[9]harness
minimumcostedgecutsforclusteridentifikov
ClusterAlgorithm(MCL)algorithmfindsclustersusingiterative
roundsofexpansionandinflationthatpromotethestronglycon-
nectedregionsandweakenthesparlyconnectedregions,respec-
tively[26].Thelinegraphgenerationapproach[20]transforms
thenetworkofproteinsconnectedbyinteractionsintoanetworkof
connectedinteractionsandthenustheMCLalgorithmtocluster
haandLiang[22]employedasta-
tisticalapproachtoclusteringofproteinsbadonthepremithat
apairofproteinssharingasignificantlylargernumberofcommon
neighborswillhavehighfunctionalsimilarity.
However,currentlyudapproachencounterchallengesbecau
therelationshipbetweenproteinfunctionandPPIischaracterized
byweakconnectivityandunexpectedtopologicalphenomena,such
aslowintraconnectivityandlongishshapesofactualtopological
shapesofMIPSfunctionalcategories[19].Inourexperimental
analysis,subgraphsofeachfunctionalcategoriesinMIPSdataba
[19]areextractedfromtheYeastPPInetwork,andthedensityof
sityoftho
subgraphsisaveragedabout0.0023whichisfairlylowerthanwe
nctionalcategorieshavelowconnectivitywithin
theminthePPInetworkandthemajorityofmembersinfunctional
categoriesdonothavedirectphysicalinteractionwithothermem-
rmore,itisnot
difficulttofindsingletonsintheextractedsubgraphsoffunctional
categorieswhichmeansthatsomeproteinsdonothaveanyinter-
actionwithotherproteinsinthefunctionalcategorytheybelong
averagediameterofagraphbethelengthofthelongest
ragediam-
eterofthesubgraphsofallMIPSfunctionalcategoriesisapproxi-
mately4whichisclototheaverageshortestpathslength,5.47,
rwords,thesubgraphsofactual
MIPSfunctionalcategoriesinthePPInetworkgenerallyarenot
clolycongregatedasweexpected,
tosuchlowdensitywithinthemodules,theexistingapproaches
haveproducedanumberofclusterswithsmallsizeandsingletons
andmercilesslydiscardedmanyweaklyconnectednodessincethey
complete-
nessofclusteringisariousdrawbackoftheexistingalgorithms.
Discardingthesparlyconnectednodescouldbeahazardousloss
ofimportantinformationofthePPInetwork.
Biologicalnetworks,includingPPInetworks,illustratethebio-
chemicalrelationshipsofcomponentsinbiochemicalprocess.
Clusteringofbiologicalnetworksshouldbeabletoidentifyclusters
ofanyarbitraryshapesandanydensityifthemembersofacluster
shareimportantbiochemicalpropertiesfromthepointofviewof
withthisnecessityandovercome
thodrawbacksofexistingapproaches,wepropoanovelstrat-
egytoanalyzethedegreeofbiologicalandtopologicalinfluence
lPPI
networksasadynamicsignaltransductionsystemanddemonstrate
thesignaltransductionbehavioroftheperturbationbyeachprotein
haviorshouldalsoreflectthetopological
rallsignaltransductionbehavior
functionbetweenanytwoproteinswillbeformulatedtoevaluate
theperturbationcaudbyaproteinonotherproteinsbiologically
cessfullyidentifiedthe
clusterswithbiggersize,arbitraryshape,lowdensity,andbiologi-
callymoreenrichedthanotherexistingapproachesdid.
2.1TheSignalTransductionModel
Theproteinsandtheprotein-proteininteractionsinaPPIdatat
were,respectively,
graphreprentationwasmodeledusingapharmacodynamicsignal
fically,thesignaltransduction
behaviorofthenetworkwasmodeledusingtheErlangdistribution,
aspecialcaoftheGammadistribution.
Graphdefinitions:AnundirectedgraphG=(V,E)consistsof
atVofnodesandatEofedges,E⊆V×
e=(i,j)connectstwonodesiandj,e∈ghborsN(i)
ofnodeiaredefinedtobethetofdirectlyconnectednodesof
reed(i)ofanodeiisthenumberoftheedges
sdefinedasaquenceofnodes
(i
1
,...,i
k
)suchthatfromeachofitsnodesthereisanedgeto
gthofapathisthenumberofnodes
estpathbetweentwonodes,iandj,
tancebetweentwo
nodes,iandj,isthelengthofitsshortestpath.
Signaltransductionmodel:Wepropotomodelthedynamicre-
lationshipsbetweenproteinsinaPPInetworkusingasignaltrans-
fically,thesignaltransductionbe-
haviorofthenetworkismodeledusingtheErlangdistribution,a
angdistribution
functionis:
F(x)=1−e
−
x
b×
c−1X
k=0
(x
b
)k
k!
(1)
wherec>0istheshapeparameter,b>0isthescaleparam-
eter,x≥0istheindependentvariable,-
langdistributionhasveralcharacteristics,whichareappropriate
fordescribingtheprotein-proteininteractionnetwork,includingits
positiverangeanditsimportantreproductiveproperty[14].The
Erlangdistributionwithx/b=1isudandthevalueofcist
tothenumberofedgesbetweensourceproteinnodeandthetarget
gthevalueofx/btounityasssthepertur-
bationatthetargetproteinwhentheperturbationreaches1/eofits
initialvalueatthenearestneighborofthesourceproteinnode.
Erlangdistributionmodelshavebeenudinpharmacodynamics
tomodelsignaltransductionandtransferdelaysinavarietyofsys-
temsincludingtheproductionofdruginducedmRNAandprotein
dynamics[21]andcalciumion-mediatedsignalinginneutrophils
[8].Figure1showsthecorrespondingpharmacodynamicsignal
transductionmodelwhobolusresponisanErlangdistribution
withparametersbandc,tically,theErlangdis-
tributionreprentsthetimerequiredtocarryoutaquenceof
ctaskswhodurationsareidentical,exponentialprobabilitydis-
madistributionisalsoexcellentforapproxi-
matingpopulationabundancesfluctuatingaroundequilibrium[4].
ThusinthecontextofthePPInetwork,theErlangdistributioncan
beviewedasamechanisticallymotivatedmeasureoftheperturba-
tioncaudbythesuddenintroductionofthesourceproteintoa
targetproteininthenetwork.
TheErlangdistributionneedstobefurthermodifiedtoreflectnet-
turbationinducedbythesourceprotein
BIOKDD06:6thWorkshoponDataMininginBioinformatics(withSIGKDDConference)page4
bbbb
Bolus
Input
Erlang
Output
Node
1
Node
2
Node
c
Figure1:Thepharmacodynamicthetimeconstantfor
signaltransferandcisthenumberofcompartments.
A
B
C
D
E
L
H
K
I
M
J
F
G
N
NodeA
A:5.0
F:4.0
G:0.4741
H:0.0881
NodeB
A:5.0
F:0.5057
G:0.0396
H:0.0054
NodeC
A:5.0
F:0.5057
G:0.0396
H:0.0054
NodeD
A:5.0
F:0.5057
G:0.0396
H:0.0054
NodeE
A:5.0
F:0.5057
G:0.0396
H:0.0054
NodeF
A:5.0
F:4.0
G:3.0
H:0.8428
NodeL
A:0.7902
F:4.0
G:0.4741
H:0.0881
NodeG
A:0.7902
F:4.0
G:3.0
H:4.0
NodeN
A:0.7902
F:4.0
G:3.0
H:0.8428
NodeK
A:0.0084
F:0.0881
G:0.4741
H:4.0
NodeH
A:0.1101
F:0.8428
G:3.0
H:4.0
NodeJ
A:0.0084
F:0.0881
G:0.4741
H:4.0
NodeI
A:0.0084
F:0.0881
G:0.4741
H:4.0
NodeM
A:0.0009
F:0.0133
G:0.0991
H:1.2642
Figure2:xcontainsthenumericalvaluesobtainedfromEquation2fromnodesA,F,G,andHtoothertarget
sforothernodesarenotshown.
nodeshouldbeproportionaltoitsdegreeandtofollowtheshortest
transductiontothetarget
proteinnode,theperturbationshoulddissipateateachintermediate
rallsignaltransduction
behaviorfunctionfromnodevtonodew(v=w)isthus:
S(v→w)=
d(v)
Q
i∈P(v,w)
d(i)
×F(x)(2)
whered(i)isthedegreeofnodei,P(v,w)isthetoftheallnodes
visitedenroutefromnodevtonodew,excludingthesourcenode
vandthetargetdestinationnodew,andF(x)isthesignaltrans-
=wanddistance(v,w)=0,
wedefineS(v→w)=d(v).Thenumeratorofthefirstterm
intherighthandsideofEquation2reprentsthedegreeofthe
sourcenodev,andthedenominatorreprentsthedissipationon
eachvisitingnodeontheshortestpathfromsourcenodevtotarget
iceoftheshortestpathismotivatedbythefinding
thatthemajorityoffluxprefersthepathofleastresistanceinmany
,thefirstterminthe
righthandsideofEquation2reprentsthetopologicaleffectof
ondtermintherighthand
sideofEquation2reprentsthebiologicaleffectofsourcenodev
ore,
thenodesthatscorethehighestvalueontargetnodewwillbethe
mostinfluentialnodesonnodewbiologicallyandtopologically.
Figure2demonstratesthesignaltransductionbehaviorofasmall
eaofunder-
standing,onlythesignalsfromnodeA,F,G,andHareprented,
althoughsignalsshouldbepropagatedfromeachnodeinthenet-
xinFigure2containsthesignalassdbythe
Equation2fromnodesA,F,G,andHtoothertargetnodes,e.g.,
5.0,0.5057,0.0396,0.0054arethesignalsassdfromnodes
A,F,G,andH,respectively,umericalvalues
illustrateoveralleffectsofcombiningthenetworktopologywith
thesignaltransductionmodelfromsourcenodesA,F,G,andH
uently,nodeA,whichhasscoredthehighest
value,willbethemostinfluentialnodeonnodeEbiologicallyand
topologically.
2.2ClusteringModel
STMalgorithmsimulatestheperturbationfromeachnodetothe
othernodesinanetworkusingEquation2,whichreflectsthebi-
repren-
tativesarethenodesthatrecordthehighestscoresbyEquation2
oneverynodeinamodule,i.e.,theyarethemostinfluentialnodes
hesignaltrans-
ductionsimulation,eachnodelectsthemostinfluentialnodesas
ereprentatives,prelim-
inarymodulescanbeformedbyaggregatingeachnodeintoeach
y,the
preliminarymodulesaremergediftherearesubstantialintercon-
nectionsbetweenthem.
ThepudocodefortheSTMalgorithm,whichemploysthesignal
transductionfunctionofEquation2andademocraticreprenta-
tiveslectionalgorithm,orithm
involvesfourquentialprocess:
Process1:Computesignalstransducedbetweenallnodepairs.
Process2:Selectclusterreprentativesforeachnode.
Process3:Formationofpreliminaryclusters.
Process4:Mergepreliminaryclusters.
Process1propagatessignalsfromeachsourcenodeandrecordsthe
signalquantitiesoneachtargetnodeforallnodepairsaccordingto
lementationofProcess1isshownonlines
6-10oftheSTMalgorithminAlgorithm1.
InProcess2,eachnodeelectsthenodesfromwhichitreceives
thehighestsignalvalueasthereprentativesoftheclustersthat
mple,inFigure2,nodesA,B,
BIOKDD06:6thWorkshoponDataMininginBioinformatics(withSIGKDDConference)page5
Algorithm1STM(G)
1:V:tofnodesingraphG
2:F(x):Transductionbehaviorfunction
3:S(v,w):arrivedsignalfromnodevtonodew
4:C:thelistoffinalclusters
5:PreClusters:thelistofpreliminaryclusters
6:foreachnodepair(v,w)v,w∈V,v=wdo
7:distance(v,w)←theshortestpathlengthfromnodevto
nodew
8:tparametercinfunctionasdistance(v,w)
9:signal(v,w)←S(v→w)
10:endfor
11:foreachnodev∈Vdo
12:entative←lectthebestscorednodewfornode
v
13:ifcluster
w==nullthen
14:makeclusterw
15:(v)
16:(clusterw)
17:el
18:cluster
(v)
19:endif
20:endfor
21:C←Merge(PreClusters)
Procedure1Merge(C)
1:C:theclusterlist
2:MaxPair:theclusterpair(c,k)withmaxinterconnections
amongallpairs
3::interconnectionsbetweenclusterpaircandk
4:MaxPair←findMaxPair(C,null)
5:≥thresholddo
6:newCluster←mergeMaxPaircandk
7:ReplaceclustercwithnewCluster
8:Removeclusterk
9:MaxPair←findMaxPair(C,newCluster)
10:endwhile
11:returnC
C,D,E,andFwillchoonodeAandnodesL,G,andNwill
choonodeF,whicharethebestscorednodesonthonodes,as
thereprentatives.
Eachpreliminaryclusterisinitializedbytakingitsreprentative
inaryclustersarethenaugmentedby
accumulatingeachnodetowardeachclusterthateachoftherep-
rentatives,whicharechonbythenode,rom
11-20inAlgorithm1containthereprentativelectionprocess
thatSTM
allowsoverlapsamongclustersbyopeningthepossibilityofmul-
tiplereprentativeswhichhavethetiescoreonanode,
example,nodeGpicksnodesFandH,whichhavethetiescoreon
nodeG,,Gwillbelongto
-
fore,overlapsoccurbetweentheclusterformedbynodeF,{F,G,
L,N},andtheclusterformedbynodeH,{G,H,I,J,K,M}.STM
identifiedthreepreliminaryclusters,{A,B,C,D,E,F},{F,G,L,
N},and{G,H,I,J,K,M},badonthechoiceofreprentatives
inFigure2.
Somepreliminaryclustersmaybemergediftheyhavesubstantial
otomeasurethedegreeofintercon-
nectivitybetweenclustersbythesimilarityoftwoclustersiandj
definedbelow:
Similarity(i,j)=
interconnectivity(i,j)
minsize(i,j)
(3)
whereinterconnectivity(i,j)isthenumberofconnectionsbe-
tweenclustersiandj,andminsize(i,j)isthesizeofthesmaller
ilarity(i,j)betweentwo
clustersiandjistheratioofthenumberoftheconnectionsbe-
intercon-
nectedclustersareiterativelymergedbadonthesimilarityofthe
rofclustersthathavethehighestsimilarityare
mergedineachiterationandthemergeprocessiteratesuntilthe
highestsimilarityofallclusterpairsislessthanagiventhreshold.
Thelectionofthethresholdformergingclustersisacriticalfac-
torforthefitically,wecanewhen
interconnectivity(i,j)≥minsize(i,j),clustersiandjhave
lusters,{A,B,C,D,E,F},
{F,G,L,N},{G,H,I,J,K,M},areobtainedaftertheProcess4
sters,{A,B,C,
D,E,F,G,L,N},{G,H,I,J,K,M},areobtainedaftertheMerge
processwhen1.0isudasthemergethreshold.
2.3ClusterAsssment
ThestructuresoftheclustersidentifiedbySTMandothercompet-
ingalternativeapproachesareassdusingveralmetrics.
Theclusteringcoefficient,C(v),ofanodevmeasurestheconnec-
tivityamongitsdirectneighbors:
C(v)=
2|
S
i,j∈N(v)
(i,j)|
d(v)(d(v)−1)
(4)
InEquation4,N(v)isthetofthedirectneighborsofnodev
andd(v)
connectednodeshavehighvaluesofclusteringcoefficient.
Degreecentralityordersnodesbythenumberoftheirdirectneigh-
bors,andbetweennesscentralitymeasuresthenodes’importance
fromtheinformationfland
betweennesscentralitycommonlyudtomeasuretheimportance
weenessCentrality,C
B
(v),isa
measureoftheglobalimportanceofanodethatasssthepro-
portionofshortestpathsbetweenallnodepairsthatpassthrough
weenessCentrality,C
B
(v)foranode
ofinterest,v,isdefinedby:
C
B
(v)=
X
s=v=t∈V
ρ
st
(v)
ρ
st
(5)
IntheEquation5,ρ
st
isthenumberofshortestpathsfromnodes
totandρ
st
(v)thenumberofshortestpathsfromstotthatpass
throughthenodev.
Theextenttowhichtheclustersareassociatedwithaspecificbi-
ologicalfunctionivaluatedusingap-valuebadonthehyper-
geometricdistribution[2].Thep-valueistheprobabilitythata
clusterwouldbeenrichedwithproteinswithaparticularfunction
-valueisgivenby:
p=1−
k−1X
i=0
„
C
i
«„
G−C
n−i
«
„
G
n
«
(6)
InEquation6,Cisthesizeoftheclustercontainingkproteins
withagivenfunction;Gisthesizeoftheuniversaltofproteins
-
cauthep-valuesarefrequentlysmallnumberswithpositiveval-
BIOKDD06:6thWorkshoponDataMininginBioinformatics(withSIGKDDConference)page6
Figure3:Accumulationoflethalproteinsforvariouspercentilesof
degree(grayline),betweenesscentrality(dashedline)ortheSTM
signaltransductionmetric(solidline).Theresultsareshownfor
thetop555proteinsobtainedfromtheyeastPPInetworkandare
ordered;thehighestvaluesofthemetricsareclosttotheorigin.
uesbetween0and1,thenegativelogarithms(toba10,denoted
-logp)areud.A-logpvalueof2orgreaterindicatesstatistical
significanceatα=0.01.
ThedensityofasubgraphsinaPPInetworkismeasuredby:
D
s
=
2e
n(n−1)
(7)
InEquation7,nisthenumberofproteinsandeisthenumberof
interactionsinasubgraphsofaPPInetwork.
MENTALRESULTS
3.1ProteinInteractionData
siaewasobtainedfromtheDIPdataba
[3].Thisdatatinclude2526proteinsand5949filteredreliable
saeprovideim-
portanttestbedsforthestudyofthePPInetworkssinceitisa
well-studiedorganismforwhichmostproteomicsdataisavailable
fortheorganism,byvirtueoftheavailabilityofadefinedandrel-
ativelystableproteome,fullgenomeclonelibraries,established
molecularbiologyexperimentaltechniquesandanassortmentof
welldesignedgenomicsdatabas[3;10].
3.2BiologicalSignificanceofthePutativeMod-
uleReprentatives
OursignaltransductionmodelofEquation2providesavehicle
toquantitativelymeasurethedegreeofbiologicalandtopologi-
calinfluenceofeachproteinonotherproteinsinthePPInetwork.
Themostinfluentialproteins,thatis,thehighestscorednodes,are
uatethebiologicalsignificance
ofthemostinfluentialproteins,weannotatedthelethalityofeach
proteinintheyeastPPInetworkaccordingtotheMIPSlethality
ityisacrucialfactortocharacterizethebiologicales-
terminedbyexaminingwhethera
moduleisfunctionallydisruptedwhentheproteinisknockedout.
WeobtainedtheproteinlethalityinformationfromMIPSdataba
[19],d
that233proteinsoutofthetopscored555proteinsarelethal.
-
berofproteinnodesincludedforincreasingpercentilesofthede-
gree,a
areshownfor555genes,obtainedfromtheyeastPPInetwork,with
ca,theresults
-
ure3showsthattheperformanceoftheSTMmetricinpredicting
lethalityiscomparabletothatofdegreeandbetweenessapproaches
forupto150nodes.
3.3ClusteringPerformanceAnalysis
Experimentally,weperformedSTMalgorithmontheyeastPPI
datatusingvariousmergethresholdvaluestofindthebestthresh-
mentsusing0.5,1.0,1.5,2.0,
2.5,and3.0asthemergethresholdwereperformedoneachdata
ultsshowthatwhenthemergethresholdislessthan
1.0,clustersthatdonothavesubstantialsimilarityaremerged;and
whenthemergethresholdisgreaterthat1.5,mergingldomoc-
snomuchperformancedifferencewhenthevalues
erimentwhen1.0isudas
themergethresholdshowedthebestperformance.
3.3.1ClusterAnalysis
555preliminaryclustersareobtainedfromtheyeastPPInetwork
e1,all60
clustersthathavemorethan4proteinsarelisted,anditalsoshows
theirtopologicalcharacteristicsandtheirassignedmolecularfunc-
litatecriticalasss-
ments,thepercentageofproteinsthatareinconcordancewiththe
majorassignedfunction(hits),thediscordantproteins(miss)and
he60clusters,thelargest
onecontains210proteinsandthesmallestonecontains5inthem.
Onaverage,wehave40.1proteinsinacluster,andtheaverageden-
sityofthesubgraphsoftheclusterxtractedfromthePPInetwork
-logpvaluesofthemajorfunctionidentifiedineach
clusterisalsoshownandthevaluesprovideameasureoftherela-
tiveenrichmentofaclusterforagivenfunctionalcategory:higher
ultsdemon-
stratethattheSTMmethodcandetectlargebutsparlyconnected
hval-
uesof-logp(valuesgreaterthan2.0indicatestatisticalsignificance
atα<0.01)indicatethatclustersaresignificantlyenrichedforbi-
ologicalfunctionandcanbeconsideredtobefunctionalmodules.
Asaresult,ourmethodcanclearlyidentifylargermodulesthat
havelowdensitybutstillbiologicallyenrichedaswecanefrom
thesize,thedensity,andtheP-valueoftheclustersinTable1.
Figure4exhibitsthedistributionofthehit,miss,andunknown
percentageofmemberproteinswiththeassignedfunctionforeach
dthat
mostoftheproteinsinaclusterhavethesamefunctionsthatare
assignedasamainfunctionfortheclusterasshowninFigure4.
3.3.2ComparativeAnalysis
TheresultsinTable2and3fortheyeastPPIdatatshowthat
STMgenerateslargerclusters;theclustersidentifiedhadp-values
thatare2.2ordersofmagnitudeorapproximately125-foldlower
thanQuasiclique,thebestperformingalternativeclusteringmethod,
-valuesforthecellularlocalizationare
earthatthe
clustersidentifiedbySTMdespitebeinglargerhavelowp-values.
Althoughp-valuesgenerallydecreawithincreasingclustersize,
BIOKDD06:6thWorkshoponDataMininginBioinformatics(withSIGKDDConference)page7
Distribution
ClusterSizeDensityHDU-LogpFunction
12140.01924.769.65.643.9Nucleartransport
21880.01569.125.05.836.4CellcycleandDNAprocessing
31810.02222.072.35.517.2Cytoplasmicandnuclearproteindegradation
4
1700.02846.442.910.531.6Transportedcompounds(substrates)
5
1310.02837.455.76.828.6Vesiculartransport(Golginetwork,etc.)
61250.03060.833.65.632.2tRNAsynthesis
71130.02719.471.68.811.8Actincytoskeleton
8790.04517.773.48.812.3Homeostasisofprotons
9780.03326.962.810.212.5Ribosomebiogenesis
10760.04138.159.22.620.2rRNAprocessing
11720.0305.684.79.76.2Calciumbinding
12
680.06466.125.08.844.5mRNAprocessing
13
610.04140.952.46.511.5Cytoskeleton
14580.06472.427.60.037.4Generaltranscriptionactivities
15530.04815.071.613.27.9MAPKKKcascade
16500.06466.032.02.033.5rRNAprocessing
17450.05524.473.32.211.1Metabolismofenergyrerves
18
440.05859.036.34.55.1Metabolism
19
390.07210.289.70.07.3Cell-celladhesion
20
360.12558.336.15.516.9Vesiculartransport
21
290.09155.144.80.08.3Phosphatemetabolism
22280.07414.278.57.14.5Lysosomalandvacuolarproteindegradation
23270.11929.666.63.77.3Cytokinesis(celldivision)/ptumformation
24260.15353.846.10.028.6Peroxisomaltransport
25250.09028.068.04.04.6RegulationofC-compoundandcarbohydrateutilization
26
250.11668.0284.012.9Cellfate
27
220.15159.036.34.511.4DNAconformationmodification
28
210.14776.119.04.723.9Mitochondrialtransport
29200.20075.020.05.024.0rRNAsynthesis
30190.22878.915.75.217.9Splicing
31170.22070.529.40.019.7Microtubulecytoskeleton
32170.18323.576.40.08.2Regulationofnitrogenutilization
33
150.30486.613.30.031.3Energygeneration
34
140.14250.042.87.19.0SmallGTPamediatedsignaltransduction
35
130.56476.923.00.015.9Mitosis
36
130.35884.615.40.012.4DNAconformationmodification
37130.41069.223.07.617.63’-endprocessing
38130.17961.530.77.66.7DNArecombinationandDNArepair
39120.19616.675.08.33.9Unspecifiedsignaltransduction
40
120.36358.341.60.014.7Posttranslationalmodificationofaminoacids
41
120.16616.675.08.32.4Autoproteolyticprocessing
42
110.21854.545.40.02.9Transcriptionalcontrol
43
110.20072.727.20.08.2Enzymaticactivityregulation/enzymeregulator
44100.46680.020.00.014.8Translationinitiation
4590.36177.722.20.012.8Translationinitiation
4680.32150.037.512.55.6Metabolismofenergyrerves
4780.32175.025.00.09.0Modificationbyubiquitination,deubiquitination
48
80.32137.562.50.03.7Mitosis
49
70.33342.857.10.03.5DNAdamagerespon
5070.33357.128.514.24.1Vacuolartransport
51
70.28528.571.40.04.4Biosynthesisofrine
5260.33350.033.316.62.38Modificationbyphosphorylation,dephosphorylation,etc.
5350.4001000.00.07.0Meiosis
5450.6001000.00.07.0Vacuolartransport
5550.4001000.00.08.5ERtoGolgitransport
56
50.40020.040.040.01.8cAMPmediatedsignaltransduction
5750.50040.040.020.03.1Oxidativestressrespon
58
50.50080.020.00.04.4Intracellularsignalling
5950.60040.060.00.04.2Tetracyclicandpentacyclictriterpenes
6050.40060.040.00.04.1Mitochondrialtransport
Table1:firstcolumnisaclusteridentifier;theSizecolumnindicatesthenumberofproteins
ineachcluster;theDensityindicatesthepercentageofpossibleproteininteractionsthatareprent;theHcolumnindicatesthepercentageofproteins
concordantwiththemajorfunctionindicatedinthelastcolumn;theDcolumnindicatesthepercentageofproteinsdiscordantwiththemajorfunctionandU
-logpvaluesforbiologicalfunctionareshown.
BIOKDD06:6thWorkshoponDataMininginBioinformatics(withSIGKDDConference)page8
Figure4:Distributionofthethreeclassof60clusters:thehitpercentagewiththeassignedfunction,discordantpercentagefromthe
assignedfunction,andunknownpercentage
thedecreasinp-valuescanoccuronlywhenthenullhypothe-
-valuesreflecttheconfidencethatthedifferences,
ifprent,fidenceinanygiven
resultincreaswhentheareobtainedinalargersampleandin
,thedependenceofp-valuesonsamplesizeisintu-
-valuexpressthestrengthofevidenceagainstthenull
hypothesistoaccountforboththesamplesize,theamountofnoi
ore,theSTMclustershavelowp-values
becautheyareenrichedforfunctionandnotsimplybecauthey
arelarger.
Tables2and3demonstratethatSTMoutperformstheotherexist-
acomparisonwith6otherexistingap-
proaches,Maximalcliques[24],Quasicliques[2],Samantha[23],
Minimumcut[14],Betweennesscut[6],andMCL[26].Thecom-
parisonontheclustersizemorethan4isinTable2andonthe
blesshowthatoursig-
naltransductionmodelbadmethodgeneratesconsiderablylarger
clusters,andtheidentifiedclustersbyourmethodhaveatleast2or-
dersofmagnitudehigherP-valuethantheothersonbothfunction
andlocalizationcategories.
QuasicliqueandMaximalcliquediscarded80.8%and98.4%nodes
duringclusteringprocess,eventhoughtheyidentifiedtheclusters
liqueandSaman-
thadiscarded86.7%and93.3%nodes,eventhoughtheyidentified
theclusterswithrelativelyhighp-valuesintheclusterswithsize
rimportantstrengthofSTMis
thatthepercentageofproteinsthatarediscardedtocreateclusters
is7.8%,whichismuchlowerthantheotherapproaches,which
haveanaveragediscardpercentageof59%.TheyeastPPIdatat
isrelativelymodularandthebottom-upapproaches(e.g.,maximal
cliqueandquasicliquemethods)generallyoutperformedthetop-
downapproaches(exemplifiedbytheminimumcutandbetwee-
nesscutmethods)onfunctionalenrichmentasassdby-log
rbecaubottom-upapproachesarebadonconnec-
tivityofdenregions,thepercentagesofdiscardednodesforthe
bottom-upmethodsarealsohigherthanSTMandthetop-downap-
,wealreadyhaveshownthatthefunctionalmodules
havefairlylowdensityandarbitraryshapeswithlongdiameter.
So,discardingthosparlyconnectedproteinscouldbeafatal
decisionwhichmightresultedintheimportantbiologicalinforma-
uently,STMisversatileanditsperformanceon
biologicalfunctionandlocalizationenrichment,clustersize,and
discardrateissuperiortothebestoftheothersixmethodsonboth
datats.
3.3.3TopologicalAnalysis
Toanalyzetheclusteringresultsvisually,weobrvedthetopo-
logicalshapesofthedetectedclustersbySTMandtheirassociated
functionalcategoriesthatwereassignedfromMIPSdataba.4ex-
ampleclusterswhichhavevisuallyrecognizablesizearelected
graphsofthechon
clusterswereextractedfromtheyeastPPInetworkanddisplayed
Sfunction
categoriesthatwereassignedtoeachoftheclusterswerealso
extractedfromtheyeastPPInetworkanddisplayedontheright
thedetectedclustershave
-
ticethatwecaneasilyfindoutthatthediameterandthedensityof
thoexampleclustersandtheirassignedfunctionsarefairlylong
ticethatitisnotdifficulttofindsingletons
ineachfunctionalcategorysubgraphs,whichmeansthatthereare
memberswhichdonothaveadirectphysicalinteractionwithinthe
functionalcategorythattheybelongto.
3.4ComputationalComplexityAnalysis
STMisfundamentallyestablishedonallpairsshortestpatharch-
problemcanbesolvedinO(V2logV+VE)timeifitisimple-
mentedusingJohnson’salgorithm[13],whereVisthenumberof
easuringthe
distancebetweenallnodepairs,formationofpreliminaryclusters
takesO(V)untoftimerequiredtofindthebestclus-
terpairthathasthemostinterconnectionsisO(k2logk)byusing
heap-badpriorityqueue,wherekisthenumberofpreliminary
clusters[15].TheMergingprocessneedstofindtheclusterpair
whichhasthemostinterconnections,andittakesO(k2logk)time
econditeration,findingthe
bestclusterpairtakesO(klogk)timesincetheclusterpaircom-
parisonsareneededonlybetweenthenewlymergedclusterand
maximumk,thenumberofprelimi-
naryclusters,isatmostO(V)inthecaofthefullyconnected
graph,thereforetheMergingprocesstakesO(V2logV)
kismuchsmallerthanVinsparnetworksliketheYeastPPI
otaltimecomplexityofouralgorithmisbounded
BIOKDD06:6thWorkshoponDataMininginBioinformatics(withSIGKDDConference)page9
MethodNumberSizeDiscard(%)FunctionLocation
STM6040.17.813.77.42
Maximalclique1205.6598.410.67.93
Quasiclique10311.280.811.56.58
Samantha647.979.99.164.89
Minimumcut11413.535.08.364.75
Bwtweennesscut18010.2621.08.194.18
MCL1639.7936.78.183.97
Table2:ComparisonofSTMtocompetingclusteringmethodsfortheyeastprotein-proteininteractiondatatforclusterswith5ormore
bercolumnindicatesthenumberofclustersidentifiedbyeachmethod,theSizecolumnindicatestheaveragenumber
ofproteinsineachcluster;theDiscard%-logpvaluesforbiological
functionandcellularlocationareshown.
MethodNumberSizeDiscard(%)FunctionLocation
STM4552.411.516.89.01
MaximalcliqueN/AN/AN/AN/AN/A
Quasiclique4616.786.715.39.34
Samantha1712.393.315.97.65
Minimumcut4424.355.014.88.78
Bwtweennesscut7814.450.511.36.05
MCL5516.769.411.55.42
Table3:ComparisonofSTMtocompetingclusteringmethodsfortheyeastprotein-proteininteractiondatatforclusterswith9ormore
tnoteisthesametoTable2.
bythetimeconsumedincomputingthedistancebetweenallnode
pairs,whichisO(V2logV+VE).
SION
Wehavestudiedthatthetopologicalshapesofthesubgraphsof
MIPSfunctionalcategoriextractedfromthePPInetworkare
wounex-
pectedpropertiesoffunctionalcategoriesprohibitedotherexist-
ingapproachesfromdetectingfunctionalmodulesfromPPInet-
iveexcessofemphasisondensityand
interconnectivityintheexistingmethodscanbepreferentialfor
detectingclusterswithrelativelybalancedroundshapesandlimit
ompletenessofclusteringisanotherdistinct
drawbackofexistingalgorithms,whichproducemanyclusterswith
ferenceforstronglyconnected
nodesresultsinmanyweaklyconnectednodesbeingdiscarded.
Moreover,consideringonlythetopologicalpropertiesandignoring
thebiologicalcharacteristicsofthenetworkalsocandamagethe
effectivenessofclustering.
Inthispaperwehavepropodanovelclusteringmethodbad
head-to-headcomparisons,theSTMoutperformedcompetingap-
proachesandiscapableofeffectivelydetectingbothdenand
sparlyconnected,biologicallyrelevantfunctionalmoduleswith
nowledge,thisisthefirstdescriptionofthe
uofsignaltransductionbadapproachforthisapplication.
Overwhelmingperformanceofourapproachhasbeendemonstrated
eratedbig-
gersizeclusterswitharbitraryshape,andthoidentifiedclusters
aremorebiologicallyenriched,i.e.,higherP-value,eventhough
remorethan5%ofunannotated
proteinsintheidentifictionofthounanno-
tatedproteinscanbepredictedaccordingtotheirassignedmain
tenessofourclusteringmethod
methoddiscardedonlyabout7.8%ofproteinswhichistremen-
douslylowerthantheotherapproachesdid,59%
conclusion,STMhasstrongpharmacodynamics-badunderpin-
ningsandisaneffective,versatileapproachforanalyzingprotein-
approachcontainsaframeworkfor
rationallyincorporatingreactionrates,proteinconcentrationsand
d
thereforehavepotentialapplicationsinthedrugdiscoveryandde-
velopment.
NCES
[1]Bader,matedmethodfor
findingmolecularcomplexesinlargeproteininteractionnet-
informatics,4:2,2003.
[2]Bu,gicalstructureanalysisoftheprotein-
cAcids
Res.,31:2443–2450,2003.
[3]Deane,C.M.,Salwinski,L.,Xenarios,enberg,D..
Proteininteractions:twomethodsforasssmentoftherelia-
oteomics,
1:349–356,2002.
[4]Dennis,il,G.P..Thegammadistributionand
weightedmultimodalgammadistributionsasmodelsofpop-
ences,68:187–212,1984.
[5]Gavin,..Functionalorganizationoftheyeastpro-
,
415:141–147,2002.
[6]Girvan,man,itystructurein
,99:7821–7826,2002.
[7]onalcartographyof
,433:895–900,2005.
BIOKDD06:6thWorkshoponDataMininginBioinformatics(withSIGKDDConference)page10
YBL099W
YNL315C
YJL180C
YJR121W
YDR322C-A
YLR295C
YPL271W
YDL004W
YDR377W
YML081C-A
YPR020W
YPL078C
YDR298C
YKL016C
YDL181W
(a)
YBL099W
YBR039W
YDL004W
YDL181W
YDR298C
YDR322C-A
YDR377W
YJR121W
YKL016C
YLR295C
YML081C-A
YPL078C
YPL271W
YPR020W
(b)
YNR017W
YIL022W
YHR005C-A
YJL143W
YEL020W-A
YML054C
YGR181W
YJR135W-A
YPL063W
YBR091C
YDL217C
YJL054W
YOL034W
YOR297C
YJR045C
YFL016C
YGR254W
YJL064W
YMR203W
YNL131W
YNL121C
(c)
YBL030C
YBL099W
YBR039W
YBR085W
YBR091C
YBR291C
YDL004W
YDL198C
YDL217C
YDR298C
YDR322C-A
YEL020W-A
YEL030W
YER017C
YER154W
YGR082W
YGR101W
YGR181W
YHR002W
YHR005C-A
YIL022W
YIL114C
YJL054W
YJL133W
YJL143W
YJR045C
YJR077C
YJR095W
YJR121W
YJR135W-A
YKL016C
YKL120W
YKR065C
YLL024C
YLR008C
YLR259C
YLR295C
YML042W
YML062C
YMR089C
YMR203W
YNL055C
YNL064C
YNL121C
YNL131W
YNR017W
YOR232W
YOR297C
YPL063W
YPL078C
YPL271W
YPR021C
(d)
YHR166C
YKL022C
YFR036W
YBL084C
YLR127C
YOR249C
YGL240W
YLR102C
YDL008W
YDR118W
YER161C
YER047C
YGR225W
(e)
YAL047C
YBL084C
YBR109C
YCL029C
YDL008W
YDL064W
YDR118W
YDR159W
YDR457W
YEL061C
YFL008W
YFL037W
YFR031C
YFR036W
YGL003C
YGL075C
YGL116W
YGL240W
YHR129C
YHR166C
YKL022C
YKL049C
YKR048C
YKR054C
YLR086W
YLR102C
YLR127C
YLR226W
YML064C
YML085C
YML124C
YMR190C
YMR198W
YMR294W
YNL172W
YNL307C
YOR058C
YOR249C
YPL155C
YPL174C
YPL204W
YPR141C
(f)
YKR002W
YGR048W
YDR390C
YNL317W
YJR093C
YPR107C
YAL043C
YLR277C
YKL059C
YGR156W
YLR115W
YNL222W
YDL094C
(g)
YAL043C
YDR228C
YDR301W
YER032W
YGL044C
YGL122C
YGR156W
YGR178C
YJR093C
YKL059C
YKR002W
YLR115W
YLR277C
YMR061W
YNL317W
YOL123W
YOR250C
YPR107C
(h)
Figure5:Thesubgraphsof4exampleclustersidentifigraphsof4example
clustersdetectedbySTMareegraphsoftheir
associatedfunctionalcategories,whichareassignedfromMIPSdatabaforeachoftheclusters,arealsoextractedfromyeastPPI
networkanddisplayedontherightcolumnoftheirassociatedclusters.
BIOKDD06:6thWorkshoponDataMininginBioinformatics(withSIGKDDConference)page11
[8]Hallett,tit,E.J..StochasticeventsunderlieCa2+
.,186:1–6,1997.
[9]Hartuv,E.,Shamir,eringAlgorithmbadGraph
ationProcessingLetters,76:175–181,
2000.
[10]Hishigaki,H.,Nakai,K.,Ono,T.,Tanigami,agi,
mentofpredictionaccuracyofproteinfunctionfrom
protein–,18:523–531,2001.
[11]Ho,aticidentificationofproteincomplexes
,
415:180–183,2002.
[12]Ito,ehensivetwo-hybridanalysistoexplore
,98:4569–4574,2001.
[13]Johnson,ficientalgorithmsforshortestpathsinspar
CM,24:1–13,1977.
[14]Johnson,N.L.,Kotz,akrishnan,uousuni-
ley&sons,NewYork,NY,
1994.
[15]Karypis,G.,Han,E.-ar,eon:Hier-
Com-
puter:Specialissueondataanalysisandmining,volume32,
pages68–75,1999.
[16]King,A.D.,Przulj,isica,ncomplexpredic-
ormatics,20:3013–3020,
2004.
[17]Letovsky,if,tingproteinfunctionfrom
protein/proteininteractiondata:aprobabilisticapproach.
Bioinformatics,19:i197–i204,2003.
[18]Lin,dgeDiscoveryinBioinformatics:Tech-
niques,Methods,andApplications,chapterClusteringmeth-
odsinprotein-proteininteractionnetwork,ley
&SonsInc.,2006.
[19]Mewes,H.W..MIPS:analysisandannotationofpro-
cAcidRearch,
34:D169–D172,2006.
[20]Pereira-Leal,J.B.,Enright,ounis,-
tionoffunctionalmodulesfromproteininteractionnetworks.
Proteins,54:49–57,2004.
[21]Ramanathan,rsionmodelforcellularsignaltrans-
.,19:1544–1548,2002.
[22]Samanta,ng,S..Predictingproteinfunctions
fromredundanciesinlarge-scaleproteininteractionnetworks.
PNAS,100:12579–12583,2003.
[23]Samanta,ng,anciesinlarge-scalepro-
.,100:12579–
12583,2003.
[24]Spirin,ny,L.A..Proteincomplexesandfunctional
,100:12123–12128,
2003.
[25]Uetz,ehensiveanalysisofprotein-protein
,403:623–
627,2000.
[26]vanDongen,S..TechnicalReportINS-R0010:Aclusteral-
alRearchInstituteforMathe-
maticsandComputerScience,2000.
BIOKDD06:6thWorkshoponDataMininginBioinformatics(withSIGKDDConference)page12
ProteinFoldingTrajectoriesAnalysis:Summarization,
EventDetectionandConnsusPartialFoldingPathway
Identification
HuiYang
.&Eng.
TheOhioStateUniversity
2015NeilAvenue
Columbus,OH43210,USA
yanghu@-
Srinivasan
Parthasarathy
∗
.&Eng.
TheOhioStateUniversity
2015NeilAvenue
Columbus,OH43210,USA
srini@
DuyguUcar
.&Eng.
TheOhioStateUniversity
2015NeilAvenue
Columbus,OH43210,USA
ucar@
ABSTRACT
Moleculardynamicssimulationsareemployedtoexplainthefor-
mationof3Dproteinstructures,afundamentalyetunsolvedprob-
ivecomparisonand
reprentationofthesimulationdataisamajorstepinunderstand-
paper,weex-
ploretheuofspatio-temporalassociationpatternstodiscover
-
poanapproachthatemploysthesimplicityofcontactmapsto
effectivelyanalyzeandutilizeinformationin3Dfoldingsimula-
hodalsoallowsonetoperformcrosscomparison
ofmultipletrajectoriestoidentifycriticaleventsorpartialfolding
iricalresultsonthe
foldingtrajectoriesoftheproteinBBA5demonstratetheefficacy
ofthepropodapproach.
UCTION
Thethreedimensional(3D)nativestructuresofproteinshaveim-
tandingthestructureof
aproteinenablesustoexplorethefunctionoftheprotein,explain
substrateandligandbinding,performrealisticdrugdesignandpo-
teinfolding
problemisthereforeoneofthemostfundamentalyetunsolved
orchal-
lengeinsimulatingtheproteinfoldingprocessisitscomplexity.
Snowetal.[15]statethatperformingaMolecularDynamics(MD)
simulationonamini-proteinforjust10μswouldrequiredecades
ding@homedis-
tributedcomputingproject[14]recentlypropodusingworldwide
distributedcomputingtotackleproteinfoldingsimulations.
Withtheincreasingnumberoftrajectoriesproducedbydistributed
computing,thereisaneedtoanalyze,understand,andmanagethe
usly,rearchershaveexaminedveralsum-
marystatistics(ofgyration,rootmeansquaredeviation
(RMSD))ghsummarystatisticsarecom-
monlyudforcomparison,theycanonlycaptureabiadand
ly,Rusletal.
[13]suggestedusinggeometricspannersformappingasimulation
nsiderus-
∗
Contactauthor.
inggeometricspannerstodiscovertheproximitybetweendifferent
gmentsofaproteinacrossarangeofscales,andtrackthechanges
ofsuchproximityovertime.
Toovercomethedifficultiesinmanagingandanalyzingthelarge
amountofsimulationdata,Berraretal.[2]propoddesigning
bedtheirwarehouinagrid
environmenttoenablethesharingoftheactualsimulationdata.
Theyalsopropoimplementingatofdataminingalgorithmsto
facilitatecommonlyneededdataanalysistasks.
Inthispaper,wepropoamethodtoanalyzefoldingtrajecto-
riesoftheminiproteinBBA5producedbytheFolding@home
izethespatio-temporaldataminingframeworkthat
wehavedevelopedanddescribedearlierforthepurpoofman-
agingandanalyzingsuchdata[21].Asmentionedin[21],this
frameworkisdesignedtoanalyzespatio-temporaldataproduced
inveralscientifireviouswork,wehaveap-
pliedthisframeworkto8732proteinstakenfromtheProteinData
Banktoidentifystructuralfingerprintsfordifferentclassofpro-
teins[19].Eachproteinisassociatedwithatofobjectsthatare
ctivelycapturespatial
relationshipsamongobjects,wedefineSpatialObjectAssociation
Patterns(SOAP).Furthermore,byassociatingSOAPswithproteins
indifferentproteinclass,weestablishtheconnectionsbetween
differenttypesofSOAPsandproteinclass.
Itisapparentthatproteinfoldingtrajectorydatahavebothspa-
oteininaMDsimulation
consistsofanumberofresiduesspatiallylocatedinthe3Dspace
ameofthetrajectorycanberepre-
ntedasacontactmapin2D,capturingthepair-wi3Ddistance
rtoourpreviouswork[19],weextractnon-local
uanentropy-bad
bit-patternsarefurtherassociatedtoformspatialobjectassocia-
tionpatterns(SOAPs).BytheuofSOAPs,weeffectivelyrepre-
ntandanalyzefoldingtrajectoriesproducedbyMDsimulations.
Amajoradvantageofthisreprentationisitsappropriatenessfor
efitsofour
frameworkinclude:
•Effective,informativeandscalablereprentationoffold-
ingsimulations:WereprenteachframebyatofSOAPs,
whereeachSOAPinturncharacterizesthespatialrelation-
ship(orinteractionsinthefoldingca)amongmultiplebit-
renotonlyeasilyobtainablebutalso,as
BIOKDD06:6thWorkshoponDataMininginBioinformatics(withSIGKDDConference)page13
wewillshow,areabletocapturelandmarksalongafolding
trajectory.
•Cross-analysisoftrajectoriestorevealaconnsuspar-
tialpathway:ByreprentingeachframeasatofSOAPs,
analysisincludesdetectingcriticaleventsandidentifyingcon-
nsuspartialfoldingpathwaysacrosstrajectories.
ISOFPROTEINFOLDINGTRA-
JECTORIES
2.1ProteinFoldingTrajectories
Advancesinhigh-performancecomputingtechnologiesandmolec-
ulardynamicshaveledtosuccessfulsimulationsoffoldingdynam-
icsfor(small)proteinsatatomisticlevel[12].Suchsimulations
resultinalargenumberoffoldingtrajectories,eachofwhichcon-
sistsofariesof3Dconformationsoftheproteinundersimu-
onformationsareusuallysampledregularly(e.g.,
every200fs)paper,wealsoreferto
r-
more,toreprentaproteinconformation,weadoptoneofthe
commonlyadoptedreprentationschemes,whereaconformation
isreprentedasaquenceofα-carbons(C
α
)locatedin3Dspace.
Asmentionedearlier,weobtainedtwofoldingtrajectoriesofthe
designedmini-proteinBBA5(ProteinDataBankID)fromthe
Folding@5is
ive
structure(orfold)ofBBA5showsaβ-hairpininvolvingresidues1-
includesanα-helixin-
ention,residuesare
numberedincreasinglyfromtheN-terminaltoC-terminalofapro-
1(a)
twofoldingtrajectories,referredtoasT
23
andT
24
respectively,
areofdifferentlength.T
23
consistsofariesof192conforma-
tions(orframes),whileT
24
nformationis
describedatatomisticlevelinPDBformatadoptedbytheProtein
DataBankprograms.
2.2ComparingConformationsofBBA5Across
Trajectories
Althoughbothtrajectoriesstartfromthesameextendedconfor-
mationasshowninFigure1(b),whenweexaminethevisualized
frames,theyemtoidentifytwoverydifferentfoldingprocess.
Figures1(c)and(d)illustratethelastframeinT
23
andT
24
respec-
emingdifferencemightbeattributedtothestochas-
ticnatureofthefoldingsimulationprocess[12;16].However,it
isalsodesirabletocharacterizethesimilarities(ordissimilarities)
acrossmultipletrajectories.
Tocomparetwotrajectories,akeyissuethatmustbeaddresd
is:howcanwecomparetwoproteinconformations?Severalmea-
sureshavebeencommonlyudtodosuchcomparison,includ-
ingRMSD(rootmeansquareddistance)[22],contactorder[9],
andnativecontacts[4].However,allthemeasuresaredesigned
rmore,
badonourempiricalanalysisofthemeasures,wenoticethat
theyaregenerallytoocoarandthuscanoftenbemisleading.
Evenmoreimportant,suchmeasuresfailtoidentifysimilarlocal
structures(ormotifs)especially
1/
2Pleareferto[12;16]fordetailsonthesimulationmodelem-
ployedtoproducesuchtrajectories.
nstratedinbothex-
perimentalandtheoreticalstudies,smallproteinsoftenfoldhierar-
chicallyandbeginlocally[1].Forinstance,ithasbeenshownthat
BBA5tendstofirstformcondarystructuressuchasβ-turnsand
α-helix,thenconformtoitsglobaltopology[16].Finally,assug-
gestedbyPande[12],bothsterics(localmotifs)andglobaltopol-
ore,
tocompareconformationsof(small)proteins,amorereasonable
-
over,itshouldalsotakethenativetopologyoftheproteinunder
studyintoaccount.
Tomeettherequirements,wepropothefollowingapproachto
,weloolypartitionthe
23residuesofBBA5intofourfragments:(i)F
1
:N-terminal1-10
β-hairpin;(ii)F
2
:C-terminal11-23α-helixfragment;(iii)F
3
:the
firsthalfofF
1
andthecondhalfofF
2
;and(iv)F
4
:thec-
ondhalfofF
1
andthefirsthalfofF
2
,i.e.,themiddlectionin
,werecognizethecondarystruc-
formationsaresaidtobe
similariftheydemonstratethesamecondarystructurepropen-
tance,theconformationpairin
Figure6(a)aresimilarasresiduesinF
1
,F
2
andF
4
frombothcon-
formationsindicateaβ-notethatthe
-
stance,inFigure6(d),wesaythetwoconformationshaveasimilar
structureinF
1
fragment,eventhoughtheβ-turnmotifshavedif-
ferentorientations.
Torealizethecomparisonofconformations,twomoreissuesmust
,howcanweeffectivelycaptureandrepre-
ntlocalmotifs?Second,howcanwereprenttheglobaltopol-
ogyofaconformationintermsoflocalmotifs?Toaddressthefirst
issue,weleveragethenon-localpatternsinproteincontactmaps.
Forthecond,wecharacterizethespatialarrangementamong
eSection3fordetails.
2.3FoldingTrajectoryAnalysis:Objectives
Therearetwogoalswewouldliketoachieveinanalyzingthe
,wewouldliketoaddressthefollow-
ingfoldingissuesforagiventrajectory:(1)todetect(oreven
predict)significantfoldingevents,includingtheformationofβ-
turns,α-helices,andnative-likeconformations;and(2)torecog-
nizethetemporalorderingofimportantfoldingeventsinthetra-
tance,betweenthetwocondarystructuresα-helix
andβ-hairpininBBA5,whichformarlier?Whatisorderingof
thetwoeventsprecedingaβ-hairpinformation:formationoftwo
extendedstrandsorformationoftheturn?
Ifthefirstgoalconcernsindividualtrajectories,thecondgoal
fically,wewouldliketoiden-
tifyasub-quenceofsimilarconformationsinbothtrajectories.
Thissub-quenceofconformationsisreferredtoastheconnsus
analogoustotheLongestCom-
monSub-quence(LCS)problem[5],butmuchmorechalleng-
,wearedealingwithtimeries
,wearenotlookingforanex-
-
stead,wearelookingforsimilarconformationsacrosstrajectories.
Wewouldliketopointoutthatthisworkisclolyrelatedtoour
previousworkonproteinstructuralanalysis[18]andourworkon
miningspatio-temporaldata[21].
BIOKDD06:6thWorkshoponDataMininginBioinformatics(withSIGKDDConference)page14
40
45
50
55
60
65
70
-48
-46
-44
-42
-40
-38
-36
-34
-8
-6
-4
-2
0
2
4
6
8
10
12
N-terminal
C-terminal
0
5
10
15
20
25
30
35
40
15
20
25
30
35
40
45
10
15
20
25
30
35
40
45
N-terminal
C-terminal
14
16
18
20
22
24
26
28
30
32
34
20
22
24
26
28
30
32
34
16
18
20
22
24
26
28
30
32
34
36
N-terminal
C-terminal
16
18
20
22
24
26
28
30
32
34
18
20
22
24
26
28
30
32
34
15
20
25
30
35
40
N-terminal
C-terminal
(a)(b)(c)(d)
Figure1:DifferentconformationsofBBA5,whereeachpointcorrespondstoanα-carbon.(a)ThenativeNMRstructureofBBA5badon
datafromtheSCOPwebsite.(b)Theinitialconformationofbothfoldingtrajectories.(c)Thelastconformationinthefirsttrajectory.(d)The
lastconformationinthecondtrajectory.
THM
Inthisction,wedescribeindetailourapproachforanalyzing
ninFigure2,thisanalysis
consistsofthreemainphas:(I)Datapreprocessing,(II)Spatio-
temporalobjectassociationpatternmining,and(III)Trajectoryanal-
discusachphainfurtherdetails.
3.1DataPreprocessing
Asinourpreviousstudiesonproteinstructuralanalysis[18;19],
,in
thiswork,weusimilarmeasurestocreateandprocesssuchmaps.
Inorderforthisalgorithmtobelf-contained,webrieflygoover
rtoreadersto[18]formore
tion,wewillalsoexplaintherationalebehindveral
keystepsinthecontextofproteinfolding.
ContactMapGeneration
Whengeneratingcontactmaps,weconsidertheEuclideandis-
tancesbetweenα-carbons(C
α
)α-carbons
areconsideredtobeincontactiftheirdistanceiswithin8.5
˚
A.
Thus,foraproteinofNresidues,itscontactmapisanN×N
binarymatrix,wherethecellat(i,j)is1iftheithandjthα-
carbonsareincontact,ontactmapsaresym-
metricacrossthediagonal,weonlyconsiderthebitsbelowthedi-
rmore,wealsoignorethepairsofC
α
atomswho
distanceintheprimaryquenceis≤2,astheyaresuretobein
epesntiallytransformsthetwoBBA5trajectories
intotworiesofcontactmaps,witheachmapofsize23×23.
IdentifyingMaximallyConnectedBit-patterns
dge
position,
contactmap,aconnectedbit-patternisacollectionofbit-1posi-
tions,whereforeach1,-
respondingly,wedefineamaximally-connectedbit-pattern(also
referredtoasabit-patterninthisarticle)tobeaconnectedpat-
yasim-
pleregiongrowthalgorithmtoidentifyallmaximally-connected
of
352maximally-connectedbit-patternswereextractedfromthetwo
foldingtrajectoriesofBBA5.
Asdiscusdinourpreviousworkandelwhere[7;8;10;17],
suchbit-patternscaneffectivelycapturethecondarystructures
ontextofproteinfolding,wehaveobrved
thattheyarepowerfulenoughtoreprentawiderangeoflocal
venmeasureapproximatelythestrength
ofcondarystructurepropensityinaconformationbadonthe
tance,wehaveidentifiedbit-patternsthatcor-
respondto“premature”α-helicesandnative-likeα-helicesrespec-
orth,werefertothe3Dstructureformedbyallthe
participatingresiduesofabit-patternasthe3Dmotifofthebit-
pattern.
ClusteringBit-patternsintoApproximatelyEquivalent
Groups
Weapplyanentropy-badclusteringalgorithmtogroupthebit-
patternsintolclusters,wherethebit-patternsinaclustershow
similargeometricproperties(e.g.,shapeandsize).Thevalueof
lisdeterminedusinganentropy-badmeasurethatquantitatively
indicatesthequalityofaclusteringresult.(Pleareferto[18]for
details.)FortheBBA5foldingdata,theclusteringstepgroupsthe
352bit-patternsinto10clusters(ortypes).
Intuitively,the3Dmotifsofthebit-patternsinaclusterwillalso
verifiedbadonour
3illustratestherepre-
type0,asbit-patternsofthistype,unliketheothers,correspondto
awidevarietyof3Dmotifs.
Thisdemonstrates,toacertainextent,theadvantageofusing2D
tedly,
usingcontactmapsgreatlyreducesthecomputationalcomplexity
ofouralgorithm,thoughatthecostoflossinstructuralinformation.
Moreimportantly,byexploitingdifferentfeaturesincontactmaps
(bit-patternsinthiswork),weareabletoconnect2Dfeatureswith
a,byidentifying10typesofbit-
patternsincontactmaps,weindirectlyrecognize10different3D
structuralmotifsinthefoldingconformations.
Re-labelingBit-patternswithTheCorrespondingClus-
terLabel
Inthisstep,were-labelallthepreviouslyidentifiedbit-patterns
alabeledbit-pattern.
Itcanbereprentedasfollows:p=(trajID,frameID,listC
α
,
label).Here,trajIDidentifiesafoldingtrajectory,andframeID
indicatestheframewherepoccurs,listC
α
consistsofallpar-
ticipatingα-carbonsofp,identifiedbytheirpositioninthepri-
y,5,
label∈{g
0
,g
1
,···,g
9},correspondingtothe10approximately
equivalentgroups(ortypes).
BIOKDD06:6thWorkshoponDataMininginBioinformatics(withSIGKDDConference)page15
I:Datapreprocessing
1.1Generatecontactmapsforeveryconformationinthetwofoldingtrajectories
1.2Identifymaximallyconnectedbit-patternsinallcontactmaps
1.3Clusterbit-patternsintoapproximatelyequivalentgroupsbadongeometricproperties
1.4Re-labeleachbit-patternwithitscorrespondingclusterlabel
II:Discoveringfrequentspatio-temporalobjectassociationpatterns(SOAPs)
2.1Discoverfrequent(minLink=1)-SOAPsofbit-patternsineitherfoldingtrajectory
III:Foldingtrajectoryanalysis
3.1SummarizeeachfoldingtrajectorybadonfrequentSOAPs
3.2Detectfoldingeventsandrecognizetheorderingoffoldingeventsinatrajectory
3.3Identifytheconnsuspartialfoldingpathwayacrosstrajectories
Figure2:Mainstepstoanalyzeproteinfoldingtrajectories
3DMotifNotationbit-patternvs.3Dmotif
α-helixlikeα
#4#7#9
-111——
11111—–
—-11—-
—–11—
——11–
——-11-
——–11
——–1-
——–1-
16
18
20
22
24
26
28
30
32
20
22
24
26
28
30
32
34
24
26
28
30
32
34
36
38
-1——-
-11——
–11—–
—11—-
—111—
—1111–
—-1111-
—–1–1
12
14
16
18
20
22
24
26
22
24
26
28
30
32
34
26
27
28
29
30
31
32
-11—–
-111—-
—11—
—111–
—-111-
—-11–
12
14
16
18
20
22
24
26
28
30
32
22
24
26
28
30
32
34
36
28
28.5
29
29.5
30
30.5
31
31.5
32
32.5
β-turnlikeβ
#1#6#8
–1—
–11–
-1111-
-11—
-1—-
24
26
28
30
32
34
36
38
23
24
25
26
27
28
29
20
21
22
23
24
25
26
27
28
29
—-1–
-11111-
-1111–
26
27
28
29
30
31
32
33
34
35
36
21
22
23
24
25
26
27
28
29
30
17
18
19
20
21
22
23
24
25
26
—-1—
—-11–
-111111-
—111–
—11—
—11—
—11—
29
30
31
32
33
34
35
36
37
38
19
20
21
22
23
24
25
14
16
18
20
22
24
26
28
smallα-helixlikea
#3(α)#2(β)
-11—
-111–
—11-
—–1
14
16
18
20
22
24
26
28
20
21
22
23
24
25
26
27
28
29
25
25.5
26
26.5
27
27.5
28
28.5
29
29.5
30
30.5
-1—
-11–
-111-
-1—
26
27
28
29
30
31
32
33
34
24
25
26
27
28
29
30
31
32
33
19
20
21
22
23
24
25
26
27
smallβ-turnlikeb
parallelstrands
本文发布于:2022-11-24 21:43:32,感谢您对本站的认可!
本文链接:http://www.wtabcd.cn/fanwen/fan/90/14426.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
留言与评论(共有 0 条评论) |