Sequence analysis
JASSA:a comprehensive tool for prediction of SUMOylation sites and SIMs
Guillaume Beauclair 1,2,3,*,‡,Antoine Bridier-Nahmias 1,2,3,4,
Jean-Franc ¸ois Zagury 5,Ali Saı
¨b 1,2,3,4,†and Alessia Zamborlini 1,2,3,4,†1
CNRS UMR7212,Ho ˆpital St Louis,2Inrm U944,Institut Universitaire d’He ´matologie,Ho ˆpital St Louis,3Universite ´
Paris Diderot,Sorbonne Paris Cite ´,Ho ˆpital St Louis,4
Laboratoire PVM,Conrvatoire national des arts et me ´tiers
型号英语(Cnam)and 5
Laboratoire Ge ´nomique,Bioinformatique,et Applications,EA4627,Chaire de bioinformatique,Conrvatoire national des arts et me ´tiers (Cnam),Paris,France
*To whom correspondence should be addresd.†
The authors wish it to be known that,in their opinion,the last two authors should be regarded as Joint Last Authors.‡Prent address:Institute of Virology,Hannover Medical School,Carl-Neuberg-Stras 1,30625,Hannover,Germany.Associate Editor:John Hancock
Received on April 14,2015;revid on June 5,2015;accepted on June 25,2015
Abstract
Motivation:Post-translational modification by the Small Ubiquitin-like Modifier (SUMO)proteins,a process termed SUMOylation,is involved in many fundamental cellular process.SUMO pro-teins are conjugated to a protein substrate,creating an interface for the recruitment of cofactors harboring SUMO-interacting motifs (SIMs).Mapping both SUMO-conjugation sites and SIMs is required to study the functional conquence of SUMOylation.To define the best candidate sites for experimental validation we designed JASSA,a Joint Analyzer of SUMOylation site and SIMs.Results:JASSA is a predictor that us a scoring system bad on a Position Frequency Matrix derived from the alignment of experimental SUMOylation sites or SIMs.Compared with existing web-tools,JASSA displays on par or better performances.Novel features were implemented to-wards a be
tter evaluation of the prediction,including identification of databa hits matching the query quence and reprentation of candidate sites within the condary structural elements and/or the 3D fold of the protein of interest,retrievable from deposited PDB files.
Availability and Implementation:JASSA is freely accessible at www.jassa.fr/.Website is im-plemented in PHP and MySQL,with all major browrs supported.Contact:guillaume.beauclair@inrm.fr
Supplementary information:Supplementary data are available at Bioinformatics online.
1Introduction
SUMOylation is a eukaryotic post-translational modification which consists in the reversible attachment of members of the Small Ubiquitin-like Modifier (SUMO)protein family on a protein sub-strate resulting in the dynamic regulation of its biochemical proper-ties.Proteins involved in many fundamental cellular process like DNA repair,transcription control,chromatin organization,macro-molecular asmbly and signal transduction are SUMOylated [for a review e (Flotho and Melchior,2013)].Thus,it is not surprising
that deregulation of SUMOylation is associated to various patho-logical conditions like neurological disorders,cancers and pathogen proliferation [for reviews (Droescher et al.,2013;Sarge and Park-Sarge,2009;Wilson,2012;Wimmer et al.,2012)].
SUMO proteins are conjugated to a lysine (K)residue of the sub-strate through the quential action of SUMO-specific activating (E1),conjugating (E2)and ligating (E3)enzymes and de-conjugation relies on SUMO-specific proteas.Virtually all eukaryotes express SUMO proteins,mammals and plants harboring veral paralogues,
V C The Author 2015.Published by Oxford University Press.All rights rerved.For Permissions,plea e-mail:journals. 3483
Bioinformatics ,31(21),2015,3483–3491
doi:10.1093/bioinformatics/btv403
Advance Access Publication Date:2July 2015
Original Paper
at Xiamen University on November 19, 2015
/Downloaded fromcriminal什么意思
and the components of the conjugation pathway are highly con-rved across eukaryote proteomes(Flotho and Melchior,2013; Gareau and Lima,2011).
Analysis of SUMO targets shows that the modified K is often embedded within the connsus quence W KxE(where W is a hydrophobic residue and x any amino acid),which is a binding site for Ubc9,the single E2conjugating enzyme(Sampson et al.,2001). Extended variants of this motif were described,including negatively charged amino acid-dependent SUMOylation motifs(NDSM) (Yang et al.,2006),phosphorylation-dependent SUMOylation motifs(PDSM)(Hietakangas et al.,2006)and phosphorylation SUMOylation motif(Picard et al.,2012),where a cluster of nega-tively charged and/or phosphorylatable residues downstream of the core motif promotes SUMO conjugation by strengthening the inter-action between the substrate and Ubc9(Mohideen et al.,2009). SUMOylation of the inverted connsus motif([E/D]xK W)was also reported(Ivanov et al.,2007;Matic et al.,2010).However,not every quence conforming to the connsus motifs is modified, likely becau the environment of the target K must adopt a favor-able conformation to be accessible to the SUMO machinery(Pichler et al.,2005).Notably,modification by SUMO occurs at non-connsus sites.This is the ca for$50%of the SUMOylated sub-strates identified by the Verteg
aal’s group(Hendriks et al.,2014).
SUMO creates an interface for the recruitment of protein cofactors that harbor short peptide quences known as SUMO-interacting motifs(SIMs).Most SIMs feature a loo connsus -quence compod of3–4aliphatic residues often flanked by acidic and/or phosphorylatable amino acids(Kerscher,2007).The hydro-phobic core adopts a b-strand conformation that can accommodate in a pocket formed by the a1-helix and the b2-strand of SUMO (Hecker et al.,2006;Song et al.,2005).Adjacent negatively charged residues control the affinity,the polarity and the paralogue-specificity of the SIM/SUMO interaction(Chang et al.,2011; Hecker et al.,2006;Meulmeester et al.,2008).
Mapping SUMO-conjugation sites and SIMs is mandatory to fully characterize the biological conquences of SUMOylation.Large-scale mass spectrometry(MS)-bad proteomic studies allowed the identifi-cation of hundreds of proteins harboring SUMOylation sites and/or SIMs(Blomster et al.,2009;Hendriks et al.,2014;Impens et al., 2014;Matic et al.,2010;Tammsalu et al.,2014;Tatham et al., 2011).However,the u of this approach is limited by the transient nature of the modification,the small fraction of a protein that is SUMOylated and the difficulty to identify branched peptides resulting for the tryptic digestion of SUMOylated proteins.When MS data are not
available,computer-aided prediction of SUMOylation sites and SIMs by in silico analysis reprents a promising strategy to reduce the number of potential targets for experimental verification.
We developed JASSA(Joint analyzer of SUMOylation site and SIMs)to provide a comprehensive overview of potential SUMOylation sites and SIMs.The prediction relies on a scoring sys-tem bad on a Position Frequency Matrix(PFM)derived from the alignment of experimentally validated quences.When compared with existing bioinformatics tools,JASSA displays on par or better predictive performances.To increa the reliability of the prediction, JASSA offers additional features such as identification of databa (DB)hits matching the query quence,systematic pattern arch against extended motifs,analysis of the physico-chemical properties of adjacent residues and,when a PDB file is available,the possibility to reprent candidate sites within the condary structural elements and the3D fold of the query protein.We believe that JASSA will provide a valuable support for lecting the best candidate sites for experimental studies.2System and methods
2.1Databas of SUMOylation sites and SIMs
The training DB of SUMOylation sites was generated by collecting data from PhosphoSite(Hornbeck et al.,2004),Teng et al.(Teng et al.,2012)and NCBI(publication until January2011)using ‘SUMO’and‘S
UMOylation’as keywords.It encompass877 unique SUMOylated experimentally defined K residues from505 proteins.The quence of the21-mer centered on the modified K was retrieved from UniProt(Apweiler et al.,2004).JASSA operates using either of three clusters of experimental SUMOylation sites: ALL,DIRECT or INVERTED,which include all the quences of the DB or the quences following the direct W Kx a or the inverted a xK W connsus,respectively(where W is an hydrophobic residue;a is E or D and x is any amino acid)(e below).
The collection of SIMs consists of102non-redundant motifs from66proteins obtained from NCBI using‘(SIM or SBD or SBM) and SUMO’as keywords(publication until January2014).Putative SIMs from proteins interacting with SUMO in yeast two-hybrid screens,but not validated further,were not included.
2.2Characterization of the SUMOylation sites DB and motifs clustering strategy
Most of the SUMOylation sites retrieved from the scientific litera-ture and included in our training DB are from Homo sapiens (71.9%)and rodents(14.7%).The remaining motifs are from yeast, virus and plants(5.6,3.1and2.5%,respectively)(Supplementary Fig.S1A).A pattern arch against known SUMOylation connsus motifs showed that598sites(68.2%)of the DB fit with the direct W Kx a motif
(where W¼A,F,G,I,L,M,P,V or Y;a¼D or E; x¼any amino acid).Among the26.3%are NSDM,3.6%PSDM, 12%HSCM and20%synergy control motif(Table1).We also found that80sites of the DB(9.1%)follow the inverted a xK W con-nsus,while30(3.4%)fit both the direct and the inverted conn-sus motif.The remaining229sites(26.1%)do not conform to any of the motifs and are considered as non-connsus.Bad on the obrvations we grouped the experimental SUMOylation sites in three clusters which encompass all the quences of the DB(ALL), the sites following the direct W Kx a(DIRECT)or the inverted a xK W connsus(INVERTED).We reasoned that the DB of JASSA being enriched in quences that match the direct W Kx a motif(>68%), while inverted motifs are underreprented(<10%),could under-mine the prediction of sites that fit the a xK W pattern.
Next,we analyzed the local target protein context in a21-mer window centered on the SUMOylated K(position0)for each cluster with WebLogo3(Crooks et al.,2004)(Fig.1).The quence logo for the cluster ALL agrees with the prevalence of the W Kx a motif in the DB.At positionÀ1there is a higher occurrence of hydrophobic amino acids most of which are residues with aliphatic side chains (67.5%among which27.7%are I,12.7%are L and23.7%are V), whereas aromatic amino acids are rare(5.4%of which4.6%are F) (Supplementary Table S1A).At positionþ2acidic residues are sig-nifica
ntly enriched(70%E,5.4%D).Finally,no particular amino acid is overreprented at positionþ1.As expected,the quence logo of the cluster DIRECT is similar except that positionÀ1is occupied exclusively by hydrophobic residues,whereas E($94%) and D($6%)are the only amino acids found at positionþ2. Consistent with the fact that about a third of the SUMOylation sites of the collection match with the NDSM or the PDSM motifs,acidic and/or phosphorylatable residues are enriched downstream of the core connsus motif(Fig.1and Table1).The amino acids distribu-tion around the SUMO-acceptor K is different for the cluster
3484G.Beauclair et al.
at Xiamen University on November 19, 2015
/Downloaded from
watermelon的意思INVERTED (Fig.1and Supplementary Table S1A ).A similar inci-dence of E (55%)and D (45%)is found at position À2.Non-polar residues are still enriched at position þ1(97.5%).However,there is a greater variety of residues,with a higher rate of P (22.5%),A (15%),F (12.5%)and M ($9%),and a lower occurrence of I ($6.3%),L ($6.3%)and V (25%)compared with the quences of the cluster ALL or DIRECT.Finally,the quence logo for the non-connsus sites has an ambiguous profile,the SUM
Oylated K being the only conrved residue (Fig.1).In agreement with the fact that the SUMOylation pathway is highly conrved among eukary-otes,similar quence logos were obtained for the SUMOylation sites belonging to yeast proteins (Supplementary Fig.S2).
2.3Characterization of the SIMs DB and motifs clustering strategy
The identification of SIMs,which act as SUMO recognition mod-ules,is necessary to better understand the functional conquences
of SUMOylation.To develop a protocol for computer-aided predic-tion,we generated a DB by collecting 102experimentally validated SIMs,most of which are from human proteins (62.4%).The remain-ing quences are from rodent (3.0%),yeast (12.9%)and virus (16.8%)(Supplementary Fig.S1B ).
Alignment of the hydrophobic core (positions À2to þ2)of the quences showed that V and I reprent more than 50%of the amino acids occurring at position À2and À1,respectively,while having a rate ranging from 19to 28%at other positions (Supplementary Table S1B ).L is the most frequent residue at position þ2(45%).Although its occurrence is around 6%at other positions,L is never-theless the more reprented residue after V and I,the other amino acids having rates below 3
%.
We manually curated the alignment of the 24-mer centered on the hydrophobic core for the 102motifs of the DB and defined 5types of SIMs (Fig.2and Table 2).The SIM connsus being degenerate,some experimental sites could fit with more than one pattern.Motifs
Table 1.List and proportion of known SUMOylation sites in the DB of JASSA
英汉字典下载
Name
Motif
Nb %References
Connsus direct
Stong connsus [W 1]-[K]-[x]-[a ]49856.8Melchior 2000;Rodriguez et al.,2001
Connsus
[W 2]-[K]-[x]-[a ]59167.4Weak connsus [W 3]-[K]-[x]-[a ]
59868.2PDSM [W 2]-[K]-[x]-[a ]-[x]2-[S]-[P]32 3.6Hietakangas et al.,2006NDSM [W 2]-[K]-[x]-[a ]-[x]-[a ]2/623126.3Yang et al.,2006HCSM [W 4]3-[K]-[x]-[E]
10512.0Matic et al.,2010SC-SUMO
[P/G]-[x](0–3)-[I/V]-[K]-[x]-[E]-[x](0–3)-[P/G]11012.5Benson et al.,2007
Minimal SC-SUMO [I/V]-[K]-[x]-[E]-[x](0–3)-[P]17820.3Subramanian et al.,2003
SUMO-acetyl switch [W 2]-[K]-[x]-[a ]-[P]13014.8Stankovic-Valentin et al.,2007pSuM
[W 2]-[K]-[x]-[p S]-[P]10.1Picard et al.,2012
Connsus inverted
Strong connsus [a ]-[x]-[K]-[W 1]30 3.4Ivanov et al.,2007Matic et al.,2010
dominica
Connsus
[a ]-[x]-[K]-[W 2]778.8Weak connsus [a ]-[x]-[K]-[W 3]
809.1Non connsus
229
26.1
Note :PDSM,phosphorylation-dependent SUMOylation motif;NSDM,negatively charged amino acid-dependent SUMOylation motif;HCSM,hydrophobic cluster SUMOylation motif;SC-SUMO,synergy control motif;pSuM,phosphorylated SUMOylation motif.The abundance [number (Nb)and percentage (%)]of each motif in the DB is indicated.W 1¼I,L or V;W 2¼A,F,I,L,M,P,V or W;W 3¼A,F,G,I,L,M,P,V,W or Y;W 4¼A,F,G,I,L,P or V;a ¼D or E;pS/T,phosphorylated
rine/threonine.
Fig.1.Characterization of the collection of SUMOylation sites.The quence logo reprentations of the 21-mer harboring the SUMO-acceptor K residue for motifs included in the cluster ALL,DIRECT or INVERTED were obtained with WebLogo3(Crooks et al.,2004)using bits unit (left)or probability unit (right).The 5-mer (positions À2to þ2)ud to define the frequency plots (Supplementary Table S1)is indicated
Prediction of SUMOylation sites and SIMs
3485
at Xiamen University on November 19, 2015
/Downloaded from
matching the connsus WW x W (type 1),W x WW (type 2),WWWW (type 3),x WWW (type 4)and WaWaW (type 5)(where W ¼V,I or L;a ¼D or E and x ¼any amino acids),reprent $60,40,20,30and 4%of the collection,respectively.A deeper analysis of the quences of type 1cluster showed that $40%of the sites match with the V[I/V]DLT pat-tern.In this context,Sun and Hunter (2012)showed tha
t aromatic resi-dues can be accommodated at position À2as in the SIMs of FLASH/CASP8AP2and C5orf25.Thus,we named the group of sites which compri an F or Y residue at position À2‘type 1bis’cluster (Fig.2).Only 5out of 102known SIMs could not be included in any cluster.
2.4Determination of the prediction performances面试自我介绍ppt实例
To compute the area under the curve (AUC)of the Receiver Operating Characteristic (ROC)curves for the prediction of SUMOylation sites by JASSA using any of the three clusters (ALL,DIRECT,INVERTED)or GPS-SUMO,we ud the R package pROC,notably the ‘roc()’function with no smoothing method (R Core Team,2015;Robin et al.,2011).
The evaluation was performed using (i)the testing datat of SUMOhydro which consists of 24positive and 510negative motifs (Chen et al.,2012);(ii)the 4351positive site identified by the Vertegaal’s group in 1489proteins (Hendriks et al.,2014).As a matching negative datat,we lected the 68325K residues from the same proteins which SUMOylation was not detected.The results were plotted using the R package ggplot2(Wickham,2009).
3Algorithm
3.1Scoring strategy and definition of cut-off values for the prediction of SUMOylation sites
Sequence analysis of SUMOylation sites points to a hydrophobic and an acidic amino acids surrounding the target K (position 0)as major determinants of SUMO conjugation (Fig.1).Thus,we established a scoring system where the occurrence of the residues,which are
found at position À1and þ2in direct connsus motifs or À2and þ1in inverted connsus motifs,is ud to quantify the eventuality of a K residue to lie within a SUMOylation site.The method is bad on a PFM at four positions generated by aligning the quences of the lected cluster (ALL,DIRECT or INVERTED)(Supplementary Table S1A ).Each K residue of a query is given two predictive scores (PS)termed PSd and PSi.When either the cluster ALL or DIRECT is lected,the scores are defined as PSd ¼f À1ðaa À1ÞÂf 0ðK 0ÞÂf þ2ðaa þ2ÞÂ100and PSi ¼f À1ðaa þ1ÞÂf 0ðK 0ÞÂf þ2ðaa À2ÞÂ100.When the cluster INVERTED is chon,the scores are defined as PSd ¼f þ1ðaa À1ÞÂf 0ðK 0ÞÂf À2ðaa þ2ÞÂ100and PSi ¼f þ1ðaa þ1ÞÂf 0ðK 0ÞÂf À2ðaa À2ÞÂ100.In the formulas,f p ðaa q Þis the fre-quency at position p of the amino acid (amino acids)at position q for the lected cluster.Note that f 0ðK 0Þ¼1and the frequency of residues that are abnt is t to 0.0001.Since the majority of SUMOylation sites of the DB conform to the canonical connsus pattern,hydrophobic residues are overreprented at position À1(Fig.1and Supplementary Table S1A ).To avoid any bias when analyzing putative inverted SUMOylation sites,the frequencies of the amino acids at position þ1and À1are not considered in the calculation of the PSd and the PSi,respectively.
Two cut-off values were defined by decision tree analysis for each cluster of SUMOylation sites (Fig.3A–C).To this aim,a posi-tive and a negative datats compod of 358bona fide SUMOylated motifs and 8071non-SUMOylated sites,respectively (Chen et al.,2012),were analyzed with JASSA choosing either the cluster ALL,DIRECT or INVERTED.The best score (either PSd or PSi)for each quence tested was submitted to SIPINA Rearch (eric.univ-lyon2.fr/$ricco/sipina.html).
3.2Scoring strategy and definition of cut-off values for the prediction of SIMs
For the detection of putative SIMs,the query protein is scanned against multiple connsus motifs (Table 2),and a PS is
calculated
Fig.2.Characterization of the collection of SIMs.Alignment of the 24amino acid-long quences harboring experimentally proved SIMs.The connsus motifs ud to classify the quences are shown below each cluster.Residues V,I,L (W )are colored in yellow,D and E (a )in green,S in blue,T in purple and F and Y in orange
3486G.Beauclair et al.
at Xiamen University on November 19, 2015
/Downloaded from
for each tetrapeptide that fits at least one of them.The algorithm for SIMs prediction is bad on a PFM at four positions derived from the alignment of the quences of the training DB.The score is calcu-lated as PS ¼100ÂY 4
i ¼1f i ðaa i Þ,where f i is the frequency of a given
amino acid at position i in the frequency table (Supplementary Table S1B ).
pleawaitWe defined four different thresholds for the SIM predictor with the same decision tree methodology ud for the SUMOylation
sites
interestingly
Fig.3.Establishment of cut-off values for the prediction of SUMOylation sites and for SIM.Decision trees were generated with SIPINA Rearch v3.9,induction method C4.5,following analysis with JASSA of the SUMOylated and non-SUMOylated quences from the control datats of Chen et al.(2012)with cluster ALL (A),DIRECT (B),INVERTED (C).The distribution (in %,y -axis)of the positive (green)and the negative (red)motifs of the control clusters after paration by the two cut-off values is shown.The cut-off values for the prediction of SIM motifs (D)were defined with the same methodology as for the SUMOylation sites.The distribution (in %,y -axis)of the positive (green)and the negative (red)motifs of the control datats after paration by the four cut-off values is shown
Table 2.List and proportion of known SIMs and abundance in the DB of JASSA
Name
Motif
Nb %References
SIM
Type 1[W 1]-[W 1]-[x]-[W 1]6361.8Type 2[W 1]-[x]-[W 1]-[W 1]4140.2Type 3[W 1]-[W 1]-[W 1]-[W 1]2322.5This study
Type 4[x]-[W 1]-[W 1]-[W 1]3130.4Type 5[W 1]-[a ]-[W 1]-[a ]-[W 1]4 3.9This study;Ouyang et al.,2009Type a [V/I]-[x]-[V/I]-[V/I]2423.5Song et al.,2005
Type b [V/I]-[V/I]-[x]-[V/I/L]
5452.9Type a [P/I/L/V/M]-[I/L/V/M]-[x]-[I/L/V/M]-[a /S]31716.7Miteva et al.,2010;Sun and Hunter.,2012
Type b [P/I/L/V/M/F/Y]-[I/L/V/M]-[D]-[L]-[T]2423.5Type r [a /S]3-[I/L/V/M]-[x]-[I/L/V/M/F]2
sidekick6 5.9Type H [K]-[x]3–5-[I/V]-[I/L]-[I/L]-[x]3-[a /Q/N]-[a ]2
7 6.9Hannich et al.,2005Type M
[W 2]2-[x]-[S]-[x]-[S/T]-[a ]3
3 2.9Minty et al.,2000
Non connsus
5
4.9
货到付款是什么意思
Note :The abundance [number (Nb)and percentage (%)]of each motif in the DB is indicated.W 1¼I,L or V;W 2¼A,F,I,L,M,P,V or W;a ¼D or E.
Prediction of SUMOylation sites and SIMs
3487
at Xiamen University on November 19, 2015
/Downloaded from