EXP什么意思Bio Med Central
BMC Bioinformatics
Rearch article
The COG databa: an updated version includes eukaryotes
Roman L Tatusov*1, Natalie D Fedorova 1, John D Jackson 1, Aviva R Jacobs 1,
Boris Kiryutin 1, Eugene V Koonin 1, Dmitri M Krylov 1, Raja Mazumder 2, Sergei L Mekhedov 1, Anastasia N Nikolskaya 2, B Sridhar Rao 1,
Sergei Smirnov 1, Alexander V Sverdlov 1, Sona Vasudevan 1, Yuri I Wolf 1, Jodie J Yin 1 and Darren A Natale 2
Address: 1National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda MD, USA and 2Protein Information Resource, Georgetown University Medical Center, 3900 Rervoir Road, NW, Washington, DC 20007, USA
Email: Roman L Tatusov*-tatusov@ncbi.v; Natalie D Fedorova -fedorova@ncbi.v;
John D Jackson -jjackson@ncbi.v; Aviva R Jacobs -jacobs@ncbi.v; Boris Kiryutin -kiryutin@ncbi.v; Eugene V Koonin -koonin@ncbi.v; Dmitri M Krylov -krylov@ncbi.v; Raja Mazumder -rm285@georgetown.edu;
Sergei L Mekhedov -mekhedov@ncbi.v; Anastasia N Nikolskaya -ann2@georgetown.edu; B Sridhar Rao -rao@ncbi.v; Sergei Smirnov -smirnov@ncbi.v; Alexander V Sverdlov -asverdlo@ncbi.v;
Sona Vasudevan -vasudeva@ncbi.v; Yuri I Wolf -wolf@ncbi.v; Jodie J Yin -yin@ncbi.v; Darren A Natale -dan5@georgetown.edu * Corresponding author Abstract
Background: The availability of multiple, esntially complete genome quences of prokaryotes and eukaryotes spurred both the demand and the opportunity for the construction of an evolutionary classification of genes from the genomes. Such a classification system bad on orthologous relationships between genes appears to be a natural framework for comparative genomics and should facilitate both functional annotation of genomes and large-scale evolutionary studies.
Results: We describe here a major update of the previously developed system for delineation of Clusters of Orthologous Groups of proteins (COGs) from the quenced genomes of prokaryotes an
d unicellular eukaryotes and the construction of clusters of predicted orthologs for 7 eukaryotic genomes, which we named KOGs after eukaryotic orthologous groups. The COG collection currently consists of 138,458 proteins, which form 4873 COGs and compri 75% of the 185,505 (predicted)proteins encoded in 66 genomes of unicellular organisms. The eukaryotic orthologous groups (KOGs) include proteins from 7eukaryotic genomes: three animals (the nematode Caenorhabditis elegans , the fruit fly Drosophila melanogaster and Homo sapiens ),one plant, Arabidopsis thaliana , two fungi (Saccharomyces cerevisiae and Schizosaccharomyces pombe ), and the intracellular microsporidian parasite Encephalitozoon cuniculi . The current KOG t consists of 4852 clusters of orthologs, which include 59,838 proteins, or ~54% of the analyzed eukaryotic 110,655 gene products. Compared to the coverage of the prokaryotic genomes with COGs, a considerably smaller fraction of eukaryotic genes could be included into the KOGs; addition of new eukaryotic genomes is expected to result in substantial increa in the coverage of eukaryotic genomes with KOGs. Examination of the phyletic patterns of KOGs reveals a conrved core reprented in all analyzed species and consisting of ~20% of the KOG t. This conrved portion of the KOG t is much greater than the ubiquitous portion of the COG t (~1% of the COGs). In part, this difference is probably due to the small number of included eukaryotic genomes, but it could also reflect the relative compactness of eukaryotes as a clade and the greater evolutionary stability of eukaryotic genomes.
张继科比赛Conclusion: The updated collection of orthologous protein ts for prokaryotes and eukaryotes is expected to be a uful platform for functional annotation of newly quenced genomes, including tho of complex eukaryotes, and genome-wide evolutionary studies.
Published: 11 September 2003BMC Bioinformatics 2003, 4:41
Received: 20 May 2003
Accepted: 11 September 2003
This article is available from: /1471-2105/4/41
© 2003 Tatusov et al; licene BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpo, provided this notice is prerved along with the article's original URL.
本页已使用福昕阅读器进行编辑。福昕软件(C)2005-2007,版权所有,仅供试用。
Background
The rapid accumulation of genome quences is a major challenge to rearchers attempting to extract the maxi-mum functional and evolutionary information from the new genomes. To avoid informational overflow from the constant influx of new genome quences, a comprehen-sive evolutionary classification of the genes from all quenced genomes is required. Such classifications are bad on two fundamental notions from evolutionary biology: orthology and paralogy, which describe the two fundamentally different types of homologous relation-ships between genes [1–4]. Orthologs are homologous genes derived by vertical descent from a single ancestral gene in the last common ancestor of the compared spe-cies. Paralogs, in contrast, are homologous genes, which, at some stage of evolution of the respective gene family, have evolved by duplication of an ancestral gene. The notions of orthology and paralogy are intimately linked becau, if a duplication (s) occurred after the speciation event that parated the compared species, orthology becomes a relationship between ts of paralogs (co-orthologs), rather than individual genes. A classic ca of the interplay between orthologous and paralogous rela-tionships is en in the globin family: all animal globins, including myoglobin, are paralogs, but they are all co-orthologs of the plant leghemoglobin(s) [5]. Deciphering orthologous and paralogous relationships among genes is critical for both the functional and the evolutionary aspects of comparative genomics [4,5]. Orthologs typically occupy the same functional niche in different species, whereas paralogs tend to evolve tow
ard functional diversification. Therefore, robustness of genome annotation depends on accurate identification of orthologs. Similarly, knowing which homologous genes are orthologs and which are paralogs is required for con-structing evolutionary scenarios involving, along with ver-tical inheritance, lineage-specific gene loss and horizontal gene transfer.
In principle, identification of orthologs requires phyloge-netic analysis of entire families of homologous proteins, which is expected to isolate orthologous protein ts in distinct clades [6–8]. However, on the scale of complete genomes, such analysis is both extremely labor-intensive and error-prone due to the inherent artifacts of phyloge-netic tree construction. Therefore shortcuts have been developed by introducing the notion of a genome-specific best hit (BeT). A BeT is the protein in a target genome, which is most similar to a given protein from the query genome [9,10]. The underlying premi is that orthologs are more similar to each other than they are to any other protein from the respective genomes. In multiple-genome comparisons, pairs of potential orthologs identified via BeTs can be joined to form clusters of orthologs repre-nted in all or a subt of the analyzed genomes [9,11]. This approach to the identification of orthologous protein ts meets with two obvious complications. Firstly, many proteins belong to lineage-specific expansions, i.e., have evolved via duplication(s) after the divergence of the com-pared species [12–14]. I n the cas, de
ciphering (co)orthologous relationships can be a hard task and clus-ters of orthologs that include such expansions should be treated with particular caution. The cond complication is caud by the fact that many proteins exist in multido-main forms encoded by a single gene in some species and as products of two or more stand-alone genes in others. In protein clustering, multidomain proteins may connect distinct clusters of orthologs resulting in artifactual lumping.
无忧无虑的意思The approach to the identification of orthologous protein ts bad on clustering of consistent BeTs has been implemented in the collection of Clusters of Orthologous Groups (COGs) of proteins [9,15]. The COG construction protocol included an automatic procedure for detecting candidate ts of orthologs, manual splitting of multido-main proteins into the component domains, and sub-quent manual curation and annotation. The COGs started with 6 prokaryotic genomes and one genome of a unicel-lular eukaryote, yeast Saccharomyces cerevisiae [9]. Sub-quent updates incread the number of prokaryotic genomes in the COGs to 43 [15]. The procedure for COG construction required that each COG included proteins from at least three sufficiently distant species. This con-rvative approach notwithstanding, ~60 to ~85% of the proteins encoded in prokaryotic genomes were included in the COGs.
The COG system, which includes the COGNI TOR pro-gram for adding new members to COGs (RL
T, unpub-lished results), has become a widely ud tool for computational genomics. The most important applica-tions of the COGs are functional annotation of newly quenced genomes [16–20] and genome-wide evolu-tionary analys [21–25].
Here, we prent a major update to the COGs, with over 63 quenced prokaryotic genomes and three genomes of unicellular prokaryotes now included. Furthermore, the COG system is extended to complex, multicellular eukary-otes by constructing clusters of probable orthologs, which we named KOGs (eukaryotic orthologous groups) for 7 quenced genomes of animals, fungi, microsporidia, and plants.
Results and discussion
Update of the COGs
To add a new species to the COG system, the annotated protein quences from the respective genome were com-
pared to the proteins in the COG databa by using the BLAST program and assigned to pre-existing COGs by using the COGNI TOR program (and e Materials and Methods). The genomes o
f prokaryotes and unicellular eukaryotes that have been quenced since the latest update of the COGs were added one at a time. At each step, the proteins that remained unassigned after manual validation of the COGNITOR results were subject to the COG construction procedure in order to identify new COGs that could be formed thanks to the addition of the analyzed genome. The resulting COG assignments for 63 prokaryotic genomes and three genomes of unicellular eukaryotes are quantified in Table 1. The addition of new species leads to incremental increa in the COG coverage for each of the included prokaryotic genomes. The highest coverage now achieved is for Buchnera sp. (99%) and the lowest coverage is for Borrelia burgdorferi (43%). Each of the organisms is a special ca. Buchnera is a highly degraded ensymbiont, which evolved from a relatively recent common ancestor with E. coli but apparently lost the great majority of genes, retaining – almost exclusively – conrved, esntial ones [26], whereas Borrelia has numerous plasmids that mostly encode poorly conrved genes [27]. Probably more telling is the obrvation that, for most free-living prokaryotes, ~80% of the genes belong to COGs and there is no appreciable dependence between the number of genes in a genome and the COG coverage (Table 1). Given that most genomes encode a substantial fraction (up to 10%) of fast-evolving, non-globular proteins [28] and other poorly conrved pro-teins (e.g., remnants of prophages) as well, the findings em to suggest that the COG coverage of most genomes is approaching saturation.
王强歌手The COGs are accompanied by a phyletic pattern arch tool, i.e., a Web-bad tool that allows the ur to lect COGs with a desired pattern of prence-abnce of spe-cies. Using the phyletic pattern arch tool, one can clas-sify the COGs by the reprentation of the major lineages of unicellular life forms (Fig. 1). This breakdown of the updated COGs emphasizes the important trend noticed previously[9,15]: only a minuscule fraction (~1%) of the COGs are ubiquitous and even the COGs that are prent in all bacteria or in all archaea reprent a small minority. Furthermore, many COGs show scattered distribution, which appears to reflect rampant lineage-specific gene loss and horizontal gene transfer, which are typical of prokary-otic evolution [29–31].
Construction of KOGs for 7 quenced eukaryotic genomes
Eukaryotic KOGs were constructed from annotated pro-teins encoded in the genomes of three animals (Homo sapiens [32], the fruit fly Drosophila melanogaster [33], and the nematode Caenorhabditis elegans) [34], the green plant Arabidopsis thaliana (thale cress) [35], two fungi (budding yeast Saccharomyces cere v isiae [36] and fission yeast Schizosaccharomyces pombe [37], and the microsporidian Encephalitozoon cuniculi [38]). The basic procedure for KOG construction was the same as the procedure previ-ously employed for prokaryotic genomes (Refs. [9,15] and e Materials and Methods). Given the abundance of multidomain architectures among e
ukaryotic proteins and the fact that apparent orthologs often differ in domain composition [32,39], the protocol bad on the BeT analysis was amended with domain identification using the RPS-BLAST program [40]. Proteins assigned to a KOG by the initial KOG construction procedure were kept in that KOG without splitting them into individual domains if they shared a common core of domains. I n addition, proteins, which consisted solely of widespread, "promiscuous" domains (e.g., SH2, SH3, WD40 repeats or TPR repeats) and did not show clear-cut orthologous relationships, were assigned to Fuzzy Orthologous Groups (FOGs). In addition to KOGs and FOGs, we also identified provisional clusters of orthologs reprented in two genomes (TWOGs) by detecting bi-directional BeTs between proteins not included in KOGs or FOGs and assigning additional members by examination of the BLAST arch outputs. Finally, lineage-specific expansions (LSEs) of paralogs among the proteins from each genome not included in KOGs, FOGs, and TWOGs were detected by using the clustering procedure described previously [14] accompanied by a newly developed procedure for finding tight protein clusters (BK and RLT, unpublished results). The construction of TWOGs and LSEs involved more extensive ca by ca evaluation than the KOG con-struction due to the lack of well established procedures to generate the types of clusters; nevertheless, the clusters should be considered preliminary until further validation. Table 2 shows the assignment of the proteins from each of the analyzed eukaryotic species to KOGs. Unlike the situ-ation with prokaryotic
COGs (Table 1), the fraction of proteins assigned to KOGs tends to decrea with increas-ing genome size of the analyzed eukaryotic species, from the maximum of ~74% for fission yeast Schizosaccharomy-ces pombe, the cond smallest genome (for reasons that remain unclear, the smallest genome, that of the micro-sporidian Encephalitozoon cuniculi, had only 61% of the proteins included in COGs) to ~49% for the largest, human genome (Table 2).
Compared to prokaryotes, a considerably smaller fraction of eukaryotic genes could be included into KOGs (Tables 1 and 2). Thus, the apparent difference in coverage with highly conrved clusters of orthologs (C/KOGs) between prokaryotes and eukaryotes, particularly complex ones, is probably due to the relatively small number of eukaryotic genomes included in this analysis and is expected to level
Table 1: Coverage of unicellular organisms in COGs
Species Number of annotated proteins Number (and percentage) of
proteins in COGs Number of COGs that include the
given species
Bacteria
Proteobacteria (Gram-negative)
Agrobacterium tumefaciens52994398 (83%)1978 Brucella melitensis31982678 (84%)1654 Caulobacter crescentus37372958 (79%)1734 Mesorhizobium loti72755653 (78%)2175 Sinorhizobium meliloti62055207 (84%)2084 Rickettsia conorii1374891 (65%)733 Rickettsia prowazekii835727 (87%)647 Buchnera sp574567 (99%)559 Escherichia coli K1242793623 (85%)2131 Escherichia coli O157:H753244050 (76%)2190 Escherichia coli O157:H7 EDL93353614023 (75%)2200 Salmonella typhi45533724 (82%)2167 Yersinia pestis40833341 (82%)1993 Haemophilus influenzae17141597 (93%)1317 Pasteurella multocida20151829 (91%)1455 Vibrio cholerae34632929 (85%)1918 Pudomonas aeruginosa55674660 (84%)2243 Xylella fastidiosa28321740 (61%)1310 Neisria meningitidis MC5820791561 (75%)1255 Neisria meningitides Z249120651573 (76%)1260 Ralstonia solanaraceum51163931 (77%)2018 Campylobacter jejuni16341328 (81%)1093 Helicobacter pylori 2669515761127 (72%)920 Helicobacter pylori J9914911106 (74%)921
Low-GC Gram-positive bacteria
Bacillus halodurans40663149 (77%)1744 Bacillus subtilis41123125 (76%)1771 Clostridium acetobutilicum38482879 (75%)1549 Lactococcus lactis22671798 (79%)1208 Listeria innocua30432428 (80%)1522 Mycoplasma genitalium484385 (80%)362 Mycoplasma pneumoniae689431 (63%)383 Mycoplasma pulmonis782514 (66%)426 Ureaplasma urealyticum614418 (68%)378 Staphylococcus aureus26252071 (79%)1419 Streptococcus pneumoniae20941586 (76%)1105 Streptococcus pyogenes16971356 (80%)1030
Actinobacteria
Corinebacterium glutamicum30402162 (71%)1339 Mycobacterium tuberculosis H37Rv39272843 (72%)1450 Mycobacterium tuberculosis
CDC1551
41872756 (66%)1434 Mycobacterium leprae16051180 (74%)927
Hyperthermophilic bacteria
Aquifex aeolicus15601349 (86%)1088 Thermotoga maritima18581565 (84%)1167
off with the growth of the eukaryotic genome collection. This view is compatible with the obrved dependence of the KOG coverage on the number of genes (Table 1), which suggests that the KOGs are still far from saturation. Examination of the phyletic patterns of KOGs points to the existence of a conrved eukaryotic gene core as well as substantial diversity (Fig. 2); this clearly rembles the evolutionary pattern en previously during the analysis of archaeal COGs [41]. The genes reprented in each of the 7 analyzed genomes compri ~20% of the KOG t and approximately the same number of KOGs includes 6 species, with the exception of the microsporidian. The prevalence of the latter pattern is not surprising given that microsporidia are intracellular parasites with minimal metabolic capabilities and a dramatically reduced genome [38]. The next largest group consists of animal-specific COGs, which, again, could be expected becau animals are the only lineage of complex eukaryotes that is reprented by more than one species in the analyzed t of genomes. However, a notable obrvation is that ~30% of the KOGs had "odd" phyletic patterns, e.g., are repre-nted in one animal, one plant and one fungal species (Fig. 2).
To illustrate the typical composition of a KOG, some of the problems that tend to emerge with their construction, and possible biological implications, we briefly discuss
Cyanobacteria下的成语开头
Synechocystis sp.31672346 (74%)1427 Nostoc sp.61293832 (63%)1673
Other bacteria
Borrelia burgdorferi1638701 (43%)577 Treponema pallidum1036737 (71%)639 Chlamydia trachomatis895644 (72%)587 Chlamydophila pneumoniae1054667 (63%)603 Deinococcus radiodurans31822322 (73%)1495 Fusobacterium nucleatum20671556 (75%)1143
Archaea
梅菜怎么做好吃Euryarchaeota
Archaeoglobus fulgidus24201953 (81%)1244 Methanocaldococcus jannaschii17581448 (82%)1117 Methanothermobacter autotrophicus18731500 (80%)1123 Methanopyrus kandleri16911253 (74%)1022 Methanosarcina acetivorans45403142 (69%)1462 Pyrococcus abyssi17691506 (85%)1065 Pyrococcus horikoshii18011425 (79%)1019 Thermoplasma acidophilum14821261 (85%)890 Thermoplasma volcanium14991277 (85%)900 Halobacterium sp.26221809 (69%)1109
Crenarchaeota
Aeropyrum pernix18401236 (67%)947 Pyrobaculum aerophylum26051529 (59%)1015 Sulfolobus solfataricus29772207 (74%)1084
Eukaryota
Saccharomyces cerevisiae63383012 (48%)1299 Schizosaccharomyces pombe49792774 (56%)1282 Encephalitozoon cuniculi19961105 (55%)696 Table 1: Coverage of unicellular organisms in COGs (Continued)
here KOG3378, which includes proteins already men-tioned above as a typical ca of paralogy and orthology, namely, the globins (Fig. 3). Globins are small (typically, between 140 and 150 amino acid residues) and relatively poorly conrved proteins. As a conquence, the initial,automatic procedure for KOG construction produced a candidate KOG consisting of only 3 proteins from 3 spe-cies: S. cerev isiae YGR234w, its ortholog from S. pombe SPAC869.02c, and human neuroglobin Hs10864065. The remaining proteins were brought into the KOG manually,
Figure 1果园管理
Phyletic patterns of COGs. All, reprented in all unicellular organisms included in the COG system;
All archaea, All bacteria, All eukaryotes, reprented in each species from the respective domain of life (and possibly in some species from other domains); All bacteria except the smallest, reprented in all bacteria except, possibly, parasites with small genomes (mycoplasma, chlamydia, rickettsia, and spirochetes).
Table 2: Reprentation of the 7 analyzed eukaryotic species in KOGs
香蕉怎么吃Species Symbol Number of annotated
proteins
Number of proteins in KOGs (%)
Arabidopsis thaliana A25,74913,53153% Caenorhabditis elegans C20,27510,39351% Drosophila melanogaster D13,4688,32162%
Homo Sapiens H37,84018,71449% Saccharomyces cerevisiae Y6,3383,97163% Schizosaccharomyces pombe P4,9893,69274% Encephalitozoon cuniculi E1,9961,21661%
Total110,65559,83854%