Fast UniFrac ,PCoA 分析软件使用说明

更新时间:2023-06-19 06:11:18 阅读: 评论:0

Fast UniFrac is a new version of UniFrac that is specifically designed to handle very large datats. Like UniFrac, Fast UniFrac provides a suite of tools for the com parison of m icrobial com m unities using phylogenetic inform ation. It takes as input a single phylogenetic tree that contains quences derived from at least three different environm ental sam ples, a file m apping ids ud in the tree to a t of unique sam ple ids (sam e form at as prior version 'environm ent file', and an (optional) category m apping file describing additional relationships between sam ples and subcategories for visualizations. For exam ple, in a given t of gut sam ples, you m ight define subcategories for different diets, different physical locations/dates, different species, and/or different treatm ents like antibiotics or high fat. For sam ple data click here. For citation, click here.
Both the UniFrac distance m etric and the P test can be ud to m ake com parisons. Both of the techniques bypass the need to choo operational taxonom ic units (OTUs) bad on quence divergence prior to analysis.
Fast UniFrac allows you to:
Determ ine if the sam ples in the input phylogenetic tree have significantly different m icrobial com m unities.
potato什么意思
Cluster sam ples to determ ine whether there are environm ental factors (such as tem perature, pH, or salinity) that group com m unities together.
Determ ine whether system under study was sam pled sufficiently to support cluster nodes.
Easily visualize the differences between sam ples graphically, with support for three dim ensional exploration of datats and with m ultiple subcategory coloring.
Plea enter your em ail and password to continue. After you register you will be able to analyze up to 100000 unique quences, up to 200sam ples, and perform significance test bad on up to 1000 tree perm utations.
If you wish to analyze m uch larger datats than the defaults, plea contact us and we will be happy to try to accom m odate you.
Fast UniFrac tutorial
Introduction
This tutorial takes you through the steps of analyzing data in the Fast UniFrac web application. The
purpo of this tutorial is to show you how to u the interface to find the im portant variables for describing phylogenetic variation am ong your sam ples: in this ca, to test what types of physical or chem ical factors are m ost im portant for structuring bacterial diversity. The datat ud in this tutorial includes 50 of the 464 sam ples analyzed in Ley, RE, Lozupone, CA, Ham ady, M, Knight, R and JI Gordon. (2008). Worlds within worlds: evolution of the vertebrate gut m icrobiota. Nat. Rev. Microbiol. 6(10): 776-88 (Pubm ed). It includes quences from 16S ribosom al RNA surveys of diver freeliving bacterial asm blages and the guts of diver m am m als and term ites. At the end of this tutorial, you should be fully equipped to test hypothes about your own quences.
Also included in this tutorial are other exam ple files you m ay u to explore som e of the other features of Fast UniFrac.
Example data files
To u Fast UniFrac, you need three files: a tree file, a sam ple id m apping file, and a category m apping file. The tree file contains a phylogenetic tree, in Newick form at. The sam ple id m apping file contains a table showing how m any tim es each taxon (from the tree) occurred in each of your sam ples. The category m apping file contains additional m etadata about the sam ples, and is a table rel
肯尼亚首都ating each sam ple to param eters you have m easured such as tem perature, pH, etc. In general, people usually prepare the two m apping files using Excel, although it is im portant to save them as plain text form at and not as Excel docum ents.
You can either generate your own tree file, or u one of the reference trees. The PhyloChip reference tree m atches the probes on the PhyloChip and is uful for analyzing PhyloChip data; the Greengenes reference tree is from the Greengenes core t and is a phylogenetically diver and reprentative t of bacteria. The trees are built using 16S rRNA, although you can u trees built from any m olecule, not just the 16S, or even trees constructed from m orphological or other data.
The sam ple id m apping file m ust be generated m apping the quence ids in the tree file with the sam ple ids ud in your study. In other words, exactly the sam e taxon nam es m ust be ud in your tree and in your sam ple id m apping file.
The category m apping file m aps your sam ple ids to additional m etadata, such as subcategories, and sam ple descriptions. This file can be autogenerated but it is highly recom m ended that you generate one that is m eaningful for the variation you plan to exam ine in your studies. For exam ple,
cageif you were studying the effects of diet on the gut com m unities of conventional and hum anized m ice, you m ight want one colum n indicating whether the sam ple was from a conventional or a hum anized m ou, another colum n indicating whether the m ou was on a chow diet or a high-fat diet, another colum n containing the com bination of the two colum ns (i.e. diet and hum anized/conventional), etc.
In this ction, veral exam ple files are listed, not all of which are ud in this tutorial.
Greengenes coret reference datats
遮蔽This is the tree and the quences m atching the Greengenes core t as of May 2009. The files are uful for m apping your quences against known bacterial diversity.
1. Greengenes coret tree (May 09)
2. Greengenes coret fasta (May 09)
NRM data (demo subt)
The data are from the Ley et al. 2008 Nature Reviews Microbiology paper referenced above, and
provide an exam ple of m apping heterogeneous reads to the Greengenes core t tree so that the com m unities can be com pared by UniFrac. The sam ple ID m apping file was generated by blasting the datat from the paper against the Greengenes_coret_fasta file linked above, and the category m apping file was constructed m anually to provide a range of fine- and coar-grained reprentations of the environm ental data.
1. Ley et al exam ple sam ple ID m apping file
2. Ley et al exam ple category m apping file
剑桥少儿英语网
1度Example PhyloChip data
Exam ple data from Sagaram et al. 2009 AEM paper (Pubm ed) for u with PhyloChip reference tree.
1. Sagaram et al PhyloChip sam ple ID m apping file
2. Sagaram et al PhyloChip category m apping file
Crump et al data
The quences are from Crum p et al. 1999 "Phylogenetic analysis of particle-attached and free-living bacterial com m unities in the Colum bia river, its estuary, and the adjacent coastal ocean", AEM 65:3192 (Pubm ed). This datat was ud in the original online UniFrac tutorial (Pubm ed)so are provided again here with two im portant changes. We provide an exam ple category m apping file that contains additional m etadata about each of the sam ples.
1. Crum p et al exam ple tree file
2. Crum p et al exam ple sam ple ID m apping file
3. Crum p et al exam ple category m apping file
Megablast protocol and sample mapping generation script
The application of UniFrac to large quence ts, such as tho generated with pyroquencing, is also lim ited by the com putational power needed to m ake a de novo phylogenetic tree using standard m ethods, such as neighbor joining, likelihood, or parsim ony m ethods. In order to prepare phylogenetic trees for input into UniFrac from very large datats, we recom m end using QIIME. The best source for inform ation about QIIME are the website and the QIIME paper, which you can get at the following links:
1. Source code
2. QIIME allows analysis of high-throughput com m unity quencing data
The quickest way to get started with QIIME is using the virtual m achine.
One potential workflow for working with ñarge datats is to u QIIME to:
1. Preprocess quences to handle low quality reads
2. Select OTUs
3. Generate a phylogenetic tree , and then u the QIIME script convert_otu_table_to_unifrac_sam ple_m apping.py, to generate the proper input files for
the Fast UniFrac web interface.
英语四级听力练习In the initial relea of Fast UniFrac, we also described the following procedure for generating a phylogenetic tree, which is bad on m apping quences to their clost relative in a reference tree using BLAST. This functionality is now in QIIME, and we recom m end using QIIME for this step, but retain this docum entation below for tho who m ay still be interested in using it
bedThe BLAST to greengenes protocol
We illustrate that the analysis of such large quence ts can be carried out by assigning them to their clost relative in a phylogeny of the Greengenes core t (DeSantis et al., 2006) using BLAST’s m egablast protocol (Altschul et al., 1990). Below is a detailed protocol for carrying out this analysis. Note that
a different BLAST databa can be substituted for u with any reference tree.
1. Create the Greengenes BLAST databa:
This link is a fasta file containing the quences from the greengenes coret. This fasta record can be form atted into a BLAST databa using the com m and:
f o r m a t d b-i G r e e n G e n e s C o r e-M a y09.r e f.f n a-p F-o F-n
g g_c o r e s e t
2. Perform the megablast arch:
A fasta record of your sam ples can be BLASTed against the gg_coret BLAST databa created in step 1 using the following com m and:
b l a s t a l l-p b l a s t n-n T-d g g_
c o r e s e t-i-e1e-30-b5-m9-o b l a s t_o u t p u t.t x t
Note that the -m 9 flag is esntial becau it specifies the hit table output form at that the script below requires.
Also note that the quence nam es m ust conform to the following form at:
s a m p l e N a m e D e l i m i t e r s e q u e n c e I d
For instance, if you quenced 2 clones from each of two sam ples nam ed SA and SB, valid quence nam es m ight be:
S A#01
S A#02
S B#01
S B#02
If you have not nam es the quences according to this convention, it is possible to also u a m apping file describing which quence is from which sam ple. See docum entation within the code for m ore details on this.
3. U this python script and the BLAST output from step 2 to create an environment file that can be ud with UniFrac:
Note that the PyCogent toolkit m ust be downloaded from SourceForge and the cogent directory should be on your PYTHONPATH.
You can then u the code as follows:
p y t h o n c r e a t e_u n i f r a c_e n v_f i l e_B L A S T.p y<b l a s t_o u t p u t.t x t><o u t f i l e_p a t h.t x t><s a m p l e_n a m e_d e l i m i t e r>
: Path to the hit tables from the BLAST arches
: Path to where the environm ent file will be saved
sam ple_nam e_delim iter: A delim iter (e.g. a #) that parates the sam ple nam e from the quence id.
Steps
1. Create a phylogenetic tree containing quences from samples that you would like to compare, or lect a reference tree.
The tree should be rooted, and m ust have branch lengths to u Fast UniFrac. Typically, the tree is rooted by including an outgroup, e.g. an archaeal quence to root the bacteria, but we som etim es u m idpoint rooting as well. If an unrooted tree is supplied, UniFrac will assign a root arbitrarily. If you have extra quences in the tree that are not annotated by sam ple, they will autom atically be rem oved from the tree when you upload the file, so the outgroup will not be included in the analysis. If no quences appear in the tree after upload, the m ost likely problem is that there was an issue with your sam ple ID m apping file (for exam ple, you m ight have ud GenBank identifiers in the tree, but NCBI GIs in the sam ple ID m apping file, which wouldn't m atch each other).
There are m any different program s that you can u for quence alignm ent and/or the phylogeny include the NAST alignm ent tool, PyNAST, FastTree, ARB, ClustalW, MUSCLE, PHYLIP, PAUP, or MrBayes. For 16S rRNA quences, we prefer PyNAST for alignm ent. For generating trees from large datat, we prefer FastTree for de novo tree generation trees or m apping quences to their clost relative in a reference tree. The preferred options as well as veral others can be run using QIIME. For large datats, it is greatly preferred to lect OTUs prior to the alignm ent and tree building step. This cuts down on the com putation tim e and does not have an effect on the results. Becau UniFrac depends on branch lengths, it is im portant to look at your tree to ensure that you don't e long branches that result from m isalignm ent rather than from long periods of evolution. At the end of this process, you can export the tree in Newick form at for upload into the UniFrac interface.
xytoAlternatively, you can choo one of the reference trees provided and m ap your quences to this tree. This can be uful, particularly for large datats, such as tho produced by 454 pyroquencing, since creating a single phylogenetic tree with all quences m ay not be feasible with the program s listed above. One sim ple way to m ap your quences onto their clost relatives in a reference tree is u m egablast. In this tutorial, the original quences from the NRM
paper were assigned to their clost hit in the 11-Aug_2007 version of the greengenes coret (can be downloaded from v/Download/Sequence_Data/Fasta_data_files/). Sequences with no hit or that m atch with an e-value greater than e-50 were dropped from this exam ple datat.
For the purpo of this tutorial, we provide the greengenes coret tree in Newick form at that we exported from an arb databa that is available for download at v/Download/Sequence_Data/Arb_databas/ A sm all num ber of quences were added to this tree using parsim ony inrtion in arb so that the fasta data files and tree for the core t were in sync. The resulting tree (Greengenes coret tree (May 09)) and corresponding quences (Greengenes coret fasta (May 09)) can downloaded, but plea note that this tree can be im ported to your history and does not need to be re-uploaded. In order to im port the GreenGenes reference tree to your history follow the steps:
1. In the upper m enu, go to Shared Data - Data Libraries:
2. Then, lect 'GreenGenes coret tree (May 09):
3. Click on the checkbox next to '' and, finally, on the 'Go' button:
4. The reference tree is now in your history and you can u it.
2. Create a sample ID mapping file.
This file m aps each quence ID in the tree to the sam ple ID that it cam e from. This m ust be done m anually (or via a script): for each quence, type the quence ID ud in the tree, then a tab, then the sam ple ID that it com es from, then optionally, another tab and then the num ber of tim es each quence was obrved (quence abundance).
The quence abundance colum n is im portant if you have dereplicated the quence data in any way (e.g. choosing OTUs and only including a reprentative quence in the tree, rem oving exact duplicate quences, or pre-screening clones using RFLP patterns prior to quencing), and you are planning on using tools in the interface that consider differences in relative abundance (e.g. weighted UniFrac). It is fine to u a tree and sam ple ID m apping file with all of the quences (e.g. 5 duplicate quences in the tree each with a weight of 1 rather than 1 reprentative quence with a weight of 5) and to perform abundance-bad analys, although dereplicating the data will allow you to process larger datats.
For PCoA analysis, it is m ost convenient to nam e each environm ent so that sam ples of the sam e
type have nam es that start with the sam e first 1, 3, or 5 letters or that have sam ple types followed by a period, hash, or plus character (this allows you to apply colors in the PCoA scatterplots later).
In this exam ple, there are 50 bacterial sam ples from the following sam ple types: Surface and subsurface saline water (Sws, and Swb respectively), Nonsaline water (Nw), Saline dim ents (S), Nonsaline dim ents (Nsa), Soils (Nso), the Vertebrate gut (Vg) and the Term ite gut (Tg). We'll label each sam ple with its 2-3 letter sam ple code, followed by a hash, and a unique num ber becau our hypothesis is that the organism s from the sam e overall environm ent should be m ore sim ilar to one another.
The following is a short snippet of a sam ple ID m apping file. The first colum n is the quence ID, the cond colum n is sam ple ID, and the last colum n is the num ber of tim es the quence was obrved.
150394T g#12491
2015年12月20日150394T g#12512
215260N s o#651
215260N s o#1294
16073V g#h#111
...
For the purpo of this tutorial, we provide a sam ple ID m apping file called fastunifrac_Ley_et_al_NRM_2_sam ple_id_ip sam ple ID m apping file.
3.Create a category mapping file.
The category m apping file relates sam ple nam es in the sam ple ID m apping file to their related m eta data (defined via subcategory colum ns) and descriptions of where the sam ples cam e from. The descriptions can be accesd throughout the results interface in order to m ake them easier to interpret. The subcategory colum ns allow for dynam ic coloring of PCoA results in the 3d viewer to determ ine which categories are related to which principal coordinate axes.
For the purpo of this tutorial, we provide a category m apping file called Ley et al exam ple category m apping file with 4 subcategory colum ns that define for each sam ple (1) which sam ple type it is from(EnvType), (2) whether the sam ple cam e from a freeliving bacterial asm blage or fr
om the gut (FreelivingGut), (3) whether the freeliving com m unities were saline or nonsaline (SalineNon), and whether they were from aquatic (Water) or "Particulate" sam ples such as soils and dim ents (WaterPartic). There is also a short description of each sam ple in the final colum n.
The file form at is tab-delim ited text. The first line is a header line that m ust start with a "#" character.
Optionally, a general description of the input files can be included in the lines im m ediately following the header line that start with a "#". This description will be included in the upload and results screens so that relevant inform ation can be easily accesd.
The first colum n m ust be nam ed Sam pleID, m ust contain unique (short, m eaningful) sam ple IDs containing only alphanum eric characters. (With the exception of ".", "+", and "#" characters.)
The cond colum n to "n-1 th" colum n are subcategories. The can be anything (random assignm ent if you want) but each subcategory should a sm all num ber of distinct values <= num ber of sam ples. There m ust be at least two unique values for each category.
The last colum n m ust be nam ed "Description" and contains the short descriptions for the sam ples.
#S a m p l e I D E n v T y p e F r e e l i v i n g G D e s c r i p t i o n
#G e n e r a l d e s c r i p t i o n o f a n a l y s i s l i n e1(o p t i o n a l)
#G e n e r a l d e s c r i p t i o n o f a n a l y s i s l i n e2(o p t i o n a l)
#...
T g#1249T e r m i t e G u t G W h o l e g u t o f t h e w o o d-f e e d i n g t e r m i t e
T g#1251T e r m i t e G u t G W h o l e g u t o f t h e f u n g u s-g r o w i n g t e r m i t e M a c r o t e r m e s g i l v u s
N s o#65S o i l F r e e l i v i U n c u l t i v a t e d a g r i c u l t u r a l s o i l i n W i s c o n s i n
N s o#1209S o i l F r e e l i v i S o i l f r o m a f e r t i l i z e d S w i t z e r l a n d p l o t i n t h e D O K.
V g#h#111V e r t e b r a t e G u t G F e c e s f r o m A n g o l a n C o l o b u s M o n k e y f r o m t h e S t L o u i s Z o o.
.
..
For the purpo of this tutorial, we provide a category m apping file called fastunifrac_Ley_et_al_NRM_3_category_ip sam ple ID m apping file.
4. Go to the Fast UniFrac web site.
If you're reading this tutorial, you already know how to get here. You will need to register and log in to com plete the tutorial, becau we restrict the num ber of quences that unregistered urs can analyze. The reason for this is that m any of the analys are com putationally expensive, so we need to keep track of which groups are using a lot of resources to ensure fair access for everyone. Plea note that if you have previously registered for the original UniFrac interface, you will have to contact m icrobiom ehelp@colorado.edu to register for FastUniFrac. We apologize for this inconvenience.
5. The Fast UniFrac upload screen
After you have logged in, you have to upload your sam ple ID m apping file and your category m apping file. To get to the upload page, click 'Get data' on the Tools panel and then 'Upload file':
Then, the upload page will appear:
First, upload your sam ple ID m apping file. Click 'Brow' below where it says File, and navigate to your sam ple ID m apping file (in this ca, fastunifrac_Ley_et_al_NRM_2_sample_). One com m on problem is that you m ight have your sam ple ID m apping file saved as a Word docum ent: this will NOT work, becau Word us a proprietary file form at that is difficult for other program s to read. If you are saving your sam ple ID m apping file from Word, rem em ber to save it as Plain Text, NOT as Microsoft Word. If you are using Excel, save as Tab-delim ited Text. At the end of this
process, your screen should look like this:
state - blue color) will appear in the history panel:
While the sam ple ID m apping is uploading, you can start with the category m apping file upload. In order to upload your category m apping file follow the above steps, but now navigate to your category m apping file (in this ca, fastunifrac_Ley_et_al_NRM_3_). This file is m ost easily created in Excel, rem em ber to save as Tab-delim ited text.
If you have your own tree file, you can upload it following the sam e steps. In this tutorial, we will u the 'GreenGenes Core - May 2009' tree, which is already on the system.
Once all the files are uploaded (the datats in the history panel are in green color) you can start any of the available analysis in Fast UniFrac.
6. Measuring the overall difference between each pair of samples.
In order to generate the raw distances between each pair of sam ples using the UniFrac m etric, first choo the Sample Distance Matrix option from the Tool panel, under the 'Fast UniFrac' ction.
On the Sam ple Distance Matrix page you can lect the reference tree, sam ple ID m apping file and the category m apping file you want to u to perform the analysis. First, lect the 'GreenGenes Core - May 2009' tree using the drop-down m enu below 'Select reference tree'. Next, lect the '1: fastunifrac_Ley_et_al_NRM_2_sam ple_id_' file and the '2: fastunifrac_Ley_et_al_NRM_3_category_' file using the drop-down m enus below 'Select sam ple ID m apping file' and 'Select category m apping file', respectively. If you then click the 'Execute' button, you will get a m essage saying that your job has been subm itted to the queue, and two new datats will appear in the History panel. When the datats are green (tim e depending on rver load) you can view them clicking on the eye icon. The first datat will display a screen like the following. containing the distance m atrix that relates each pair of environm ents:

本文发布于:2023-06-19 06:11:18,感谢您对本站的认可!

本文链接:https://www.wtabcd.cn/fanwen/fan/90/150249.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:练习   听力
相关文章
留言与评论(共有 0 条评论)
   
验证码:
Copyright ©2019-2022 Comsenz Inc.Powered by © 专利检索| 网站地图