Tutorial 1: P rocessing R NA-‐q I llumina P aired E nd
Data t hrough T rinity D e N ovo
Trinity partitions the quence data into many individual de Bruijn graphs, each reprenting the transcriptional complexity at a given gene or locus, and then process each graph independently to extract full-length splicing isoforms and to tea apart transcripts derived from paralogous genes. Briefly, the process works like so: •Inchworm asmbles the RNA-Seq data into the unique quences of transcripts, often generating full-length transcripts for a dominant isoform, but then reports just the unique portions of alternatively spliced transcripts.
•Chrysalis clusters the Inchworm contigs into clusters and constructs complete de Bruijn graphs for each cluster. Each cluster reprents the full transcriptional complexity for a given gene (or ts of genes that
share quences in common). Chrysalis then partitions the full read t among the disjoint graphs.
•Butterfly then process the individual graphs in parallel, tracing the paths that reads and pairs of reads take within the graph, ultimately reporting full-length transcripts for alternatively spliced isoforms, and
teasing apart transcripts that corresponds to paralogous genes.燃气灶十大品牌
To r un T rinity, w e c an u e ither p aired o r u npaired r eads. W hen u sing p aired, y ou s hould be a ble t o s ee /1 i n o ne f asta f ile a nd /2 i n t he o ther.
国子监是什么机构<Example L eft.fq>
@61DFRAAXX100204:1:100:10494:3070/1 ACTGCATCCTGGAAAGAATCAATGGTGGCCGGAAAGTGTTTTTCAAATACAAGAGTGACAATGTGCCCTGTTGTTT +
ACCCCCCCCCCCCCCCCCCCCCCCCCCCCCBC?CCCCCCCCC@@CACCCCCACCCCCCCCCCCCCCCCCCCCCCCC @61DFRAAXX100204:1:100:10497:13422/1 GTAATTTCCGTACCTGCCACAGTGTGGGCTCACCCTGCTTAGAGGACAGGGAAGGACCCTAAAGGTAGGCTGATGC + CCCCCCCCCCCCCCCCCCCCCCDCDCCCCCCCCCCCCCCCCCCCDDCCDDCDCBDCCDDDDBADDADDB@DBBBA@ @61DFRAAXX100204:1:100:10546:4478/1 CTGGGCTGCAGCTAAGTTCTCTGCATCCTCCTTCTT
GCTTGTGGCTGGGAAGAAGACAATGTTGTCGATGGTCTGG +
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC7CB@CA:>AB?C=C@@@@?A@?5:88:
<Example R ight.fq>
@61DFRAAXX100204:1:100:10494:3070/2 CTCAAATGGTTAATTCTCAGGCTGCAAATATTCGTTCAGGATGGAAGAACATTTTCTCAGTATTCCATCTAGCTGC +
C<CCCCCCCACCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCBCCCCCCCCCCCCCCCCACCCCCACCC= @61DFRAAXX100204:1:100:10497:13422/2 GAGTTACTGGTAAGACGCTTACACCTATAACTCAAGGTCGGAATAGTCCCTCCAGTCCCTTTAGTAACCCAGTGGC + CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDCCCCCCCCCCCCCCCCCCCCCCCCCCDCCCCCCCACCC
If you have strand-specific data, specify the library type. There are four library types:
•Paired reads:
o RF: first read (/1) of fragment pair is quenced as anti-n (rever(R)), and cond read (/2) is in the n strand (forward(F)); typical of the dUTP/UDG quencing method.
o FR: first read (/1) of fragment pair is quenced as n (forward), and cond read (/2) is in the antin strand (rever)
•Unpaired (single) reads:
o F: the single read is in the n (forward) orientation
o R: the single read is in the antin (rever) orientation
Once w e h ave t ransferred t he f iles t o t he s erver, r unning t rinity i s r elatively s imple s ince i t runs s tepwi t hrough t he “Trinity” p ipeline. N avigate t o t he d irectory w here y ou h ave your s equence f iles (tutorial1and2 f older), a nd r un t he f ollowing c ommand:
Example P aired R un:
Trinity.pl --qType fq --left reads.left.fq --right
reads.right.fq --SS_lib_type RF --paired_fragment_length 280 --min_contig_length 305 --CPU 4 --bfly_opts "-V 10 --stderr"
Note: -‐-‐bfly_opts “-‐V 10 -‐-‐stderr" i s s et s o t hat t he V erbo l evel i s h igh a nd w ill p rint t o t he s creen.
Where -- left <FILENAME> is your filename of you want to process one type of paired end reads, and --right
<FILENAME> is the filename of the cond type of paired end reads. Page 1 of the tutorial should describe the difference.
By tting the —SS_lib_type parameter to one of the above, you are indicating that the reads are strand-specific. By default, reads are treated as not strand-specific.
if strand-specific data, t:
--SS_lib_type <string> :if paired: RF or FR, if single: F or R
Butterfly-related options:
--bfly_opts <string> :parameters to pass through to butterfly (e butterfly documentation). --bflyHeapSpace <string> :java heap space tting for butterfly (default: 1000M) => yields command java -Xmx1000M -jar Butterfly.jar ... $bfly_opts
--no_run_butterfly :stops after the Chrysalis stage. You'll need to run the Butterfly computes parately, such as on a computing grid.
营销组织Inchworm-related options:
--no_meryl :do not u meryl for computing the k-mer catalog (default: us meryl, providing improved runtime performance)
冬至诗词--min_kmer_cov <int> :min count for K-mers to be asmbled by Inchworm (default: 1) Misc:
--CPU <int> :number of CPUs to u, default: 2
--min_contig_length <int> :minimum asmbled contig length to report (def=200)
--paired_fragment_length <int> :maximum length expected between fragment pairs (aim for 90% percentile) (def=300)
--jaccard_clip :option, t if you have paired reads and you expect high gene density with UTR overlap (u FASTQ input file format for reads).
Other important considerations:
•Trinity performs best with strand-specific data, in which ca n and antin transcripts can be resolved. If you do know this, u the –SS_lib_type flag to describe the data.
•Whether you u Fastq or Fasta formatted input files, be sure to keep the reads oriented as they are reported by Illumina, if the data are strand-specific. This is becau, Trinity will properly orient the quences
according to the specified library type. If the data are not strand-specific, now worries becau the reads
will be pard in both orientations.
•If you do not have strand-specific data, and you do not plan to u the —jaccard_clip option, you can combine all your reads into a single fastq or fasta file and u the —single option. You can also combine
paired reads and single reads, as long as the paired reads are recognized by having the same accession
prefix with /1 and /2 to discriminate between paired ends.
•If you have multiple paired-end library fragment sizes, t the —paired_fragment_length according to the larger inrt library. Pairings that exceed that distance will be treated as if they were unpaired by the
Butterfly process. Trinity's defaults are tuned to a library with an ~300 ba fragment length.
•by tting the —CPU option, you are indicating:
o the number of threads for Inchworm to u (in most cas, Inchworm multithreading does not currently lead to performance gains. In future releas, this may change).
o most importantly, the number of Butterfly executions that will occur simultaneously.
For l oblolly, t ake n ote t hat t he –cpu o ption c an b e s et t o 4 i nstead o f t he d efault o f 2.
Tutorial 2: P rocess f or R ead A lignment, V isualization, a nd
Abundance E stimation w ith P aired E nd
Once y ou h ave f inished r unning T rinity.pl, y ou c an p rocess t his o utput t o v isualize a nd g et abundance e stimations.
1.Align r eads t o t he T rinity t ranscripts u sing t he u til/alignReads.pl s cript, w hich c an
leverage B owtie, B LAT, o r B WA a s t he a ligner.
大大大香蕉
Caution should be taken in using this wrapper and the modified tools, becau there are advantages and disadvantages to each, as described below:
a.Bowtie: Abundance estimation using RSEM (as described below) currently leverages Bowtie gap-
free alignments. Running bowtie (original, not the newer bowtie 2…still investigating) with paired
fragment reads will exclude alignments where only one of the mate pairs aligns. Since Trinity
doesn't perform scaffolding across quencing gaps yet, there will be cas (more so in fragmented
transcripts corresponding to lowly expresd transcripts) where only one of the mate-pairs aligns.
The alignReads.pl script operates similarly to TopHat in that it runs Bowtie to align each of the
淡蓝色图片paired fragment reads parately, and then groups them into pairs afterwards. We capture both the
paired and the unpaired fragment read alignments from Bowtie for visualization and examining
read support for the transcript asmblies. The properly-mapped pairs are further extracted and can
be ud as a substrate for RSEM-bad abundance estimation (e below).
b.BLAT: we've found BLAT to be particularly uful in generating spliced short-read alignments to
targets where short introns exist. We include BLAT here only for exploratory purpos.
c.BWA: the modified version of BWA provides SAM entries for each of the multiply mapped reads
alternative mappings, but grouping of pairs is performed by the alignReads.pl script, and the total
number of alignments reported tends to be substantially less than running the latest version of
BWA in paired mode without having the multiply mapped individual reads. BWA is
recommended specifically for SNP-calling exercis, and we're continuing to explore the various
options available, including further tweaks here.
2.Run f rom t he t utorial1and2 d irectory:
/opt/trinityrnaq_r2011-10-29/util/alignReads.pl --left reads.left.fq --right reads.right.fq --qType fq --target trinity_out_dir/Trinity.fasta --aligner bowtie --
SS_lib_type RF
Note: i f y our d ata a re s trand-‐specific, b e s ure t o s et -‐-‐SS_lib_type a s d one w ith T rinity.pl
3.This a lignment g enerates a l ot o f o utput f iles. T he b dSorted.bam
file c ontains b oth p roperly-‐mapped p airs a nd s ingle u npaired f ragment r eads. T his file c an b e u d f or v isualizing t he a lignments a nd c overage d ata u sing I GV. T he *nameSorted*PropMapPairsForRm.bam c ontains o nly t he p roperly-‐mapped pairs f or u w ith t he R SEM s oftware. W e w ill b e u sing t he
dSorted.bam f ile t o v isualize t he d ata w ith I GV.
4.To u I GV, g et i t f rom : h ttp:///igv/ .
5.Once y ou h ave t he p rogram r unning, u t he I mport G enome t ool t o l oad t he
Trinity.fasta f ile a s a g enome. A lso, l oad t he b dSorted.bam f ile containing t he a ligned r eads. Y ou w ill n eed t o t ransfer t he a ssociated
dSorted.bam.bai f ile (the i ndex), s o t hat I GV w ill l oad t he b am f ile.
Note: I f a fter l oading t he g enome a nd t he b am f ile y ou s till c an n ot s ee a ny
data, u t he z oom t ool i n t he t op r ight c orner t o z oom i n. A lso, c licking o n the n ucleotides i n t he b ottom s equence w indow w ill t oggle a 3 f rameshift
城堡简笔画translation. T his c ould t hen b e f lipped b y r ight c licking i n t his s ame w indow to g et t he o ther 3 f rameshift t ranslation.
6.RSEM i s e normously u ful f or a bundance e stimation i n t he c ontext o f
transcriptome a smblies. R SEM c an b e d ownloaded h ere:
deweylab.biostat.wisc.edu/rm/. H owever, c urrently R SEM i s i ncluded
with T rinity s ince t hey h ave a s lightly m odified v ersion.
7.Run
品性
/opt/trinityrnaq_r2011-10-29/util/RSEM_util/run_RSEM.pl --transcripts
trinity_out_dir/Trinity.fasta --name_sorted_bam
bowtie_out/bowtie_out.nameSorted.sam.+.sam.PropMapPairsForRSEM.bam --paired --group_by_component
This w ill r un R SEM t o e stimate r ead a bundance.
8.Execute
/opt/trinityrnaq_r2011-10-29/util/RSEM_util/summarize_RSEM_fpkm.pl --
transcripts trinity_out_dir/Trinity.fasta --RSEM sults --
fragment_length 300 --group_by_component | tee Trinity.RSEM.fpkm
This w ill s ummarize t he R SEM F PKM v alues i nto a n e asy t o r ead t ext f ile n amed
Trinity.RSEM.fpkm.
Trinity.RSEM.fpkm F ile
#Total fragments mapped to transcriptome: 24114.01
transcript length eff_length count fraction fpkm %comp_fpkm comp20_c0_q1 349 50 3.00 5.67e-03 2488.18 100.00
comp0_c0_q1 3739 3440 531.56 2.03e-02 6408.03 11.02
comp0_c0_q2 3697 3398 4240.44 1.64e-01 51750.92 88.98
comp9_c0_q1 5528 5229 192.07 4.83e-03 1523.25 12.45
comp9_c0_q2 5399 5100 1317.93 3.40e-02 10716.49 87.55
comp19_c0_q1 433 134 2.00 1.87e-03 618.95 100.00
comp1_c0_q1 6716 6417 699.32 1.43e-02 4519.33 17.66
comp1_c0_q2 6665 6366 2949.41 6.10e-02 19213.17 75.07 comp1_c0_q3 3969 3670 6.08 2.18e-04 68.70 0.27
comp1_c0_q4 3918 3619 123.99 4.51e-03 1420.79 5.55
comp1_c0_q5 3152 2853 0.42 1.93e-05 6.10 0.02
comp1_c0_q6 3101 2802 24.79 1.16e-03 366.89 1.43
comp32_c0_q1 562 263 7.00 3.45e-03 1103.76 100.00
comp10_c0_q1 3823 3524 610.19 2.28e-02 7180.58 90.22
comp10_c0_q2 3715 3416 50.42 1.94e-03 612.09 7.69
comp10_c0_q3 2749 2450 0.00 1.29e-07 0.00 0.00
comp10_c0_q4 2641 2342 9.39 5.27e-04 166.27 2.09