构建NCBI本地BLAST数据库(NRNT等)blastxdiamond使⽤⽅法blast。。。:
如何下载 NCBI NR NT数据库?
下载blast:
先了解BLAST Databas:
如何下载NCBI blast数据库?
NCBI提供了⼀个⾮常智能化的脚本update_blastdb.pl来⾃动下载所有blast数据库。
脚本使⽤⽅法:
perl update_blastdb.pl nr
有哪些可供下载的blast数据库?
perl update_blastdb.pl --showall
该命令会显⽰所有可供下载的blast数据库,请⾃⾏选择:
16SMicrobial
cdd_delta
env_nr
env_nt
est
est_human
est_mou
est_others
gss
gss_annot
htgs
human_genomic
landmark
nr
nt
other_genomic
pataa
patnt
pdbaa
pdbnt
ref_prok_rep_genomes
ref_viroids_rep_genomes
教师节邮件ref_virus_rep_genomes
refq_genomic
refq_protein
refq_rna
refqgene
sts
swissprot
taxdb
tsa_nr
tsa_nt
vector
这⾥我选择的是nr数据库。
nohup perl update_blastdb.pl --decompress nr >out.log 2>&1 &
⾃动在后台下载,然后⾃动解压。(下载到⼀半断⽹了,在运⾏会接着下载,⽽不会覆盖已经下载好的⽂件)
blast如何使⽤?
这⾥只演⽰blastx的使⽤⽅法。
刚才下载的nr库就是蛋⽩库,blastx就是⽤来将核酸序列⽐对到蛋⽩库上的。(nt就是核酸库)
因为我们下载的是已经建好索引的数据库,所以省去了makeblastdb的过程。
常见的命令有下⾯⼏个:
大花蕙兰
-query <File_In> 要查询的核酸序列
-db <String> 数据库名字
-out <File_Out> 输出⽂件
-evalue <Real> evalue阈值
-outfmt <String> 输出的格式
blast构建索引 | makeblastdb
makeblastdb -in mature.fa -input_type fasta -dbtype nucl -title miRBa -par_qids -out miRBa -logfile File_Name
-in 后接输⼊⽂件,你要格式化的fasta序列
-dbtype 后接序列类型,nucl为核酸,prot为蛋⽩
-title 给数据库起个名,好看~~(不能⽤在后⾯搜索时-db的参数)
-par_qids 推荐加上,现在有啥原因还没搞清楚
-out 后接数据库名,⾃⼰起⼀个有意义的名字,以后blast+搜索时要⽤到的-db的参数
-logfile ⽇志⽂件,如果没有默认输出到屏幕
资源消耗
blastx -d.transcript.fasta -db nr -out test.blastx.out
其中fasta⽂件只有19938⾏。
可是运⾏起来耗费了很多资源:
平均内存消耗:51.45G;峰值:115.37G
cpu:1个
运⾏时间:06:00:24(你敢信?这才是⼀个⼩⼩的test)
所以我强烈推荐⽤diamond替代blast来做数据库搜索。
blast结果解读
每⼀个合格的序列⽐对都会给出⼀个这样的结果(⼀个query quence⽐对到多个就有多个结果):
>AAB70410.1 Similar to Schizosaccharomyces CCAAT-binding factor (gb|U88525).
EST gb|T04310 comes from this gene [Arabidopsis thaliana]
Length=208
Score = 238 bits (607), Expect = 7e-76, Method: Compositional matrix adjust.
Identities = 116/145 (80%), Positives = 127/145 (88%), Gaps = 2/145 (1%)
Frame = +1
Query 253 FWASQYQEIEQTSDFKNHSLPLARIKKIMKADEDVRMISAEAPVVFARACEMFILELTLR 432
FW +Q++EIE+T+DFKNHSLPLARIKKIMKADEDVRMISAEAPVVFARACEMFILELTLR
Sbjct 39 FWENQFKEIEKTTDFKNHSLPLARIKKIMKADEDVRMISAEAPVVFARACEMFILELTLR 98
Query 433 SWNHTEENKRRTLQKNDIAAAITRNEIFDFLVDIVPREDLKDEVLASIPRGTLPMGAPTE 612一马当先的意思
SWNHTEENKRRTLQKNDIAAA+TR +IFDFLVDIVPREDL+DEVL SIPRGT+P A
Sbjct 99 SWNHTEENKRRTLQKNDIAAAVTRTDIFDFLVDIVPREDLRDEVLGSIPRGTVPEAA-AA 1
57
Query 613 GLPYYYMQPQHAPQVGAPGMFMGKP 687
G PY Y+ AP +G PGM MG P
Sbjct 158 GYPYGYLPAGTAP-IGNPGMVMGNP 181
结果解读⽹上很多,这⾥不啰嗦了。
以下是我在同样条件下测试的diamond:
平均内存消耗:11.01G;峰值:12.44G
cpu:1个(571.17%)也就是会⾃动占⽤5-6个cpu
运⾏时间:00:26:15
⽽且diamond注明了,它的优势是处理>1M 的query,量越⼤速度越快。
diamond的简单⽤法:
diamond makedb --in nr.fa -d nr
diamond blastx -d nr -d.transcript.fasta -o test.matches.m8
但是diamond使⽤有限制,只能⽤于⽐对蛋⽩数据库。
以下是OrfPredictor推荐的参数设置:
To minimize the file size of BLASTX output for loading, the following parameters are recommended if the BLASTX in the 'NCBI-blastall' package is ud: "-v 1 -b 1 -e 1e-5" (Note: we ud version 2.2.19 - earlier or later versions may not work pro 下⾯是详细的blastx帮助⽂档,以供查阅:
$ blastx -help
USAGE
blastx [-h] [-help] [-import_arch_strategy filename]
[-export_arch_strategy filename] [-task task_name] [-db databa_name]
[-dbsize num_letters] [-gilist filename] [-qidlist filename]
美国竹柳
[-negative_gilist filename] [-negative_qidlist filename]
[-entrez_query entrez_query] [-db_soft_mask filtering_algorithm]
[-db_hard_mask filtering_algorithm] [-subject subject_input_file]
[-subject_loc range] [-query input_file] [-out output_file]
[-evalue evalue] [-word_size int_value] [-gapopen open_penalty]
[-gapextend extend_penalty] [-qcov_hsp_perc float_value]
[-max_hsps int_value] [-xdrop_ungap float_value] [-xdrop_gap float_value]
[-xdrop_gap_final float_value] [-archsp int_value]
[-sum_stats bool_value] [-max_intron_length length] [-g SEG_options]
[-soft_masking soft_masking] [-matrix matrix_name]
[-threshold float_value] [-culling_limit int_value]
[-best_hit_overhang float_value] [-best_hit_score_edge float_value]
[-window_size int_value] [-ungapped] [-lca_masking] [-query_loc range]
[-strand strand] [-par_deflines] [-query_gencode int_value]
[-outfmt format] [-show_gis] [-num_descriptions int_value]
[-num_alignments int_value] [-line_length line_length] [-html]
[-max_target_qs num_quences] [-num_threads int_value] [-remote]
[-comp_bad_stats compo] [-u_sw_tback] [-version]
DESCRIPTION
Translated Query-Protein Subject BLAST 2.7.1+
OPTIONAL ARGUMENTS
-h
Print USAGE and DESCRIPTION; ignore all other parameters
-help
Print USAGE, DESCRIPTION and ARGUMENTS; ignore all other parameters
-version
Print version number; ignore other arguments
*** Input query options
-query <File_In>
Input file name
Default = `-'
-query_loc <String>
Location on the query quence in1-bad offts (Format: start-stop)
-
strand <String, `both', `minus', `plus'>
任贤使能
Query strand(s) to arch against databa/subject
Default = `both'
-query_gencode <Integer, values between: 1-6, 9-16, 21-25>
Genetic code to u to translate query (e ur manual for details)
Default = `1'
*** General arch options
-task <String, Permissible values: 'blastx''blastx-fast' >
Task to execute
Default = `blastx'
-db <String>
BLAST databa name
* Incompatible with: subject, subject_loc
-out <File_Out>
Output file name
Default = `-'
-evalue <Real>
Expectation value (E) threshold for saving hits
Default = `10'
-word_size <Integer, >=2>
Word size for wordfinder algorithm
-gapopen <Integer>
Cost to open a gap
-gapextend <Integer>
Cost to extend a gap
-max_intron_length <Integer, >=0>
Length of the largest intron allowed in a translated nucleotide quence
when linking multiple distinct alignments
Default = `0'
-matrix <String>
Scoring matrix name (normally BLOSUM62)
-threshold <Real, >=0>
Minimum word score such that the word is added to the BLAST lookup table -comp_bad_stats <String>
U composition-bad statistics:
D or d: default (equivalent to 2 )
0 or F or f: No composition-bad statistics
1: Composition-bad statistics as in NAR 29:2994-3005, 2001
2 or T or t : Composition-bad score adjustment as in Bioinformatics 21:902-911,
2005, conditioned on quence properties
3: Composition-bad score adjustment as in Bioinformatics 21:902-911, 2005, unconditionally
Default = `2'
*** BLAST-2-Sequences options
-subject <File_In>
Subject quence(s) to arch
* Incompatible with: db, gilist, qidlist, negative_gilist,
negative_qidlist, db_soft_mask, db_hard_mask
-subject_loc <String>
Location on the subject quence in1-bad offts (Format: start-stop)
* Incompatible with: db, gilist, qidlist, negative_gilist,
negative_qidlist, db_soft_mask, db_hard_mask, remote
*** Formatting options
-outfmt <String>
alignment view options:
0 = Pairwi,
1 = Query-anchored showing identities,
2 = Query-anchored no identities,
3 = Flat query-anchored showing identities,
4 = Flat query-anchored no identities,
5 = BLAST XML,
6 = Tabular,
7 = Tabular with comment lines,
8 = Seqalign (Text ASN.1),
9 = Seqalign (Binary ASN.1),
10 = Comma-parated values,
11 = BLAST archive (ASN.1),
12 = Seqalign (JSON),
13 = Multiple-file BLAST JSON,
14 = Multiple-file BLAST XML2,
15 = Single-file BLAST JSON,
通过16 = Single-file BLAST XML2,
18 = Organism Report
Options 6, 7 and 10 can be additionally configured to produce
a custom format specified by space delimited format specifiers.
The supported format specifiers are:
qqid means Query Seq-id
qgi means Query GI
qacc means Query accesion
qaccver means Query accesion.version
qlen means Query quence length
sqid means Subject Seq-id
sallqid means All subject Seq-id(s), parated by a ';'
sgi means Subject GI
sallgi means All subject GIs
sacc means Subject accession
saccver means Subject accession.version
sallacc means All subject accessions
slen means Subject quence length
qstart means Start of alignment in query
qend means End of alignment in query
sstart means Start of alignment in subject
nd means End of alignment in subject
qq means Aligned part of query quence
sq means Aligned part of subject quence
evalue means Expect value
bitscore means Bit score
score means Raw score
length means Alignment length
pident means Percentage of identical matches
nident means Number of identical matches
mismatch means Number of mismatches
positive means Number of positive-scoring matches
gapopen means Number of gap openings
gaps means Total number of gaps
ppos means Percentage of positive-scoring matches
frames means Query and subject frames parated by a '/'
qframe means Query frame
sframe means Subject frame
btop means Blast traceback operations (BTOP)
staxid means Subject Taxonomy ID
ssciname means Subject Scientific Name
scomname means Subject Common Name
sblastname means Subject Blast Name
sskingdom means Subject Super Kingdom
staxids means unique Subject Taxonomy ID(s), parated by a ';'
(in numerical order)
sscinames means unique Subject Scientific Name(s), parated by a ';' scomnames means unique Subject Common Name(s), parated by a ';' sblastnames means unique Subject Blast Name(s), parated by a ';'
(in alphabetical order)
sskingdoms means unique Subject Super Kingdom(s), parated by a ';' (in alphabetical order)
stitle means Subject Title
salltitles means All Subject Title(s), parated by a '<>'
sstrand means Subject Strand
qcovs means Query Coverage Per Subject
qcovhsp means Query Coverage Per HSP
qcovus means Query Coverage Per Unique Subject (blastn only)
When not provided, the default value is:
'qaccver saccver pident length mismatch gapopen qstart qend sstart nd evalue bitscore', which is equivalent to the keyword 'std'
Default = `0'
-show_gis
Show NCBI GIs in deflines?
-num_descriptions <Integer, >=0>
Number of databa quences to show one-line descriptions for
Not applicable for outfmt > 4
Default = `500'
* Incompatible with: max_target_qs
-num_alignments <Integer, >=0>
Number of databa quences to show alignments for
Default = `250'
* Incompatible with: max_target_qs梦到自己大哭
-line_length <Integer, >=1>
Line length for formatting alignments
Not applicable for outfmt > 4
Default = `60'
-html
Produce HTML output?
*** Query filtering options
-g <String>
Filter query quence with SEG (Format: 'yes', 'window locut hicut', or
'no' to disable)
Default = `122.22.5'
-soft_masking <Boolean>
Apply filtering locations as soft masks
Default = `fal'
-
lca_masking
U lower ca filtering in query and subject quence(s)?
*** Restrict arch or results
-gilist <String>
Restrict arch of databa to list of GI's
* Incompatible with: negative_gilist, qidlist, negative_qidlist,
remote, subject, subject_loc
-qidlist <String>
Restrict arch of databa to list of SeqId's
* Incompatible with: gilist, negative_gilist, negative_qidlist, remote,
subject, subject_loc
-
negative_gilist <String>
Restrict arch of databa to everything except the listed GIs
* Incompatible with: gilist, qidlist, remote, subject, subject_loc
-negative_qidlist <String>
Restrict arch of databa to everything except the listed SeqIDs
* Incompatible with: gilist, qidlist, remote, subject, subject_loc
-entrez_query <String>
Restrict arch with the given Entrez query
* Requires: remote
-db_soft_mask <String>
Filtering algorithm ID to apply to the BLAST databa as soft masking
* Incompatible with: db_hard_mask, subject, subject_loc
-db_hard_mask <String>
Filtering algorithm ID to apply to the BLAST databa as hard masking
* Incompatible with: db_soft_mask, subject, subject_loc
-qcov_hsp_perc <Real, 0..100>
Percent query coverage per hsp
-max_hsps <Integer, >=1>
Set maximum number of HSPs per subject quence to save for each query
-culling_limit <Integer, >=0>
If the query range of a hit is enveloped by that of at least this many
higher-scoring hits, delete the hit
* Incompatible with: best_hit_overhang, best_hit_score_edge
-best_hit_overhang <Real, (>0 and <0.5)>
Best Hit algorithm overhang value (recommended value: 0.1)
* Incompatible with: culling_limit
-best_hit_score_edge <Real, (>0 and <0.5)>
Best Hit algorithm score edge value (recommended value: 0.1)
* Incompatible with: culling_limit
-max_target_qs <Integer, >=1>
Maximum number of aligned quences to keep
Not applicable for outfmt <= 4
Default = `500'
* Incompatible with: num_descriptions, num_alignments
*** Statistical options
-dbsize <Int8>
Effective length of the databa
-archsp <Int8, >=0>
Effective length of the arch space
-sum_stats <Boolean>
U sum statistics
*** Search strategy options
-import_arch_strategy <File_In>
Search strategy to u
* Incompatible with: export_arch_strategy
-export_arch_strategy <File_Out>
File name to record the arch strategy ud
* Incompatible with: import_arch_strategy
*** Extension options
-xdrop_ungap <Real>
X-dropoff value (in bits) for ungapped extensions
-xdrop_gap <Real>
X-dropoff value (in bits) for preliminary gapped extensions
-xdrop_gap_final <Real>
X-dropoff value (in bits) for final gapped alignment
-
window_size <Integer, >=0>
Multiple hits window size, u 0 to specify 1-hit algorithm
-ungapped
Perform ungapped alignment only?
*** Miscellaneous options
-par_deflines
Should the query and subject defline(s) be pard?
-num_threads <Integer, (>=1 and =<24)>
Number of threads (CPUs) to u in the BLAST arch
Default = `1'
* Incompatible with: remote
-
remote
Execute arch remotely?
* Incompatible with: gilist, qidlist, negative_gilist,
negative_qidlist, subject_loc, num_threads
-u_sw_tback
Compute locally optimal Smith-Waterman alignments?
以下是copy的详细英⽂教学:
1. Quick Start
Get all numbered files for a databa with the same ba name: Each of the files reprents a subt (volume) of that databa, and all of them are needed to reconstitute the databa.
After extraction, there is no need to concatenate the resulting files:Call the databa with the ba name, for nr databa files, u "-db nr". 这些数据库是已经预先进⾏过makeblastdb命令的,下载后
可以直接使⽤
For easy download, u the update_blastdb.pl script from the blast+ package.
Incremental update is not available.
2. General Introduction
The pre-formatted databas offer the following advantages:
Pre-formatting removes the need to run makeblastdb; ⽆需再运⾏建库命令⾏
Species-level taxonomy ids are included for each databa entry;
Databas are broken into smaller-sized volumes and are therefore easier to download;
Sequences in FASTA format can be generated from the pre-formatted databas by using the blastdbcmd utility;可以从这些数据库⽂件中导出FASTA⽂件
A convenient script (update_blastdb.pl) is available in the blast+ package to download the pre-formatted databas. 可⽤该脚本升级数据库
Pre-formatted databas must be downloaded using the update_blastdb.pl script or via FTP in binary mode. Documentation for this script can be obtained by running the script
without any arguments; Perl installation is required.
The compresd files downloaded must be inflated with gzip or other decompress tools. The BLAST databa files can then be extracted out of the resulting tar file using the tar
utility on Unix/Linux, or WinZip and StuffIt Expander on
Windows and Macintosh platforms, respectively. 下载的数据库为压缩包,要解压缩
Large databas are formatted in multiple one-gigabyte volumes, which are named using the baname.##. convention. All volumes with the same ba name are
required. An alias file is provided to tie individual volumes together so that the databa can be called using the ba name (without the .nal or .pal extension). For example, to
call the est databa, simply u "-db est" option in the command line (without the quotes). ⼤的数据库通常分为多个压缩包,例如nr库有11个压缩包。所有的相关压缩包都要下
载,解压。解压缩会⽣成对应的库⽂件,同时⽣成⼀个nr.pal⽂件。检索nr库时输⼊-d nr 即可。
Additional BLAST databas that are not provided in pre-formatted formats may be available in the FASTA subdirectory. For other genomic BLAST databas, plea check
the genomes ftp directory at:
3. Contents of the /blast/db/ directory
The pre-formatted BLAST databas are archived in this directory. The names of the databas and their contents are listed below.
+-----------------------------+------------------------------------------------+
File Name # Content Description
+-----------------------------+------------------------------------------------+
# Bacterial and Archaeal 16S rRNA quences from BioProjects 33175 and 33117
FASTA/ # Subdirectory for FASTA formatted quences
README # README for this subdirectory (this file)
Reprentative_Genomes.* # Reprentative bacterial/archaeal genomes databa
cdd_ # Conrved Domain Databa quences for u with stand alone deltablast
cloud/ # Subdirectory of databas for BLAST AMI; e v/TJAnEt
env_nr.* # Protein quences for metagenomes
env_nt.* # Nucleotide quences for metagenomes
# This file requires est_human.*., est_mou.*., and est_others.*. files to function. It contains the est.nal alias so that arches against est (-db est) will include est_human, est_mou and est_others. est_human.*. # Human subt of the est databa from the est division of GenBank, EMBL and DDBJ.
est_mou.*. # Mou subt of the est databasae
观念交付
est_others.*. # Non-human and non-mou subt of the est databa
gss.* # Sequences from the GSS division of GenBank, EMBL, and DDBJ
htgs.* # Sequences from the HTG division of GenBank, EMBL,and DDBJ
human_genomic.* # Human RefSeq (NC_) chromosome records with gap adjusted concatenated NT_ contigs
nr.* # Non-redundant protein quences from GenPept, Swissprot, PIR, PDF, PDB, and NCBI RefSeq
nt.* # Partially non-redundant nucleotide quences from all traditional divisions of GenBank, EMBL, and DDBJ excluding GSS,STS, PAT, EST, HTG, and WGS.
other_genomic.* # RefSeq chromosome records (NC_) for non-human organisms
pataa.* # Patent protein quences
patnt.* # Patent nucleotide quences. Both patent databas are directly from the USPTO, or from the EPO/JPO via EMBL/DDBJ
pdbaa.* # Sequences for the protein structure from the Protein Data Bank
pdbnt.* # Sequences for the nucleotide structure from the Protein Data Bank. They are NOT the protein coding quences for the corresponding pdbaa entries.
refq_genomic.* # NCBI genomic reference quences
refq_protein.* # NCBI protein reference quences
refq_rna.* # NCBI Transcript reference quences
sts.* # Sequences from the STS division of GenBank, EMBL,and DDBJ
# Swiss-Prot quence databa (last major update)
# Additional taxonomy information for the databas listed here providing common and scientific names
tsa_nt.* # Sequences from the TSA division of GenBank, EMBL,and DDBJ
# Vector quences from 2010, e Note 2 in ction 4.
wgs.* # Sequences from Whole Genome Shotgun asmblies
+-----------------------------+------------------------------------------------+
4. Contents of the /blast/db/FASTA directory
This directory contains FASTA formatted quence files. The file names and databa contents are listed below. The files must be unpacked and procesd through
blastdbcmd before they can be ud by the BLAST programs.
+-----------------------+-----------------------------------------------------+
File Name # Content Description #
+-----------------------+-----------------------------------------------------+
# translation of alu.n repeats
< # alu repeat elements (from 2003)
# CDS translations
< # genomic quences for drosophila (from 2003)
* # Protein quences for metagenomes, taxid 408169
* # Nucleotide quences for metagenomes, taxid 408169
* # human subt of the est databa (e Note 1)
* # mou subt of the est databa
* # non-human and non-mou subt of the est databa
<* # quences from the GSS division of GenBank, EMBL, and DDBJ
<* # quences from the HTG division of GenBank, EMBL, and DDBJ
* # human RefSeq (NC_) chromosome records with gap adjusted concatenated NT_ contigs
< # human and mou immunoglobulin variable region nucleotide quences
< # human and mou immunoglobulin variable region protein quences
# CDS translations of complete mitochondrial genomes
< # complete mitochondrial genomes
<* # non-redundant protein quence databa with entries from GenPept, Swissprot, PIR, PDF, PDB, and RefSeq
<* # nucleotide quence databa, with entries from all traditional divisions of GenBank, EMBL, and DDBJ; excluding bulk divisions (gss, sts, pat, est, htg) and wgs entries. Partially non-redundant.
* # RefSeq chromosome records (NC_) for organisms other than human
<* # patent protein quences
<* # patent nucleotide quences. Both patent quence files are from the USPTO, or EPO/JPO via EMBL/DDBJ
<* # protein quences from pdb protein structures
<* # nucleotide quences from pdb nucleic acid structures. They are NOT the protein coding quences for the corresponding pdbaa entries.
<* # databa for quence tag site entries
<* # swiss-prot databa (last major relea)
< # vector quences from 2010. (See Note 2)
<* # whole genome shotgun genome asmblies
# protein translations
< # yeast genomes (from 2003)
+-----------------------+---------------------------------------------------+
NOTE:
(1) NCBI does not provide the complete est databa in FASTA format. One needs to get all three subts (est_human, est_mou, and est_others and concatenate them into
the complete est fasta databa).
(2) For screening for vector contamination, u the UniVec databa: