首页 > 美文鉴赏

构建NCBI本地BLAST数据库（NRNT等）blastxdiamond使用方法blast。。。

更新时间:2023-05-16 07:28:46 阅读：评论：0

构建NCBI本地BLAST数据库（NRNT等）blastxdiamond使⽤⽅法blast。。。：

如何下载 NCBI NR NT数据库？

下载blast：

先了解BLAST Databas：

如何下载NCBI blast数据库？

NCBI提供了⼀个⾮常智能化的脚本update_blastdb.pl来⾃动下载所有blast数据库。

脚本使⽤⽅法：

perl update_blastdb.pl nr

有哪些可供下载的blast数据库？

perl update_blastdb.pl --showall

该命令会显⽰所有可供下载的blast数据库，请⾃⾏选择：

16SMicrobial

cdd_delta

env_nr

env_nt

est

est_human

est_mou

est_others

gss

gss_annot

htgs

human_genomic

landmark

other_genomic

pataa

patnt

pdbaa

pdbnt

ref_prok_rep_genomes

ref_viroids_rep_genomes

教师节邮件ref_virus_rep_genomes

refq_genomic

refq_protein

refq_rna

refqgene

sts

swissprot

taxdb

tsa_nr

tsa_nt

vector

这⾥我选择的是nr数据库。

nohup perl update_blastdb.pl --decompress nr >out.log 2>&1 &

⾃动在后台下载，然后⾃动解压。（下载到⼀半断⽹了，在运⾏会接着下载，⽽不会覆盖已经下载好的⽂件）

blast如何使⽤？

这⾥只演⽰blastx的使⽤⽅法。

刚才下载的nr库就是蛋⽩库，blastx就是⽤来将核酸序列⽐对到蛋⽩库上的。（nt就是核酸库）

因为我们下载的是已经建好索引的数据库，所以省去了makeblastdb的过程。

常见的命令有下⾯⼏个：

大花蕙兰

-query <File_In> 要查询的核酸序列

-db <String> 数据库名字

-out <File_Out> 输出⽂件

-evalue <Real> evalue阈值

-outfmt <String> 输出的格式

blast构建索引 | makeblastdb

makeblastdb -in mature.fa -input_type fasta -dbtype nucl -title miRBa -par_qids -out miRBa -logfile File_Name

-in 后接输⼊⽂件，你要格式化的fasta序列

-dbtype 后接序列类型，nucl为核酸，prot为蛋⽩

-title 给数据库起个名，好看~~(不能⽤在后⾯搜索时-db的参数)

-par_qids 推荐加上，现在有啥原因还没搞清楚

-out 后接数据库名，⾃⼰起⼀个有意义的名字，以后blast+搜索时要⽤到的-db的参数

-logfile ⽇志⽂件，如果没有默认输出到屏幕

资源消耗

blastx -d.transcript.fasta -db nr -out test.blastx.out

其中fasta⽂件只有19938⾏。

可是运⾏起来耗费了很多资源：

平均内存消耗：51.45G；峰值：115.37G

cpu：1个

运⾏时间：06:00:24（你敢信？这才是⼀个⼩⼩的test）

所以我强烈推荐⽤diamond替代blast来做数据库搜索。

blast结果解读

每⼀个合格的序列⽐对都会给出⼀个这样的结果（⼀个query quence⽐对到多个就有多个结果）：

>AAB70410.1 Similar to Schizosaccharomyces CCAAT-binding factor (gb|U88525).

EST gb|T04310 comes from this gene [Arabidopsis thaliana]

Length=208

Score = 238 bits (607), Expect = 7e-76, Method: Compositional matrix adjust.

Identities = 116/145 (80%), Positives = 127/145 (88%), Gaps = 2/145 (1%)

Frame = +1

Query 253 FWASQYQEIEQTSDFKNHSLPLARIKKIMKADEDVRMISAEAPVVFARACEMFILELTLR 432

FW +Q++EIE+T+DFKNHSLPLARIKKIMKADEDVRMISAEAPVVFARACEMFILELTLR

Sbjct 39 FWENQFKEIEKTTDFKNHSLPLARIKKIMKADEDVRMISAEAPVVFARACEMFILELTLR 98

Query 433 SWNHTEENKRRTLQKNDIAAAITRNEIFDFLVDIVPREDLKDEVLASIPRGTLPMGAPTE 612一马当先的意思

SWNHTEENKRRTLQKNDIAAA+TR +IFDFLVDIVPREDL+DEVL SIPRGT+P A

Sbjct 99 SWNHTEENKRRTLQKNDIAAAVTRTDIFDFLVDIVPREDLRDEVLGSIPRGTVPEAA-AA 1

Query 613 GLPYYYMQPQHAPQVGAPGMFMGKP 687

G PY Y+ AP +G PGM MG P

Sbjct 158 GYPYGYLPAGTAP-IGNPGMVMGNP 181

结果解读⽹上很多，这⾥不啰嗦了。

以下是我在同样条件下测试的diamond：

平均内存消耗：11.01G；峰值：12.44G

cpu：1个（571.17%）也就是会⾃动占⽤5-6个cpu

运⾏时间：00:26:15

⽽且diamond注明了，它的优势是处理>1M 的query，量越⼤速度越快。

diamond的简单⽤法：

diamond makedb --in nr.fa -d nr

diamond blastx -d nr -d.transcript.fasta -o test.matches.m8

但是diamond使⽤有限制，只能⽤于⽐对蛋⽩数据库。

以下是OrfPredictor推荐的参数设置：

To minimize the file size of BLASTX output for loading, the following parameters are recommended if the BLASTX in the 'NCBI-blastall' package is ud: "-v 1 -b 1 -e 1e-5" (Note: we ud version 2.2.19 - earlier or later versions may not work pro 下⾯是详细的blastx帮助⽂档，以供查阅：

$ blastx -help

USAGE

blastx [-h] [-help] [-import_arch_strategy filename]

[-export_arch_strategy filename] [-task task_name] [-db databa_name]

[-dbsize num_letters] [-gilist filename] [-qidlist filename]

美国竹柳

[-negative_gilist filename] [-negative_qidlist filename]

[-entrez_query entrez_query] [-db_soft_mask filtering_algorithm]

[-db_hard_mask filtering_algorithm] [-subject subject_input_file]

[-subject_loc range] [-query input_file] [-out output_file]

[-evalue evalue] [-word_size int_value] [-gapopen open_penalty]

[-gapextend extend_penalty] [-qcov_hsp_perc float_value]

[-max_hsps int_value] [-xdrop_ungap float_value] [-xdrop_gap float_value]

[-xdrop_gap_final float_value] [-archsp int_value]

[-sum_stats bool_value] [-max_intron_length length] [-g SEG_options]

[-soft_masking soft_masking] [-matrix matrix_name]

[-threshold float_value] [-culling_limit int_value]

[-best_hit_overhang float_value] [-best_hit_score_edge float_value]

[-window_size int_value] [-ungapped] [-lca_masking] [-query_loc range]

[-strand strand] [-par_deflines] [-query_gencode int_value]

[-outfmt format] [-show_gis] [-num_descriptions int_value]

[-num_alignments int_value] [-line_length line_length] [-html]

[-max_target_qs num_quences] [-num_threads int_value] [-remote]

[-comp_bad_stats compo] [-u_sw_tback] [-version]

DESCRIPTION

Translated Query-Protein Subject BLAST 2.7.1+

OPTIONAL ARGUMENTS

-h

Print USAGE and DESCRIPTION; ignore all other parameters

-help

Print USAGE, DESCRIPTION and ARGUMENTS; ignore all other parameters

-version

Print version number; ignore other arguments

*** Input query options

-query <File_In>

Input file name

Default = `-'

-query_loc <String>

Location on the query quence in1-bad offts (Format: start-stop)

strand <String, `both', `minus', `plus'>

任贤使能

Query strand(s) to arch against databa/subject

Default = `both'

-query_gencode <Integer, values between: 1-6, 9-16, 21-25>

Genetic code to u to translate query (e ur manual for details)

Default = `1'

*** General arch options

-task <String, Permissible values: 'blastx''blastx-fast' >

Task to execute

Default = `blastx'

-db <String>

BLAST databa name

* Incompatible with: subject, subject_loc

-out <File_Out>

Output file name

Default = `-'

-evalue <Real>

Expectation value (E) threshold for saving hits

Default = `10'

-word_size <Integer, >=2>

Word size for wordfinder algorithm

-gapopen <Integer>

Cost to open a gap

-gapextend <Integer>

Cost to extend a gap

-max_intron_length <Integer, >=0>

Length of the largest intron allowed in a translated nucleotide quence

when linking multiple distinct alignments

Default = `0'

-matrix <String>

Scoring matrix name (normally BLOSUM62)

-threshold <Real, >=0>

Minimum word score such that the word is added to the BLAST lookup table -comp_bad_stats <String>

U composition-bad statistics:

D or d: default (equivalent to 2 )

0 or F or f: No composition-bad statistics

1: Composition-bad statistics as in NAR 29:2994-3005, 2001

2 or T or t : Composition-bad score adjustment as in Bioinformatics 21:902-911,

2005, conditioned on quence properties

3: Composition-bad score adjustment as in Bioinformatics 21:902-911, 2005, unconditionally

Default = `2'

*** BLAST-2-Sequences options

-subject <File_In>

Subject quence(s) to arch

* Incompatible with: db, gilist, qidlist, negative_gilist,

negative_qidlist, db_soft_mask, db_hard_mask

-subject_loc <String>

Location on the subject quence in1-bad offts (Format: start-stop)

* Incompatible with: db, gilist, qidlist, negative_gilist,

negative_qidlist, db_soft_mask, db_hard_mask, remote

*** Formatting options

-outfmt <String>

alignment view options:

0 = Pairwi,

1 = Query-anchored showing identities,

2 = Query-anchored no identities,

3 = Flat query-anchored showing identities,

4 = Flat query-anchored no identities,

5 = BLAST XML,

6 = Tabular,

7 = Tabular with comment lines,

8 = Seqalign (Text ASN.1),

9 = Seqalign (Binary ASN.1),

10 = Comma-parated values,

11 = BLAST archive (ASN.1),

12 = Seqalign (JSON),

13 = Multiple-file BLAST JSON,

14 = Multiple-file BLAST XML2,

15 = Single-file BLAST JSON,

通过16 = Single-file BLAST XML2,

18 = Organism Report

Options 6, 7 and 10 can be additionally configured to produce

a custom format specified by space delimited format specifiers.

The supported format specifiers are:

qqid means Query Seq-id

qgi means Query GI

qacc means Query accesion

qaccver means Query accesion.version

qlen means Query quence length

sqid means Subject Seq-id

sallqid means All subject Seq-id(s), parated by a ';'

sgi means Subject GI

sallgi means All subject GIs

sacc means Subject accession

saccver means Subject accession.version

sallacc means All subject accessions

slen means Subject quence length

qstart means Start of alignment in query

qend means End of alignment in query

sstart means Start of alignment in subject

nd means End of alignment in subject

qq means Aligned part of query quence

sq means Aligned part of subject quence

evalue means Expect value

bitscore means Bit score

score means Raw score

length means Alignment length

pident means Percentage of identical matches

nident means Number of identical matches

mismatch means Number of mismatches

positive means Number of positive-scoring matches

gapopen means Number of gap openings

gaps means Total number of gaps

ppos means Percentage of positive-scoring matches

frames means Query and subject frames parated by a '/'

qframe means Query frame

sframe means Subject frame

btop means Blast traceback operations (BTOP)

staxid means Subject Taxonomy ID

ssciname means Subject Scientific Name

scomname means Subject Common Name

sblastname means Subject Blast Name

sskingdom means Subject Super Kingdom

staxids means unique Subject Taxonomy ID(s), parated by a ';'

(in numerical order)

sscinames means unique Subject Scientific Name(s), parated by a ';' scomnames means unique Subject Common Name(s), parated by a ';' sblastnames means unique Subject Blast Name(s), parated by a ';'

(in alphabetical order)

sskingdoms means unique Subject Super Kingdom(s), parated by a ';' (in alphabetical order)

stitle means Subject Title

salltitles means All Subject Title(s), parated by a '<>'

sstrand means Subject Strand

qcovs means Query Coverage Per Subject

qcovhsp means Query Coverage Per HSP

qcovus means Query Coverage Per Unique Subject (blastn only)

When not provided, the default value is:

'qaccver saccver pident length mismatch gapopen qstart qend sstart nd evalue bitscore', which is equivalent to the keyword 'std'

Default = `0'

-show_gis

Show NCBI GIs in deflines?

-num_descriptions <Integer, >=0>

Number of databa quences to show one-line descriptions for

Not applicable for outfmt > 4

Default = `500'

* Incompatible with: max_target_qs

-num_alignments <Integer, >=0>

Number of databa quences to show alignments for

Default = `250'

* Incompatible with: max_target_qs梦到自己大哭

-line_length <Integer, >=1>

Line length for formatting alignments

Not applicable for outfmt > 4

Default = `60'

-html

Produce HTML output?

*** Query filtering options

-g <String>

Filter query quence with SEG (Format: 'yes', 'window locut hicut', or

'no' to disable)

Default = `122.22.5'

-soft_masking <Boolean>

Apply filtering locations as soft masks

Default = `fal'

lca_masking

U lower ca filtering in query and subject quence(s)?

*** Restrict arch or results

-gilist <String>

Restrict arch of databa to list of GI's

* Incompatible with: negative_gilist, qidlist, negative_qidlist,

remote, subject, subject_loc

-qidlist <String>

Restrict arch of databa to list of SeqId's

* Incompatible with: gilist, negative_gilist, negative_qidlist, remote,

subject, subject_loc

negative_gilist <String>

Restrict arch of databa to everything except the listed GIs

* Incompatible with: gilist, qidlist, remote, subject, subject_loc

-negative_qidlist <String>

Restrict arch of databa to everything except the listed SeqIDs

* Incompatible with: gilist, qidlist, remote, subject, subject_loc

-entrez_query <String>

Restrict arch with the given Entrez query

* Requires: remote

-db_soft_mask <String>

Filtering algorithm ID to apply to the BLAST databa as soft masking

* Incompatible with: db_hard_mask, subject, subject_loc

-db_hard_mask <String>

Filtering algorithm ID to apply to the BLAST databa as hard masking

* Incompatible with: db_soft_mask, subject, subject_loc

-qcov_hsp_perc <Real, 0..100>

Percent query coverage per hsp

-max_hsps <Integer, >=1>

Set maximum number of HSPs per subject quence to save for each query

-culling_limit <Integer, >=0>

If the query range of a hit is enveloped by that of at least this many

higher-scoring hits, delete the hit

* Incompatible with: best_hit_overhang, best_hit_score_edge

-best_hit_overhang <Real, (>0 and <0.5)>

Best Hit algorithm overhang value (recommended value: 0.1)

* Incompatible with: culling_limit

-best_hit_score_edge <Real, (>0 and <0.5)>

Best Hit algorithm score edge value (recommended value: 0.1)

* Incompatible with: culling_limit

-max_target_qs <Integer, >=1>

Maximum number of aligned quences to keep

Not applicable for outfmt <= 4

Default = `500'

* Incompatible with: num_descriptions, num_alignments

*** Statistical options

-dbsize <Int8>

Effective length of the databa

-archsp <Int8, >=0>

Effective length of the arch space

-sum_stats <Boolean>

U sum statistics

*** Search strategy options

-import_arch_strategy <File_In>

Search strategy to u

* Incompatible with: export_arch_strategy

-export_arch_strategy <File_Out>

File name to record the arch strategy ud

* Incompatible with: import_arch_strategy

*** Extension options

-xdrop_ungap <Real>

X-dropoff value (in bits) for ungapped extensions

-xdrop_gap <Real>

X-dropoff value (in bits) for preliminary gapped extensions

-xdrop_gap_final <Real>

X-dropoff value (in bits) for final gapped alignment

window_size <Integer, >=0>

Multiple hits window size, u 0 to specify 1-hit algorithm

-ungapped

Perform ungapped alignment only?

*** Miscellaneous options

-par_deflines

Should the query and subject defline(s) be pard?

-num_threads <Integer, (>=1 and =<24)>

Number of threads (CPUs) to u in the BLAST arch

Default = `1'

* Incompatible with: remote

remote

Execute arch remotely?

* Incompatible with: gilist, qidlist, negative_gilist,

negative_qidlist, subject_loc, num_threads

-u_sw_tback

Compute locally optimal Smith-Waterman alignments?

以下是copy的详细英⽂教学：

1. Quick Start

Get all numbered files for a databa with the same ba name: Each of the files reprents a subt (volume) of that databa, and all of them are needed to reconstitute the databa.

After extraction, there is no need to concatenate the resulting files:Call the databa with the ba name, for nr databa files, u "-db nr". 这些数据库是已经预先进⾏过makeblastdb命令的，下载后

可以直接使⽤

For easy download, u the update_blastdb.pl script from the blast+ package.

Incremental update is not available.

2. General Introduction

The pre-formatted databas offer the following advantages:

Pre-formatting removes the need to run makeblastdb; ⽆需再运⾏建库命令⾏

Species-level taxonomy ids are included for each databa entry;

Databas are broken into smaller-sized volumes and are therefore easier to download;

Sequences in FASTA format can be generated from the pre-formatted databas by using the blastdbcmd utility；可以从这些数据库⽂件中导出FASTA⽂件

A convenient script (update_blastdb.pl) is available in the blast+ package to download the pre-formatted databas. 可⽤该脚本升级数据库

Pre-formatted databas must be downloaded using the update_blastdb.pl script or via FTP in binary mode. Documentation for this script can be obtained by running the script

without any arguments; Perl installation is required.

The compresd files downloaded must be inflated with gzip or other decompress tools. The BLAST databa files can then be extracted out of the resulting tar file using the tar

utility on Unix/Linux, or WinZip and StuffIt Expander on

Windows and Macintosh platforms, respectively. 下载的数据库为压缩包，要解压缩

Large databas are formatted in multiple one-gigabyte volumes, which are named using the baname.##. convention. All volumes with the same ba name are

required. An alias file is provided to tie individual volumes together so that the databa can be called using the ba name (without the .nal or .pal extension). For example, to

call the est databa, simply u "-db est" option in the command line (without the quotes). ⼤的数据库通常分为多个压缩包，例如nr库有11个压缩包。所有的相关压缩包都要下

载，解压。解压缩会⽣成对应的库⽂件，同时⽣成⼀个nr.pal⽂件。检索nr库时输⼊-d nr 即可。

Additional BLAST databas that are not provided in pre-formatted formats may be available in the FASTA subdirectory. For other genomic BLAST databas, plea check

the genomes ftp directory at:

3. Contents of the /blast/db/ directory

The pre-formatted BLAST databas are archived in this directory. The names of the databas and their contents are listed below.

+-----------------------------+------------------------------------------------+

File Name # Content Description

+-----------------------------+------------------------------------------------+

# Bacterial and Archaeal 16S rRNA quences from BioProjects 33175 and 33117

FASTA/ # Subdirectory for FASTA formatted quences

README # README for this subdirectory (this file)

Reprentative_Genomes.* # Reprentative bacterial/archaeal genomes databa

cdd_ # Conrved Domain Databa quences for u with stand alone deltablast

cloud/ # Subdirectory of databas for BLAST AMI; e v/TJAnEt

env_nr.* # Protein quences for metagenomes

env_nt.* # Nucleotide quences for metagenomes

# This file requires est_human.*., est_mou.*., and est_others.*. files to function. It contains the est.nal alias so that arches against est (-db est) will include est_human, est_mou and est_others. est_human.*. # Human subt of the est databa from the est division of GenBank, EMBL and DDBJ.

est_mou.*. # Mou subt of the est databasae

观念交付

est_others.*. # Non-human and non-mou subt of the est databa

gss.* # Sequences from the GSS division of GenBank, EMBL, and DDBJ

htgs.* # Sequences from the HTG division of GenBank, EMBL,and DDBJ

human_genomic.* # Human RefSeq (NC_) chromosome records with gap adjusted concatenated NT_ contigs

nr.* # Non-redundant protein quences from GenPept, Swissprot, PIR, PDF, PDB, and NCBI RefSeq

nt.* # Partially non-redundant nucleotide quences from all traditional divisions of GenBank, EMBL, and DDBJ excluding GSS,STS, PAT, EST, HTG, and WGS.

other_genomic.* # RefSeq chromosome records (NC_) for non-human organisms

pataa.* # Patent protein quences

patnt.* # Patent nucleotide quences. Both patent databas are directly from the USPTO, or from the EPO/JPO via EMBL/DDBJ

pdbaa.* # Sequences for the protein structure from the Protein Data Bank

pdbnt.* # Sequences for the nucleotide structure from the Protein Data Bank. They are NOT the protein coding quences for the corresponding pdbaa entries.

refq_genomic.* # NCBI genomic reference quences

refq_protein.* # NCBI protein reference quences

refq_rna.* # NCBI Transcript reference quences

sts.* # Sequences from the STS division of GenBank, EMBL,and DDBJ

# Swiss-Prot quence databa (last major update)

# Additional taxonomy information for the databas listed here providing common and scientific names

tsa_nt.* # Sequences from the TSA division of GenBank, EMBL,and DDBJ

# Vector quences from 2010, e Note 2 in ction 4.

wgs.* # Sequences from Whole Genome Shotgun asmblies

+-----------------------------+------------------------------------------------+

4. Contents of the /blast/db/FASTA directory

This directory contains FASTA formatted quence files. The file names and databa contents are listed below. The files must be unpacked and procesd through

blastdbcmd before they can be ud by the BLAST programs.

+-----------------------+-----------------------------------------------------+

File Name # Content Description #

+-----------------------+-----------------------------------------------------+

# translation of alu.n repeats

< # alu repeat elements (from 2003)

# CDS translations

< # genomic quences for drosophila (from 2003)

* # Protein quences for metagenomes, taxid 408169

* # Nucleotide quences for metagenomes, taxid 408169

* # human subt of the est databa (e Note 1)

* # mou subt of the est databa

* # non-human and non-mou subt of the est databa

<* # quences from the GSS division of GenBank, EMBL, and DDBJ

<* # quences from the HTG division of GenBank, EMBL, and DDBJ

* # human RefSeq (NC_) chromosome records with gap adjusted concatenated NT_ contigs

< # human and mou immunoglobulin variable region nucleotide quences

< # human and mou immunoglobulin variable region protein quences

# CDS translations of complete mitochondrial genomes

< # complete mitochondrial genomes

<* # non-redundant protein quence databa with entries from GenPept, Swissprot, PIR, PDF, PDB, and RefSeq

<* # nucleotide quence databa, with entries from all traditional divisions of GenBank, EMBL, and DDBJ; excluding bulk divisions (gss, sts, pat, est, htg) and wgs entries. Partially non-redundant.

* # RefSeq chromosome records (NC_) for organisms other than human

<* # patent protein quences

<* # patent nucleotide quences. Both patent quence files are from the USPTO, or EPO/JPO via EMBL/DDBJ

<* # protein quences from pdb protein structures

<* # nucleotide quences from pdb nucleic acid structures. They are NOT the protein coding quences for the corresponding pdbaa entries.

<* # databa for quence tag site entries

<* # swiss-prot databa (last major relea)

< # vector quences from 2010. (See Note 2)

<* # whole genome shotgun genome asmblies

# protein translations

< # yeast genomes (from 2003)

+-----------------------+---------------------------------------------------+

NOTE:

(1) NCBI does not provide the complete est databa in FASTA format. One needs to get all three subts (est_human, est_mou, and est_others and concatenate them into

the complete est fasta databa).

(2) For screening for vector contamination, u the UniVec databa:

本文发布于:2023-05-16 07:28:46，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/89/902806.html

上一篇：住房和城乡建设局个人工作总结九篇

下一篇：销售代表个人总结【三篇】