首页 > 美文鉴赏

structure 使用说明

更新时间:2023-07-13 04:47:46 阅读：评论：0

Documentation for structure software:Version2.3

Jonathan K.Pritchard a

学校卫生工作总结

Xiaoquan Wen a

Daniel Falush b123

a Department of Human Genetics

University of Chicago

b Department of Statistics

University of Oxford

Software from

pritch.bsd.uchicago.edu/structure.html

April21,2009

1Our other colleagues in the structure project are Peter Donnelly,Matthew Stephens and Melissa Hubisz.

2Theﬁrst version of this program was developed while the authors(JP,MS,PD)were in the Department of Statistics,University of Oxford.

3Discussion and questions about structure should be addresd to the online forum at Plea check this document and arch the previous discus-sion before posting questions.

Contents

1Introduction3

1.1Overview (3)

1.2What’s new in Version2.3? (3)

2Format for the dataﬁle4

2.1Components of the dataﬁle: (4)

2.2Rows (5)

2.3Individual/genotype data (6)

2.4Missing genotype data (7)

划痕怎么修复

2.5Formatting errors (7)

3Modelling decisions for the ur7

3.1Ancestry Models (7)

3.2Allele frequency models (12)

3.3How long to run the program (13)

4Missing data,null alleles and dominant markers14

毛脸雷公嘴4.1Dominant markers,null alleles,and polyploid genotypes (14)

5Estimation of K(the number of populations)15

5.1Steps in estimating K (15)

5.2Mild departures from the model can lead to overestimating K (16)

5.3Informal pointers for choosing K;is the structure real? (16)

5.4Isolation by distance data (17)

6Background LD and other miscellania17

超级模仿6.1Sequence data,tightly linked SNPs and haplotype data (17)

6.2Multimodality (18)

6.3Estimating admixture proportions when most individuals are admixed (18)

7Running structure from the command line19

7.1Program parameters (19)

7.2Parameters inﬁle mainparams (19)

7.3Parameters inﬁle extraparams (21)

集团管理

7.4Command-line changes to parameter values (25)

8Front End26

8.1Download and installation (26)

8.2Overview (27)

12月份英语8.3Building a project (27)

8.4Conﬁguring a parameter t (28)

8.5Running simulations (30)

8.6Batch runs (30)

8.7Exporting parameterﬁles from the front end (30)

8.8Importing results from the command-line program (31)

8.9Analyzing the results (32)

9Interpreting the text output33

9.1Output to screen during run (34)

9.2Printout of Q (34)

9.3Printout of Q when using prior population information (35)

9.4Printout of allele-frequency divergence (35)

9.5Printout of estimated allele frequencies(P) (35)

9.6Site by site output for linkage model (36)

10Other resources for u with structure37

10.1Plotting structure results (37)

亲密的意思

10.2Importing bacterial MLST data into structure format (37)

11How to cite this program37 12Bibliography37

1Introduction

The program structure implements a model-bad clustering method for inferring population struc-ture using genotype data consisting of unlinked markers.The method was introduced in a paper by Pritchard,Stephens and Donnelly(2000a)and extended in quels by Falush,Stephens and Pritchard(2003a,2007).Applications of our method include demonstrating the prence of popu-lation structure,identifying distinct genetic populations,assigning individuals to populations,and identifying migrants and admixed individuals.

Brieﬂy,we assume a model in which there are K populations(where K may be unknown), each of which is characterized by a t of allele frequencies at each locus.Individuals in the sample are assigned(probabilistically)to populations,or jointly to two or more populations if their genotypes indicate that they are admixed.It is assumed that within populations,the loci are at Hardy-Weinberg equilibrium,and linkage equilibrium.Looly speaking,individuals are assigned to populations in such a way as to achieve this.

Our model does not assume a particular mutation process,and it can be applied to most of the comm

only ud genetic markers including microsatellites,SNPs and RFLPs.The model assumes that markers are not in linkage diquilibrium(LD)within subpopulations,so we can’t handle markers that are extremely clo together.Starting with version2.0,we can now deal with weakly linked markers.

While the computational approaches implemented here are fairly powerful,some care is needed in running the program in order to ensure nsible answers.For example,it is not possible to determine suitable run-lengths theoretically,and this requires some experimentation on the part of the ur.This document describes the u and interpretation of the software and supplements the published papers,which provide more formal descriptions and evaluations of the methods.

1.1Overview

The software package structure consists of veral parts.The computational part of the program was written in C.We distribute source code as well as executables for various platforms(currently Mac,Windows,Linux,Sun).The C executable reads a dataﬁle supplied by the ur.There is also a Java front end that provides various helpful features for the ur including simple processing of the output.You can also invoke structure from the command line instead of using the front end.

This document includes information about how to format the dataﬁle,how to choo appropriate mod

els,and how to interpret the results.It also has details on using the two interfaces(command line and front end)and a summary of the various ur-deﬁned parameters.

1.2What’s new in Version

2.3?

The2.3relea(April2009)introduces new models for improving structure inference for data ts where(1)the data are not informative enough for the usual structure models to provide accurate in-ference,but(2)the sampling locations are correlated with population membership.In this situation, by making explicit u of sampling location information,we give structure a boost,often allowing much improved performance(Hubisz et al.,2009).We hope to relea further improvements in the coming months.

loc a loc b loc c loc d loc e

George1-914566092

George1-9-964094

Paula110614268192

Paula110614864094

Matthew2110145-9092

Matthew2110148661-9

Bob210814264194

Bob2-9142-9094

Anja1112142-91-9

瘦脸的运动Anja111414266194

Peter1-9145660-9

Peter1110145-91-9

Carsten2108145620-9

Carsten211014564192

Table1:Sample dataﬁle.Here MARKERNAMES=1,LABEL=1,POPDATA=1,NUMINDS=7, NUMLOCI=5,and MISSING=-9.Also,POPFLAG=0,LOCDATA=0,PHENOTYPE=0,EX-TRACOLS=0.The cond column shows the geographic sampling location of individuals.We can also store the data with one row per individual(ONEROWPERIND=1),in which ca theﬁrst row would read“George1-9-9145-96664009294”.

2Format for the dataﬁle

The format for the genotype data is shown in Table2(and Table1shows an example).Esntially, the entire data t is arranged as a matrix in a singleﬁle,in which the data for individuals are in rows,and the loci are in columns.The ur can make veral choices about format,and most of the data(apart from the genotypes!)are optional.

For a diploid organism,data for each individual can be stored either as2concutive rows, where each locus is in one column,or in one row,where each locus is in two concutive columns. Unless you plan to u the linkage model(e below)the order of the alleles for a single individual does not matter.The pre-genotype data columns(e below)are recorded twice for each individual. (More generally,for n-ploid organisms,data for each individual are stored in n concutive rows unless the ONEROWPERIND option is ud.)

2.1Components of the dataﬁle:

The elements of the inputﬁle are as listed below.If prent,they must be in the following order, however most are optional(as indicated)and may be deleted completely.The ur speciﬁes which data are prent,either in the front end,or(when running structure from the command line),in a parateﬁle,mainparams.At the same time,the ur also speciﬁes the number of individuals and the number of loci.

2.2Rows

1.Marker Names(Optional;string)Theﬁrst row in theﬁle can contain a list of identiﬁers

for each of the markers in the data t.This row contains L strings of integers or characters, where L is the number of loci.

2.Recessive Alleles(Data with dominant markers only;integer)Data ts of SNPs or mi-

crosatellites would generally not include this line.However if the option RECESSIVEALLE-LES is t to1,then the program requires this row to indicate which allele(if any)is recessive at each marker.See Section4.1for more information.The option is ud for data such as AFLPs and for polyploids where

genotypes may be ambiguous.

3.Inter-Marker Distances(Optional;real)the next row in theﬁle is a t of inter-marker

distances,for u with linked loci.The should be genetic ,centiMorgans),or some proxy for this bad,for example,on physical distances.The actual units of distance do not matter too much,provided that the marker distances are(roughly)proportional to recombination rate.The front end estimates an appropriate scaling from the data,but urs of the command line version must t LOG10RMIN,LOG10RMAX and LOG10RSTART in theﬁle extraparams.

The markers must be in map order within linkage groups.When concutive markers are from diﬀerent linkage ,diﬀerent chromosomes),this should be indicated by the value-1.Theﬁrst marker is also assigned the value-1.All other distances are non-negative.

This row contains L real numbers.

4.Pha Information(Optional;diploid data only;real number in the range[0,1]).This is

for u with the linkage model only.This is a single row of L probabilities that appears after the genotype data for each individual.If pha is known completely,or no pha information is available,t

he rows are unnecessary.They may be uful when there is partial pha information from family data or when haploid X chromosome data from males and diploid autosomal data are input together.There are two alternative reprentations for the pha information:(1)the two rows of data for an individual are assumed to correspond to the paternal and maternal contributions,respectively.The pha line indicates the probability that the ordering is correct at the current marker(t MARKOVPHASE=0);(2)the pha line indicates the probability that the pha of one allele relative to the previous allele is correct(t MARKOVPHASE=1).Theﬁrst entry should beﬁlled in with0.5toﬁll out the line to L entries.For example the following data input would reprent the information from an male with5unphad autosomal microsatellite loci followed by three X chromosome loci, using the maternal/paternal pha model:

102156165101143105104101

100148163101143-9-9-9

0.50.50.50.50.5 1.0 1.0 1.0

where-9indicates”missing data”,here missing due to the abnce of a cond X chromo-some,the0.5indicates that the autosomal loci are unphad,and the1.0s indicate that the X chromos

ome loci are have been maternally inherited with probability1.0,and hence are phad.The same information can be reprented with the markovpha model.In this ca the inputﬁle would read:

本文发布于:2023-07-13 04:47:46，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/89/1079425.html

上一篇：Trojan War 特洛伊战争

下一篇：卡米洛特彼得森博士

标签：总结划痕集团工作修复

留言与评论（共有 0 条评论）