Documentation for structure software:Version2.3
Jonathan K.Pritchard a
学校卫生工作总结
Xiaoquan Wen a
Daniel Falush b123
a Department of Human Genetics
University of Chicago
b Department of Statistics
University of Oxford
Software from
pritch.bsd.uchicago.edu/structure.html
April21,2009
1Our other colleagues in the structure project are Peter Donnelly,Matthew Stephens and Melissa Hubisz.
2Thefirst version of this program was developed while the authors(JP,MS,PD)were in the Department of Statistics,University of Oxford.
3Discussion and questions about structure should be addresd to the online forum at Plea check this document and arch the previous discus-sion before posting questions.
Contents
1Introduction3
1.1Overview (3)
1.2What’s new in Version2.3? (3)
2Format for the datafile4
2.1Components of the datafile: (4)
2.2Rows (5)
2.3Individual/genotype data (6)
2.4Missing genotype data (7)
划痕怎么修复
2.5Formatting errors (7)
3Modelling decisions for the ur7
3.1Ancestry Models (7)
3.2Allele frequency models (12)
3.3How long to run the program (13)
4Missing data,null alleles and dominant markers14
毛脸雷公嘴4.1Dominant markers,null alleles,and polyploid genotypes (14)
5Estimation of K(the number of populations)15
5.1Steps in estimating K (15)
5.2Mild departures from the model can lead to overestimating K (16)
5.3Informal pointers for choosing K;is the structure real? (16)
5.4Isolation by distance data (17)
6Background LD and other miscellania17
超级模仿6.1Sequence data,tightly linked SNPs and haplotype data (17)
6.2Multimodality (18)
6.3Estimating admixture proportions when most individuals are admixed (18)
7Running structure from the command line19
7.1Program parameters (19)
7.2Parameters infile mainparams (19)
7.3Parameters infile extraparams (21)
集团管理
7.4Command-line changes to parameter values (25)
8Front End26
8.1Download and installation (26)
8.2Overview (27)
12月份英语8.3Building a project (27)
8.4Configuring a parameter t (28)
8.5Running simulations (30)
8.6Batch runs (30)
8.7Exporting parameterfiles from the front end (30)
8.8Importing results from the command-line program (31)
8.9Analyzing the results (32)
9Interpreting the text output33
9.1Output to screen during run (34)
9.2Printout of Q (34)
9.3Printout of Q when using prior population information (35)
9.4Printout of allele-frequency divergence (35)
9.5Printout of estimated allele frequencies(P) (35)
9.6Site by site output for linkage model (36)
10Other resources for u with structure37
10.1Plotting structure results (37)
亲密的意思
10.2Importing bacterial MLST data into structure format (37)
11How to cite this program37 12Bibliography37
1Introduction
The program structure implements a model-bad clustering method for inferring population struc-ture using genotype data consisting of unlinked markers.The method was introduced in a paper by Pritchard,Stephens and Donnelly(2000a)and extended in quels by Falush,Stephens and Pritchard(2003a,2007).Applications of our method include demonstrating the prence of popu-lation structure,identifying distinct genetic populations,assigning individuals to populations,and identifying migrants and admixed individuals.
Briefly,we assume a model in which there are K populations(where K may be unknown), each of which is characterized by a t of allele frequencies at each locus.Individuals in the sample are assigned(probabilistically)to populations,or jointly to two or more populations if their genotypes indicate that they are admixed.It is assumed that within populations,the loci are at Hardy-Weinberg equilibrium,and linkage equilibrium.Looly speaking,individuals are assigned to populations in such a way as to achieve this.
Our model does not assume a particular mutation process,and it can be applied to most of the comm
only ud genetic markers including microsatellites,SNPs and RFLPs.The model assumes that markers are not in linkage diquilibrium(LD)within subpopulations,so we can’t handle markers that are extremely clo together.Starting with version2.0,we can now deal with weakly linked markers.
While the computational approaches implemented here are fairly powerful,some care is needed in running the program in order to ensure nsible answers.For example,it is not possible to determine suitable run-lengths theoretically,and this requires some experimentation on the part of the ur.This document describes the u and interpretation of the software and supplements the published papers,which provide more formal descriptions and evaluations of the methods.
1.1Overview
The software package structure consists of veral parts.The computational part of the program was written in C.We distribute source code as well as executables for various platforms(currently Mac,Windows,Linux,Sun).The C executable reads a datafile supplied by the ur.There is also a Java front end that provides various helpful features for the ur including simple processing of the output.You can also invoke structure from the command line instead of using the front end.
This document includes information about how to format the datafile,how to choo appropriate mod
els,and how to interpret the results.It also has details on using the two interfaces(command line and front end)and a summary of the various ur-defined parameters.
1.2What’s new in Version
2.3?
The2.3relea(April2009)introduces new models for improving structure inference for data ts where(1)the data are not informative enough for the usual structure models to provide accurate in-ference,but(2)the sampling locations are correlated with population membership.In this situation, by making explicit u of sampling location information,we give structure a boost,often allowing much improved performance(Hubisz et al.,2009).We hope to relea further improvements in the coming months.
loc a loc b loc c loc d loc e
George1-914566092
George1-9-964094
Paula110614268192
Paula110614864094
Matthew2110145-9092
Matthew2110148661-9
Bob210814264194
Bob2-9142-9094
Anja1112142-91-9
瘦脸的运动Anja111414266194
Peter1-9145660-9
Peter1110145-91-9
Carsten2108145620-9
Carsten211014564192
Table1:Sample datafile.Here MARKERNAMES=1,LABEL=1,POPDATA=1,NUMINDS=7, NUMLOCI=5,and MISSING=-9.Also,POPFLAG=0,LOCDATA=0,PHENOTYPE=0,EX-TRACOLS=0.The cond column shows the geographic sampling location of individuals.We can also store the data with one row per individual(ONEROWPERIND=1),in which ca thefirst row would read“George1-9-9145-96664009294”.
2Format for the datafile
The format for the genotype data is shown in Table2(and Table1shows an example).Esntially, the entire data t is arranged as a matrix in a singlefile,in which the data for individuals are in rows,and the loci are in columns.The ur can make veral choices about format,and most of the data(apart from the genotypes!)are optional.
For a diploid organism,data for each individual can be stored either as2concutive rows, where each locus is in one column,or in one row,where each locus is in two concutive columns. Unless you plan to u the linkage model(e below)the order of the alleles for a single individual does not matter.The pre-genotype data columns(e below)are recorded twice for each individual. (More generally,for n-ploid organisms,data for each individual are stored in n concutive rows unless the ONEROWPERIND option is ud.)
2.1Components of the datafile:
The elements of the inputfile are as listed below.If prent,they must be in the following order, however most are optional(as indicated)and may be deleted completely.The ur specifies which data are prent,either in the front end,or(when running structure from the command line),in a paratefile,mainparams.At the same time,the ur also specifies the number of individuals and the number of loci.
2.2Rows
1.Marker Names(Optional;string)Thefirst row in thefile can contain a list of identifiers
for each of the markers in the data t.This row contains L strings of integers or characters, where L is the number of loci.
2.Recessive Alleles(Data with dominant markers only;integer)Data ts of SNPs or mi-
crosatellites would generally not include this line.However if the option RECESSIVEALLE-LES is t to1,then the program requires this row to indicate which allele(if any)is recessive at each marker.See Section4.1for more information.The option is ud for data such as AFLPs and for polyploids where
genotypes may be ambiguous.
3.Inter-Marker Distances(Optional;real)the next row in thefile is a t of inter-marker
distances,for u with linked loci.The should be genetic ,centiMorgans),or some proxy for this bad,for example,on physical distances.The actual units of distance do not matter too much,provided that the marker distances are(roughly)proportional to recombination rate.The front end estimates an appropriate scaling from the data,but urs of the command line version must t LOG10RMIN,LOG10RMAX and LOG10RSTART in thefile extraparams.
The markers must be in map order within linkage groups.When concutive markers are from different linkage ,different chromosomes),this should be indicated by the value-1.Thefirst marker is also assigned the value-1.All other distances are non-negative.
This row contains L real numbers.
4.Pha Information(Optional;diploid data only;real number in the range[0,1]).This is
for u with the linkage model only.This is a single row of L probabilities that appears after the genotype data for each individual.If pha is known completely,or no pha information is available,t
he rows are unnecessary.They may be uful when there is partial pha information from family data or when haploid X chromosome data from males and diploid autosomal data are input together.There are two alternative reprentations for the pha information:(1)the two rows of data for an individual are assumed to correspond to the paternal and maternal contributions,respectively.The pha line indicates the probability that the ordering is correct at the current marker(t MARKOVPHASE=0);(2)the pha line indicates the probability that the pha of one allele relative to the previous allele is correct(t MARKOVPHASE=1).Thefirst entry should befilled in with0.5tofill out the line to L entries.For example the following data input would reprent the information from an male with5unphad autosomal microsatellite loci followed by three X chromosome loci, using the maternal/paternal pha model:
102156165101143105104101
100148163101143-9-9-9
0.50.50.50.50.5 1.0 1.0 1.0
where-9indicates”missing data”,here missing due to the abnce of a cond X chromo-some,the0.5indicates that the autosomal loci are unphad,and the1.0s indicate that the X chromos
ome loci are have been maternally inherited with probability1.0,and hence are phad.The same information can be reprented with the markovpha model.In this ca the inputfile would read: