Palaeontologia Electronica
palaeo-electronica
PAST: PALEONTOLOGICAL STATISTICS SOFTWARE PACKAGE FOR
EDUCATION AND DATA ANALYSIS
Øyvind Hammer, David A.T. Harper, and Paul D. Ryan
Øyvind Hammer. Paleontological Muum, University of Oslo, Sars gate1, 0562 Oslo, Norway
David A. T. Harper. Geological Muum, Øster Voldgade 5-7, University of Copenhagen, DK-1350 Cop
en-hagen K, Denmark
Paul D. Ryan. Department of Geology, National University of Ireland, Galway, Ireland
ABSTRACT
A comprehensive, but simple-to-u software package for executing a range of
standard numerical analysis and operations ud in quantitative paleontology has
been developed. The program, called PAST (PAleontological STatistics), runs on stan-
dard Windows computers and is available free of charge. PAST integrates spread-
sheet-type data entry with univariate and multivariate statistics, curve fitting, time-
ries analysis, data plotting, and simple phylogenetic analysis. Many of the functions
烧牛尾are specific to paleontology and ecology, and the functions are not found in stan-
dard, more extensive, statistical packages. PAST also includes fourteen ca studies
(data files and exercis) illustrating u of the program for paleontological problems,
穿越幻想h游戏>又起风了making it a complete educational package for cours in quantitative methods.
KEY WORDS: Software, data analysis, education
Copyright: Palaeontological Association, 22 June 2001
有毒物质Submission: 28 February 2001 Acceptance: 13 May 2001
INTRODUCTION
Even a cursory glance at the recent paleontological literature should convince anyone that quantitative methods in pale-ontology have arrived at last. Neverthe-less, many paleontologists still hesitate in applying such methods to their own data. One of the reasons for this has been the difficulty in acquiring and using appropri-ate data-analysis software. The ‘PALSTAT’program was developed in the 1980s in order to minimize such obstacles and pro-vide students with a coherent, easy-to-u package that supported a wide range of algorithms while allowing hands-on experi-ence with quantitative methods. The first PALSTAT version was programmed for the BBC microcomputer (Harper and Ryan 1987), while later revisions were made for the PC (Ryan et al. 1995). Incorporatin
g univariate and multivariate statistics and other plotting and analytical functions spe-cific to paleontology and ecology, PAL-
STAT gained a wide ur ba among both paleontologists and biologists.
After some years of rvice, however, it was becoming clear that PALSTAT had to undergo major revision. The DOS-bad ur interface and an architecture designed for computers with miniscule memories (by modern standards) was becoming an obstacle for most urs. Also, the field of quantitative paleontology has changed and expanded considerably in the last 15 years, requiring the imple-mentation of many new algorithms. There-fore, in 1999 we decided to redesign the program totally, keeping the general con-cept but without concern for the original source code. The new program, called PAST (PAleontological STatistics) takes full advantage of the Windows operating system, with a modern, spreadsheet-bad, ur interface and extensive graphics. Most PAST algorithms produce graphical output automatically, and the high-quality figures can be printed or pasted into other programs. The function-ality has been extended substantially with inclusion of important algorithms in the standard PAST toolbox. Functions found in PAST that were not available in PAL-STAT include (but are not limited to) parsi-mony analysis with cladogram plotting, detrended correspondence analysis, prin-cipal coordinates analysis, time-ries analysis (spectral and autocorr
elation), geometrical analysis (point distribution and Fourier shape analysis), rarefaction, modelling by nonlinear functions (e.g., logistic curve, sum-of-sines) and quantita-tive biostratigraphy using the unitary asso-ciations method. We believe that the functions we have implemented reflect the prent practice of paleontological data analysis, with the exception of some func-tionality that we hope to include in future versions (e.g., morphometric analysis with landmark data and more methods for the validation and correction of diversity curves).
给孩子的祝福语
One of the main ideas behind PAST is to include many functions in a single pro-gram package while providing for a con-sistent ur interface. This minimizes time spent on arching for, buying, and learn-ing a new program each time a new method is approached. Similar projects are being undertaken in other fields (e,g., systematics and morphometry). One example is Wayne Maddison’s ‘Mesquite’package (mesquite.biosci.ari-zona.edu/mesquite/mesquite.html).
An important aspect of PALSTAT was the inclusion of ca studies, including data ts designed to illustrate possible us of the algorithms. Working through the examples allowed the student to obtain a practical overview of the different methodologies in a very efficient way. Some of the ca studies have been adjusted and included in PAST, and new ca studies have been added in order to demonstrate the new features. The ca studies are primarily designed as student exercis for c
ours in paleontological data analysis. The PAST program, docu-mentation, and ca studies are available free of charge at www./ ~ohammer/past.
PLOTTING AND BASIC STATISTICS
Graphical plotting functions (e www./~ohammer/past/
plot.html) in PAST include different types of graph, histogram, and scatter plots. The program can also produce ternary (trian-gle) plots and survivorship curves.
Descriptive statistics (e www./~ohammer/past/ univar.html) include minimum, maximum, and mean values, population variance, sample variance, population and sample standard deviations, median, skewness, and kurtosis.
For associations or paleocommunity data, veral diversity statistics can be computed: number of taxa, number of indi-viduals, dominance, Simpson index, Shannon index (entropy), Menhinick’s and Margalef’s richness indices, equitability, and Fisher’s a (Harper 1999).
Rarefaction (Krebs 1989) is a method for estimating the number of taxa in a small sample, when abundance data for a larger sample are given. With this method, the number of taxa in samples of di
fferent sizes can be compared. An example appli-cation of rarefaction in paleontology is given by Adrain et al. (2000).
The program also includes standard statistical tests (e www./~ohammer/past/ twots.html) for univariate data, includ-ing: tests for normality (chi-squared and Shapiro-Wilk), the F and t tests, one-way ANOVA,χ2 for comparing binned samples, Mann-Whitney’s U test and Kolmogorov-Smirnov association test (non-parametric), and both Spearman’s r and Kendall’s t non-parametric rank-order tests. Dice and Jaccard similarity indices are ud for comparing associations limited to abnce/prence data. The Raup-Crick randomization method for comparing associations (Raup and Crick 1979) is also implemented. Finally, the program can also compute correlation matrices and perform contingency-table analysis.
MULTIVARIATE ANALYSIS
Paleontological data ts, whether bad on fossil occurrences or morphol-ogy, often have high dimensionality. PAST includes veral methods for multivariate data analysis (e www./ ~ohammer/past/multivar.html), including methods that are specific to paleontology and biology.
Principal components analysis (PCA) is a procedure for finding hypothetical vari-ables (components) that account for as much of the variance in a multidimensional data t as possible (Davis 1986, Harper 1999). The new variables are linear combinations of the original variables. PCA is a standard method for reducing the dimensionality of morphometric and eco-logical data. The PCA routine finds the eigenvalues and eigenvectors of the vari-ance-covariance matrix or the correlation matrix. The eigenvalues, giving a measure of the variance accounted for by the corre-sponding eigenvectors (components), are displayed together with the percentages of variance accounted for by each of the components. A scatter plot of the data projected onto the principal components is provided, along with the option of including the Minimal Spanning Tree, which is the shortest possible t of connected lines joining all points. This may be ud as a visual aid in grouping clo points (Harper 1999). The component loadings can also be plotted. Bruton and Owen (1988) describe a typical morphometrical applica-tion of PCA.
Principal coordinates analysis (PCO) is another ordination method, somewhat similar to PCA. The PCO routine finds the eigenvalues and eigenvectors of a matrix containing the distances between all data points, measured with the Gower distance or the Euclidean distance. The PCO algo-rithm ud in PAST was taken from Davis (1986), which also includes a more detailed description of the method and example analysis.
小升初自我评价
Correspondence analysis (CA) is a further ordination method, somewhat simi-lar to PCA, but for counted or discrete data. Correspondence analysis can com-pare associations containing counts of taxa or counted taxa across associations. Also, CA is more suitable if it is expected that species have unimodal respons to the underlying parameters, that is they favor a certain range of the parameter and
become rare under for lower and higher values (this is in contrast to PCA, that assumes a linear respon). The CA algo-rithm employed in PAST is taken from Davis (1986), which also includes a more detailed description of the method and example analysis. Ordination of both sam-ples and taxa can be plotted in the same CA coordinate system, who axes will normally be interpreted in terms of envi-ronmental parameters (e.g., water depth, type of substrate temperature).
The Detrended Correspondence (DCA) module us the same ‘reciprocal averaging’ algorithm as the program Dec-orana (Hill and Gauch 1980). It is special-ized for u on “ecological” data ts with abundance data (taxa in rows, localities in columns), and it has become a standard method for studying gradients in such data. Detrending is a type of normalization procedure in two steps. The first step involves an attempt to “straighten out”points lying along an arch-like pattern (= Kendall’s Horshoe). The cond step involves “spreading out” the points to avoid artificial clustering at the e
dges of the plot.
Hierarchical clustering routines pro-duce a dendrogram showing how and where data points can be clustered (Davis 1986, Harper 1999). Clustering is one of the most commonly ud methods of mul-tivariate data analysis in paleontology. Both R-mode clustering (groupings of taxa), and Q-mode clustering (grouping variables or associations) can be carried out within PAST by transposing the data matrix. Three different clustering algo-rithms are available: the unweighted pair-group average (UPGMA) algorithm, the single linkage (nearest neighbor) algo-rithm, and Ward’s method. The similarity-association matrix upon which the clusters are bad can be computed using nine dif-ferent indices: Euclidean distance, correla-tion (using Pearson’s r or Spearman’s ρ,Bray-Curtis, chord and Morisita indices for abundance data, and Dice, Jaccard, and Raup-Crick indices for prence-abnce data.
Seriation of an abnce-prence matrix can be performed using the algo-rithm described by Brower and Kyle (1988). For constrained riation, columns should be ordered according to some external criterion (normally stratigraphic level) or positioned along a presumed fau-nal gradient. Seriation routines attempt to reorganize the data matrix such that the prences are concentrated along the diagonal. Also, in the constrained mode, the program runs a ‘Monte Carlo’ simula-tion to determine w
hether the original matrix is more informative than a random matrix. In the unconstrained mode both rows and columns are free to move: the method then amounts to a simple form of ordination.
The degree of paration between to hypothesized groups (e.g., species or morphs) can be investigated using dis-criminant analysis (Davis 1986). Given two ts of multivariate data, an axis is con-structed that maximizes the differences between the ts. The two ts are then plotted along this axis using a histogram. The null hypothesis of group means equal-ity is tested using Hotelling’s T2 test. CURVE FITTING AND TIME-SERIES ANALYSIS
Curve fitting (e www./~ohammer/past/fit-ting.html) in PAST includes a range of lin-ear and non-linear functions.
Linear regression can be performed with two different algorithms: standard (least-squares) regression and the ”Reduced Major Axis” method. Least-squares regression keeps the x values fixed, and it finds the line that minimizes the squared errors in the y values. Reduced Major Axis minimizes both the x
and the y errors simultaneously. Both x and y values can also be log-transformed, in effect fitting the data to the “allometric”function y=10b x a.An allometric slope value around 1.0 indicates that an “iso
met-ric” fit may be more applicable to the data than an allometric fit. Values for the regression slope and intercepts, their errors, aχ2correlation value, Pearson’s r coefficient, and the probability that the col-umns are not correlated are given.
In addition, the sum of up to six sinu-soids (not necessarily harmonically related) with frequencies specified by the ur, but with unknown amplitudes and phas, can be fitted to bivariate data. This method can be uful for modeling periodicities in time ries, such as annual growth cycles or climatic cycles, usually in combination with spectral analysis (e below). The algorithm is bad on a least-squares criterion and singular value decomposition (Press et al. 1992). Fre-quencies can also be estimated by trial and error, by adjusting the frequency so that amplitude is maximized.9年级化学
Further, PAST allows fitting of data to the logistic equation y=a/(1+be-cx), using Levenberg-Marquardt nonlinear optimiza-tion (Press et al. 1992). The logistic equa-tion can model growth with saturation, and it was ud by Sepkoski (1984) to describe the propod stabilization of marine diversity in the late Palaeozoic. Another option is fitting to the von Berta-lanffy growth equation y=a(1-be-cx). This equation is ud for modeling growth of multi-celled animals (Brown and Rothery 1993).
Searching for periodicities in time ries (data sampled as a function of time) has been an important and controversial subject in paleontology in the last few decades, and we have therefore imple-mented two methods for such analysis in the program: spectral analysis and auto-correlation. Spectral (harmonic) analysis of time ries can be performed using the Lomb periodogram algorithm, which is more appropriate than the standard Fast Fourier Transform for paleontological data (which are often unevenly sampled; Press et al. 1992). Evenly-spaced data are of cour also accepted. In addition to the plotting of the periodogram, the highest peak in the spectrum is prented with its frequency and power value, together with a probability that the peak could occur from random data. The data t can be optionally detrended (linear component removed) prior to analysis. Applications include detection of Milankovitch cycles in isotopic data (Muller and MacDonald 2000) and arching for periodicities in diversity curves (Raup and Sepkoski 1984). Autocorrelation (Davis 1986) can be carried out on evenly sampled tempo-ral-stratigraphical data. A predominantly zero autocorrelation signifies random data—periodicities turn up as peaks.
GEOMETRICAL ANALYSIS
熊猫竹子PAST includes some functionality for geometrical analysis (e www./~ohammer/past/mor-pho.html), even if an extensive morpho-metrics module has no
t yet been implemented. We hope to implement more extensive functionality, such as landmark-bad methods, in future versions of the program.
The program can plot ro diagrams (polar histograms) of directions. The can be ud for plotting current-oriented specimens, orientations of trackways, ori-entations of morphological features (e.g., trilobite terrace lines), etc. The mean angle together with Rayleigh’s spread are given. Rayleigh’s spread is further tested against a random distribution using Ray-leigh’s test for directional data (Davis 1986). A χ2 test is also available, giving