Psychological Bulletin 1989, V ol. 105. No.1, 156-166
The Unicorn, The Normal Curve, and Other Improbable Creatures
Theodore Micceri1
Department of Educational Leadership
University of South Florida
An investigation of the distributional characteristics of 440 large-sample achieve-ment and psychometric measures found all to be significantly nonnormal at the alpha
.01 significance level. Several class of contamination were found, including tail
weights from the uniform to the double exponential, exponential-level asymmetry,
vere digit preferences, multimodalities, and modes external to the mean/median
血压低什么原因interval. Thus, the underlying tenets of normality-assuming statistics appear fallacious
for the commonly ud types of data. However, findings here also fail to support the
爱心图片
types of distributions ud in most prior robustness rearch suggesting the failure of
such statistics under nonnormal conditions. A reevaluation of the statistical robustness
literature appears appropriate in light of the findings.
1During recent years a considerable literature devoted to robust statistics has appeared. This rearch reflects a growing concern among statisticians regarding the robustness, or innsitivity, of parametric statistics to violations of their underlying assumptions. Recent findings suggest that the most commonly ud of the statistics exhibit varying degrees of nonrobustness to certain violations of the normality assumption. Although the importance of such findings is underscored by numerous empirical studies documenting nonnormality in a variety of fields, a startling lack of such evidence exists for achievement tests and psychometric measures. A naive assumption of normality appears to characterize rearch involving the discrete, bounded, measures. In fact, some contend that given the developmental process ud to produce such measures, “a bell shaped distribution is guaranteed” (Walberg, Strykowski, Rovai, & Hung, 1984, p. 107). This inquiry sought to end the tedious arguments regarding the prevalence of nor-mal-like distributions by surveying a large number of real-world achievement and psychometric distribu-tions to determine what distributional characteristics actually occur.
2Widespread belief in normality evolved quite naturally within the dominant reductionist religio-philoso-phy of the 19th century. Early statistical rearchers such as Gauss sought some measure to estimate the center of a sample. Hampel (1973) stated,
Gauss. . . introduced the normal distribution to suit the arithmetic mean. . . and. . . developed his
statistical theories mainly under the criterion of mathematical simplicity and elegance. (p. 94)
1. The author holds a joint appointment with the Department of Educational Leadership, College of Edu-
cation, University of South Florida, and with the Assistant Dean’s Office, College of Engineering, Cen-
ter for Interactive T echnologies, Applications, and Rearch. More complete tables are available from
the author for postage and handling costs.
Correspondence concerning this article should be addresd to Theodore Micceri, Department of Edu-cational Leadership, University of South Florida, FAO 296, T ampa, Florida 33620.
3Certain later scientists, duced by such elegance, may have spent too much time eking worldly manifes-tations of God:
I know of scarcely anything so apt to impress the imagination as the wonderful form of cosmic
order expresd by the “Law of Frequency of Error.” The law would have been personified by the
Greeks and deified, if they had known of it. It reigns with renity and in complete lf-efface-
ment amidst the wildest confusion. (Galton, 1889, p. 66)
4Although Galton himlf recognized the preceding to hold only for homogeneous populations (Stigler, 1986), such attributions to deity continue to appear in educational and psychological statistics texts: It is a fortunate coincidence that the measurements of many variables in all disciplines have dis-
tributions that are good approximations of the normal distribution. Stated differently, “God
loves the normal curve!” (Hopkins & Glass, 1978, p. 95)
5T oward the end of the 19th century, biometricians such as Karl Pearson (1895) raid questions about the prevalence of normality among real-world distributions. Distrust of normality incread shortly thereaf-ter when Gost’s (Student, 1908) development of the t test, with its strong assumptions, made statisti-cians of that time “almost over-conscious of universal non-normality” (Geary. 1947, p. 241). During the 1920s, however, an important change of attitude occurred
following on the brilliant work of R. A. Fisher who showed that, when universal normality could
be assumed, inferences of the widest practical ufulness could be drawn from samples of any
size. Prejudice in favour of normality returned in full force. . . and the importance of the underly-
ing assumptions was almost forgotten. (Geary, 1947, p. 241)
6The preceding illustrates both trends in attitudes toward normality and the influence of R. A. Fisher on 20th-century scientists. T oday’s literature suggests a trend toward distrust of normality; however, this atti-tude frequently bypass psychometricians and educators. Interestingly, the characteristics of their mea-sures provide little support for the expectation of normality becau they consist of a number of discrete data points and [page 157] becau their distributions are almost exclusively mult
inomial in nature. For multinomial distributions, each possible score (sample point) is itlf a variable, and correlations may exist among each variable score/sample point. Thus, an extremely large number of possible cumulative distribution functions (cdfs) exist for such distributions defined by the probability of the occurrence for each score/sample point (Hastings & Peacock, 1975, p. 90). The expectation that a single cdf (i.e., Gauss-ian) characterizes most score distributions for such measures appears unreasonable for veral reasons.
Nunnally (1978, p. 160) identifies an obvious one; “Strictly speaking, test scores are ldom normally dis-tributed.” The items of a test must correlate positively with one another for the measurement method to make n. “Average correlations as high as .40 would tend to produce a distribution that was markedly flatter than the normal” (Nunnally, 1978, p. 160). Other factors that might contribute to a non-Gaussian error distribution in the population of interest include but are not limited to (a) the existence of undefined subpopulations within a target population having different abilities or attitudes, (b) ceiling or floor effects, (c) variability in the difficulty of items within a measure, and (d) treatment effects that change not only the location parameter and variability but.also the shape of a distribution.
7Of cour, this issue is unimportant if statistics are truly robust; however, considerable rearch su
ggests that parametric statistics frequently exhibit either relative or absolute nonrobustness in the prence of certain nonnormal distributions. The arithmetic mean has not proven relatively robust in a variety of sit-uations; Andrews et al. (1972), Anll (1973), Gastwirth and Rubin (1975), Wegman and Carroll (1977), Stigler (1977), David and Shu (1978), and Hill and Dixon (1982). The standard deviation, as an estimate of scale, proves relatively inefficient given only 18/100 of 1% contamination (Hampel, 1973). Others who found the standard deviation relatively nonrobust include Tukey and McLaughlin (1963), Wainer and Thisn (1976), and Hettmansperger and McKean (1978). Kowalski (1972) recommends against using
the Pearson product moment coefficient unless (X, Y) is “very nearly normal” becau of both nonrobust-ness and interpretability. Wainer and Thisn (1976) contend that nothing would be lost by immediately switching to a robust alternative, r t .
8 A large, complex literature on the robustness of parametric inferential procedures suggests that with the
exception of the one-mean t or z tests and the random-effects analysis of variance (ANOV A), parametric statistics exhibit robustness or conrvatism with regard to alpha in a variety of nonnorm
al conditions given large and equal sample sizes. Disagreement exists regarding the meaning of large in this context (Bradley, 1980). Also, veral reviews suggest that when n s are unequal or samples are small, this robust-ness disappears in varying situations (Blair, 1981; Ito, 1980; Tan, 1982). In addition, robustness of effi-ciency (power or beta) studies suggest that competitive tests such as the Wilcoxon rank-sum exhibit considerable power advantages while retaining equivalent robustness of alpha in a variety of situations (Blair, 1981; T an, 1982).
9Although far from conclusive, the preceding indicate that normality-assuming statistics may be relatively nonrobust in the prence of non-Gaussian distributions. In addition, any number of works asrting the nonnormality of specific distributions and thereby the possible imprecision of statistical procedures dependent on this assumption may be cited (Allport, 1934; Andrews et al., 1972; Bradley, 1977, 1982;
Hampel, 1973; E. S. Pearson & Plea, 1975; K. Pearson, 1895; Simon, 1955; Stigler, 1977; T an, 1982; T apia & Thompson, 1978; Tukey & McLaughlin, 1963; Wilson & Hilferty, 1929). Despite this, the normality assumption continues to permeate both textbooks and the rearch literature of the social and behavioral sciences.
10The implications of the preceding discussion are difficult to asss becau little of the noted robustness rearch deals with real-world data. The complexity and lack of availability of real-world data compels many rearchers to simplify questions by retreating into either asymptotic theory or Monte Carlo investi-gations of interesting mathematical functions. The eminent statistical historian Stephen Stigler (l977), investigating 18th-century empirical distributions, contended, “the prent study may be the first evalua-tion of modern robust estimators to rely on real data” (p. 1070). Tho few rearchers venturesome enough to deal with real data (Hill & Dixon, 1982; Stigler, 1977; T apia & Thompson, 1978) report findings that may call much of the above-cited robustness literature into question; (a) Real data evidence different characteristics than do simulated data; (b) statistics exhibit different properties under real-world condi-tions than they do in simulated environments; and (c) causal elements for parametric nonrobustness tend to differ from tho suggested by theoretical and simulated rearch.
11In an attempt to provide an empirical ba from which robustness studies may be related to the real world and about which statistical development may evolve, the current inquiry surveyed specific empirical dis-tributions generated in applied ttings to determine which, if any, distributional characteristics typify such measures. This rearch was limited to measures generally avoided in the
past, that is, tho bad on human respons to questions either testing knowledge (ability/achievement) or inventorying percep-tions and opinions (psychometric).
12The obvious approach to classifying distributions, à la K. Pearson (1895), Simon (1955), Taillie, Patil, and Baldessari (1981), and Law and Vincent (1983), is to define functionals characterizing actual score distri-butions. Unfortunately, this approach confronts problems when faced with the intractable data of empiri-cism. T apia and Thompson (1978) in their discussion of the Pearson system of curves contend that even after going through the strenuous process of determining which of the six Pearson curves a distribution appears to fit, one cannot be sure either that the chon curve is correct or that the distribution itlf is actually a member of the Pearson family. They suggest that one might just as well estimate the density function itlf. Such a task, although feasible, is both complex and uncertain. Problems of identifiability exist for mixed distributions (Blischke, 1978; Quandt & Ramy, 1978; Taillie et al., 1981), in which the
specification of different parameter values can result in identical mixed distributions, even for mathemat-ically tractable two-parameter distributions such as the Gaussian. Kempthorne (1978) argues that “almost all” distributional problems are insoluble with a discrete sample space, notwithstanding
the fact that elementary texts are replete with finite space problems that are soluble. (p. 12)
工作服管理制度
13[page 158] No attempt is made here to solve the insoluble. Rather, this inquiry attempted, as suggested by Stigler (1977), to determine the degree and frequency with which various forms of contamination (e.g., heavy tails or extreme asymmetry) occur among real data. Even the comparatively simple process of clas-sifying empirical distributions using only symmetry and tail weight has pitfalls. Elashoff and Elashoff (1978), discussing estimates of tail weight, note that “no single parameter can summarize the varied meanings of tail length” (p. 231). The same is true for symmetry or the lack of it (Gastwirth, 1971; Hill & Dixon, 1982). Therefore, multiple measures of both tail weight and asymmetry were ud to classify dis-tributions.
14As robust measures of tail weight, Q statistics (ratios of outer means) and C statistics (ratios of outer per-centile points) receive support. Hill and Dixon (1982), Elashoff and Elashoff(1 978), W egman and Carroll (1977), and Hogg (1974) discuss the Q statistics, and Wilson and Hilferty (1929), Mosteller and Tukey (1978), and Elashoff and Elashoff (1978) discuss the C statistics.
低保证明怎么开15As a robust measure of asymmetry, Hill and Dixon (1982) recommend Hogg’s (1974) Q2 . However, Q2 depends on contamination in the tails of distributions and is not nsitive to asymmetry
occurring only between the 75th and 95th percentiles. An alternative suggested by Gastwirth (1971) is a standardized value of the population mean/median interval. In the symmetric ca, as sample size increas, the statis-tic should approach zero. In the asymmetric ca, as sample size increas, the statistic will tend to con-verge toward a value indicating the degree of asymmetry in a distribution.
Method
16Two problems in obtaining a reasonably reprentative sample of psychometric and achievement/ability measures are (a) lack of availability and (b) small sample sizes. Samples of 400 or greater were sought to provide reasonably stable estimates of distributional characteristics. Distributions, by necessity, were obtained on an availability basis. Requests were made of 15 major test publishers, the University of South Florida’s institutional rearch department, the Florida Department of Education, and veral Florida school districts for ability score distributions in excess of 400 cas. In addition, requests were nt to the authors of every article citing the u of an ability or psychometric measure on more than 400 individuals between the years 1982 and 1984 in Applied Psychology, Journal of Rearch in Personality, Journal of Per-sonality, Journal of Personality Asssment, Multivariate Behavioral Rearch, Perceptual and Motor Skills, Applied Psychological
Measurement, Journal of Experimental Education, Journal of Educational Psychology, Journal of Educational Rearch, and Personnel Psychology. A total of over 500 score distributions were obtained, but becau many were different applications of the same measure, only 440 were submitted to analysis.
17Four types of measures were sampled parately: general achievement/ability tests, criterion/mastery tests, psychometric measures, and, where available, gain scores (the difference between a pre- and post-measure).
18For each distribution, three measures of symmetry/asymmetry were computed: (a) M/M intervals (Hill and Dixon, 1982), defined as the mean/median interval divided by a robust scale estimate(1.4807 multi-plied by one-half the interquartile range), (b) skewness, and (c) Hogg’s (1974) Q2, where
Q2 = [U(05) - M(25)] / [M(25) - L(05)]
女孩儿名字
where U(alpha)[M(alpha), U(alpha)] is the mean of the upper (middle, lower) [(N + 1)alpha] obrva-tions. The inver of this ratio defines Q2 for the lower tail.
19Two different types of tail weight measure were also computed: (a) Hogg’s (1974) Q and Q1 , where
Q = [U(05) — L(05)] / [U(50)- L(50)]
Q1 = [U(20) — L(20)] / [U(50)- L(50)]
and (b) C ratios of Elashoff and Elashoff (1978): C90 , C95, and C97.5 (the ratio of the 90th, 95th, and 97.5th percentile points, respectively, to the 75th percentile point).1 The Q statistics are nsitive to relative den-sity and the C statistics to distance (between percentiles). Kurtosis, although computed, was not ud for classification becau of interpretability problems.题诗后贾岛>老舍原名叫什么
20Criterion values of contamination were determined for the measures using tabled values for symmetric distributions (Elashoff& Elashoff, 1978) and simulated values for asymmetric distributions. Table 1 shows five cut points defining six levels of tail weight (uniform to double exponential) and three cut points defin-ing four levels of symmetry or asymmetry (relatively symmetric to exponential).
T able 1. Criterion V alues for Measures of Tail Weight and Symmetry
T ail weight Symmetry/asymmetry Distribution C97.5C95C90Q Q1Skewness mn/mdn Q2
Expected V alues
Uniform 1.90 1.80 1.60 1.90 1.600.000.00 1.00
Gaussian 2.90 2.40 1.90 2.58 1.750.000.00 1.00 Double exponential 4.30 3.30 2.30 3.30 1.93 2.000.37 4.70
Cut Points
Uniform 1.90 1.80 1.60 1.90 1.60———Below Gaussian 2.75 2.30 1.85 2.50 1.70———Moderate contamination 3.05 2.50 1.93 2.65 1.800.310.05 1.25 Extreme contamination 3.90 2.80 2.00 2.73 1.850.710.18 1.75 Double exponential 4.30 3.30 2.30 3.30 1.93 2.000.37 4.70
21Cut points were t arbitrarily, and tho defining moderate contamination of either tail weight or asym-metry were lected only to identify distributions as definitely non-Gaussian. The moderate contamina-tion cut points (both symmetric and asymmetric) were t at 5% and 15% contamination on the basis of the support for the alpha trimmed mean and trimmed t in the rearch literature. Moderate contamina-tion (5%, 2 sd) reprents at least twice the expected obrvations more than 2 standard deviations from the mean, and extreme contamination (15%, 3sd ) reprents more than 1
00 times the-expected obrva-tions over 3 standard deviations from the mean. Distributions were placed in that category defined by their highest valued measure.
22Two thousand replications of each classification statistic were computed to investigate sampling error for samples of size 500 and 1,000 for simulated Gaussian, moderate, extreme, and exponential contamina-tions (T able 1) using International Mathematical and Statistical Library subprograms GGUBS, GGNML, and GGEXN. Only slight differences occurred between sample sizes 500 and 1,000. Each statistic was at expectation for the Gaussian (50% above and 50% below cut). Results for asymmetric conditions indicate
1.Becau score distributions did not have a mean of zero, in order to compute percentile ratios it was
necessary to subtract the median from each of the relevant percentile points and u the absolute values
of the ratios.
that cut points for moderate contamination underestimate nonnormality, with 70.4% (skewness), 81.2
% (Q2), and 72.2% (M/M) of the simulated statistics falling below cut values at sample size 1,000. For extreme asymmetric contamination, simulated values cloly fit expectations. However, for the exponen-tial distribution, skewness cut points underestimate contamination (62% below cut), whereas tho for Q2 and M/M overestimate contamination (35% and 43%, respectively, below cut) for sample size 1,000.
Among tail weight measures, the most variable estimate (C97.5) showed considerable precision for the most extreme distribution (exponential), placing 45% of its simulated values below expected for sample size 1,000. This suggests that one might expect some misclassifications among distributions near the cut points for moderate and exponential asymmetry, with relative precision at other cut values.
23Figure 1 shows a light-tailed, moderately asymmetric distribution as categorized by the preceding criteria. 24Multimodality and digit preferences also prent identifiability problems for distributions other than the strict Gaussian. Therefore, arbitrary but conrvative methods were ud to define the forms of contam-ination. Two techniques, one objective and one subjective, were ud to identify modality. First, histo-grams of all distributions were re- [page 159] viewed, and tho clearly exhibiting more than a single mode were classified as such. Second, during computer analysis, all s
ample points occurring with a fre-quency at least 80% of that of the true mode (up to a maximum of five) were identified, and the absolute distance between adjacent modes was computed. Distances greater than two thirds (.667) of a distribu-tion’s standard deviation ware defined as bimodal. If more than one distance was this great, the distribu-tion was defined as multimodal. In general, the two techniques coincided during application.
Figure 1:A light-tailed, moderately asymmetric distribution (n = 3,152).送母回乡古诗
25Digits were defined as preferred if they occurred at least 20 times and if adjacent digits on both sides had fewer than 70% or greater than 130% as many cas. A digit preference value was computed by multiply-ing the number of digits showing preference by the inver of the maximum percentage of preference for each distribution. A digit preference value exceeding 20 (at least four preferred digits with a maximum of 50% preference) was defined as lumpy. In addition, perceived lumpiness was identified. Figure 2 depicts a psychometric distribution that required a perceptual technique for classification as either lumpy or multi-modal. This distribution consists of at least two and perhaps three fairly distinct subpopu1ations.