Abstract
It has been argued that the application in DIF rearch of the continuity correction to the Mantel-Haenszel (MH) chi square test is to improve the estimation of the hypergeometric probability even though this might increa the difference between the nominal error rate and the size of the test. In this article it is argued that the correction must be justified by the context in which obrvations are sampled – or the extent of our knowledge about how obrvations were sampled. U of the correction is bad on an inaccurate assumption and may, but does not necessarily, lead to conrvative estimates of type 1 error levels with the practical effect is that fewer items would be scrutinized for potential bias in DIF investigations. However, many approximate tests for DIF, like the MH test, match examinees on obrved total score which is not sufficient to establish comparability for some item respon models. In this ca, the correction is one of veral factors who complex interaction can affect the performance of the MH test.
Should the Continuity Correction be Applied to the
Mantel-Haenszel Chi-Square Statistic In DIF Rearch?
Gregory Camilli
University of Colorado
Key Words: differential item functioning, DIF, Mantel-Haenszel chi square, continuity correction, simulation, test size, type 1 errors
The Continuity Correction and the Mantel-Haenszel Chi -Square Statistic
The Mantel-Haenszel (MH) chi-square statistic (Mantel & Haenszel, 1959) was propod for rearch in differential item functioning (DIF) by Holland and Thayer (1988). This test statistic as prented by Mantel and Haenszel (1959) and Cox (1958) is bad on K (K>1) 2x2 tables, each following a noncentral hypergeometric distribution with log-odds parameter β (more details on this distribution can be found in Camilli [1995]). In DIF rearch, β can be taken to reprent a uniform log-odds ratio, and in this ca, the MH chi square is bad on a normal approximation statistic for testing the hypothesis that the common odds ratio is equal to unity (Cox, 1970) under the assumption of uniform DIF. This is equivalent to the test of H o: β = 0. As propod by Holland and Thayer, the MH chi-square test is
MH CHISQ
A
j
E j
A
j
j
Var A j
j
−=
−−
∑
∑
∑
(|()|)
()
1
2
2
, (1)
where A j is the obrved frequency in the upper-left (by convention) cell of a 2x2 table, and E [A j] is its expected value under the null hypothesis. In the above equation, a “correction” factor of 1
2
is ud. According to Holland an Thayer (1988)
The effect of the continuity correction is to improve the calculation of the
obrved significance levels using the chi-square table rather than to make the size of the test equal to the nominal value. Hence simulation studies routinely find
that the actual size of a test bad on MH-CHISQ is smaller than the nominal
value. However, the obrved significance level of a large value of MH-CHISQ is better approximated by referring MH-CHISQ to the chi-square tables than by
referring [the uncorrected statistic] to the tables. The continuity correction is
simply to improve the approximation of a discrete distribution (i.e., MH-CHISQ) by a continuous distribution (i.e., 1 degree-of-freedom chi-square). p. 135.
In this paper the application of the correction in real and simulation studies is examined. It is argued that the correction is nsibly ud in some instances; however, the unqualified u of the correction for continuity is inconsistent with related conclusions in the statistical literature regarding the continuity correction (e.g., Kroll, 1989; Richardson, 1994). It is not the intent of this paper to rehash this debate. Rather, the purpo is to point out the irony that the “correction” to the Mantel-Haenszel chi-square is inappropriate in some instances, but results in more accurate test of DIF in ot
hers. To make this argument, I will first sketch out a brief history of the problem and then relate the findings to DIF rearch. Several examples of real and simulation studies will be given in which the correction for continuity has affected type 1 error levels and power.
Historical Background
Neyman and Pearson (Neyman, 1950) defined the sample space for quantitative measurements as all potential outcomes before the data are collected. The sample event space is accordingly generated by hypothetical replications of an experiment, and a test probability is a long-run relative frequency (under the null hypothesis) bad on an infinite t of replications bad on random sampling from a population. In their approach to hypothesis testing, the nominal significance level is labeled α which defines a “tail” region of the null sample space. In real data studies, however, it is not always possible to delimit a population -- much less to imagine a sample as a random draw from that population. In this instance, the argument can be made to condition probabilities on key features of the obtained data t; that is, one does not attempt to imagine other potential marginal configurations that could have occurred, but didn’t. For 2x2 tables, Fisher (1957) argued that inferences should be made relative to the smallest “recognizable subt” which is obtained by obrving the actual values of the marginal frequencies after the data are collected (e Yates, 1984
0).
Becau the sample space preferred by Fisher typically results in a smaller critical region than that of Neyman and Pearson, this theoretical difference has important
implications for applied rearchers. The differences are well-known and well-documented, and a number of theorists have made progress in bridging the gap. However, it is important to note that there was one situation in with Fisher and Neyman and Pearson may have been in greater agreement. In acceptance sampling (whether the objects are manufactured products or test questions), the notion of an event space bad on replication was plausible to Fisher. In this ca, replications of data ts are not only possible, there are realized in a production process if constant standards are applied. Thus, there is an empirical basis for treating a probability as a long-run relative frequency bad on resampling from a population, and conditioning is not necessary for constraining statistical inferences.
The Ca of 2x2 Tables
In testing for association in 2x2 tables, one point of debate concerns the marginal frequencies of the individual 2x2 tables that compri the data from which the chi square is calculated. If one treats the
as fixed quantities, then for a single 2x2 table the discrete hypergeometric distribution is obtained (ca 1). If one t of marginals is free to vary, a product binomial distribution is obtained (ca 2), that is, the product of two independent binomial probabilities. Typically, there are many more points in the sample space for the latter. For example, consider the single 2x2 table with marginals
n1n2 4
n3n4 4
4 4 8
If marginals are not free to vary, then n1 can take the values 0, 1, 2, 3, 4. There is a total of 70 points in the sample space. However, suppo the column marginals are free to vary, with first column taking the possible values 0 through 8. This results in a sample space of 256 points. If both ts of marginals are free to vary, then the multinomial model is obtained with 48 sample points (ca 3). Much telescoped, three main conclusions