Educational Measurement:Issues and Practice
Spring 2010,Vol.29,No.1,pp.3–13
Does an Argument-Bad Approach to Validity Make a Difference?
Carol A.Chapelle,Iowa State University ,Mary K.Enright,Educational Testing Service ,Joan Jamieson,Northern Arizona
University
Drawing on experience between 2000and 2007in developing a validity argument for the
high-stakes Test of English as a Foreign Language TM (TOEFL R
),this paper evaluates the differences between the argument-bad approach to validity as prented by Kane (2006)and that described in the 1999AERA/APA/NCME Standards for Educational and Psych
ological Testing .Bad on an analysis of four points of comparison—framing the intended score interpretation,outlining the esntial rearch,structuring rearch results into a validity argument,and challenging the
validity argument—we conclude that an argument-bad approach to validity introduces some new and uful concepts and practices .
Keywords:claim,construct,high-stakes,inference,interpretive argument,Standards,TOEFL,validity argument
A
ttempts to improve professional practice in validation have resulted in emingly new perspectives appearing in the fourth edition of Educational Measurement .In the introduction to the volume,Brennan (2006)contrasts the prentation on validity given by Kane (2006)with that in the previous edition:in 1989the chapter was an extensive scholarly treatment rather than one “that provides much specific guidance to tho who would undertake validation studies”(p.2).In contrast,the chapter in the 2006edi-tion “aims at making validation a more accessible enterpri for educational measurement practitioners”(p.3).Practi-tioners reading the chapter will notice a change f
rom what appears in both the third edition of Educational Measure-ment (Linn,1989)and the 1999AERA/APA/NCME Standards for Educational and Psychological Testing .The introduction of new concepts and frameworks in the 2006chapter rais the question of whether or not Kane really offers new insights into validation,and achieves the aim of making validation more accessible.We address this question in this paper by drawing on our experience between 2000and 2007in de-veloping a validity argument for the Test of English as a
Foreign Language TM (TOEFL R
),a test ud for high-stakes decisions about admissions to English-medium universities,which requires substantial validity evidence supporting score interpretation and u.
A revision of the TOEFL was undertaken by Educational Testing Service (ETS)from 1990through 2005.ETS has standards for validity evidence that are in line with tho of the AERA/APA/NCME Standards for Educational and
Carol A.Chapelle is a Professor,Department of English,Iowa State University,203Ross Hall,Ames,IA 50011;carolc@iastate.edu.Mary K.Enright is Rearch Director,Center for Validity Rearch,Educational Testing Service,Rodale Road MS10-R,Princeton,NJ 08541.Joan Jamieson i
s a Professor,English Department,Northern Arizona University,Box 6032,Flagstaff,AZ 86011.
Psychological Testing (1999),and therefore throughout the TOEFL revision,rearch was conducted that would yield the appropriate evidence for a validity argument in line with the perspectives prented in the Standards .By the late 1990s,a considerable amount of such rearch needed to be synthesized into a validity argument,and so we turned to the Standards for guidance.This process yielded consider-able insight into shortcomings in the guidance provided by the Standards for synthesizing rearch results into a validity argument.Becau the project began with an attempt to u the Standards and ended up instead drawing on the perspec-tives of Kane (1992,2001,2002,2006)and Kane,Crooks,and Cohen (1999),it revealed contrasts in the two approaches that may be informative in future attempts to update the chapter on validity in the Standards .The validity argument resulting from this process is available elwhere (Chapelle,Enright,&Jamieson,2008),but this paper describes the con-trasts as four points of comparison between the Standards and Kane’s approach as summarized in Table 1.
Framing the Intended Score Interpretation
The Standards frame the intended interpretation of the test as a construct which can be defined from
many different perspectives.Multiple perspectives were brought to bear on the discussion of construct definition in the TOEFL project as the developers attempted to define a construct of aca-demic English language proficiency bad on work in applied linguistics.A strong construct theory—which would place academic English proficiency within a nomological network that could be ud to generate clear hypothes about test performance relative to other constructs and behaviors—did not result from this process and therefore the construct itlf was not a good basis for subquent rearch.However,Kane’s
Copyright C
2010by the National Council on Measurement in Education
3
Table 1.Key Aspects in the Process of Validation in the Standards (1999)and in Educational Measurement (Kane,2006)
Four Aspects Characterizing Approaches to Validity
Standards (1999)
19-May-20
Kane (2006)
Framing the intended score interpretation A construct
An interpretive argument
Outlining the esntial rearch
Propositions consistent with the intended interpretation Inferences and their assumptions Structuring rearch results into a validity argument
Listing types of evidence Series of inferences linking grounds with conclusions
Challenging the validity argument
Counterevidence for propositions
Refuting the argument
organizing concept of an “interpretive argument,”which does not rely on a construct,proved to be uful.
A Construct
In the Standards ,the term “construct”refers broadly to “the concept or characteristic that a test is designed to measure”(1999,p.5)and “propod interpretation”is ud inter-changeably with “construct”(e.g.,1999,p.9).This approach is consistent with views in educational measurement in the early 1990s that a theoretical construct should provide the ba-sis for score interpretation for a large-scale,high-stakes test (e.g.,Messick,1994).Like experts in other areas,language asssment experts have attempted to capture the dynamic and context-mediated nature of the construct measured by a test such as the TOEFL.Discussions of how to do so (e.g.,Stansfield,1986)ultimately led to the launch of a project to develop a revid TOEFL that would be bad on modern conceptions of language proficiency.Designers of the new TOEFL agreed that theoretical rationales underlying score interpretation needed to come from a construct of language proficiency and that therefore this would rve as a basis for test design.
yunThe idea that a language proficiency construct should un-derlie test development and validation has strong support in language testing.For example,Alderson,Clapham and Wall (1995)tie the construct to both the test specifications and validation.“For validation purpos,the test specifications need to make the theoretical framework which underlies the test explicit,and to spell out relationships among 地毯的英文
its con-structs,as well as the relationship between the theory and the purpo for which the test is designed”(p.17).Bach-man and Palmer (1996)include definition of the construct in the fourth step of language test development:“...a the-oretical definition of the construct ...provides the basis for considering and investigating the construct validity of the interpretations we make of test scores.This theoretical def-inition also provides a basis for the development ...of test tasks”(p.89).Weir (2005)ties the construct to validation:“The more fully we are able to describe the construct we are attempting to measure at the a priori stage the more meaningful might be the statistical procedures contribut-ing to construct validation that can subquently be applied to the results of the test ....We can never escape from the need to define what is being measured,just as we are obliged to investigate a test in operation”(p.18).
Despite agreement on the need to define the construct as a basis for test development,no agreement exists concerning a single best way to define constructs of language proficiency to rve as a defensible basis for score interpretation (e.g.,Bachman,1990;Bachman &Palmer,1996;Chapelle,1998;
lady是什么意思Chalhoub-Deville,1997,2001;Oller,1979;McNamara,1996).Nevertheless,most would agree that limiting a construct of language proficiency to a trait such as knowledge of vocabu-lary or listening is
too narrow for the interpretations that test urs want to make for university admissions decisions.In-stead,test urs are typically interested in examinees’ability to u a complex of knowledge and process to achieve par-ticular goals.Therefore,strategies or process of language u have been included in constructs of language proficiency,called communicative competence (Canale &Swain,1980)or communicative language ability (Bachman,1990).Applied linguists would also agree that language proficiency needs to be defined in terms of contexts of performance becau c-ond language learners can be proficient in some contexts but lack proficiency in others (Cummins,1983).A concep-tualization of language proficiency that recognizes one trait (or even a complex of abilities)as responsible for perfor-mance across all contexts fails to account for the variation in performance obrved across the different contexts of lan-guage u (Bachman &Palmer,1996;Chalhoub-Deville,1997;Chapelle,1998;McNamara,1996;Norris,Brown,Hudson,&Bonk,2002;Skehan,1998).As a result,language proficiency constructs of interest are difficult to define in a preci way (e.g.,as construct reprentation;Embretson,1983).Bach-man (2007)provides an analysis of the issues and the various perspectives on construct definition that have been prented over the past fifty years in applied linguistics.
This historical perspective reveals the challenge the TOEFL project faced in attempting to u a construct def-inition developed by applied linguists as a basis of the va-lidity argument.But this challenge is not unique to con-structs in applied linguistics.Borsboom (2006)points out that “psychological theories are often simply too vague to mo-tivate psychometric models”(p.437),and without a specified psychometric model underlying score interpretation,what kind of evidence should be sought for the validity argument?When the construct underlying test score interpretation is so complex,how can a validity argument be formulated?After struggling with the questions,we welcomed the different perspective offered by Kane for approaching the problem of score interpretation.
An Interpretive Argument
Kane’s approach does not require a construct per but rather an explicitly stated interpretation called an interpretive ar-gument (Kane,1992,p.527;2001;Kane et al.,1999).He describes the interpretive argument as consistent with the general principles accepted for construct validity that appear in the Standards :“Validation requires a clear statement of the propod interpretations and us”(Kane,2006,p.23).
4
Educational Measurement:Issues and Practice
SO
FIGURE1.Illustration of an interpretive argument comprising an inference about ability bad upon grounds.
Rather than relying on formal theories,however,the inter-pretive argument“specifies the propod interpretations and us of test results by laying out the network of inferences and assumptions leading from the obrved performances to the conclusions and decisions bad on the performances”(Kane,2006,p.23).“The validity argument provides an eval-uation of the interpretive argument”(Kane,2006,p.23).In the simplest terms,then,a validity argument is an interpre-tive argume
nt in which backing has been provided for the assumptions.In view of the complexity of the construct the-ory underlying the TOEFL interpretation,we were open to exploring a means of establishing a basis for score interpre-tation that did not rely solely on the concepts of tasks and abilities from psychology or on applied linguistics’discussion of language ability constructs.Instead,the building blocks of interpretive arguments are the types of inferences iden-tified by Kane.Test developers and rearchers wishing to support score interpretation need to identify the inferences upon which score interpretation is to be bad.
To illustrate the meaning of inference,a simplified inter-pretive argument for speaking performance in an English language classroom is shown in Figure1.It begins with grounds such as the obrvation that a student’s prentation to a class on an assigned topic was characterized by hesita-tions and mispronunciations.“Grounds”is the term ud by Toulmin,Rieke and Janik(1984)to denote the basis for mak-ing a claim;“data”was ud by Toulmin(2003)and has been ud by others to refer to the same functional unit of the argu-ment.The claim one might make on the basis of that perfor-mance is that the student’s speaking abilities are inadequate for study in an English-medium university.The arrow extend-ing from the grounds to the claim reprents an inference. The inference allows for a conclusion,which,in the example, is the claim.The point is that the obrvation itlf cannot mean tha
t the student is unprepared.Instead,an interpretive argument specifies the interpretation to be drawn from the grounds to a claim by an inference.Such inferences play a critical role in current approaches toward developing inter-pretive and validity arguments(Kane,1992,2001;Mislevy,Steinberg,and Almond,2003),which are bad on Toulmin’s (2003)description of informal or practical arguments,such as tho ud in nonmathematicalfields like law. Interpretive arguments for test scores are constructed through the u of particular types of inferences such as the evaluation inference shown in Figure2.In this example, the grounds consist of the obrvations made of a student’s speaking performance on a task in class.The evaluation in-ference is made in awarding a score of“2”to the student for that performance.The interpretive argument states explic-itly that such an inference is the basis for awarding a score of“2.”This statement about the inference provides the basis for planning and interpreting validity evidence.Validity ev-idence is needed to support such inferences,which,in this example,could come from evidence showing the rationales for developing the scoring rubric and the consistency with which it was applied.
This simple,hypothetical example illustrates the basic ap-proach to an interpretive argument that states the basis of an interpretation without defining a construct.Rather than terms referring to exami
nee characteristics,the tools of the interpretive argument are inferences that are typically made in the process of measurement.When we pursued this ap-proach in the TOEFL project,we did not entirely eliminate the construct but rather it became one part of an overall chain of inferences.The t of inferences,which we were able to specify in terms that had direct implications for rearch,be-came the organizing units underlying score interpretation.If test developers can work with a t of such inferences rather than solely with the complex constructs that can be defined in many different ways,the basis for score interpretation becomes more manageable.
Outlining the Esntial Rearch
翻译机构In a general n,the construct or interpretive argument provides a starting point for planning rearch to be ud in a validation argument,but more specifically,within the Stan-dards framework,how does one move from a construct defini-tion of something like“reading comprehension”or“speaking ability”to the design of a study who results will show that the scores should be interpreted as intended?The Standards advi rearchers to begin to consider validation rearch by generating propositions that would be expected to be true if test scores did in fact reflect the intended construct.In the TOEFL project we attempted to follow this advice,ending up with a list of propositions.Taking Kane’s approach,in con-trast,propositions,called warrants,were also generate
d but the important difference lies in how tho propositions are generated and what they are likely to consist of.
thicken
Propositions Consistent with the Intended Interpretation The Standards direct test developers and rearchers to gather“kinds of evidence”that are needed to evaluate the “intended interpretation”of test scores.This guidance,of cour,is recognizable as a way of summarizing the dominant views on test validation throughout the1990s that assumed that multiple types of evidence should support score ,Cronbach,1988;Messick,1989).The Standards point out“many lines of evidence can contribute to an under-standing of the construct meaning of test scores”(p.5)and suggests that tho lines of evidence can consist of the familiar
Spring20105
FIGURE2.Example grounds and conclusion for an evaluation inference.(Adapted from Chapelle,Enright,and Jamieson,2008,p.11. Reprinted courtesy of Taylor and Francis.)
categories of evidence bad on test content,respon pro-cess,internal structure,relations to other variables,and conquences,as outlined by Messick(1989).Each of the lines of evidence suggests methodologies,and there are plenty of examples of the in language testing and elwhere.How-ever,as Shepard(1993)pointed out,the idea that“many lines of evidence can contribute”offers a large t of options rather than guidance to an efficient and effective path for validation. According to the Standards,“The decision about what types of evidence are important for validation in each instance can be clarified by developing a t of propositions that support the propod interpretation for a particular purpo of testing”(AERA/APA/NCME,1999,p.9).For example,if one wishes to make the proposition that the test score distinguishes among examinees at different English ability levels,the validation rearch must provide data indicating that this is actually the ca.The proposition guides the rearcher to produce supporting evidence.The propositions are to rve as hy-pothes about score interpretations,which would provide guidance about the types of validity evidence required.In the Standards,six examples of propositions are given for a mathematics achievement test ud to test readiness for an advanced cour.They include stateme
nts such as“that cer-tain skills are prerequisite for the advanced cour”and“that test scores are not unduly influenced by ancillary variables”(p.9).
Drawing on the examples,we developed the following propositions:
1.Certain language skills defined as listening,reading,speak-
ing and writing both independently and in combination are necessary(but not sufficient)for students to succeed in advanced academic ttings.
2.The content domain of the tasks on the TOEFL requires
the English language skills students need to succeed in English-speaking North American university ttings. 3.Each of the skills—listening,reading,speaking,and
writing—is compod of a t of subskills.
4.Test tasks comprising each skill score exhibit internal con-
sistency.
5.Each of the four skills is distinct enough from each other
to be measured independently,but the skills are related by some core competencies.
6.Test performance is not affected by test-taking process
irrelevant to the constructs of interest.
7.Test scores are arrived at through judgments of appropriate
aspects of learners’performance.
8.Test performance is not affected by examinees’familiarity
with computer u.
9.Test performance is not affected inappropriately by back-
ground knowledge of the topics reprented on the test.
10.The test asss cond language abilities independent of在线英英词典
general cognitive abilities.
11.Criterion measures can validly asss the linguistic aspects
of academic success.
javaapplet12.Test scores are positively related to criterion measures of
success.
13.U of the test will result in positive washback in ESL/EFL
instruction,such as incread emphasis on speaking and writing and focus on academic language.
We stopped at number13,recognizing that the list could go on and on unless we gained a better n of what a proposition should consist of and how many one would like to have for a good validation argument.Moreover,in the abnce of such guidelines we found that the propositions we were generating were influenced by the validation rearch that had been completed,and were therefore unlikely to help identify areas where more rearch was needed.On the one hand,contextualizing rearch was precily what needed to be done,but on the other hand this proc
文案翻译ess emed to start from the perspective of completed rearch rather than from the perspective of score meaning.In short,our examination of the Standards and the materials that had led up to them demonstrated the need for more explicit guidance on how to formulate an intended interpretation and the propositions that are suppod to point to the types of evidence that would ultimately contribute to the TOEFL validity argument.
Inferences and Their Assumptions
Kane’s approach to identifying framing statements is to con-nect them to the inferences in the interpretive argument through the u of two types of statements:warrants and as-sumptions.Taking the example in Figure2,the evaluation inference would be supported by a warrant such as“Obrva-tions of performance on the speaking task are evaluated to provide a score reflective of the relevant language abilities.”Here,the warrant is the generally held principle that hesita-tions and mispronunciations are characteristics of students with low levels of speaking ability who would have trouble at an American university.Such a warrant is a statement which rests on assumptions that need to be supported in order for the inference to be made.A warrant is a law,a
6Educational Measurement:Issues and Practice
Table2.Summary of the Inferences,Warrants in the TOEFL Validity Argument with Their Underlying Assumptions
Inference Warrant Licensing the Inference Assumptions Underlying Inferences
Domain description Obrvations of performance on the TOEFL reveal
relevant knowledge,skills,and abilities in
situations reprentative of tho in the target
domain of language u in the English-medium
institutions of higher education.1.Critical English language skills,knowledge,and process needed for study in English-medium colleges and universities can be identified.
2.Asssment tasks that require important skills and are reprentative of the academic domain can be simulated.
Evaluation Obrvations of performance on TOEFL tasks are
英语初级听力下载
evaluated to provide obrved scores reflective
of targeted language abilities.1.Rubrics for scoring respons are appropriate for providing evidence of targeted language abilities.
2.Task administration conditions are appropriate for providing evidence of targeted language abilities.
3.The statistical characteristics of items,measures, and test forms are appropriate for
norm-referenced decisions.
Generalization Obrved scores are estimates of expected scores
over the relevant parallel versions of tasks and
test forms and across raters.1.A sufficient number of tasks are included on the test to provide stable estimates of test takers’performances.
2.Configuration of tasks on measures is appropriate for intended interpretation.
3.Appropriate scaling and equating procedures for test scores are ud.
4.Task and test specifications are well defined so that parallel tasks and test forms are created.
Explanation Expected scores are attributed to a construct of
academic language proficiency.1.The linguistic knowledge,process,and strategies required to successfully complete tasks vary across tasks in keeping with theoretical expectations.
2.Task difficulty is systematically influenced by task characteristics.
3.Performance on new test measures relates to performance on other test-bad measures of language proficiency as expected theoretically.
4.The internal structure of the test scores is consistent with a theoretical view of language proficiency as a number of highly interrelated components.
5.Test performance varies according to amount and quality of experience in learning English.
Extrapolation The construct of academic language proficiency as
assd by TOEFL accounts for the quality of
linguistic performance in English-medium
institutions of higher education.Performance on the test is related to other criteria of language proficiency in the academic context.
Utilization Estimates of the quality of performance in the
English-medium institutions of higher education
obtained from the TOEFL are uful for making
decisions about admissions and appropriate
curricula for test takers.1.The meaning of test scores is clearly interpretable by admissions officers,test takers,and teachers.
2.The test will have a positive influence on how English is taught.
Adapted from Chapelle,Enright,and Jamieson,2008,pp.19–21.Reprinted courtesy of Taylor and Fran
cis.
generally held principle,rule of thumb,or established proce-dure.Assumptions would be,for example,that the rubric for scoring the respons was appropriate for providing the rel-evant evidence of ability.Assumptions prompt rearch that focus on particular issues.In this ca,the rearch would need to provide evidence for the accuracy and relevance of the rating of the student’s performance.
As shown in Table2,we identified six inferences,each with a warrant and assumptions,that form the basis for the TOEFL interpretive argument.Each of the inferences is ud to move from grounds to a claim;each claim becomes grounds for a subquent claim.For example,a generalization in-ference connects the grounds of an obrved score which reflects the relevant aspects of performance with a claim that the obrved score reflects the expected score across tasks,occasions and raters.Rather than state all of the grounds and claims which are linked in a formulaic way to types of inferences,Table2focus on the warrants and as-sumptions which need to be generated by the rearcher to guide the validity rearch.Discussion of the inferences as the central building blocks for the interpretive argument ap-pears in Kane et al.(1999),and the specific statements ud as grounds and claims in the TOEFL validity argument ap-pear in Chapelle(2008).The intended score in
terpretation is bad on a domain description inference,which has a war-rant that obrvations of performance on the TOEFL reveal relevant knowledge,skills,and abilities in situations repre-ntative of tho in the target domain of language u in the
Spring20107