首页 > 英语园地

Does an Argument-Bad Approach to Validity Make a Difference_

更新时间:2023-05-28 04:24:33 阅读：评论：0

Educational Measurement:Issues and Practice

Spring 2010,Vol.29,No.1,pp.3–13

Does an Argument-Bad Approach to Validity Make a Difference?

Carol A.Chapelle,Iowa State University ,Mary K.Enright,Educational Testing Service ,Joan Jamieson,Northern Arizona

University

Drawing on experience between 2000and 2007in developing a validity argument for the

high-stakes Test of English as a Foreign Language TM (TOEFL R

),this paper evaluates the differences between the argument-bad approach to validity as prented by Kane (2006)and that described in the 1999AERA/APA/NCME Standards for Educational and Psych

ological Testing .Bad on an analysis of four points of comparison—framing the intended score interpretation,outlining the esntial rearch,structuring rearch results into a validity argument,and challenging the

validity argument—we conclude that an argument-bad approach to validity introduces some new and uful concepts and practices .

Keywords:claim,construct,high-stakes,inference,interpretive argument,Standards,TOEFL,validity argument

ttempts to improve professional practice in validation have resulted in emingly new perspectives appearing in the fourth edition of Educational Measurement .In the introduction to the volume,Brennan (2006)contrasts the prentation on validity given by Kane (2006)with that in the previous edition:in 1989the chapter was an extensive scholarly treatment rather than one “that provides much speciﬁc guidance to tho who would undertake validation studies”(p.2).In contrast,the chapter in the 2006edi-tion “aims at making validation a more accessible enterpri for educational measurement practitioners”(p.3).Practi-tioners reading the chapter will notice a change f

rom what appears in both the third edition of Educational Measure-ment (Linn,1989)and the 1999AERA/APA/NCME Standards for Educational and Psychological Testing .The introduction of new concepts and frameworks in the 2006chapter rais the question of whether or not Kane really offers new insights into validation,and achieves the aim of making validation more accessible.We address this question in this paper by drawing on our experience between 2000and 2007in de-veloping a validity argument for the Test of English as a

Foreign Language TM (TOEFL R

),a test ud for high-stakes decisions about admissions to English-medium universities,which requires substantial validity evidence supporting score interpretation and u.

A revision of the TOEFL was undertaken by Educational Testing Service (ETS)from 1990through 2005.ETS has standards for validity evidence that are in line with tho of the AERA/APA/NCME Standards for Educational and

Carol A.Chapelle is a Professor,Department of English,Iowa State University,203Ross Hall,Ames,IA 50011;carolc@iastate.edu.Mary K.Enright is Rearch Director,Center for Validity Rearch,Educational Testing Service,Rodale Road MS10-R,Princeton,NJ 08541.Joan Jamieson i

s a Professor,English Department,Northern Arizona University,Box 6032,Flagstaff,AZ 86011.

Psychological Testing (1999),and therefore throughout the TOEFL revision,rearch was conducted that would yield the appropriate evidence for a validity argument in line with the perspectives prented in the Standards .By the late 1990s,a considerable amount of such rearch needed to be synthesized into a validity argument,and so we turned to the Standards for guidance.This process yielded consider-able insight into shortcomings in the guidance provided by the Standards for synthesizing rearch results into a validity argument.Becau the project began with an attempt to u the Standards and ended up instead drawing on the perspec-tives of Kane (1992,2001,2002,2006)and Kane,Crooks,and Cohen (1999),it revealed contrasts in the two approaches that may be informative in future attempts to update the chapter on validity in the Standards .The validity argument resulting from this process is available elwhere (Chapelle,Enright,&Jamieson,2008),but this paper describes the con-trasts as four points of comparison between the Standards and Kane’s approach as summarized in Table 1.

Framing the Intended Score Interpretation

The Standards frame the intended interpretation of the test as a construct which can be deﬁned from

many different perspectives.Multiple perspectives were brought to bear on the discussion of construct deﬁnition in the TOEFL project as the developers attempted to deﬁne a construct of aca-demic English language proﬁciency bad on work in applied linguistics.A strong construct theory—which would place academic English proﬁciency within a nomological network that could be ud to generate clear hypothes about test performance relative to other constructs and behaviors—did not result from this process and therefore the construct itlf was not a good basis for subquent rearch.However,Kane’s

2010by the National Council on Measurement in Education

Table 1.Key Aspects in the Process of Validation in the Standards (1999)and in Educational Measurement (Kane,2006)

Four Aspects Characterizing Approaches to Validity

Standards (1999)

19-May-20

Kane (2006)

Framing the intended score interpretation A construct

An interpretive argument

Outlining the esntial rearch

Propositions consistent with the intended interpretation Inferences and their assumptions Structuring rearch results into a validity argument

Listing types of evidence Series of inferences linking grounds with conclusions

Challenging the validity argument

Counterevidence for propositions

Refuting the argument

organizing concept of an “interpretive argument,”which does not rely on a construct,proved to be uful.

A Construct

In the Standards ,the term “construct”refers broadly to “the concept or characteristic that a test is designed to measure”(1999,p.5)and “propod interpretation”is ud inter-changeably with “construct”(e.g.,1999,p.9).This approach is consistent with views in educational measurement in the early 1990s that a theoretical construct should provide the ba-sis for score interpretation for a large-scale,high-stakes test (e.g.,Messick,1994).Like experts in other areas,language asssment experts have attempted to capture the dynamic and context-mediated nature of the construct measured by a test such as the TOEFL.Discussions of how to do so (e.g.,Stansﬁeld,1986)ultimately led to the launch of a project to develop a revid TOEFL that would be bad on modern conceptions of language proﬁciency.Designers of the new TOEFL agreed that theoretical rationales underlying score interpretation needed to come from a construct of language proﬁciency and that therefore this would rve as a basis for test design.

yunThe idea that a language proﬁciency construct should un-derlie test development and validation has strong support in language testing.For example,Alderson,Clapham and Wall (1995)tie the construct to both the test speciﬁcations and validation.“For validation purpos,the test speciﬁcations need to make the theoretical framework which underlies the test explicit,and to spell out relationships among 地毯的英文

its con-structs,as well as the relationship between the theory and the purpo for which the test is designed”(p.17).Bach-man and Palmer (1996)include deﬁnition of the construct in the fourth step of language test development:“...a the-oretical deﬁnition of the construct ...provides the basis for considering and investigating the construct validity of the interpretations we make of test scores.This theoretical def-inition also provides a basis for the development ...of test tasks”(p.89).Weir (2005)ties the construct to validation:“The more fully we are able to describe the construct we are attempting to measure at the a priori stage the more meaningful might be the statistical procedures contribut-ing to construct validation that can subquently be applied to the results of the test ....We can never escape from the need to deﬁne what is being measured,just as we are obliged to investigate a test in operation”(p.18).

Despite agreement on the need to deﬁne the construct as a basis for test development,no agreement exists concerning a single best way to deﬁne constructs of language proﬁciency to rve as a defensible basis for score interpretation (e.g.,Bachman,1990;Bachman &Palmer,1996;Chapelle,1998;

lady是什么意思Chalhoub-Deville,1997,2001;Oller,1979;McNamara,1996).Nevertheless,most would agree that limiting a construct of language proﬁciency to a trait such as knowledge of vocabu-lary or listening is

too narrow for the interpretations that test urs want to make for university admissions decisions.In-stead,test urs are typically interested in examinees’ability to u a complex of knowledge and process to achieve par-ticular goals.Therefore,strategies or process of language u have been included in constructs of language proﬁciency,called communicative competence (Canale &Swain,1980)or communicative language ability (Bachman,1990).Applied linguists would also agree that language proﬁciency needs to be deﬁned in terms of contexts of performance becau c-ond language learners can be proﬁcient in some contexts but lack proﬁciency in others (Cummins,1983).A concep-tualization of language proﬁciency that recognizes one trait (or even a complex of abilities)as responsible for perfor-mance across all contexts fails to account for the variation in performance obrved across the different contexts of lan-guage u (Bachman &Palmer,1996;Chalhoub-Deville,1997;Chapelle,1998;McNamara,1996;Norris,Brown,Hudson,&Bonk,2002;Skehan,1998).As a result,language proﬁciency constructs of interest are difﬁcult to deﬁne in a preci way (e.g.,as construct reprentation;Embretson,1983).Bach-man (2007)provides an analysis of the issues and the various perspectives on construct deﬁnition that have been prented over the past ﬁfty years in applied linguistics.

This historical perspective reveals the challenge the TOEFL project faced in attempting to u a construct def-inition developed by applied linguists as a basis of the va-lidity argument.But this challenge is not unique to con-structs in applied linguistics.Borsboom (2006)points out that “psychological theories are often simply too vague to mo-tivate psychometric models”(p.437),and without a speciﬁed psychometric model underlying score interpretation,what kind of evidence should be sought for the validity argument?When the construct underlying test score interpretation is so complex,how can a validity argument be formulated?After struggling with the questions,we welcomed the different perspective offered by Kane for approaching the problem of score interpretation.

An Interpretive Argument

Kane’s approach does not require a construct per but rather an explicitly stated interpretation called an interpretive ar-gument (Kane,1992,p.527;2001;Kane et al.,1999).He describes the interpretive argument as consistent with the general principles accepted for construct validity that appear in the Standards :“Validation requires a clear statement of the propod interpretations and us”(Kane,2006,p.23).

Educational Measurement:Issues and Practice

FIGURE1.Illustration of an interpretive argument comprising an inference about ability bad upon grounds.

Rather than relying on formal theories,however,the inter-pretive argument“speciﬁes the propod interpretations and us of test results by laying out the network of inferences and assumptions leading from the obrved performances to the conclusions and decisions bad on the performances”(Kane,2006,p.23).“The validity argument provides an eval-uation of the interpretive argument”(Kane,2006,p.23).In the simplest terms,then,a validity argument is an interpre-tive argume

nt in which backing has been provided for the assumptions.In view of the complexity of the construct the-ory underlying the TOEFL interpretation,we were open to exploring a means of establishing a basis for score interpre-tation that did not rely solely on the concepts of tasks and abilities from psychology or on applied linguistics’discussion of language ability constructs.Instead,the building blocks of interpretive arguments are the types of inferences iden-tiﬁed by Kane.Test developers and rearchers wishing to support score interpretation need to identify the inferences upon which score interpretation is to be bad.

To illustrate the meaning of inference,a simpliﬁed inter-pretive argument for speaking performance in an English language classroom is shown in Figure1.It begins with grounds such as the obrvation that a student’s prentation to a class on an assigned topic was characterized by hesita-tions and mispronunciations.“Grounds”is the term ud by Toulmin,Rieke and Janik(1984)to denote the basis for mak-ing a claim;“data”was ud by Toulmin(2003)and has been ud by others to refer to the same functional unit of the argu-ment.The claim one might make on the basis of that perfor-mance is that the student’s speaking abilities are inadequate for study in an English-medium university.The arrow extend-ing from the grounds to the claim reprents an inference. The inference allows for a conclusion,which,in the example, is the claim.The point is that the obrvation itlf cannot mean tha

t the student is unprepared.Instead,an interpretive argument speciﬁes the interpretation to be drawn from the grounds to a claim by an inference.Such inferences play a critical role in current approaches toward developing inter-pretive and validity arguments(Kane,1992,2001;Mislevy,Steinberg,and Almond,2003),which are bad on Toulmin’s (2003)description of informal or practical arguments,such as tho ud in nonmathematicalﬁelds like law. Interpretive arguments for test scores are constructed through the u of particular types of inferences such as the evaluation inference shown in Figure2.In this example, the grounds consist of the obrvations made of a student’s speaking performance on a task in class.The evaluation in-ference is made in awarding a score of“2”to the student for that performance.The interpretive argument states explic-itly that such an inference is the basis for awarding a score of“2.”This statement about the inference provides the basis for planning and interpreting validity evidence.Validity ev-idence is needed to support such inferences,which,in this example,could come from evidence showing the rationales for developing the scoring rubric and the consistency with which it was applied.

This simple,hypothetical example illustrates the basic ap-proach to an interpretive argument that states the basis of an interpretation without deﬁning a construct.Rather than terms referring to exami

nee characteristics,the tools of the interpretive argument are inferences that are typically made in the process of measurement.When we pursued this ap-proach in the TOEFL project,we did not entirely eliminate the construct but rather it became one part of an overall chain of inferences.The t of inferences,which we were able to specify in terms that had direct implications for rearch,be-came the organizing units underlying score interpretation.If test developers can work with a t of such inferences rather than solely with the complex constructs that can be deﬁned in many different ways,the basis for score interpretation becomes more manageable.

Outlining the Esntial Rearch

翻译机构In a general n,the construct or interpretive argument provides a starting point for planning rearch to be ud in a validation argument,but more speciﬁcally,within the Stan-dards framework,how does one move from a construct deﬁni-tion of something like“reading comprehension”or“speaking ability”to the design of a study who results will show that the scores should be interpreted as intended?The Standards advi rearchers to begin to consider validation rearch by generating propositions that would be expected to be true if test scores did in fact reﬂect the intended construct.In the TOEFL project we attempted to follow this advice,ending up with a list of propositions.Taking Kane’s approach,in con-trast,propositions,called warrants,were also generate

d but the important difference lies in how tho propositions are generated and what they are likely to consist of.

thicken

Propositions Consistent with the Intended Interpretation The Standards direct test developers and rearchers to gather“kinds of evidence”that are needed to evaluate the “intended interpretation”of test scores.This guidance,of cour,is recognizable as a way of summarizing the dominant views on test validation throughout the1990s that assumed that multiple types of evidence should support score ,Cronbach,1988;Messick,1989).The Standards point out“many lines of evidence can contribute to an under-standing of the construct meaning of test scores”(p.5)and suggests that tho lines of evidence can consist of the familiar

Spring20105

FIGURE2.Example grounds and conclusion for an evaluation inference.(Adapted from Chapelle,Enright,and Jamieson,2008,p.11. Reprinted courtesy of Taylor and Francis.)

categories of evidence bad on test content,respon pro-cess,internal structure,relations to other variables,and conquences,as outlined by Messick(1989).Each of the lines of evidence suggests methodologies,and there are plenty of examples of the in language testing and elwhere.How-ever,as Shepard(1993)pointed out,the idea that“many lines of evidence can contribute”offers a large t of options rather than guidance to an efﬁcient and effective path for validation. According to the Standards,“The decision about what types of evidence are important for validation in each instance can be clariﬁed by developing a t of propositions that support the propod interpretation for a particular purpo of testing”(AERA/APA/NCME,1999,p.9).For example,if one wishes to make the proposition that the test score distinguishes among examinees at different English ability levels,the validation rearch must provide data indicating that this is actually the ca.The proposition guides the rearcher to produce supporting evidence.The propositions are to rve as hy-pothes about score interpretations,which would provide guidance about the types of validity evidence required.In the Standards,six examples of propositions are given for a mathematics achievement test ud to test readiness for an advanced cour.They include stateme

nts such as“that cer-tain skills are prerequisite for the advanced cour”and“that test scores are not unduly inﬂuenced by ancillary variables”(p.9).

Drawing on the examples,we developed the following propositions:

1.Certain language skills deﬁned as listening,reading,speak-

ing and writing both independently and in combination are necessary(but not sufﬁcient)for students to succeed in advanced academic ttings.

2.The content domain of the tasks on the TOEFL requires

the English language skills students need to succeed in English-speaking North American university ttings. 3.Each of the skills—listening,reading,speaking,and

writing—is compod of a t of subskills.

4.Test tasks comprising each skill score exhibit internal con-

sistency.

5.Each of the four skills is distinct enough from each other

to be measured independently,but the skills are related by some core competencies.

6.Test performance is not affected by test-taking process

irrelevant to the constructs of interest.

7.Test scores are arrived at through judgments of appropriate

aspects of learners’performance.

8.Test performance is not affected by examinees’familiarity

with computer u.

9.Test performance is not affected inappropriately by back-

ground knowledge of the topics reprented on the test.

10.The test asss cond language abilities independent of在线英英词典

general cognitive abilities.

11.Criterion measures can validly asss the linguistic aspects

of academic success.

javaapplet12.Test scores are positively related to criterion measures of

success.

13.U of the test will result in positive washback in ESL/EFL

instruction,such as incread emphasis on speaking and writing and focus on academic language.

We stopped at number13,recognizing that the list could go on and on unless we gained a better n of what a proposition should consist of and how many one would like to have for a good validation argument.Moreover,in the abnce of such guidelines we found that the propositions we were generating were inﬂuenced by the validation rearch that had been completed,and were therefore unlikely to help identify areas where more rearch was needed.On the one hand,contextualizing rearch was precily what needed to be done,but on the other hand this proc

文案翻译ess emed to start from the perspective of completed rearch rather than from the perspective of score meaning.In short,our examination of the Standards and the materials that had led up to them demonstrated the need for more explicit guidance on how to formulate an intended interpretation and the propositions that are suppod to point to the types of evidence that would ultimately contribute to the TOEFL validity argument.

Inferences and Their Assumptions

Kane’s approach to identifying framing statements is to con-nect them to the inferences in the interpretive argument through the u of two types of statements:warrants and as-sumptions.Taking the example in Figure2,the evaluation inference would be supported by a warrant such as“Obrva-tions of performance on the speaking task are evaluated to provide a score reﬂective of the relevant language abilities.”Here,the warrant is the generally held principle that hesita-tions and mispronunciations are characteristics of students with low levels of speaking ability who would have trouble at an American university.Such a warrant is a statement which rests on assumptions that need to be supported in order for the inference to be made.A warrant is a law,a

6Educational Measurement:Issues and Practice

Table2.Summary of the Inferences,Warrants in the TOEFL Validity Argument with Their Underlying Assumptions

Inference Warrant Licensing the Inference Assumptions Underlying Inferences

Domain description Obrvations of performance on the TOEFL reveal

relevant knowledge,skills,and abilities in

situations reprentative of tho in the target

domain of language u in the English-medium

institutions of higher education.1.Critical English language skills,knowledge,and process needed for study in English-medium colleges and universities can be identiﬁed.

2.Asssment tasks that require important skills and are reprentative of the academic domain can be simulated.

Evaluation Obrvations of performance on TOEFL tasks are

英语初级听力下载

evaluated to provide obrved scores reﬂective

of targeted language abilities.1.Rubrics for scoring respons are appropriate for providing evidence of targeted language abilities.

2.Task administration conditions are appropriate for providing evidence of targeted language abilities.

3.The statistical characteristics of items,measures, and test forms are appropriate for

norm-referenced decisions.

Generalization Obrved scores are estimates of expected scores

over the relevant parallel versions of tasks and

test forms and across raters.1.A sufﬁcient number of tasks are included on the test to provide stable estimates of test takers’performances.

2.Conﬁguration of tasks on measures is appropriate for intended interpretation.

3.Appropriate scaling and equating procedures for test scores are ud.

4.Task and test speciﬁcations are well deﬁned so that parallel tasks and test forms are created.

Explanation Expected scores are attributed to a construct of

academic language proﬁciency.1.The linguistic knowledge,process,and strategies required to successfully complete tasks vary across tasks in keeping with theoretical expectations.

2.Task difﬁculty is systematically inﬂuenced by task characteristics.

3.Performance on new test measures relates to performance on other test-bad measures of language proﬁciency as expected theoretically.

4.The internal structure of the test scores is consistent with a theoretical view of language proﬁciency as a number of highly interrelated components.

5.Test performance varies according to amount and quality of experience in learning English.

Extrapolation The construct of academic language proﬁciency as

assd by TOEFL accounts for the quality of

linguistic performance in English-medium

institutions of higher education.Performance on the test is related to other criteria of language proﬁciency in the academic context.

Utilization Estimates of the quality of performance in the

English-medium institutions of higher education

obtained from the TOEFL are uful for making

decisions about admissions and appropriate

curricula for test takers.1.The meaning of test scores is clearly interpretable by admissions ofﬁcers,test takers,and teachers.

2.The test will have a positive inﬂuence on how English is taught.

Adapted from Chapelle,Enright,and Jamieson,2008,pp.19–21.Reprinted courtesy of Taylor and Fran

cis.

generally held principle,rule of thumb,or established proce-dure.Assumptions would be,for example,that the rubric for scoring the respons was appropriate for providing the rel-evant evidence of ability.Assumptions prompt rearch that focus on particular issues.In this ca,the rearch would need to provide evidence for the accuracy and relevance of the rating of the student’s performance.

As shown in Table2,we identiﬁed six inferences,each with a warrant and assumptions,that form the basis for the TOEFL interpretive argument.Each of the inferences is ud to move from grounds to a claim;each claim becomes grounds for a subquent claim.For example,a generalization in-ference connects the grounds of an obrved score which reﬂects the relevant aspects of performance with a claim that the obrved score reﬂects the expected score across tasks,occasions and raters.Rather than state all of the grounds and claims which are linked in a formulaic way to types of inferences,Table2focus on the warrants and as-sumptions which need to be generated by the rearcher to guide the validity rearch.Discussion of the inferences as the central building blocks for the interpretive argument ap-pears in Kane et al.(1999),and the speciﬁc statements ud as grounds and claims in the TOEFL validity argument ap-pear in Chapelle(2008).The intended score in

terpretation is bad on a domain description inference,which has a war-rant that obrvations of performance on the TOEFL reveal relevant knowledge,skills,and abilities in situations repre-ntative of tho in the target domain of language u in the

Spring20107

本文发布于:2023-05-28 04:24:33，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/78/797022.html

上一篇：建筑造型设计文献翻译-中学

下一篇：移动构造函数与拷贝构造函数

标签：翻译听力下载机构文案

留言与评论（共有 0 条评论）