学位论文详细信息
Beyond CTT and IRT: using an interactional measurement model to investigate the decision making process of EPT essay raters
English as a second language (ESL) writing test;rater decision making;performance assessment;test validity;test reliability
Wang, Xin
关键词: English as a second language (ESL) writing test;    rater decision making;    performance assessment;    test validity;    test reliability;   
Others  :  https://www.ideals.illinois.edu/bitstream/handle/2142/49646/Xin_Wang.pdf?sequence=1&isAllowed=y
美国|英语
来源: The Illinois Digital Environment for Access to Learning and Scholarship
PDF
【 摘 要 】

The current study as a doctorate dissertation investigates the gap between the nature of ESLperformance tests and score-based analysis tools used in the field of language testing. Thepurpose of this study is hence to propose a new testing model and a new experiment instrumentto examine test validity and reliability through rater’s decision making process in an ESL writingperformance test.A writing test as a language performance assessment is a multifaceted entity that involvesthe interaction of various stakeholders, among whom essay raters have a great impact on essayscores due to their subjective scoring decision, hence influencing the test validity and reliability(Huot, 1990; Lumley, 2002). This understanding puts forward the demand on the developmentand facilitation of methodological tools to quantify rater decision making process and theinteraction between rater and other stakeholders in a language test. Previous studies within theframework of Classic Testing Theory (CTT) and Item Response Theory (IRT) mainly focus onthe final outcome of rating or the retrospective survey data and/or rater’s think-aloud protocols.Due to the limitation of experimental tools, very few studies, if any, have directly examined themoment-to-moment process about how essay raters reach their scoring decisions and theinteraction per se.The present study proposes a behavioral model for writing performance tests, whichinvestigates raters’ scoring behavior and their reading comprehension as combined with the finalessay score. Though the focus of this study is writing assessment, the current researchmethodology is applicable to the field of performance-based testing in general. The presentframework considers the process of a language test as the interaction between test developer, testtaker, test rater and other test stakeholders. In the current study focusing on writing performanceiiitest, the interaction between test developer and test taker is realized directly through test promptand indirectly through test score; on the other hand, the interaction between test taker and testrater is reflected in the writing response. This model defines and explores rater reliability and testvalidity via the interaction between text (essays written by test-takers) and essay rater. Instead ofindirectly approaching the success of such an interaction through the final score, this new testingmodel directly measures and examines the success of rater behaviors with regard to their essayreading and score decision making. Bearing the “interactional” nature of a performance test, thisnew model is named as the Interactional Testing Model (ITM).In order to examine the online evidence of rater decision making, a computer-basedinterface was designed for this study to automatically collect the time-by-location information ofraters’ reading patterns, their text comprehension and other scoring events. Three groups ofvariables representing essay features and raters’ dynamic scoring process were measured by therating interface: 1) Reading pattern. Related variables include raters’ reading rate, raters’ go-backrate within and across paragraphs, and the time-by-location information of raters’ sentenceselection. 2) Raters’ reading comprehension and scoring behaviors. Variables include the timeby-location information of raters’ verbatim annotation, the time-by-location information ofraters’ comments, essay score assignment, and their answers to survey questions. 3) Essayfeatures. The experiment essays will be processed and analyzed by Python and SAS with regardto following variables: a) word frequency, b) essay length, c) total number of subject-verbmismatch as the indicator of syntactic anomaly, d) total number of clauses and sentence length asthe indicators of syntactic complexity, e) total number and location of inconsistent anaphoricreferent as the indicator of discourse incoherence, and f) density and word frequency of sentenceivconnectors as indicators of discourse coherence. The relation between these variables and raters’decision making were investigated both qualitatively and quantitatively.Results from the current study are categorized to address the following themes:1) Rater reliability: The rater difference occurred not only in their score assignment, butalso in raters’ text reading and scoring focus. Results of inter-rater reliability coincided withfindings from raters' reading time and their reading pattern. Those raters who had a high readingrate and low reading digression rate were less reliable.2) Test validity: Rater attention was assigned unevenly across an essay and concentrated onessay features associated to “Idea Development”. Raters’ sentence annotation and scoringcomments also demonstrated a common focus on this scoring dimension.3) Rater decision making: Most raters demonstrated a linear reading pattern during theirtext reading and essay grading. A rater-text interaction has been observed in the current study.Raters' reading time and essay score were strongly correlated with certain essay features. Adifference between trained rater and untrained rater was observed. Untrained raters tend to overemphasis the importance of "grammar and lexical choice".As a descriptive framework in the study of rating, the new measurement model bears bothpractical and theoretical significance. On the practical side, this model may shed light on thedevelopment of the following research domains: 1) Rating validity and rater reliability. Inaddition to looking at raters’ final score assignments, IRM provides a quality control tool toensure that a rater follows rating rubrics and assigns test scores in a consistent manner; 2)Electronic essay grading. Results from this study may provide helpful information to the designand validation of an automated rating engine in writing assessment. On the theoretical side, as asupplementary model to IRT and CTT, this model may enable researchers to go beyond simplevpost hoc analysis of test score and get a deeper understanding of raters’ decision making processin the context of a writing test.

【 预 览 】
附件列表
Files Size Format View
Beyond CTT and IRT: using an interactional measurement model to investigate the decision making process of EPT essay raters 1979KB PDF download
  文献评价指标  
  下载次数:7次 浏览次数:10次