Investigating the Suitability of Implementing the e-rater® Scoring Engine in a Large-Scale English Language Testing Program

نویسندگان

Mo Zhang

F. Jay Breyer

Florian Lorenz

Beata Beigman Klebanov

Heather Buzick

Keelan Evanini

Ruth Greenwood

Shelby J. Haberman

چکیده

Since its 1947 founding, ETS has conducted and disseminated scientific research to support its products and services, and to advance the measurement and education fields. In keeping with these goals, ETS is committed to making its research freely available to the professional community and to the general public. Published accounts of ETS research, including papers in the ETS Research Report series, undergo a formal peer-review process by ETS staff to ensure that they meet established scientific and professional standards. All such ETS-conducted peer reviews are in addition to any reviews that outside organizations may provide as part of their own publication processes. Peer review notwithstanding, the positions expressed in the ETS Research Report series and other published accounts of ETS research are those of the authors and not necessarily those of the Officers and Trustees of Educational Testing Service. Abstract In this research, we investigated the suitability of implementing e-rater ® automated essay scoring in a high-stakes large-scale English language testing program. We examined the effectiveness of generic scoring and 2 variants of prompt-based scoring approaches. Effectiveness was evaluated on a number of dimensions, including agreement between the automated and the human score and relations with criterion variables. Results showed that the sample size was generally not sufficient for prompt-specific scoring. For the generic scoring model, automated scores agreed with human raters as strongly as, or more strongly than, human raters agreed with one another for more than 97% of the prompts. The impact of substituting e-rater for the second human rater made no practically important impact on test takers' scores at both the item and total test score levels. However, neither automated scoring models nor human raters performed invariantly across all prompts or across different test countries/territories. Further investigation indicated homogeneity in the examinee population, possibly nested within test countries/territories as one potential cause of this lack of invariance. Among other limitations, findings may not be generalizable beyond the examinee population investigated in this study.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Study of Raters’ Behavior in Scoring L2 Speaking Performance: Using Rater Discussion as a Training Tool

The studies conducted so far on the effectiveness of resolution methods including the discussion method in resolving discrepancies in rating have yielded mixed results. What is left unnoticed in the literature is the potential of discussion to be used as a training tool rather than a resolution method. The present study addresses this research gap by exploring the data coming from rating behavi...

متن کامل

Score Generalizability of Writing Assessment: the Effect of Rater’s Gender

The score reliability of language performance tests has attracted increasing interest. Classical Test Theory cannot examine multiple sources of measurement error. Generalizability theory extends Classical Test Theory to provide a practical framework to identify and estimate multiple factors contributing to the total variance of measurement. Generalizability theory by using analysis of variance ...

متن کامل

Investigating the Effect of Self-, Peer-, and Teacher Assessment in Second Language Writing over Time: A Multifaceted Rasch Approach

This study investigated the accuracy of scores assigned by self-, peer-, and teacher assessors over time. Thirty-three English majors who were taking paragraph development course at Vali-e-Asr University of Rafsanjan and two instructors who had been teaching essay writing for at least two years at university, participated in the study. After receiving instructions on paragraph development, part...

متن کامل

Automated Essay Scoring For Nonnative English Speakers

The e-rater system 1 is an operational automated essay scoring system, developed at Educational Testing Service (ETS). The average agreement between human readers, and between independent human readers and e-rater is approximately 92%. There is much interest in the larger writing community in examining the system’s performance on nonnative speaker essays. This paper focuses on results of a stud...

متن کامل

Fundamental Reform Document of Education and ELT Program: The Investigation of Language Teachers’ Perspectives

: The purpose of the current attitudinal study is to investigate the attitudes and opinions of language teachers toward the implemented ELT program resulted from the Fundamental Reform Document of Education in the Iranian Ministry of Education. Three items were investigated: Teacher’s Practice, Teacher Training Courses, and Materials. Following the rigorous and systematic proce...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2014

Investigating the Suitability of Implementing the e-rater® Scoring Engine in a Large-Scale English Language Testing Program

نویسندگان

چکیده

منابع مشابه

A Study of Raters’ Behavior in Scoring L2 Speaking Performance: Using Rater Discussion as a Training Tool

Score Generalizability of Writing Assessment: the Effect of Rater’s Gender

Investigating the Effect of Self-, Peer-, and Teacher Assessment in Second Language Writing over Time: A Multifaceted Rasch Approach

Automated Essay Scoring For Nonnative English Speakers

Fundamental Reform Document of Education and ELT Program: The Investigation of Language Teachers’ Perspectives

عنوان ژورنال:

اشتراک گذاری