ERIC - Search Results

Publication Date

In 2025	1
Since 2024	2
Since 2021 (last 5 years)	5
Since 2016 (last 10 years)	13
Since 2006 (last 20 years)	21

Descriptor

Error of Measurement	33
Evaluators	33
Interrater Reliability	11
Scores	11
Scoring	10
Item Response Theory	8
Models	8
Reliability	7
Comparative Analysis	6
English (Second Language)	6
Foreign Countries	6
Rating Scales	6
Accuracy	5
Data Analysis	5
Evaluation Methods	5
Generalizability Theory	5
Language Tests	5
Second Language Learning	5
Classification	4
Correlation	4
Performance Based Assessment	4
Undergraduate Students	4
Writing Evaluation	4
Cutting Scores	3
Educational Assessment	3
More ▼

Publication Type

Journal Articles	26
Reports - Research	22
Reports - Evaluative	7
Reports - Descriptive	2
Speeches/Meeting Papers	2
Dissertations/Theses -…	1
ERIC Digests in Full Text	1
ERIC Publications	1
Tests/Questionnaires	1

Education Level

Higher Education	4
Postsecondary Education	4
Elementary Education	2
Middle Schools	2
Elementary Secondary Education	1
Grade 7	1
Junior High Schools	1
Secondary Education	1

Audience

Researchers

Location

Turkey	2
Australia	1
China	1
Europe	1
Thailand	1

Laws, Policies, & Programs

Assessments and Surveys

Test of English as a Foreign…	2
Alabama High School…	1
Flesch Kincaid Grade Level…	1
International English…	1
Test of English for…	1

What Works Clearinghouse Rating

Showing 1 to 15 of 33 results Save | Export

IRT Observed-Score Equating for Rater-Mediated Assessments Using a Hierarchical Rater Model

Peer reviewed

Direct link

Tong Wu; Stella Y. Kim; Carl Westine; Michelle Boyer – Journal of Educational Measurement, 2025

While significant attention has been given to test equating to ensure score comparability, limited research has explored equating methods for rater-mediated assessments, where human raters inherently introduce error. If not properly addressed, these errors can undermine score interchangeability and test validity. This study proposes an equating…

Descriptors: Item Response Theory, Evaluators, Error of Measurement, Test Validity

Improving the Precision of Classroom Observation Scores Using a Multi-Rater and Multi-Timepoint Item Response Theory Model

Peer reviewed

Direct link

Kelly Edwards; James Soland – Educational Assessment, 2024

Classroom observational protocols, in which raters observe and score the quality of teachers' instructional practices, are often used to evaluate teachers for consequential purposes despite evidence that scores from such protocols are frequently driven by factors, such as rater and temporal effects, that have little to do with teacher quality. In…

Descriptors: Classroom Observation Techniques, Teacher Evaluation, Accuracy, Scores

Effects of Using Double Ratings as Item Scores on IRT Proficiency Estimation

Peer reviewed

Direct link

Song, Yoon Ah; Lee, Won-Chan – Applied Measurement in Education, 2022

This article presents the performance of item response theory (IRT) models when double ratings are used as item scores over single ratings when rater effects are present. Study 1 examined the influence of the number of ratings on the accuracy of proficiency estimation in the generalized partial credit model (GPCM). Study 2 compared the accuracy of…

Descriptors: Item Response Theory, Item Analysis, Scores, Accuracy

Examining Differential Rater Functioning Using a Between-Subgroup Outfit Approach

Peer reviewed

Direct link

Wind, Stefanie A.; Sebok-Syer, Stefanie S. – Journal of Educational Measurement, 2019

When practitioners use modern measurement models to evaluate rating quality, they commonly examine rater fit statistics that summarize how well each rater's ratings fit the expectations of the measurement model. Essentially, this approach involves examining the unexpected ratings that each misfitting rater assigned (i.e., carrying out analyses of…

Descriptors: Measurement, Models, Evaluators, Simulation

Examining the Precision of Cut Scores within a Generalizability Theory Framework: A Closer Look at the Item Effect

Peer reviewed

Direct link

Clauser, Brian E.; Kane, Michael; Clauser, Jerome C. – Journal of Educational Measurement, 2020

An Angoff standard setting study generally yields judgments on a number of items by a number of judges (who may or may not be nested in panels). Variability associated with judges (and possibly panels) contributes error to the resulting cut score. The variability associated with items plays a more complicated role. To the extent that the mean item…

Descriptors: Cutting Scores, Generalization, Decision Making, Standard Setting

Examining the Differential Rater Functioning in the Process of Assessing Writing Skills of Middle School 7th Grade Students

Peer reviewed
PDF on ERIC

Download full text

Erman Aslanoglu, Aslihan; Sata, Mehmet – Participatory Educational Research, 2021

When students present writing tasks that require higher order thinking skills to work, one of the most important problems is scoring these writing tasks objectively. The fact that raters give scores below or above their performance based on several environmental factors affects the consistency of the measurements. Inconsistencies in scoring…

Descriptors: Interrater Reliability, Evaluators, Error of Measurement, Writing Evaluation

Investigating the Impact of Rater Training on Rater Errors in the Process of Assessing Writing Skill

Peer reviewed
PDF on ERIC

Download full text

Sata, Mehmet; Karakaya, Ismail – International Journal of Assessment Tools in Education, 2022

In the process of measuring and assessing high-level cognitive skills, interference of rater errors in measurements brings about a constant concern and low objectivity. The main purpose of this study was to investigate the impact of rater training on rater errors in the process of assessing individual performance. The study was conducted with a…

Descriptors: Evaluators, Training, Comparative Analysis, Academic Language

Kappa and Rater Accuracy: Paradigms and Parameters

Peer reviewed

Direct link

Conger, Anthony J. – Educational and Psychological Measurement, 2017

Drawing parallels to classical test theory, this article clarifies the difference between rater accuracy and reliability and demonstrates how category marginal frequencies affect rater agreement and Cohen's kappa. Category assignment paradigms are developed: comparing raters to a standard (index) versus comparing two raters to one another…

Descriptors: Interrater Reliability, Evaluators, Accuracy, Statistical Analysis

Mapping the CU-TEP to the Common European Framework of Reference (CEFTR)

Peer reviewed
PDF on ERIC

Download full text

Wudthayagorn, Jirada – LEARN Journal: Language Education and Acquisition Research Network, 2018

The purpose of this study was to map the Chulalongkorn University Test of English Proficiency, or the CU-TEP, to the Common European Framework of Reference (CEFR) by employing a standard setting methodology. Thirteen experts judged 120 items of the CU-TEP using the Yes/No Angoff technique. The experts decided whether or not a borderline student at…

Descriptors: Guidelines, Rating Scales, English (Second Language), Language Tests

Working with Sparse Data in Rated Language Tests: Generalizability Theory Applications

Peer reviewed

Direct link

Lin, Chih-Kai – Language Testing, 2017

Sparse-rated data are common in operational performance-based language tests, as an inevitable result of assigning examinee responses to a fraction of available raters. The current study investigates the precision of two generalizability-theory methods (i.e., the rating method and the subdividing method) specifically designed to accommodate the…

Descriptors: Data Analysis, Language Tests, Generalizability Theory, Accuracy

High-Dimensional Explanatory Random Item Effects Models for Rater-Mediated Assessments

Peer reviewed
PDF on ERIC

Download full text

Kelcey, Ben; Wang, Shanshan; Cox, Kyle – Society for Research on Educational Effectiveness, 2016

Valid and reliable measurement of unobserved latent variables is essential to understanding and improving education. A common and persistent approach to assessing latent constructs in education is the use of rater inferential judgment. The purpose of this study is to develop high-dimensional explanatory random item effects models designed for…

Descriptors: Test Items, Models, Evaluators, Longitudinal Studies

Investigating Correspondence between Language Proficiency Standards and Academic Content Standards: A Generalizability Theory Study

Peer reviewed

Direct link

Lin, Chih-Kai; Zhang, Jinming – Language Testing, 2014

Research on the relationship between English language proficiency standards and academic content standards serves to provide information about the extent to which English language learners (ELLs) are expected to encounter academic language use that facilitates their content learning, such as in mathematics and science. Standards-to-standards…

Descriptors: Language Proficiency, Academic Standards, Generalizability Theory, English Language Learners

Investigating the Application of Automated Writing Evaluation to Chinese Undergraduate English Majors: A Case Study of "WriteToLearn"

Peer reviewed
PDF on ERIC

Download full text

Liu, Sha; Kunnan, Antony John – CALICO Journal, 2016

This study investigated the application of "WriteToLearn" on Chinese undergraduate English majors' essays in terms of its scoring ability and the accuracy of its error feedback. Participants were 163 second-year English majors from a university located in Sichuan province who wrote 326 essays from two writing prompts. Each paper was…

Descriptors: Foreign Countries, Undergraduate Students, English (Second Language), Second Language Learning

Understanding the Growth of ESL Paragraph Writing Skills and Its Relationships with Linguistic Features

Peer reviewed

Direct link

Aryadoust, Vahid – Educational Psychology, 2016

This study sought to examine the development of paragraph writing skills of 116 English as a second language university students over the course of 12 weeks and the relationship between the linguistic features of students' written texts as measured by Coh-Metrix--a computational system for estimating textual features such as cohesion and…

Descriptors: English (Second Language), Second Language Learning, Writing Skills, College Students

Effects of Acoustic Perception of Gender on Nonsampling Errors in Telephone Surveys

Direct link

Kenney McCulloch, Susan – ProQuest LLC, 2012

Many telephone surveys require interviewers to observe and record respondents' gender based solely on respondents' voice. Researchers may rely on these observations to: (1) screen for study eligibility; (2) determine skip patterns; (3) foster interviewer tailoring strategies; (4) contribute to nonresponse assessment and adjustments; (5)…

Descriptors: Telephone Surveys, Gender Differences, Acoustics, Observation

Previous Page | Next Page »

Pages: 1 | 2 | 3

Journal of Educational…	4
Educational and Psychological…	3
Language Testing	2
Applied Measurement in…	1
Applied Psychological…	1
CALICO Journal	1
Canadian Journal of Program…	1
ETS Research Report Series	1
Educational Assessment	1
Educational Measurement:…	1
Educational Psychology	1
Evaluation and the Health…	1
International Journal of…	1
Journal of Clinical Child and…	1
Journal of Counseling…	1
Journal of Educational and…	1
LEARN Journal: Language…	1
Participatory Educational…	1
ProQuest LLC	1
Psychological Methods	1
Psychometrika	1
Society for Research on…	1
More ▼

Lin, Chih-Kai	2
Sata, Mehmet	2
Aryadoust, Vahid	1
Batchelder, William H.	1
Carl Westine	1
Chafouleas, Sandra M.	1
Christ, Theodore J.	1
Clauser, Brian E.	1
Clauser, Jerome C.	1
Conger, Anthony J.	1
Cox, Kyle	1
Dekle, Dawn J.	1
Erman Aslanoglu, Aslihan	1
Evans, Brian	1
Gross, Leon J.	1
Halpin, Glennelle	1
Hoskens, Machteld	1
Hoyt, William T.	1
Ito, Kyoko	1
James Soland	1
Kane, Michael	1
Karakaya, Ismail	1
Karcher, Michael J.	1
Kelcey, Ben	1
More ▼