ERIC - Search Results

Publication Date

In 2025	2
Since 2024	2
Since 2021 (last 5 years)	3
Since 2016 (last 10 years)	7
Since 2006 (last 20 years)	8

Descriptor

Evaluators	21
Item Response Theory	6
Cutting Scores	5
Evaluation Methods	5
Error of Measurement	4
Licensing Examinations…	4
Models	4
Scoring	4
Standard Setting	4
Standard Setting (Scoring)	4
Comparative Analysis	3
Decision Making	3
Equated Scores	3
Error Patterns	3
Higher Education	3
Interrater Reliability	3
Performance Based Assessment	3
Physicians	3
Standards	3
Test Items	3
Test Validity	3
Bias	2
Certification	2
Computer Assisted Testing	2
Data Collection	2
More ▼

Source

Journal of Educational…

Publication Type

Journal Articles	21
Reports - Research	17
Reports - Evaluative	4
Information Analyses	1
Speeches/Meeting Papers	1

Education Level

Audience

Location

Georgia	1
Israel	1

Laws, Policies, & Programs

Assessments and Surveys

National Teacher Examinations

What Works Clearinghouse Rating

Showing 1 to 15 of 21 results Save | Export

The Vulnerability of AI-Based Scoring Systems to Gaming Strategies: A Case Study

Peer reviewed

Direct link

Peter Baldwin; Victoria Yaneva; Kai North; Le An Ha; Yiyun Zhou; Alex J. Mechaber; Brian E. Clauser – Journal of Educational Measurement, 2025

Recent developments in the use of large-language models have led to substantial improvements in the accuracy of content-based automated scoring of free-text responses. The reported accuracy levels suggest that automated systems could have widespread applicability in assessment. However, before they are used in operational testing, other aspects of…

Descriptors: Artificial Intelligence, Scoring, Computational Linguistics, Accuracy

IRT Observed-Score Equating for Rater-Mediated Assessments Using a Hierarchical Rater Model

Peer reviewed

Direct link

Tong Wu; Stella Y. Kim; Carl Westine; Michelle Boyer – Journal of Educational Measurement, 2025

While significant attention has been given to test equating to ensure score comparability, limited research has explored equating methods for rater-mediated assessments, where human raters inherently introduce error. If not properly addressed, these errors can undermine score interchangeability and test validity. This study proposes an equating…

Descriptors: Item Response Theory, Evaluators, Error of Measurement, Test Validity

Using Linkage Sets to Improve Connectedness in Rater Response Model Estimation

Peer reviewed

Direct link

Casabianca, Jodi M.; Donoghue, John R.; Shin, Hyo Jeong; Chao, Szu-Fu; Choi, Ikkyu – Journal of Educational Measurement, 2023

Using item-response theory to model rater effects provides an alternative solution for rater monitoring and diagnosis, compared to using standard performance metrics. In order to fit such models, the ratings data must be sufficiently connected in order to estimate rater effects. Due to popular rating designs used in large-scale testing scenarios,…

Descriptors: Item Response Theory, Alternative Assessment, Evaluators, Research Problems

Examining Differential Rater Functioning Using a Between-Subgroup Outfit Approach

Peer reviewed

Direct link

Wind, Stefanie A.; Sebok-Syer, Stefanie S. – Journal of Educational Measurement, 2019

When practitioners use modern measurement models to evaluate rating quality, they commonly examine rater fit statistics that summarize how well each rater's ratings fit the expectations of the measurement model. Essentially, this approach involves examining the unexpected ratings that each misfitting rater assigned (i.e., carrying out analyses of…

Descriptors: Measurement, Models, Evaluators, Simulation

Examining the Precision of Cut Scores within a Generalizability Theory Framework: A Closer Look at the Item Effect

Peer reviewed

Direct link

Clauser, Brian E.; Kane, Michael; Clauser, Jerome C. – Journal of Educational Measurement, 2020

An Angoff standard setting study generally yields judgments on a number of items by a number of judges (who may or may not be nested in panels). Variability associated with judges (and possibly panels) contributes error to the resulting cut score. The variability associated with items plays a more complicated role. To the extent that the mean item…

Descriptors: Cutting Scores, Generalization, Decision Making, Standard Setting

The Effects of Incomplete Rating Designs in Combination with Rater Effects

Peer reviewed

Direct link

Wind, Stefanie A.; Jones, Eli – Journal of Educational Measurement, 2019

Researchers have explored a variety of topics related to identifying and distinguishing among specific types of rater effects, as well as the implications of different types of incomplete data collection designs for rater-mediated assessments. In this study, we used simulated data to examine the sensitivity of latent trait model indicators of…

Descriptors: Rating Scales, Models, Evaluators, Data Collection

Exploring the Influence of Judge Proficiency on Standard-Setting Judgments

Peer reviewed

Direct link

Peabody, Michael R.; Wind, Stefanie A. – Journal of Educational Measurement, 2019

Setting performance standards is a judgmental process involving human opinions and values as well as technical and empirical considerations. Although all cut score decisions are by nature somewhat arbitrary, they should not be capricious. Judges selected for standard-setting panels should have the proper qualifications to make the judgments asked…

Descriptors: Standard Setting, Decision Making, Performance Based Assessment, Evaluators

Item Response Models for Local Dependence among Multiple Ratings

Peer reviewed

Direct link

Wang, Wen-Chung; Su, Chi-Ming; Qiu, Xue-Lan – Journal of Educational Measurement, 2014

Ratings given to the same item response may have a stronger correlation than those given to different item responses, especially when raters interact with one another before giving ratings. The rater bundle model was developed to account for such local dependence by forming multiple ratings given to an item response as a bundle and assigning…

Descriptors: Item Response Theory, Interrater Reliability, Models, Correlation

Consistency of Angoff-based Predictions of Item Performance: Evidence of Technical Quality of Results from the Angoff Standard Setting Method.

Peer reviewed

Plake, Barbara S.; Impara, James C.; Irwin, Patrick M. – Journal of Educational Measurement, 2000

Examined intra- and inter-rater consistency of item performance estimated from an Angoff standard setting over 2 years, with 29 panelists one year, and 30 the next. Results provide evidence that item performance estimates were consistent within and across panels within and across years. Factors that might have influenced this high degree of…

Descriptors: Evaluators, Prediction, Reliability, Standard Setting

Evaluating Rater Accuracy in Performance Assessments.

Peer reviewed

Englehard, George, Jr. – Journal of Educational Measurement, 1996

A new method for evaluating rater accuracy within the context of performance assessments is described. It uses an extended Rasch measurement model, FACETS, which is illustrated with 373 benchmark papers from the Georgia High School Graduation Writing Test rated by 20 operational raters and an expert panel. (SLD)

Descriptors: Essay Tests, Evaluation Methods, Evaluators, Performance Based Assessment

A Model of Rater Behavior in Essay Grading Based on Signal Detection Theory

Peer reviewed

Direct link

DeCarlo, Lawrence T. – Journal of Educational Measurement, 2005

An approach to essay grading based on signal detection theory (SDT) is presented. SDT offers a basis for understanding rater behavior with respect to the scoring of construct responses, in that it provides a theory of psychological processes underlying the raters' behavior. The approach also provides measures of the precision of the raters and the…

Descriptors: Validity, Simulation, Grading, Item Response Theory

Contrast Effects in Evaluating Essays.

Peer reviewed

Daly, John A.; Dickson-Markman, Fran – Journal of Educational Measurement, 1982

The effect of the quality of preceding essays on judgments of the quality of a subsequent essay was investigated. Inservice teachers as judges failed to produce consistently and unambiguously biased judgments. Results suggest the presence of a positive bias and the absence of a negative bias. (Author/PN)

Descriptors: Bias, Cognitive Processes, Context Clues, Context Effect

A Comparison of Angoff and Bookmark Standard Setting Methods.

Peer reviewed

Buckendahl, Chad W.; Smith, Russell W.; Impara, James C.; Plake, Barbara S. – Journal of Educational Measurement, 2002

Compared simplified variations on the Angoff and Bookmark methods for setting cut scores on educational assessments with data from a grade 7 mathematics test (23 panelists in all). Although the Angoff method is more widely used, results show that the Bookmark method has some promising features. (SLD)

Descriptors: Cutting Scores, Educational Assessment, Evaluators, Junior High School Students

Components of Rater Error in a Complex Performance Assessment.

Peer reviewed

Clauser, Brian E.; Clyman, Stephen G.; Swanson, David B. – Journal of Educational Measurement, 1999

Two studies focused on aspects of the rating process in performance assessment. The first, which involved 15 raters and about 400 medical students, made the "committee" facet of raters working in groups explicit, and the second, which involved about 200 medical students and four raters, made the "rating-occasion" facet…

Descriptors: Error Patterns, Evaluation Methods, Evaluators, Higher Education

Estimating the Reliability, Validity, and Invalidity of Essay Ratings.

Peer reviewed

Blok, H. – Journal of Educational Measurement, 1985

Raters judged essays on two occasions making it possible to address the question of whether multiple ratings, however obtained, represent the same true scores. Multiple ratings of a given rater did represent the same true scores, but ratings of different raters did not. Reliability, validity, and invalidity coefficients were computed. (Author/DWH)

Descriptors: Analysis of Variance, Elementary Education, Essay Tests, Evaluators

Previous Page | Next Page »

Pages: 1 | 2

Norcini, John J.	3
Plake, Barbara S.	3
Wind, Stefanie A.	3
Clauser, Brian E.	2
Impara, James C.	2
Alex J. Mechaber	1
Blok, H.	1
Brian E. Clauser	1
Buckendahl, Chad W.	1
Busch, John Christian	1
Carl Westine	1
Casabianca, Jodi M.	1
Chao, Szu-Fu	1
Choi, Ikkyu	1
Clauser, Jerome C.	1
Clyman, Stephen G.	1
Daly, John A.	1
DeCarlo, Lawrence T.	1
Dickson-Markman, Fran	1
Donoghue, John R.	1
Englehard, George, Jr.	1
Irwin, Patrick M.	1
Jaeger, Richard M.	1
Jones, Eli	1
Kai North	1
More ▼