Publication Date
In 2025 | 2 |
Since 2024 | 2 |
Since 2021 (last 5 years) | 3 |
Since 2016 (last 10 years) | 7 |
Since 2006 (last 20 years) | 8 |
Descriptor
Evaluators | 21 |
Item Response Theory | 6 |
Cutting Scores | 5 |
Evaluation Methods | 5 |
Error of Measurement | 4 |
Licensing Examinations… | 4 |
Models | 4 |
Scoring | 4 |
Standard Setting | 4 |
Standard Setting (Scoring) | 4 |
Comparative Analysis | 3 |
More ▼ |
Source
Journal of Educational… | 21 |
Author
Norcini, John J. | 3 |
Plake, Barbara S. | 3 |
Wind, Stefanie A. | 3 |
Clauser, Brian E. | 2 |
Impara, James C. | 2 |
Alex J. Mechaber | 1 |
Blok, H. | 1 |
Brian E. Clauser | 1 |
Buckendahl, Chad W. | 1 |
Busch, John Christian | 1 |
Carl Westine | 1 |
More ▼ |
Publication Type
Journal Articles | 21 |
Reports - Research | 17 |
Reports - Evaluative | 4 |
Information Analyses | 1 |
Speeches/Meeting Papers | 1 |
Education Level
Audience
Laws, Policies, & Programs
Assessments and Surveys
National Teacher Examinations | 1 |
What Works Clearinghouse Rating
Peter Baldwin; Victoria Yaneva; Kai North; Le An Ha; Yiyun Zhou; Alex J. Mechaber; Brian E. Clauser – Journal of Educational Measurement, 2025
Recent developments in the use of large-language models have led to substantial improvements in the accuracy of content-based automated scoring of free-text responses. The reported accuracy levels suggest that automated systems could have widespread applicability in assessment. However, before they are used in operational testing, other aspects of…
Descriptors: Artificial Intelligence, Scoring, Computational Linguistics, Accuracy
Tong Wu; Stella Y. Kim; Carl Westine; Michelle Boyer – Journal of Educational Measurement, 2025
While significant attention has been given to test equating to ensure score comparability, limited research has explored equating methods for rater-mediated assessments, where human raters inherently introduce error. If not properly addressed, these errors can undermine score interchangeability and test validity. This study proposes an equating…
Descriptors: Item Response Theory, Evaluators, Error of Measurement, Test Validity
Casabianca, Jodi M.; Donoghue, John R.; Shin, Hyo Jeong; Chao, Szu-Fu; Choi, Ikkyu – Journal of Educational Measurement, 2023
Using item-response theory to model rater effects provides an alternative solution for rater monitoring and diagnosis, compared to using standard performance metrics. In order to fit such models, the ratings data must be sufficiently connected in order to estimate rater effects. Due to popular rating designs used in large-scale testing scenarios,…
Descriptors: Item Response Theory, Alternative Assessment, Evaluators, Research Problems
Wind, Stefanie A.; Sebok-Syer, Stefanie S. – Journal of Educational Measurement, 2019
When practitioners use modern measurement models to evaluate rating quality, they commonly examine rater fit statistics that summarize how well each rater's ratings fit the expectations of the measurement model. Essentially, this approach involves examining the unexpected ratings that each misfitting rater assigned (i.e., carrying out analyses of…
Descriptors: Measurement, Models, Evaluators, Simulation
Clauser, Brian E.; Kane, Michael; Clauser, Jerome C. – Journal of Educational Measurement, 2020
An Angoff standard setting study generally yields judgments on a number of items by a number of judges (who may or may not be nested in panels). Variability associated with judges (and possibly panels) contributes error to the resulting cut score. The variability associated with items plays a more complicated role. To the extent that the mean item…
Descriptors: Cutting Scores, Generalization, Decision Making, Standard Setting
Wind, Stefanie A.; Jones, Eli – Journal of Educational Measurement, 2019
Researchers have explored a variety of topics related to identifying and distinguishing among specific types of rater effects, as well as the implications of different types of incomplete data collection designs for rater-mediated assessments. In this study, we used simulated data to examine the sensitivity of latent trait model indicators of…
Descriptors: Rating Scales, Models, Evaluators, Data Collection
Peabody, Michael R.; Wind, Stefanie A. – Journal of Educational Measurement, 2019
Setting performance standards is a judgmental process involving human opinions and values as well as technical and empirical considerations. Although all cut score decisions are by nature somewhat arbitrary, they should not be capricious. Judges selected for standard-setting panels should have the proper qualifications to make the judgments asked…
Descriptors: Standard Setting, Decision Making, Performance Based Assessment, Evaluators
Wang, Wen-Chung; Su, Chi-Ming; Qiu, Xue-Lan – Journal of Educational Measurement, 2014
Ratings given to the same item response may have a stronger correlation than those given to different item responses, especially when raters interact with one another before giving ratings. The rater bundle model was developed to account for such local dependence by forming multiple ratings given to an item response as a bundle and assigning…
Descriptors: Item Response Theory, Interrater Reliability, Models, Correlation

Plake, Barbara S.; Impara, James C.; Irwin, Patrick M. – Journal of Educational Measurement, 2000
Examined intra- and inter-rater consistency of item performance estimated from an Angoff standard setting over 2 years, with 29 panelists one year, and 30 the next. Results provide evidence that item performance estimates were consistent within and across panels within and across years. Factors that might have influenced this high degree of…
Descriptors: Evaluators, Prediction, Reliability, Standard Setting

Englehard, George, Jr. – Journal of Educational Measurement, 1996
A new method for evaluating rater accuracy within the context of performance assessments is described. It uses an extended Rasch measurement model, FACETS, which is illustrated with 373 benchmark papers from the Georgia High School Graduation Writing Test rated by 20 operational raters and an expert panel. (SLD)
Descriptors: Essay Tests, Evaluation Methods, Evaluators, Performance Based Assessment
DeCarlo, Lawrence T. – Journal of Educational Measurement, 2005
An approach to essay grading based on signal detection theory (SDT) is presented. SDT offers a basis for understanding rater behavior with respect to the scoring of construct responses, in that it provides a theory of psychological processes underlying the raters' behavior. The approach also provides measures of the precision of the raters and the…
Descriptors: Validity, Simulation, Grading, Item Response Theory

Daly, John A.; Dickson-Markman, Fran – Journal of Educational Measurement, 1982
The effect of the quality of preceding essays on judgments of the quality of a subsequent essay was investigated. Inservice teachers as judges failed to produce consistently and unambiguously biased judgments. Results suggest the presence of a positive bias and the absence of a negative bias. (Author/PN)
Descriptors: Bias, Cognitive Processes, Context Clues, Context Effect

Buckendahl, Chad W.; Smith, Russell W.; Impara, James C.; Plake, Barbara S. – Journal of Educational Measurement, 2002
Compared simplified variations on the Angoff and Bookmark methods for setting cut scores on educational assessments with data from a grade 7 mathematics test (23 panelists in all). Although the Angoff method is more widely used, results show that the Bookmark method has some promising features. (SLD)
Descriptors: Cutting Scores, Educational Assessment, Evaluators, Junior High School Students

Clauser, Brian E.; Clyman, Stephen G.; Swanson, David B. – Journal of Educational Measurement, 1999
Two studies focused on aspects of the rating process in performance assessment. The first, which involved 15 raters and about 400 medical students, made the "committee" facet of raters working in groups explicit, and the second, which involved about 200 medical students and four raters, made the "rating-occasion" facet…
Descriptors: Error Patterns, Evaluation Methods, Evaluators, Higher Education

Blok, H. – Journal of Educational Measurement, 1985
Raters judged essays on two occasions making it possible to address the question of whether multiple ratings, however obtained, represent the same true scores. Multiple ratings of a given rater did represent the same true scores, but ratings of different raters did not. Reliability, validity, and invalidity coefficients were computed. (Author/DWH)
Descriptors: Analysis of Variance, Elementary Education, Essay Tests, Evaluators
Previous Page | Next Page ยป
Pages: 1 | 2