ERIC Number: EJ1491361
Record Type: Journal
Publication Date: 2025-Dec
Pages: 24
Abstractor: As Provided
ISBN: N/A
ISSN: ISSN-0022-0655
EISSN: EISSN-1745-3984
Available Date: 2025-10-12
Measuring the Accuracy of True Score Predictions for AI Scoring Evaluation
Daniel F. McCaffrey1; Jodi M. Casabianca1; Matthew S. Johnson1
Journal of Educational Measurement, v62 n4 p763-786 2025
Use of artificial intelligence (AI) to score responses is growing in popularity and likely to increase. Evidence of the validity of scores relies on quadratic weighted kappa (QWK) to demonstrate agreement between AI scores and human ratings. QWK is a measure of agreement that accounts for chance agreement and the ordinality of the data by giving greater weight to larger disagreements. It has known shortcomings including sensitivity to the human rating reliability. The proportional reduction in mean squared error (PRMSE) measures agreement between predictions and their target that accounts for measurement error in the target. For example, the accuracy of the automated scoring model, with respect to prediction of the human true scores rather than the observed ratings. Extensive simulation study results show PRMSE is robust to many factors to which QWK is sensitive such as the human rater reliability, skew in the data and the number of score points. Analysis of operational test data demonstrates QWK and PRMSE can lead to different conclusions about AI scores. We investigate sample size requirements for accurate estimation of PRMSE in the context of AI scoring, although the results could apply more generally to measures with similar distributions as those tested in our study.
Descriptors: Accuracy, True Scores, Prediction, Artificial Intelligence, Scoring, Automation, Error of Measurement, Statistical Distributions
Wiley. Available from: John Wiley & Sons, Inc. 111 River Street, Hoboken, NJ 07030. Tel: 800-835-6770; e-mail: cs-journals@wiley.com; Web site: https://www.wiley.com/en-us
Publication Type: Journal Articles; Reports - Evaluative
Education Level: N/A
Audience: N/A
Language: English
Sponsor: N/A
Authoring Institution: N/A
Grant or Contract Numbers: N/A
Author Affiliations: 1Educational Testing Service

Peer reviewed
Direct link
