Publication Date
In 2025 | 4 |
Descriptor
Evaluation Methods | 4 |
Test Validity | 4 |
Test Reliability | 3 |
Accuracy | 2 |
Scores | 2 |
Tests | 2 |
Alternative Assessment | 1 |
Assessment Literacy | 1 |
Bias | 1 |
College Students | 1 |
Comparative Testing | 1 |
More ▼ |
Source
Journal of Educational… | 4 |
Author
Amery D. Wu | 1 |
Carl Westine | 1 |
Hamid Mohammadi | 1 |
Jake Stone | 1 |
Kylie Gorney | 1 |
Mark J. Gierl | 1 |
Michelle Boyer | 1 |
Sandip Sinharay | 1 |
Shun-Fu Hu | 1 |
Stella Y. Kim | 1 |
Tahereh Firoozi | 1 |
More ▼ |
Publication Type
Journal Articles | 4 |
Reports - Research | 4 |
Education Level
Higher Education | 2 |
Postsecondary Education | 2 |
Audience
Location
Laws, Policies, & Programs
Assessments and Surveys
What Works Clearinghouse Rating
Tong Wu; Stella Y. Kim; Carl Westine; Michelle Boyer – Journal of Educational Measurement, 2025
While significant attention has been given to test equating to ensure score comparability, limited research has explored equating methods for rater-mediated assessments, where human raters inherently introduce error. If not properly addressed, these errors can undermine score interchangeability and test validity. This study proposes an equating…
Descriptors: Item Response Theory, Evaluators, Error of Measurement, Test Validity
Tahereh Firoozi; Hamid Mohammadi; Mark J. Gierl – Journal of Educational Measurement, 2025
The purpose of this study is to describe and evaluate a multilingual automated essay scoring (AES) system for grading essays in three languages. Two different sentence embedding models were evaluated within the AES system, multilingual BERT (mBERT) and language-agnostic BERT sentence embedding (LaBSE). German, Italian, and Czech essays were…
Descriptors: College Students, Slavic Languages, German, Italian
Kylie Gorney; Sandip Sinharay – Journal of Educational Measurement, 2025
Although there exists an extensive amount of research on subscores and their properties, limited research has been conducted on categorical subscores and their interpretations. In this paper, we focus on the claim of Feinberg and von Davier that categorical subscores are useful for remediation and instructional purposes. We investigate this claim…
Descriptors: Tests, Scores, Test Interpretation, Alternative Assessment
Shun-Fu Hu; Amery D. Wu; Jake Stone – Journal of Educational Measurement, 2025
Scoring high-dimensional assessments (e.g., > 15 traits) can be a challenging task. This paper introduces the multilabel neural network (MNN) as a scoring method for high-dimensional assessments. Additionally, it demonstrates how MNN can score the same test responses to maximize different performance metrics, such as accuracy, recall, or…
Descriptors: Tests, Testing, Scores, Test Construction