Publication Date
In 2025 | 3 |
Since 2024 | 3 |
Since 2021 (last 5 years) | 3 |
Since 2016 (last 10 years) | 7 |
Since 2006 (last 20 years) | 8 |
Descriptor
Source
Journal of Educational… | 67 |
Author
Hanna, Gerald S. | 3 |
Fitzpatrick, Anne R. | 2 |
Hakstian, A. Ralph | 2 |
Hambleton, Ronald K. | 2 |
Kansup, Wanlop | 2 |
Whitney, Douglas R. | 2 |
Abeles, Harold F. | 1 |
Airasian, Peter W. | 1 |
Algina, James | 1 |
Amery D. Wu | 1 |
Belfry, M. Joan | 1 |
More ▼ |
Publication Type
Journal Articles | 32 |
Reports - Research | 23 |
Reports - Evaluative | 5 |
Information Analyses | 2 |
Opinion Papers | 2 |
Reports - Descriptive | 2 |
Book/Product Reviews | 1 |
Speeches/Meeting Papers | 1 |
Tests/Questionnaires | 1 |
Education Level
Higher Education | 2 |
Postsecondary Education | 2 |
Audience
Practitioners | 2 |
Researchers | 2 |
Laws, Policies, & Programs
Assessments and Surveys
What Works Clearinghouse Rating
Tahereh Firoozi; Hamid Mohammadi; Mark J. Gierl – Journal of Educational Measurement, 2025
The purpose of this study is to describe and evaluate a multilingual automated essay scoring (AES) system for grading essays in three languages. Two different sentence embedding models were evaluated within the AES system, multilingual BERT (mBERT) and language-agnostic BERT sentence embedding (LaBSE). German, Italian, and Czech essays were…
Descriptors: College Students, Slavic Languages, German, Italian
Kylie Gorney; Sandip Sinharay – Journal of Educational Measurement, 2025
Although there exists an extensive amount of research on subscores and their properties, limited research has been conducted on categorical subscores and their interpretations. In this paper, we focus on the claim of Feinberg and von Davier that categorical subscores are useful for remediation and instructional purposes. We investigate this claim…
Descriptors: Tests, Scores, Test Interpretation, Alternative Assessment
Shun-Fu Hu; Amery D. Wu; Jake Stone – Journal of Educational Measurement, 2025
Scoring high-dimensional assessments (e.g., > 15 traits) can be a challenging task. This paper introduces the multilabel neural network (MNN) as a scoring method for high-dimensional assessments. Additionally, it demonstrates how MNN can score the same test responses to maximize different performance metrics, such as accuracy, recall, or…
Descriptors: Tests, Testing, Scores, Test Construction
Wind, Stefanie A. – Journal of Educational Measurement, 2019
Numerous researchers have proposed methods for evaluating the quality of rater-mediated assessments using nonparametric methods (e.g., kappa coefficients) and parametric methods (e.g., the many-facet Rasch model). Generally speaking, popular nonparametric methods for evaluating rating quality are not based on a particular measurement theory. On…
Descriptors: Nonparametric Statistics, Test Validity, Test Reliability, Item Response Theory
Liu, Bowen; Kennedy, Patrick C.; Seipel, Ben; Carlson, Sarah E.; Biancarosa, Gina; Davison, Mark L. – Journal of Educational Measurement, 2019
This article describes an ongoing project to develop a formative, inferential reading comprehension assessment of causal story comprehension. It has three features to enhance classroom use: equated scale scores for progress monitoring within and across grades, a scale score to distinguish among low-scoring students based on patterns of mistakes,…
Descriptors: Formative Evaluation, Reading Comprehension, Story Reading, Test Construction
Wang, Shiyu; Lin, Haiyan; Chang, Hua-Hua; Douglas, Jeff – Journal of Educational Measurement, 2016
Computerized adaptive testing (CAT) and multistage testing (MST) have become two of the most popular modes in large-scale computer-based sequential testing. Though most designs of CAT and MST exhibit strength and weakness in recent large-scale implementations, there is no simple answer to the question of which design is better because different…
Descriptors: Computer Assisted Testing, Adaptive Testing, Test Format, Sequential Approach
Dwyer, Andrew C. – Journal of Educational Measurement, 2016
This study examines the effectiveness of three approaches for maintaining equivalent performance standards across test forms with small samples: (1) common-item equating, (2) resetting the standard, and (3) rescaling the standard. Rescaling the standard (i.e., applying common-item equating methodology to standard setting ratings to account for…
Descriptors: Cutting Scores, Equivalency Tests, Test Format, Academic Standards
Kahraman, Nilufer; Thompson, Tony – Journal of Educational Measurement, 2011
A practical concern for many existing tests is that subscore test lengths are too short to provide reliable and meaningful measurement. A possible method of improving the subscale reliability and validity would be to make use of collateral information provided by items from other subscales of the same test. To this end, the purpose of this article…
Descriptors: Test Length, Test Items, Alignment (Education), Models
Incremental Reliability and Validity of Multiple-Choice Tests with an Answer-Until-Correct Procedure

Hanna, Gerald S. – Journal of Educational Measurement, 1975
An alternative to the conventional right-wrong scoring method used on multiple-choice tests was presented. In the experiment, the examinee continued to respond to a multiple-choice item until feedback signified a correct answer. Findings showed that experimental scores were more reliable but less valid than inferred conventional scores.…
Descriptors: Feedback, Higher Education, Multiple Choice Tests, Scoring

Raffeld, Paul – Journal of Educational Measurement, 1975
Results support the contention that a Guttman-weighted objective test can have psychometric properties that are superior to those of its unweighted counterpart, as long as omissions do not exist or are assigned a value equal to the mean of the k item alternative weights. (Author/BJG)
Descriptors: Multiple Choice Tests, Predictive Validity, Test Reliability, Test Validity
The Relationship Between Number of Response Categories and Reliability of Likert-Type Questionnaires

Masters, James R. – Journal of Educational Measurement, 1974
Descriptors: Attitudes, Questionnaires, Rating Scales, Response Style (Tests)

Woodson, M. I. Chas. E. – Journal of Educational Measurement, 1974
Descriptors: Criterion Referenced Tests, Item Analysis, Test Construction, Test Reliability

Koehler, Roger A. – Journal of Educational Measurement, 1974
The purposes of the study were to develop a measure of overconfidence on probabilistic tests, to assess the measurement characteristics of such a measure, and to investigate the relationship of overconfidence on tests to knowledge and to risk-taking propensity. (Author/BB)
Descriptors: Confidence Testing, Measurement Techniques, Multiple Choice Tests, Risk

Grier, J. Brown – Journal of Educational Measurement, 1975
The expected reliability of a multiple choice test is maximized by the use of three alternative items. (Author)
Descriptors: Achievement Tests, Multiple Choice Tests, Test Construction, Test Reliability

Algina, James; Noe, Michael J. – Journal of Educational Measurement, 1978
A computer simulation study was conducted to investigate Subkoviak's index of reliability for criterion-referenced tests, called the coefficient of agreement. Results indicate that the index can be adequately estimated. (JKS)
Descriptors: Criterion Referenced Tests, Mastery Tests, Measurement, Test Reliability