Publication Date
In 2025 | 3 |
Since 2024 | 5 |
Since 2021 (last 5 years) | 5 |
Since 2016 (last 10 years) | 6 |
Since 2006 (last 20 years) | 8 |
Descriptor
Test Reliability | 40 |
Test Validity | 19 |
Testing | 15 |
Testing Problems | 13 |
Test Construction | 12 |
Multiple Choice Tests | 9 |
Test Interpretation | 8 |
Test Items | 6 |
Error of Measurement | 5 |
Higher Education | 5 |
Response Style (Tests) | 5 |
More ▼ |
Source
Journal of Educational… | 40 |
Author
Fitzpatrick, Anne R. | 2 |
Hakstian, A. Ralph | 2 |
Kansup, Wanlop | 2 |
Subkoviak, Michael J. | 2 |
Amery D. Wu | 1 |
Angoff, William H. | 1 |
Askegaard, Lewis D. | 1 |
Augustin Mutak | 1 |
Bashaw, W. L. | 1 |
Breland, Hunter M. | 1 |
Budescu, David | 1 |
More ▼ |
Publication Type
Journal Articles | 24 |
Reports - Research | 14 |
Reports - Evaluative | 4 |
Information Analyses | 2 |
Book/Product Reviews | 1 |
Guides - Non-Classroom | 1 |
Opinion Papers | 1 |
Speeches/Meeting Papers | 1 |
Tests/Questionnaires | 1 |
Education Level
Higher Education | 2 |
Postsecondary Education | 2 |
Secondary Education | 1 |
Audience
Practitioners | 2 |
Researchers | 1 |
Location
Laws, Policies, & Programs
Assessments and Surveys
Program for International… | 1 |
System of Multicultural… | 1 |
Test of Standard Written… | 1 |
What Works Clearinghouse Rating
Tahereh Firoozi; Hamid Mohammadi; Mark J. Gierl – Journal of Educational Measurement, 2025
The purpose of this study is to describe and evaluate a multilingual automated essay scoring (AES) system for grading essays in three languages. Two different sentence embedding models were evaluated within the AES system, multilingual BERT (mBERT) and language-agnostic BERT sentence embedding (LaBSE). German, Italian, and Czech essays were…
Descriptors: College Students, Slavic Languages, German, Italian
Augustin Mutak; Robert Krause; Esther Ulitzsch; Sören Much; Jochen Ranger; Steffi Pohl – Journal of Educational Measurement, 2024
Understanding the intraindividual relation between an individual's speed and ability in testing scenarios is essential to assure a fair assessment. Different approaches exist for estimating this relationship, that either rely on specific study designs or on specific assumptions. This paper aims to add to the toolbox of approaches for estimating…
Descriptors: Testing, Academic Ability, Time on Task, Correlation
Hwanggyu Lim; Danqi Zhu; Edison M. Choe; Kyung T. Han – Journal of Educational Measurement, 2024
This study presents a generalized version of the residual differential item functioning (RDIF) detection framework in item response theory, named GRDIF, to analyze differential item functioning (DIF) in multiple groups. The GRDIF framework retains the advantages of the original RDIF framework, such as computational efficiency and ease of…
Descriptors: Item Response Theory, Test Bias, Test Reliability, Test Construction
Wallace N. Pinto Jr.; Jinnie Shin – Journal of Educational Measurement, 2025
In recent years, the application of explainability techniques to automated essay scoring and automated short-answer grading (ASAG) models, particularly those based on transformer architectures, has gained significant attention. However, the reliability and consistency of these techniques remain underexplored. This study systematically investigates…
Descriptors: Automation, Grading, Computer Assisted Testing, Scoring
Shun-Fu Hu; Amery D. Wu; Jake Stone – Journal of Educational Measurement, 2025
Scoring high-dimensional assessments (e.g., > 15 traits) can be a challenging task. This paper introduces the multilabel neural network (MNN) as a scoring method for high-dimensional assessments. Additionally, it demonstrates how MNN can score the same test responses to maximize different performance metrics, such as accuracy, recall, or…
Descriptors: Tests, Testing, Scores, Test Construction
Wang, Shiyu; Lin, Haiyan; Chang, Hua-Hua; Douglas, Jeff – Journal of Educational Measurement, 2016
Computerized adaptive testing (CAT) and multistage testing (MST) have become two of the most popular modes in large-scale computer-based sequential testing. Though most designs of CAT and MST exhibit strength and weakness in recent large-scale implementations, there is no simple answer to the question of which design is better because different…
Descriptors: Computer Assisted Testing, Adaptive Testing, Test Format, Sequential Approach
Wang, Wenyi; Song, Lihong; Chen, Ping; Meng, Yaru; Ding, Shuliang – Journal of Educational Measurement, 2015
Classification consistency and accuracy are viewed as important indicators for evaluating the reliability and validity of classification results in cognitive diagnostic assessment (CDA). Pattern-level classification consistency and accuracy indices were introduced by Cui, Gierl, and Chang. However, the indices at the attribute level have not yet…
Descriptors: Classification, Reliability, Accuracy, Cognitive Tests
Jin, Kuan-Yu; Wang, Wen-Chung – Journal of Educational Measurement, 2014
Sometimes, test-takers may not be able to attempt all items to the best of their ability (with full effort) due to personal factors (e.g., low motivation) or testing conditions (e.g., time limit), resulting in poor performances on certain items, especially those located toward the end of a test. Standard item response theory (IRT) models fail to…
Descriptors: Student Evaluation, Item Response Theory, Models, Simulation

Lennon, Roger T. – Journal of Educational Measurement, 1975
Reviews the 1974 Standards, an updating serving as a guide to test making and publishing, and training of persons for these endeavors. (DEP)
Descriptors: Educational Testing, Psychological Testing, Scoring, Standards

Carlson, Jerry S.; Dillon, Ronna – Journal of Educational Measurement, 1979
The Matrices and Order of Appearance subtests of a Piagetian test battery were administered to a sample of second-grade children on two occasions under two test conditions: standardized testing and a dialogue between child and examiner. Differences for test condition and time of testing were found. (JKS)
Descriptors: Academic Achievement, Developmental Psychology, Developmental Stages, Individual Testing

Wen, Shih-Sung – Journal of Educational Measurement, 1975
The relationship between students' scores on a verbal meaning test and their degrees of confidence in item responses was investigated. Subjects were black undergraduate students and they were administered a verbal meaning test by following a confidence testing procedure. (Author/BJG)
Descriptors: Blacks, Confidence Testing, Higher Education, Language Skills

Sykes, Robert C.; Ito, Kyoko; Fitzpatrick, Anne R.; Ercikan, Kadriye – Journal of Educational Measurement, 1997
The five chapters of this report provide resources that deal with the validity, generalizability, comparability, performance standards, and fairness, equity, and bias of performance assessments. The book is written for experienced educational measurement practitioners, although an extensive familiarity with performance assessment is not required.…
Descriptors: Educational Assessment, Measurement Techniques, Performance Based Assessment, Standards

Angoff, William H.; Schrader, William B. – Journal of Educational Measurement, 1984
The reported data provide a basis for evaluating the formula-scoring versus rights-scoring issue and for assessing the effects of directions on the reliability and parallelism of scores for sophisticated examinees taking professionally developed tests. Results support the invariance hypothesis rather than the differential effects hypothesis.…
Descriptors: College Entrance Examinations, Guessing (Tests), Higher Education, Hypothesis Testing

Subkoviak, Michael J. – Journal of Educational Measurement, 1988
Current methods for obtaining reliability indices for mastery tests can be laborious. This paper offers practitioners tables from which agreement and kappa coefficients can be read directly and provides criterion for acceptable values of agreement and kappa coefficients. (TJH)
Descriptors: Mastery Tests, Statistical Analysis, Test Reliability, Testing

Glass, Gene V. – Journal of Educational Measurement, 1978
A detailed analysis of standard setting and criteria for test scores and educational decisions is presented. The author contends that present procedures are in need of re-examination. (JKS)
Descriptors: Academic Standards, Behavioral Objectives, Criterion Referenced Tests, Decision Making