ERIC - Search Results

Publication Date

In 2025	7
Since 2024	9
Since 2021 (last 5 years)	11
Since 2016 (last 10 years)	18
Since 2006 (last 20 years)	25

Descriptor

Test Reliability	173
Test Validity	67
Test Construction	45
Multiple Choice Tests	29
Criterion Referenced Tests	27
Test Items	26
Error of Measurement	21
Item Analysis	21
Test Interpretation	21
Statistical Analysis	20
Correlation	17
Higher Education	17
Scores	17
Scoring	17
Achievement Tests	16
Mastery Tests	15
Measurement Techniques	15
Testing	15
Comparative Analysis	14
Response Style (Tests)	14
Mathematical Models	13
Testing Problems	13
True Scores	13
Measurement	12
Item Response Theory	11
More ▼

Source

Journal of Educational…

173

Publication Type

Journal Articles	95
Reports - Research	68
Reports - Evaluative	15
Information Analyses	5
Guides - Non-Classroom	3
Opinion Papers	3
Reports - Descriptive	3
Tests/Questionnaires	2
Book/Product Reviews	1
Numerical/Quantitative Data	1
Speeches/Meeting Papers	1
More ▼

Education Level

Higher Education	2
Postsecondary Education	2
Secondary Education	1

Audience

Practitioners	3
Researchers	2

Location

Australia	2
Africa	1
Jordan	1
South Carolina	1
United Kingdom	1
United States	1

Laws, Policies, & Programs

What Works Clearinghouse Rating

Showing 1 to 15 of 173 results Save | Export

Another Look at Yen's Q3: Is 0.2 an Appropriate Cut-Off?

Peer reviewed

Direct link

Kelsey Nason; Christine DeMars – Journal of Educational Measurement, 2025

This study examined the widely used threshold of 0.2 for Yen's Q3, an index for violations of local independence. Specifically, a simulation was conducted to investigate whether Q3 values were related to the magnitude of bias in estimates of reliability, item parameters, and examinee ability. Results showed that Q3 values below the typical cut-off…

Descriptors: Item Response Theory, Statistical Bias, Test Reliability, Test Items

Modeling the Intraindividual Relation of Ability and Speed within a Test

Peer reviewed

Direct link

Augustin Mutak; Robert Krause; Esther Ulitzsch; Sören Much; Jochen Ranger; Steffi Pohl – Journal of Educational Measurement, 2024

Understanding the intraindividual relation between an individual's speed and ability in testing scenarios is essential to assure a fair assessment. Different approaches exist for estimating this relationship, that either rely on specific study designs or on specific assumptions. This paper aims to add to the toolbox of approaches for estimating…

Descriptors: Testing, Academic Ability, Time on Task, Correlation

Detecting Differential Item Functioning among Multiple Groups Using IRT Residual DIF Framework

Peer reviewed

Direct link

Hwanggyu Lim; Danqi Zhu; Edison M. Choe; Kyung T. Han – Journal of Educational Measurement, 2024

This study presents a generalized version of the residual differential item functioning (RDIF) detection framework in item response theory, named GRDIF, to analyze differential item functioning (DIF) in multiple groups. The GRDIF framework retains the advantages of the original RDIF framework, such as computational efficiency and ease of…

Descriptors: Item Response Theory, Test Bias, Test Reliability, Test Construction

Using Simulated Retests to Estimate the Reliability of Diagnostic Assessment Systems

Peer reviewed

Direct link

Thompson, W. Jake; Nash, Brooke; Clark, Amy K.; Hoover, Jeffrey C. – Journal of Educational Measurement, 2023

As diagnostic classification models become more widely used in large-scale operational assessments, we must give consideration to the methods for estimating and reporting reliability. Researchers must explore alternatives to traditional reliability methods that are consistent with the design, scoring, and reporting levels of diagnostic assessment…

Descriptors: Diagnostic Tests, Simulation, Test Reliability, Accuracy

Modeling Directional Testlet Effects on Multiple Open-Ended Questions

Peer reviewed

Direct link

Kuan-Yu Jin; Wai-Lok Siu – Journal of Educational Measurement, 2025

Educational tests often have a cluster of items linked by a common stimulus ("testlet"). In such a design, the dependencies caused between items are called "testlet effects." In particular, the directional testlet effect (DTE) refers to a recursive influence whereby responses to earlier items can positively or negatively affect…

Descriptors: Models, Test Items, Educational Assessment, Scores

Comparing and Combining IRTree Models and Anchoring Vignettes in Addressing Response Styles

Peer reviewed

Direct link

Mingfeng Xue; Ping Chen – Journal of Educational Measurement, 2025

Response styles pose great threats to psychological measurements. This research compares IRTree models and anchoring vignettes in addressing response styles and estimating the target traits. It also explores the potential of combining them at the item level and total-score level (ratios of extreme and middle responses to vignettes). Four models…

Descriptors: Item Response Theory, Models, Comparative Analysis, Vignettes

Evaluating the Consistency and Reliability of Attribution Methods in Automated Short Answer Grading (ASAG) Systems: Toward an Explainable Scoring System

Peer reviewed

Direct link

Wallace N. Pinto Jr.; Jinnie Shin – Journal of Educational Measurement, 2025

In recent years, the application of explainability techniques to automated essay scoring and automated short-answer grading (ASAG) models, particularly those based on transformer architectures, has gained significant attention. However, the reliability and consistency of these techniques remain underexplored. This study systematically investigates…

Descriptors: Automation, Grading, Computer Assisted Testing, Scoring

Studying Score Stability with a Harmonic Regression Family: A Comparison of Three Approaches to Adjustment of Examinee-Specific Demographic Data

Peer reviewed

Direct link

Lee, Yi-Hsuan; Haberman, Shelby J. – Journal of Educational Measurement, 2021

For assessments that use different forms in different administrations, equating methods are applied to ensure comparability of scores over time. Ideally, a score scale is well maintained throughout the life of a testing program. In reality, instability of a score scale can result from a variety of causes, some are expected while others may be…

Descriptors: Scores, Regression (Statistics), Demography, Data

Using Automated Procedures to Score Educational Essays Written in Three Languages

Peer reviewed

Direct link

Tahereh Firoozi; Hamid Mohammadi; Mark J. Gierl – Journal of Educational Measurement, 2025

The purpose of this study is to describe and evaluate a multilingual automated essay scoring (AES) system for grading essays in three languages. Two different sentence embedding models were evaluated within the AES system, multilingual BERT (mBERT) and language-agnostic BERT sentence embedding (LaBSE). German, Italian, and Czech essays were…

Descriptors: College Students, Slavic Languages, German, Italian

A Note on the Use of Categorical Subscores

Peer reviewed

Direct link

Kylie Gorney; Sandip Sinharay – Journal of Educational Measurement, 2025

Although there exists an extensive amount of research on subscores and their properties, limited research has been conducted on categorical subscores and their interpretations. In this paper, we focus on the claim of Feinberg and von Davier that categorical subscores are useful for remediation and instructional purposes. We investigate this claim…

Descriptors: Tests, Scores, Test Interpretation, Alternative Assessment

Using Multilabel Neural Network to Score High-Dimensional Assessments for Different Use Foci: An Example with College Major Preference Assessment

Peer reviewed

Direct link

Shun-Fu Hu; Amery D. Wu; Jake Stone – Journal of Educational Measurement, 2025

Scoring high-dimensional assessments (e.g., > 15 traits) can be a challenging task. This paper introduces the multilabel neural network (MNN) as a scoring method for high-dimensional assessments. Additionally, it demonstrates how MNN can score the same test responses to maximize different performance metrics, such as accuracy, recall, or…

Descriptors: Tests, Testing, Scores, Test Construction

Calculating Conditional Reliability for Dynamic Measurement Model Capacity Estimates

Peer reviewed

Direct link

McNeish, Daniel; Dumas, Denis – Journal of Educational Measurement, 2018

Dynamic measurement modeling (DMM) is a recent framework for measuring developing constructs whose manifestation occurs after an assessment is administered (e.g., learning capacity). Empirical studies have suggested that DMM may improve consequential validity of test scores because DMM learning capacity estimates were shown to be much less related…

Descriptors: Measurement Techniques, Test Reliability, Accuracy, Computation

Measures of Agreement to Assess Attribute-Level Classification Accuracy and Consistency for Cognitive Diagnostic Assessments

Peer reviewed

Direct link

Johnson, Matthew S.; Sinharay, Sandip – Journal of Educational Measurement, 2018

One of the proposed uses of cognitive diagnostic assessments is to classify the examinees as either masters or nonmasters on each of a number of skills being assessed. As with any test, it is important to report the quality of these binary classifications with measures of their reliability. Cui et al. and Wang et al. have suggested reliability…

Descriptors: Classification, Accuracy, Test Reliability, Diagnostic Tests

Nonparametric Evidence of Validity, Reliability, and Fairness for Rater-Mediated Assessments: An Illustration Using Mokken Scale Analysis

Peer reviewed

Direct link

Wind, Stefanie A. – Journal of Educational Measurement, 2019

Numerous researchers have proposed methods for evaluating the quality of rater-mediated assessments using nonparametric methods (e.g., kappa coefficients) and parametric methods (e.g., the many-facet Rasch model). Generally speaking, popular nonparametric methods for evaluating rating quality are not based on a particular measurement theory. On…

Descriptors: Nonparametric Statistics, Test Validity, Test Reliability, Item Response Theory

Can We Learn from Student Mistakes in a Formative, Reading Comprehension Assessment?

Peer reviewed

Direct link

Liu, Bowen; Kennedy, Patrick C.; Seipel, Ben; Carlson, Sarah E.; Biancarosa, Gina; Davison, Mark L. – Journal of Educational Measurement, 2019

This article describes an ongoing project to develop a formative, inferential reading comprehension assessment of causal story comprehension. It has three features to enhance classroom use: equated scale scores for progress monitoring within and across grades, a scale score to distinguish among low-scoring students based on patterns of mistakes,…

Descriptors: Formative Evaluation, Reading Comprehension, Story Reading, Test Construction

Previous Page | Next Page »

Pages: 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12

Livingston, Samuel A.	6
Subkoviak, Michael J.	5
Hambleton, Ronald K.	4
Lord, Frederic M.	4
Hanna, Gerald S.	3
Kane, Michael T.	3
Whitney, Douglas R.	3
Birenbaum, Menucha	2
Brennan, Robert L.	2
Burton, Nancy W.	2
Crehan, Kevin D.	2
Fitzpatrick, Anne R.	2
Frisbie, David A.	2
Hakstian, A. Ralph	2
Huynh, Huynh	2
Kansup, Wanlop	2
Lewis, Charles	2
Marsh, Herbert W.	2
Reilly, Richard R.	2
Whitely, Susan E.	2
Wright, Benjamin D.	2
Abeles, Harold F.	1
Airasian, Peter W.	1
Akeju, S. A.	1
More ▼

SAT (College Admission Test)	2
Classroom Environment Scale	1
Comprehensive Tests of Basic…	1
Differential Aptitude Test	1
General Educational…	1
Law School Admission Test	1
Metropolitan Achievement Tests	1
My Class Inventory	1
National Assessment of…	1
Peabody Picture Vocabulary…	1
Program for International…	1
Raven Progressive Matrices	1
Remote Associates Test	1
Self Description Questionnaire	1
Stanford Achievement Tests	1
System of Multicultural…	1
Test of Standard Written…	1
Torrance Tests of Creative…	1
Wechsler Intelligence Scale…	1
More ▼