ERIC - Search Results

Publication Date

In 2025	3
Since 2024	4
Since 2021 (last 5 years)	4
Since 2016 (last 10 years)	10
Since 2006 (last 20 years)	20

Descriptor

Item Response Theory	29
Reliability	13
Models	11
Test Items	11
Test Reliability	11
Comparative Analysis	6
Scores	6
Error of Measurement	5
Interrater Reliability	5
Psychometrics	5
Test Construction	5
Classification	4
Difficulty Level	4
Multiple Choice Tests	4
Accuracy	3
Equated Scores	3
Estimation (Mathematics)	3
Scaling	3
Test Format	3
Test Validity	3
Writing Tests	3
Bias	2
Computation	2
Computer Software	2
Correlation	2
More ▼

Source

Journal of Educational…

Publication Type

Journal Articles	29
Reports - Research	15
Reports - Evaluative	10
Reports - Descriptive	4
Book/Product Reviews	1

Education Level

Secondary Education	2
Elementary Secondary Education	1
High Schools	1

Audience

Location

Laws, Policies, & Programs

Assessments and Surveys

SAT (College Admission Test)	2
Work Keys (ACT)	1

What Works Clearinghouse Rating

Showing 1 to 15 of 29 results Save | Export

Another Look at Yen's Q3: Is 0.2 an Appropriate Cut-Off?

Peer reviewed

Direct link

Kelsey Nason; Christine DeMars – Journal of Educational Measurement, 2025

This study examined the widely used threshold of 0.2 for Yen's Q3, an index for violations of local independence. Specifically, a simulation was conducted to investigate whether Q3 values were related to the magnitude of bias in estimates of reliability, item parameters, and examinee ability. Results showed that Q3 values below the typical cut-off…

Descriptors: Item Response Theory, Statistical Bias, Test Reliability, Test Items

Detecting Differential Item Functioning among Multiple Groups Using IRT Residual DIF Framework

Peer reviewed

Direct link

Hwanggyu Lim; Danqi Zhu; Edison M. Choe; Kyung T. Han – Journal of Educational Measurement, 2024

This study presents a generalized version of the residual differential item functioning (RDIF) detection framework in item response theory, named GRDIF, to analyze differential item functioning (DIF) in multiple groups. The GRDIF framework retains the advantages of the original RDIF framework, such as computational efficiency and ease of…

Descriptors: Item Response Theory, Test Bias, Test Reliability, Test Construction

Modeling Directional Testlet Effects on Multiple Open-Ended Questions

Peer reviewed

Direct link

Kuan-Yu Jin; Wai-Lok Siu – Journal of Educational Measurement, 2025

Educational tests often have a cluster of items linked by a common stimulus ("testlet"). In such a design, the dependencies caused between items are called "testlet effects." In particular, the directional testlet effect (DTE) refers to a recursive influence whereby responses to earlier items can positively or negatively affect…

Descriptors: Models, Test Items, Educational Assessment, Scores

Comparing and Combining IRTree Models and Anchoring Vignettes in Addressing Response Styles

Peer reviewed

Direct link

Mingfeng Xue; Ping Chen – Journal of Educational Measurement, 2025

Response styles pose great threats to psychological measurements. This research compares IRTree models and anchoring vignettes in addressing response styles and estimating the target traits. It also explores the potential of combining them at the item level and total-score level (ratios of extreme and middle responses to vignettes). Four models…

Descriptors: Item Response Theory, Models, Comparative Analysis, Vignettes

Accounting for Rater Effects with the Hierarchical Rater Model Framework When Scoring Simple Structured Constructed Response Tests

Peer reviewed

Direct link

Nieto, Ricardo; Casabianca, Jodi M. – Journal of Educational Measurement, 2019

Many large-scale assessments are designed to yield two or more scores for an individual by administering multiple sections measuring different but related skills. Multidimensional tests, or more specifically, simple structured tests, such as these rely on multiple multiple-choice and/or constructed responses sections of items to generate multiple…

Descriptors: Tests, Scoring, Responses, Test Items

Nonparametric Evidence of Validity, Reliability, and Fairness for Rater-Mediated Assessments: An Illustration Using Mokken Scale Analysis

Peer reviewed

Direct link

Wind, Stefanie A. – Journal of Educational Measurement, 2019

Numerous researchers have proposed methods for evaluating the quality of rater-mediated assessments using nonparametric methods (e.g., kappa coefficients) and parametric methods (e.g., the many-facet Rasch model). Generally speaking, popular nonparametric methods for evaluating rating quality are not based on a particular measurement theory. On…

Descriptors: Nonparametric Statistics, Test Validity, Test Reliability, Item Response Theory

IRT Approaches to Modeling Scores on Mixed-Format Tests

Peer reviewed

Direct link

Lee, Won-Chan; Kim, Stella Y.; Choi, Jiwon; Kang, Yujin – Journal of Educational Measurement, 2020

This article considers psychometric properties of composite raw scores and transformed scale scores on mixed-format tests that consist of a mixture of multiple-choice and free-response items. Test scores on several mixed-format tests are evaluated with respect to conditional and overall standard errors of measurement, score reliability, and…

Descriptors: Raw Scores, Item Response Theory, Test Format, Multiple Choice Tests

Development of Information Functions and Indices for the GGUM-RANK Multidimensional Forced Choice IRT Model

Peer reviewed

Direct link

Joo, Seang-Hwane; Lee, Philseok; Stark, Stephen – Journal of Educational Measurement, 2018

This research derived information functions and proposed new scalar information indices to examine the quality of multidimensional forced choice (MFC) items based on the RANK model. We also explored how GGUM-RANK information, latent trait recovery, and reliability varied across three MFC formats: pairs (two response alternatives), triplets (three…

Descriptors: Item Response Theory, Models, Item Analysis, Reliability

Predicting Operational Rater-Type Classifications Using Rasch Measurement Theory and Random Forests: A Music Performance Assessment Perspective

Peer reviewed

Direct link

Wesolowski, Brian C. – Journal of Educational Measurement, 2019

The purpose of this study was to build a Random Forest supervised machine learning model in order to predict musical rater-type classifications based upon a Rasch analysis of raters' differential severity/leniency related to item use. Raw scores (N = 1,704) from 142 raters across nine high school solo and ensemble festivals (grades 9-12) were…

Descriptors: Item Response Theory, Prediction, Classification, Artificial Intelligence

Can We Learn from Student Mistakes in a Formative, Reading Comprehension Assessment?

Peer reviewed

Direct link

Liu, Bowen; Kennedy, Patrick C.; Seipel, Ben; Carlson, Sarah E.; Biancarosa, Gina; Davison, Mark L. – Journal of Educational Measurement, 2019

This article describes an ongoing project to develop a formative, inferential reading comprehension assessment of causal story comprehension. It has three features to enhance classroom use: equated scale scores for progress monitoring within and across grades, a scale score to distinguish among low-scoring students based on patterns of mistakes,…

Descriptors: Formative Evaluation, Reading Comprehension, Story Reading, Test Construction

Item Response Models for Local Dependence among Multiple Ratings

Peer reviewed

Direct link

Wang, Wen-Chung; Su, Chi-Ming; Qiu, Xue-Lan – Journal of Educational Measurement, 2014

Ratings given to the same item response may have a stronger correlation than those given to different item responses, especially when raters interact with one another before giving ratings. The rater bundle model was developed to account for such local dependence by forming multiple ratings given to an item response as a bundle and assigning…

Descriptors: Item Response Theory, Interrater Reliability, Models, Correlation

IRT-Estimated Reliability for Tests Containing Mixed Item Formats

Peer reviewed

Direct link

Shu, Lianghua; Schwarz, Richard D. – Journal of Educational Measurement, 2014

As a global measure of precision, item response theory (IRT) estimated reliability is derived for four coefficients (Cronbach's a, Feldt-Raju, stratified a, and marginal reliability). Models with different underlying assumptions concerning test-part similarity are discussed. A detailed computational example is presented for the targeted…

Descriptors: Item Response Theory, Reliability, Models, Computation

A Comparison of Different Psychometric Approaches to Modeling Testlet Structures: An Example with C-Tests

Peer reviewed

Direct link

Schroeders, Ulrich; Robitzsch, Alexander; Schipolowski, Stefan – Journal of Educational Measurement, 2014

C-tests are a specific variant of cloze tests that are considered time-efficient, valid indicators of general language proficiency. They are commonly analyzed with models of item response theory assuming local item independence. In this article we estimated local interdependencies for 12 C-tests and compared the changes in item difficulties,…

Descriptors: Comparative Analysis, Psychometrics, Cloze Procedure, Language Tests

The Accuracy and Consistency of a Series of IRT True Score Equatings

Peer reviewed

Direct link

Li, Deping; Jiang, Yanlin; von Davier, Alina A. – Journal of Educational Measurement, 2012

This study investigates a sequence of item response theory (IRT) true score equatings based on various scale transformation approaches and evaluates equating accuracy and consistency over time. The results show that the biases and sample variances for the IRT true score equating (both direct and indirect) are quite small (except for the mean/sigma…

Descriptors: True Scores, Equated Scores, Item Response Theory, Accuracy

Item Response Theory Models for Performance Decline during Testing

Peer reviewed

Direct link

Jin, Kuan-Yu; Wang, Wen-Chung – Journal of Educational Measurement, 2014

Sometimes, test-takers may not be able to attempt all items to the best of their ability (with full effort) due to personal factors (e.g., low motivation) or testing conditions (e.g., time limit), resulting in poor performances on certain items, especially those located toward the end of a test. Standard item response theory (IRT) models fail to…

Descriptors: Student Evaluation, Item Response Theory, Models, Simulation

Previous Page | Next Page »

Pages: 1 | 2

Kolen, Michael J.	2
Lee, Won-Chan	2
Wang, Wen-Chung	2
Biancarosa, Gina	1
Camilli, Gregory	1
Carlson, Sarah E.	1
Casabianca, Jodi M.	1
Choi, Jiwon	1
Christine DeMars	1
Congdon, Peter J.	1
Cui, Ying	1
Danqi Zhu	1
Davison, Mark L.	1
DeMars, Christine E.	1
Edison M. Choe	1
Fitzpatrick, Anne R.	1
Gierl, Mark J.	1
Hanson, Bradley A.	1
Harris, Deborah J.	1
Hwanggyu Lim	1
Jiang, Yanlin	1
Jin, Kuan-Yu	1
Joo, Seang-Hwane	1
Kahraman, Nilufer	1
More ▼