ERIC - Search Results

Publication Date

In 2025	4
Since 2024	7
Since 2021 (last 5 years)	19
Since 2016 (last 10 years)	39
Since 2006 (last 20 years)	63

Descriptor

Evaluation Methods	120
Item Response Theory	31
Test Items	29
Simulation	24
Models	21
Scores	21
Test Bias	17
Test Validity	17
Comparative Analysis	15
Educational Assessment	15
Student Evaluation	12
Error of Measurement	11
Evaluation Research	11
Computation	10
Measurement	10
Test Construction	10
Cutting Scores	9
Elementary Secondary Education	9
Item Analysis	9
Statistical Analysis	9
Test Reliability	9
Accuracy	8
Classification	8
Educational Testing	8
Higher Education	8
More ▼

Source

Journal of Educational…

120

Publication Type

Journal Articles	111
Reports - Research	61
Reports - Evaluative	32
Reports - Descriptive	8
Book/Product Reviews	5
Opinion Papers	3
Information Analyses	2
Guides - Non-Classroom	1
Reports - General	1
Speeches/Meeting Papers	1
Tests/Questionnaires	1
More ▼

Education Level

Secondary Education	4
Elementary Secondary Education	3
Higher Education	3
Postsecondary Education	3
Grade 4	1
Grade 8	1

Audience

Researchers

Location

Georgia	2
Belgium	1
Ireland	1
Israel	1
United Kingdom (Scotland)	1

Laws, Policies, & Programs

Assessments and Surveys

National Assessment of…	4
SAT (College Admission Test)	4
Program for International…	3
National Teacher Examinations	2
Advanced Placement…	1
California Achievement Tests	1
Graduate Record Examinations	1
Sequential Tests of…	1
Trends in International…	1

What Works Clearinghouse Rating

Showing 1 to 15 of 120 results Save | Export

Model Selection Posterior Predictive Model Checking via Limited-Information Indices for Bayesian Diagnostic Classification Modeling

Peer reviewed

Direct link

Jihong Zhang; Jonathan Templin; Xinya Liang – Journal of Educational Measurement, 2024

Recently, Bayesian diagnostic classification modeling has been becoming popular in health psychology, education, and sociology. Typically information criteria are used for model selection when researchers want to choose the best model among alternative models. In Bayesian estimation, posterior predictive checking is a flexible Bayesian model…

Descriptors: Bayesian Statistics, Cognitive Measurement, Models, Classification

Measuring the Uncertainty of Imputed Scores

Peer reviewed

Direct link

Sinharay, Sandip – Journal of Educational Measurement, 2023

Technical difficulties and other unforeseen events occasionally lead to incomplete data on educational tests, which necessitates the reporting of imputed scores to some examinees. While there exist several approaches for reporting imputed scores, there is a lack of any guidance on the reporting of the uncertainty of imputed scores. In this paper,…

Descriptors: Evaluation Methods, Scores, Standardized Tests, Simulation

Using Simulated Retests to Estimate the Reliability of Diagnostic Assessment Systems

Peer reviewed

Direct link

Thompson, W. Jake; Nash, Brooke; Clark, Amy K.; Hoover, Jeffrey C. – Journal of Educational Measurement, 2023

As diagnostic classification models become more widely used in large-scale operational assessments, we must give consideration to the methods for estimating and reporting reliability. Researchers must explore alternatives to traditional reliability methods that are consistent with the design, scoring, and reporting levels of diagnostic assessment…

Descriptors: Diagnostic Tests, Simulation, Test Reliability, Accuracy

Fully Gibbs Sampling Algorithms for Bayesian Variable Selection in Latent Regression Models

Peer reviewed

Direct link

Yamaguchi, Kazuhiro; Zhang, Jihong – Journal of Educational Measurement, 2023

This study proposed Gibbs sampling algorithms for variable selection in a latent regression model under a unidimensional two-parameter logistic item response theory model. Three types of shrinkage priors were employed to obtain shrinkage estimates: double-exponential (i.e., Laplace), horseshoe, and horseshoe+ priors. These shrinkage priors were…

Descriptors: Algorithms, Simulation, Mathematics Achievement, Bayesian Statistics

Estimating Classification Accuracy and Consistency Indices for Multiple Measures with the Simple Structure MIRT Model

Peer reviewed

Direct link

Park, Seohee; Kim, Kyung Yong; Lee, Won-Chan – Journal of Educational Measurement, 2023

Multiple measures, such as multiple content domains or multiple types of performance, are used in various testing programs to classify examinees for screening or selection. Despite the popular usages of multiple measures, there is little research on classification consistency and accuracy of multiple measures. Accordingly, this study introduces an…

Descriptors: Testing, Computation, Classification, Accuracy

Validating Performance Standards via Latent Class Analysis

Peer reviewed

Direct link

Binici, Salih; Cuhadar, Ismail – Journal of Educational Measurement, 2022

Validity of performance standards is a key element for the defensibility of standard setting results, and validating performance standards requires collecting multiple pieces of evidence at every step during the standard setting process. This study employs a statistical procedure, latent class analysis, to set performance standards and compares…

Descriptors: Validity, Performance, Standards, Multivariate Analysis

IRT Observed-Score Equating for Rater-Mediated Assessments Using a Hierarchical Rater Model

Peer reviewed

Direct link

Tong Wu; Stella Y. Kim; Carl Westine; Michelle Boyer – Journal of Educational Measurement, 2025

While significant attention has been given to test equating to ensure score comparability, limited research has explored equating methods for rater-mediated assessments, where human raters inherently introduce error. If not properly addressed, these errors can undermine score interchangeability and test validity. This study proposes an equating…

Descriptors: Item Response Theory, Evaluators, Error of Measurement, Test Validity

A New Bayesian Person-Fit Analysis Method Using Pivotal Discrepancy Measures

Peer reviewed

Direct link

Combs, Adam – Journal of Educational Measurement, 2023

A common method of checking person-fit in Bayesian item response theory (IRT) is the posterior-predictive (PP) method. In recent years, more powerful approaches have been proposed that are based on resampling methods using the popular L*[subscript z] statistic. There has also been proposed a new Bayesian model checking method based on pivotal…

Descriptors: Bayesian Statistics, Goodness of Fit, Evaluation Methods, Monte Carlo Methods

An Exponentially Weighted Moving Average Procedure for Detecting Back Random Responding Behavior

Peer reviewed

Direct link

He, Yinhong – Journal of Educational Measurement, 2023

Back random responding (BRR) behavior is one of the commonly observed careless response behaviors. Accurately detecting BRR behavior can improve test validities. Yu and Cheng (2019) showed that the change point analysis (CPA) procedure based on weighted residual (CPA-WR) performed well in detecting BRR. Compared with the CPA procedure, the…

Descriptors: Test Validity, Item Response Theory, Measurement, Monte Carlo Methods

A Nonparametric Composite Group DIF Index for Focal Groups Stemming from Multicategorical Variables

Peer reviewed

Direct link

Corinne Huggins-Manley; Anthony W. Raborn; Peggy K. Jones; Ted Myers – Journal of Educational Measurement, 2024

The purpose of this study is to develop a nonparametric DIF method that (a) compares focal groups directly to the composite group that will be used to develop the reported test score scale, and (b) allows practitioners to explore for DIF related to focal groups stemming from multicategorical variables that constitute a small proportion of the…

Descriptors: Nonparametric Statistics, Test Bias, Scores, Statistical Significance

Studying Score Stability with a Harmonic Regression Family: A Comparison of Three Approaches to Adjustment of Examinee-Specific Demographic Data

Peer reviewed

Direct link

Lee, Yi-Hsuan; Haberman, Shelby J. – Journal of Educational Measurement, 2021

For assessments that use different forms in different administrations, equating methods are applied to ensure comparability of scores over time. Ideally, a score scale is well maintained throughout the life of a testing program. In reality, instability of a score scale can result from a variety of causes, some are expected while others may be…

Descriptors: Scores, Regression (Statistics), Demography, Data

Using Automated Procedures to Score Educational Essays Written in Three Languages

Peer reviewed

Direct link

Tahereh Firoozi; Hamid Mohammadi; Mark J. Gierl – Journal of Educational Measurement, 2025

The purpose of this study is to describe and evaluate a multilingual automated essay scoring (AES) system for grading essays in three languages. Two different sentence embedding models were evaluated within the AES system, multilingual BERT (mBERT) and language-agnostic BERT sentence embedding (LaBSE). German, Italian, and Czech essays were…

Descriptors: College Students, Slavic Languages, German, Italian

Detecting Differential Item Functioning Using Posterior Predictive Model Checking: A Comparison of Discrepancy Statistics

Peer reviewed

Direct link

Joo, Seang-Hwane; Lee, Philseok – Journal of Educational Measurement, 2022

Abstract This study proposes a new Bayesian differential item functioning (DIF) detection method using posterior predictive model checking (PPMC). Item fit measures including infit, outfit, observed score distribution (OSD), and Q1 were considered as discrepancy statistics for the PPMC DIF methods. The performance of the PPMC DIF method was…

Descriptors: Test Items, Bayesian Statistics, Monte Carlo Methods, Prediction

A Note on the Use of Categorical Subscores

Peer reviewed

Direct link

Kylie Gorney; Sandip Sinharay – Journal of Educational Measurement, 2025

Although there exists an extensive amount of research on subscores and their properties, limited research has been conducted on categorical subscores and their interpretations. In this paper, we focus on the claim of Feinberg and von Davier that categorical subscores are useful for remediation and instructional purposes. We investigate this claim…

Descriptors: Tests, Scores, Test Interpretation, Alternative Assessment

A Computationally Simple Method for Estimating Decision Consistency

Peer reviewed

Direct link

Wolkowitz, Amanda A. – Journal of Educational Measurement, 2021

Decision consistency (DC) is the reliability of a classification decision based on a test score. In professional credentialing, the decision is often a high-stakes pass/fail decision. The current methods for estimating DC are computationally complex. The purpose of this research is to provide a computationally and conceptually simple method for…

Descriptors: Decision Making, Reliability, Classification, Scores

Previous Page | Next Page »

Pages: 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8

Clauser, Brian E.	4
Lee, Won-Chan	3
Leighton, Jacqueline P.	3
Wilson, Mark	3
Wind, Stefanie A.	3
de la Torre, Jimmy	3
van der Linden, Wim J.	3
Ankenmann, Robert D.	2
Castellano, Katherine E.	2
Cross, Lawrence H.	2
Dorans, Neil J.	2
Haberman, Shelby J.	2
Jones, Eli	2
Kim, Kyung Yong	2
Margolis, Melissa J.	2
McCaffrey, Daniel F.	2
Mee, Janet	2
Nandakumar, Ratna	2
Albano, Anthony D.	1
Allen, Nancy L.	1
Amery D. Wu	1
Anderson, Ronald E.	1
Anthony W. Raborn	1
Armstrong, Ronald D.	1
More ▼