NotesFAQContact Us
Collection
Advanced
Search Tips
Source
Journal of Educational…120
Audience
Researchers3
Laws, Policies, & Programs
What Works Clearinghouse Rating
Showing 1 to 15 of 120 results Save | Export
Peer reviewed Peer reviewed
Direct linkDirect link
Jihong Zhang; Jonathan Templin; Xinya Liang – Journal of Educational Measurement, 2024
Recently, Bayesian diagnostic classification modeling has been becoming popular in health psychology, education, and sociology. Typically information criteria are used for model selection when researchers want to choose the best model among alternative models. In Bayesian estimation, posterior predictive checking is a flexible Bayesian model…
Descriptors: Bayesian Statistics, Cognitive Measurement, Models, Classification
Peer reviewed Peer reviewed
Direct linkDirect link
Sinharay, Sandip – Journal of Educational Measurement, 2023
Technical difficulties and other unforeseen events occasionally lead to incomplete data on educational tests, which necessitates the reporting of imputed scores to some examinees. While there exist several approaches for reporting imputed scores, there is a lack of any guidance on the reporting of the uncertainty of imputed scores. In this paper,…
Descriptors: Evaluation Methods, Scores, Standardized Tests, Simulation
Peer reviewed Peer reviewed
Direct linkDirect link
Thompson, W. Jake; Nash, Brooke; Clark, Amy K.; Hoover, Jeffrey C. – Journal of Educational Measurement, 2023
As diagnostic classification models become more widely used in large-scale operational assessments, we must give consideration to the methods for estimating and reporting reliability. Researchers must explore alternatives to traditional reliability methods that are consistent with the design, scoring, and reporting levels of diagnostic assessment…
Descriptors: Diagnostic Tests, Simulation, Test Reliability, Accuracy
Peer reviewed Peer reviewed
Direct linkDirect link
Yamaguchi, Kazuhiro; Zhang, Jihong – Journal of Educational Measurement, 2023
This study proposed Gibbs sampling algorithms for variable selection in a latent regression model under a unidimensional two-parameter logistic item response theory model. Three types of shrinkage priors were employed to obtain shrinkage estimates: double-exponential (i.e., Laplace), horseshoe, and horseshoe+ priors. These shrinkage priors were…
Descriptors: Algorithms, Simulation, Mathematics Achievement, Bayesian Statistics
Peer reviewed Peer reviewed
Direct linkDirect link
Park, Seohee; Kim, Kyung Yong; Lee, Won-Chan – Journal of Educational Measurement, 2023
Multiple measures, such as multiple content domains or multiple types of performance, are used in various testing programs to classify examinees for screening or selection. Despite the popular usages of multiple measures, there is little research on classification consistency and accuracy of multiple measures. Accordingly, this study introduces an…
Descriptors: Testing, Computation, Classification, Accuracy
Peer reviewed Peer reviewed
Direct linkDirect link
Binici, Salih; Cuhadar, Ismail – Journal of Educational Measurement, 2022
Validity of performance standards is a key element for the defensibility of standard setting results, and validating performance standards requires collecting multiple pieces of evidence at every step during the standard setting process. This study employs a statistical procedure, latent class analysis, to set performance standards and compares…
Descriptors: Validity, Performance, Standards, Multivariate Analysis
Peer reviewed Peer reviewed
Direct linkDirect link
Tong Wu; Stella Y. Kim; Carl Westine; Michelle Boyer – Journal of Educational Measurement, 2025
While significant attention has been given to test equating to ensure score comparability, limited research has explored equating methods for rater-mediated assessments, where human raters inherently introduce error. If not properly addressed, these errors can undermine score interchangeability and test validity. This study proposes an equating…
Descriptors: Item Response Theory, Evaluators, Error of Measurement, Test Validity
Peer reviewed Peer reviewed
Direct linkDirect link
Combs, Adam – Journal of Educational Measurement, 2023
A common method of checking person-fit in Bayesian item response theory (IRT) is the posterior-predictive (PP) method. In recent years, more powerful approaches have been proposed that are based on resampling methods using the popular L*[subscript z] statistic. There has also been proposed a new Bayesian model checking method based on pivotal…
Descriptors: Bayesian Statistics, Goodness of Fit, Evaluation Methods, Monte Carlo Methods
Peer reviewed Peer reviewed
Direct linkDirect link
He, Yinhong – Journal of Educational Measurement, 2023
Back random responding (BRR) behavior is one of the commonly observed careless response behaviors. Accurately detecting BRR behavior can improve test validities. Yu and Cheng (2019) showed that the change point analysis (CPA) procedure based on weighted residual (CPA-WR) performed well in detecting BRR. Compared with the CPA procedure, the…
Descriptors: Test Validity, Item Response Theory, Measurement, Monte Carlo Methods
Peer reviewed Peer reviewed
Direct linkDirect link
Corinne Huggins-Manley; Anthony W. Raborn; Peggy K. Jones; Ted Myers – Journal of Educational Measurement, 2024
The purpose of this study is to develop a nonparametric DIF method that (a) compares focal groups directly to the composite group that will be used to develop the reported test score scale, and (b) allows practitioners to explore for DIF related to focal groups stemming from multicategorical variables that constitute a small proportion of the…
Descriptors: Nonparametric Statistics, Test Bias, Scores, Statistical Significance
Peer reviewed Peer reviewed
Direct linkDirect link
Lee, Yi-Hsuan; Haberman, Shelby J. – Journal of Educational Measurement, 2021
For assessments that use different forms in different administrations, equating methods are applied to ensure comparability of scores over time. Ideally, a score scale is well maintained throughout the life of a testing program. In reality, instability of a score scale can result from a variety of causes, some are expected while others may be…
Descriptors: Scores, Regression (Statistics), Demography, Data
Peer reviewed Peer reviewed
Direct linkDirect link
Tahereh Firoozi; Hamid Mohammadi; Mark J. Gierl – Journal of Educational Measurement, 2025
The purpose of this study is to describe and evaluate a multilingual automated essay scoring (AES) system for grading essays in three languages. Two different sentence embedding models were evaluated within the AES system, multilingual BERT (mBERT) and language-agnostic BERT sentence embedding (LaBSE). German, Italian, and Czech essays were…
Descriptors: College Students, Slavic Languages, German, Italian
Peer reviewed Peer reviewed
Direct linkDirect link
Joo, Seang-Hwane; Lee, Philseok – Journal of Educational Measurement, 2022
Abstract This study proposes a new Bayesian differential item functioning (DIF) detection method using posterior predictive model checking (PPMC). Item fit measures including infit, outfit, observed score distribution (OSD), and Q1 were considered as discrepancy statistics for the PPMC DIF methods. The performance of the PPMC DIF method was…
Descriptors: Test Items, Bayesian Statistics, Monte Carlo Methods, Prediction
Peer reviewed Peer reviewed
Direct linkDirect link
Kylie Gorney; Sandip Sinharay – Journal of Educational Measurement, 2025
Although there exists an extensive amount of research on subscores and their properties, limited research has been conducted on categorical subscores and their interpretations. In this paper, we focus on the claim of Feinberg and von Davier that categorical subscores are useful for remediation and instructional purposes. We investigate this claim…
Descriptors: Tests, Scores, Test Interpretation, Alternative Assessment
Peer reviewed Peer reviewed
Direct linkDirect link
Wolkowitz, Amanda A. – Journal of Educational Measurement, 2021
Decision consistency (DC) is the reliability of a classification decision based on a test score. In professional credentialing, the decision is often a high-stakes pass/fail decision. The current methods for estimating DC are computationally complex. The purpose of this research is to provide a computationally and conceptually simple method for…
Descriptors: Decision Making, Reliability, Classification, Scores
Previous Page | Next Page »
Pages: 1  |  2  |  3  |  4  |  5  |  6  |  7  |  8