ERIC - Search Results

Publication Date

In 2025	4
Since 2024	4
Since 2021 (last 5 years)	8
Since 2016 (last 10 years)	10
Since 2006 (last 20 years)	19

Descriptor

Evaluation Methods	32
Test Items	9
Computer Assisted Testing	8
Educational Testing	8
Item Response Theory	7
Measurement	7
Psychometrics	7
Simulation	7
Testing Problems	7
Evaluation Research	6
Student Evaluation	6
Educational Assessment	5
Scores	5
Scoring	5
Secondary Education	5
Accuracy	4
Adaptive Testing	4
Comparative Analysis	4
Evaluation Problems	4
Models	4
Test Bias	4
Test Construction	4
Test Reliability	4
Test Use	4
Testing	4
More ▼

Source

Journal of Educational…

Publication Type

Journal Articles	29
Reports - Research	16
Reports - Evaluative	8
Book/Product Reviews	2
Opinion Papers	2
Reports - Descriptive	1

Education Level

Higher Education	3
Postsecondary Education	3
Elementary Secondary Education	1
Secondary Education	1

Audience

Location

Ireland

Laws, Policies, & Programs

Assessments and Surveys

Advanced Placement…	1
California Achievement Tests	1
Graduate Record Examinations	1
SAT (College Admission Test)	1
Sequential Tests of…	1

What Works Clearinghouse Rating

Showing 1 to 15 of 32 results Save | Export

Using Multiple Maximum Exposure Rates in Computerized Adaptive Testing

Peer reviewed

Direct link

Kylie Gorney; Mark D. Reckase – Journal of Educational Measurement, 2025

In computerized adaptive testing, item exposure control methods are often used to provide a more balanced usage of the item pool. Many of the most popular methods, including the restricted method (Revuelta and Ponsoda), use a single maximum exposure rate to limit the proportion of times that each item is administered. However, Barrada et al.…

Descriptors: Computer Assisted Testing, Adaptive Testing, Test Items, Item Banks

Using Automated Procedures to Score Educational Essays Written in Three Languages

Peer reviewed

Direct link

Tahereh Firoozi; Hamid Mohammadi; Mark J. Gierl – Journal of Educational Measurement, 2025

The purpose of this study is to describe and evaluate a multilingual automated essay scoring (AES) system for grading essays in three languages. Two different sentence embedding models were evaluated within the AES system, multilingual BERT (mBERT) and language-agnostic BERT sentence embedding (LaBSE). German, Italian, and Czech essays were…

Descriptors: College Students, Slavic Languages, German, Italian

Measuring the Uncertainty of Imputed Scores

Peer reviewed

Direct link

Sinharay, Sandip – Journal of Educational Measurement, 2023

Technical difficulties and other unforeseen events occasionally lead to incomplete data on educational tests, which necessitates the reporting of imputed scores to some examinees. While there exist several approaches for reporting imputed scores, there is a lack of any guidance on the reporting of the uncertainty of imputed scores. In this paper,…

Descriptors: Evaluation Methods, Scores, Standardized Tests, Simulation

Simultaneous Constrained Adaptive Item Selection for Group-Based Testing

Peer reviewed

Direct link

Bengs, Daniel; Kroehne, Ulf; Brefeld, Ulf – Journal of Educational Measurement, 2021

By tailoring test forms to the test-taker's proficiency, Computerized Adaptive Testing (CAT) enables substantial increases in testing efficiency over fixed forms testing. When used for formative assessment, the alignment of task difficulty with proficiency increases the chance that teachers can derive useful feedback from assessment data. The…

Descriptors: Computer Assisted Testing, Formative Evaluation, Group Testing, Program Effectiveness

Estimating Classification Accuracy and Consistency Indices for Multiple Measures with the Simple Structure MIRT Model

Peer reviewed

Direct link

Park, Seohee; Kim, Kyung Yong; Lee, Won-Chan – Journal of Educational Measurement, 2023

Multiple measures, such as multiple content domains or multiple types of performance, are used in various testing programs to classify examinees for screening or selection. Despite the popular usages of multiple measures, there is little research on classification consistency and accuracy of multiple measures. Accordingly, this study introduces an…

Descriptors: Testing, Computation, Classification, Accuracy

Evaluating the Consistency and Reliability of Attribution Methods in Automated Short Answer Grading (ASAG) Systems: Toward an Explainable Scoring System

Peer reviewed

Direct link

Wallace N. Pinto Jr.; Jinnie Shin – Journal of Educational Measurement, 2025

In recent years, the application of explainability techniques to automated essay scoring and automated short-answer grading (ASAG) models, particularly those based on transformer architectures, has gained significant attention. However, the reliability and consistency of these techniques remain underexplored. This study systematically investigates…

Descriptors: Automation, Grading, Computer Assisted Testing, Scoring

Using Multilabel Neural Network to Score High-Dimensional Assessments for Different Use Foci: An Example with College Major Preference Assessment

Peer reviewed

Direct link

Shun-Fu Hu; Amery D. Wu; Jake Stone – Journal of Educational Measurement, 2025

Scoring high-dimensional assessments (e.g., > 15 traits) can be a challenging task. This paper introduces the multilabel neural network (MNN) as a scoring method for high-dimensional assessments. Additionally, it demonstrates how MNN can score the same test responses to maximize different performance metrics, such as accuracy, recall, or…

Descriptors: Tests, Testing, Scores, Test Construction

Validity Arguments Meet Artificial Intelligence in Innovative Educational Assessment

Peer reviewed

Direct link

Dorsey, David W.; Michaels, Hillary R. – Journal of Educational Measurement, 2022

We have dramatically advanced our ability to create rich, complex, and effective assessments across a range of uses through technology advancement. Artificial Intelligence (AI) enabled assessments represent one such area of advancement--one that has captured our collective interest and imagination. Scientists and practitioners within the domains…

Descriptors: Validity, Ethics, Artificial Intelligence, Evaluation Methods

Comparing the Accuracy of Student Growth Measures

Peer reviewed

Direct link

Castellano, Katherine E.; McCaffrey, Daniel F. – Journal of Educational Measurement, 2020

Testing programs are often interested in using a student growth measure. This article presents analytic derivations of the accuracy of common student growth measures on both the raw scale of the test and the percentile rank scale in terms of the proportional reduction in mean squared error and the squared correlation between the estimator and…

Descriptors: Student Evaluation, Accuracy, Testing, Student Development

Classroom Assessment and Large-Scale Psychometrics: Shall the Twain Meet? (A Conversation with Margaret Heritage and Neal Kingston)

Peer reviewed

Direct link

Heritage, Margaret; Kingston, Neal M. – Journal of Educational Measurement, 2019

Classroom assessment and large-scale assessment have, for the most part, existed in mutual isolation. Some experts have felt this is for the best and others have been concerned that the schism limits the potential contribution of both forms of assessment. Margaret Heritage has long been a champion of best practices in classroom assessment. Neal…

Descriptors: Measurement, Psychometrics, Context Effect, Classroom Environment

Differential Item Functioning Assessment in Cognitive Diagnostic Modeling: Application of the Wald Test to Investigate DIF in the DINA Model

Peer reviewed

Direct link

Hou, Likun; de la Torre, Jimmy; Nandakumar, Ratna – Journal of Educational Measurement, 2014

Analyzing examinees' responses using cognitive diagnostic models (CDMs) has the advantage of providing diagnostic information. To ensure the validity of the results from these models, differential item functioning (DIF) in CDMs needs to be investigated. In this article, the Wald test is proposed to examine DIF in the context of CDMs. This study…

Descriptors: Test Bias, Models, Simulation, Error Patterns

Multilevel Modeling of Item Position Effects

Peer reviewed

Direct link

Albano, Anthony D. – Journal of Educational Measurement, 2013

In many testing programs it is assumed that the context or position in which an item is administered does not have a differential effect on examinee responses to the item. Violations of this assumption may bias item response theory estimates of item and person parameters. This study examines the potentially biasing effects of item position. A…

Descriptors: Test Items, Item Response Theory, Test Format, Questioning Techniques

Monitoring Rater Performance over Time: A Framework for Detecting Differential Accuracy and Differential Scale Category Use

Peer reviewed

Direct link

Myford, Carol M.; Wolfe, Edward W. – Journal of Educational Measurement, 2009

In this study, we describe a framework for monitoring rater performance over time. We present several statistical indices to identify raters whose standards drift and explain how to use those indices operationally. To illustrate the use of the framework, we analyzed rating data from the 2002 Advanced Placement English Literature and Composition…

Descriptors: English Literature, Advanced Placement, Measures (Individuals), Writing (Composition)

Judges' Use of Examinee Performance Data in an Angoff Standard-Setting Exercise for a Medical Licensing Examination: An Experimental Study

Peer reviewed

Direct link

Clauser, Brian E.; Mee, Janet; Baldwin, Su G.; Margolis, Melissa J.; Dillon, Gerard F. – Journal of Educational Measurement, 2009

Although the Angoff procedure is among the most widely used standard setting procedures for tests comprising multiple-choice items, research has shown that subject matter experts have considerable difficulty accurately making the required judgments in the absence of examinee performance data. Some authors have viewed the need to provide…

Descriptors: Standard Setting (Scoring), Program Effectiveness, Expertise, Health Personnel

The Hierarchy Consistency Index: Evaluating Person Fit for Cognitive Diagnostic Assessment

Peer reviewed

Direct link

Cui, Ying; Leighton, Jacqueline P. – Journal of Educational Measurement, 2009

In this article, we introduce a person-fit statistic called the hierarchy consistency index (HCI) to help detect misfitting item response vectors for tests developed and analyzed based on a cognitive model. The HCI ranges from -1.0 to 1.0, with values close to -1.0 indicating that students respond unexpectedly or differently from the responses…

Descriptors: Test Length, Simulation, Correlation, Research Methodology

Previous Page | Next Page »

Pages: 1 | 2 | 3

Albano, Anthony D.	1
Amery D. Wu	1
Armstrong, Ronald D.	1
Askegaard, Lewis D.	1
Baldwin, Su G.	1
Bengs, Daniel	1
Biggs, J. B.	1
Bolger, Niall	1
Braun, P. H.	1
Brefeld, Ulf	1
Breithaupt, Krista	1
Castellano, Katherine E.	1
Chen, Shu-Ying	1
Chuah, Siang Chee	1
Clauser, Brian E.	1
Cui, Ying	1
Davison, Mark L.	1
Dillon, Gerard F.	1
Ding, Cody S.	1
Dorans, Neil J.	1
Dorsey, David W.	1
Eignor, Daniel R.	1
Hamid Mohammadi	1
Heritage, Margaret	1
More ▼