ERIC - Search Results

Publication Date

In 2025	1
Since 2024	1
Since 2021 (last 5 years)	4
Since 2016 (last 10 years)	13
Since 2006 (last 20 years)	20

Descriptor

Test Items	29
Test Validity	15
Test Construction	13
Item Analysis	11
Validity	11
Scores	7
Computer Assisted Testing	5
Achievement Tests	4
Difficulty Level	4
Evaluation Methods	4
Standards	4
Testing Problems	4
Academic Standards	3
Cutting Scores	3
Educational Assessment	3
Elementary Secondary Education	3
Expertise	3
Foreign Countries	3
Models	3
Norm Referenced Tests	3
Psychometrics	3
Reading Comprehension	3
Scoring	3
Standard Setting (Scoring)	3
Standardized Tests	3
More ▼

Source

Educational Measurement:…

Publication Type

Journal Articles	29
Reports - Research	12
Reports - Descriptive	11
Opinion Papers	4
Reports - Evaluative	3
Guides - Non-Classroom	1
Information Analyses	1

Education Level

Higher Education	2
Postsecondary Education	2
Secondary Education	2

Audience

Location

Germany	2
Canada	1
Oregon	1

Laws, Policies, & Programs

No Child Left Behind Act 2001

Assessments and Surveys

Program for International…	1
SAT (College Admission Test)	1
Stanford Achievement Tests	1

What Works Clearinghouse Rating

Showing 1 to 15 of 29 results Save | Export

Setting and Validating Multiple Standards on a Multistage-Adaptive Test

Peer reviewed

Direct link

Lewis, Jennifer; Lim, Hwanggyu; Padellaro, Frank; Sireci, Stephen G.; Zenisky, April L. – Educational Measurement: Issues and Practice, 2022

Setting cut scores on (MSTs) is difficult, particularly when the test spans several grade levels, and the selection of items from MST panels must reflect the operational test specifications. In this study, we describe, illustrate, and evaluate three methods for mapping panelists' Angoff ratings into cut scores on the scale underlying an MST. The…

Descriptors: Cutting Scores, Adaptive Testing, Test Items, Item Analysis

Instruction-Tuned Large-Language Models for Quality Control in Automatic Item Generation: A Feasibility Study

Peer reviewed

Direct link

Guher Gorgun; Okan Bulut – Educational Measurement: Issues and Practice, 2025

Automatic item generation may supply many items instantly and efficiently to assessment and learning environments. Yet, the evaluation of item quality persists to be a bottleneck for deploying generated items in learning and assessment settings. In this study, we investigated the utility of using large-language models, specifically Llama 3-8B, for…

Descriptors: Artificial Intelligence, Quality Control, Technology Uses in Education, Automation

Supporting the Interpretive Validity of Student-Level Claims in Science Assessment with Tiered Claim Structures

Peer reviewed

Direct link

Student, Sanford R.; Gong, Brian – Educational Measurement: Issues and Practice, 2022

We address two persistent challenges in large-scale assessments of the Next Generation Science Standards: (a) the validity of score interpretations that target the standards broadly and (b) how to structure claims for assessments of this complex domain. The NGSS pose a particular challenge for specifying claims about students that evidence from…

Descriptors: Science Tests, Test Validity, Test Items, Test Construction

Disrupted Data: Using Longitudinal Assessment Systems to Monitor Test Score Quality

Peer reviewed

Direct link

An, Lily Shiao; Ho, Andrew Dean; Davis, Laurie Laughlin – Educational Measurement: Issues and Practice, 2022

Technical documentation for educational tests focuses primarily on properties of individual scores at single points in time. Reliability, standard errors of measurement, item parameter estimates, fit statistics, and linking constants are standard technical features that external stakeholders use to evaluate items and individual scale scores.…

Descriptors: Documentation, Scores, Evaluation Methods, Longitudinal Studies

Evaluating Content-Related Validity Evidence Using a Text-Based Machine Learning Procedure

Peer reviewed

Direct link

Anderson, Daniel; Rowley, Brock; Stegenga, Sondra; Irvin, P. Shawn; Rosenberg, Joshua M. – Educational Measurement: Issues and Practice, 2020

Validity evidence based on test content is critical to meaningful interpretation of test scores. Within high-stakes testing and accountability frameworks, content-related validity evidence is typically gathered via alignment studies, with panels of experts providing qualitative judgments on the degree to which test items align with the…

Descriptors: Content Validity, Artificial Intelligence, Test Items, Vocabulary

The Effect of Drag-and-Drop Item Features on Test-Taker Performance and Response Strategies

Peer reviewed

Direct link

Arslan, Burcu; Jiang, Yang; Keehner, Madeleine; Gong, Tao; Katz, Irvin R.; Yan, Fred – Educational Measurement: Issues and Practice, 2020

Computer-based educational assessments often include items that involve drag-and-drop responses. There are different ways that drag-and-drop items can be laid out and different choices that test developers can make when designing these items. Currently, these decisions are based on experts' professional judgments and design constraints, rather…

Descriptors: Test Items, Computer Assisted Testing, Test Format, Decision Making

Construct Equivalence of PISA Reading Comprehension Measured with Paper-Based and Computer-Based Assessments

Peer reviewed

Direct link

Kroehne, Ulf; Buerger, Sarah; Hahnel, Carolin; Goldhammer, Frank – Educational Measurement: Issues and Practice, 2019

For many years, reading comprehension in the Programme for International Student Assessment (PISA) was measured via paper-based assessment (PBA). In the 2015 cycle, computer-based assessment (CBA) was introduced, raising the question of whether central equivalence criteria required for a valid interpretation of the results are fulfilled. As an…

Descriptors: Reading Comprehension, Computer Assisted Testing, Achievement Tests, Foreign Countries

Understanding Examinees' Responses to Items: Implications for Measurement

Peer reviewed

Direct link

Embretson, Susan E. – Educational Measurement: Issues and Practice, 2016

Examinees' thinking processes have become an increasingly important concern in testing. The responses processes aspect is a major component of validity, and contemporary tests increasingly involve specifications about the cognitive complexity of examinees' response processes. Yet, empirical research findings on examinees' cognitive processes are…

Descriptors: Testing, Cognitive Processes, Test Construction, Test Items

Validating Automated Measures of Text Complexity

Peer reviewed

Direct link

Sheehan, Kathleen M. – Educational Measurement: Issues and Practice, 2017

Automated text complexity measurement tools (also called readability metrics) have been proposed as a way to help teachers, textbook publishers, and assessment developers select texts that are closely aligned with the new, more demanding text complexity expectations specified in the Common Core State Standards. This article examines a critical…

Descriptors: Reading Material Selection, Difficulty Level, Common Core State Standards, Validity

Easier Said than Done: Rejoinder on Sijtsma and on Green and Yang

Peer reviewed

Direct link

Davenport, Ernest C.; Davison, Mark L.; Liou, Pey-Yan; Love, Quintin U. – Educational Measurement: Issues and Practice, 2016

The main points of Sijtsma and Green and Yang in Educational Measurement: Issues and Practice (34, 4) are that reliability, internal consistency, and unidimensionality are distinct and that Cronbach's alpha may be problematic. Neither of these assertions are at odds with Davenport, Davison, Liou, and Love in the same issue. However, many authors…

Descriptors: Educational Assessment, Reliability, Validity, Test Construction

Rapid-Guessing Behavior: Its Identification, Interpretation, and Implications

Peer reviewed

Direct link

Wise, Steven L. – Educational Measurement: Issues and Practice, 2017

The rise of computer-based testing has brought with it the capability to measure more aspects of a test event than simply the answers selected or constructed by the test taker. One behavior that has drawn much research interest is the time test takers spend responding to individual multiple-choice items. In particular, very short response…

Descriptors: Guessing (Tests), Multiple Choice Tests, Test Items, Reaction Time

A Process for Reviewing and Evaluating Generated Test Items

Peer reviewed

Direct link

Gierl, Mark J.; Lai, Hollis – Educational Measurement: Issues and Practice, 2016

Testing organization needs large numbers of high-quality items due to the proliferation of alternative test administration methods and modern test designs. But the current demand for items far exceeds the supply. Test items, as they are currently written, evoke a process that is both time-consuming and expensive because each item is written,…

Descriptors: Test Items, Test Construction, Psychometrics, Models

Effect of Content Knowledge on Angoff-Style Standard Setting Judgments

Peer reviewed

Direct link

Margolis, Melissa J.; Mee, Janet; Clauser, Brian E.; Winward, Marcia; Clauser, Jerome C. – Educational Measurement: Issues and Practice, 2016

Evidence to support the credibility of standard setting procedures is a critical part of the validity argument for decisions made based on tests that are used for classification. One area in which there has been limited empirical study is the impact of standard setting judge selection on the resulting cut score. One important issue related to…

Descriptors: Academic Standards, Standard Setting (Scoring), Cutting Scores, Credibility

Validating Student Score Inferences with Person-Fit Statistic and Verbal Reports: A Person-Fit Study for Cognitive Diagnostic Assessment

Peer reviewed

Direct link

Cui, Ying; Roberts, Mary Roduta – Educational Measurement: Issues and Practice, 2013

The goal of this study was to investigate the usefulness of person-fit analysis in validating student score inferences in a cognitive diagnostic assessment. In this study, a two-stage procedure was used to evaluate person fit for a diagnostic test in the domain of statistical hypothesis testing. In the first stage, the person-fit statistic, the…

Descriptors: Scores, Validity, Cognitive Tests, Diagnostic Tests

Setting Standards for English Foreign Language Assessment: Methodology, Validation, and a Degree of Arbitrariness

Peer reviewed

Direct link

Tiffin-Richards, Simon P.; Pant, Hans Anand; Koller, Olaf – Educational Measurement: Issues and Practice, 2013

Cut-scores were set by expert judges on assessments of reading and listening comprehension of English as a foreign language (EFL), using the bookmark standard-setting method to differentiate proficiency levels defined by the Common European Framework of Reference (CEFR). Assessments contained stratified item samples drawn from extensive item…

Descriptors: Foreign Countries, English (Second Language), Language Tests, Standard Setting (Scoring)

Previous Page | Next Page »

Pages: 1 | 2

Sireci, Stephen G.	3
An, Lily Shiao	1
Anderson, Daniel	1
Arslan, Burcu	1
Bond, Lloyd	1
Bottsford-Miller, Nicole A.	1
Buerger, Sarah	1
Carter, Kathy	1
Clauser, Brian E.	1
Clauser, Jerome C.	1
Cui, Ying	1
Davenport, Ernest C.	1
Davis, Laurie Laughlin	1
Davison, Mark L.	1
Drasgow, Fritz	1
Embretson, Susan E.	1
Frisbie, David A.	1
Gierl, Mark J.	1
Goldhammer, Frank	1
Gong, Brian	1
Gong, Tao	1
Gramenz, Gary W.	1
Guher Gorgun	1
Hahnel, Carolin	1
More ▼