ERIC - Search Results

Publication Date

In 2025	7
Since 2024	18
Since 2021 (last 5 years)	71
Since 2016 (last 10 years)	164

Descriptor

Test Items	60
Test Construction	36
Scores	35
Computer Assisted Testing	29
Test Validity	29
Item Response Theory	27
Evaluation Methods	25
Models	21
Achievement Tests	19
Educational Assessment	19
Foreign Countries	16
Psychometrics	16
Test Reliability	15
Decision Making	14
Measurement	14
Test Bias	14
International Assessment	13
Standards	13
Tests	13
Validity	13
Comparative Analysis	12
High Stakes Tests	12
Mathematics Tests	12
Testing Problems	12
Academic Achievement	11
More ▼

Source

Educational Measurement:…

164

Publication Type

Journal Articles	164
Reports - Research	95
Reports - Descriptive	42
Reports - Evaluative	25
Information Analyses	4
Opinion Papers	2
Speeches/Meeting Papers	1
Tests/Questionnaires	1

Education Level

Secondary Education	24
Higher Education	17
Postsecondary Education	15
High Schools	8
Elementary Education	7
Elementary Secondary Education	7
Middle Schools	7
Junior High Schools	6
Early Childhood Education	3
Grade 4	3
Intermediate Grades	3
Adult Education	2
Grade 8	2
Grade 10	1
Grade 3	1
Grade 5	1
Grade 6	1
Grade 7	1
Grade 9	1
High School Equivalency…	1
Preschool Education	1
Primary Education	1
More ▼

Audience

Location

Canada	2
Germany	2
Greece	2
United States	2
Australia	1
California	1
Finland	1
Massachusetts	1
Oregon	1
Singapore	1
United Kingdom (England)	1
More ▼

Laws, Policies, & Programs

Every Student Succeeds Act…	3
No Child Left Behind Act 2001	1

Assessments and Surveys

Program for International…	10
SAT (College Admission Test)	5
ACT Assessment	4
Graduate Record Examinations	1
Progress in International…	1
Test of English as a Foreign…	1
Trends in International…	1

What Works Clearinghouse Rating

Showing 1 to 15 of 164 results Save | Export

Guesses and Slips as Proficiency-Related Phenomena and Impacts on Parameter Invariance

Peer reviewed

Direct link

Xiangyi Liao; Daniel M Bolt – Educational Measurement: Issues and Practice, 2024

Traditional approaches to the modeling of multiple-choice item response data (e.g., 3PL, 4PL models) emphasize slips and guesses as random events. In this paper, an item response model is presented that characterizes both disjunctively interacting guessing and conjunctively interacting slipping processes as proficiency-related phenomena. We show…

Descriptors: Item Response Theory, Test Items, Error Correction, Guessing (Tests)

The Multidimensionality of Measurement Bias in High-Stakes Testing: Using Machine Learning to Evaluate Complex Sources of Differential Item Functioning

Peer reviewed

Direct link

Belzak, William C. M. – Educational Measurement: Issues and Practice, 2023

Test developers and psychometricians have historically examined measurement bias and differential item functioning (DIF) across a single categorical variable (e.g., gender), independently of other variables (e.g., race, age, etc.). This is problematic when more complex forms of measurement bias may adversely affect test responses and, ultimately,…

Descriptors: Test Bias, High Stakes Tests, Artificial Intelligence, Test Items

Examining Gender Differences in TIMSS 2019 Using a Multiple-Group Hierarchical Speed-Accuracy-Revisits Model

Peer reviewed

Direct link

Dihao Leng; Ummugul Bezirhan; Lale Khorramdel; Bethany Fishbein; Matthias von Davier – Educational Measurement: Issues and Practice, 2024

This study capitalizes on response and process data from the computer-based TIMSS 2019 Problem Solving and Inquiry tasks to investigate gender differences in test-taking behaviors and their association with mathematics achievement at the eighth grade. Specifically, a recently proposed hierarchical speed-accuracy-revisits (SAR) model was adapted to…

Descriptors: Gender Differences, Test Wiseness, Achievement Tests, Mathematics Tests

A Workflow for Minimizing Errors in Template-Based Automated Item-Generation Development

Peer reviewed

Direct link

Yanyan Fu – Educational Measurement: Issues and Practice, 2024

The template-based automated item-generation (TAIG) approach that involves template creation, item generation, item selection, field-testing, and evaluation has more steps than the traditional item development method. Consequentially, there is more margin for error in this process, and any template errors can be cascaded to the generated items.…

Descriptors: Error Correction, Automation, Test Items, Test Construction

Defining Test-Score Interpretation, Use, and Claims: Delphi Study for the Validity Argument

Peer reviewed

Direct link

Folger, Timothy D.; Bostic, Jonathan; Krupa, Erin E. – Educational Measurement: Issues and Practice, 2023

Validity is a fundamental consideration of test development and test evaluation. The purpose of this study is to define and reify three key aspects of validity and validation, namely test-score interpretation, test-score use, and the claims supporting interpretation and use. This study employed a Delphi methodology to explore how experts in…

Descriptors: Test Interpretation, Scores, Test Use, Test Validity

Instruction-Tuned Large-Language Models for Quality Control in Automatic Item Generation: A Feasibility Study

Peer reviewed

Direct link

Guher Gorgun; Okan Bulut – Educational Measurement: Issues and Practice, 2025

Automatic item generation may supply many items instantly and efficiently to assessment and learning environments. Yet, the evaluation of item quality persists to be a bottleneck for deploying generated items in learning and assessment settings. In this study, we investigated the utility of using large-language models, specifically Llama 3-8B, for…

Descriptors: Artificial Intelligence, Quality Control, Technology Uses in Education, Automation

Do Subject Matter Experts' Judgments of Multiple-Choice Format Suitability Predict Item Quality?

Peer reviewed

Direct link

Berenbon, Rebecca F.; McHugh, Bridget C. – Educational Measurement: Issues and Practice, 2023

To assemble a high-quality test, psychometricians rely on subject matter experts (SMEs) to write high-quality items. However, SMEs are not typically given the opportunity to provide input on which content standards are most suitable for multiple-choice questions (MCQs). In the present study, we explored the relationship between perceived MCQ…

Descriptors: Test Items, Multiple Choice Tests, Standards, Difficulty Level

Supporting the Interpretive Validity of Student-Level Claims in Science Assessment with Tiered Claim Structures

Peer reviewed

Direct link

Student, Sanford R.; Gong, Brian – Educational Measurement: Issues and Practice, 2022

We address two persistent challenges in large-scale assessments of the Next Generation Science Standards: (a) the validity of score interpretations that target the standards broadly and (b) how to structure claims for assessments of this complex domain. The NGSS pose a particular challenge for specifying claims about students that evidence from…

Descriptors: Science Tests, Test Validity, Test Items, Test Construction

Mode Effects in College Admissions Testing and Differential Speededness as a Possible Explanation

Peer reviewed

Direct link

Steedle, Jeffrey T.; Cho, Young Woo; Wang, Shichao; Arthur, Ann M.; Li, Dongmei – Educational Measurement: Issues and Practice, 2022

As testing programs transition from paper to online testing, they must study mode comparability to support the exchangeability of scores from different testing modes. To that end, a series of three mode comparability studies was conducted during the 2019-2020 academic year with examinees randomly assigned to take the ACT college admissions exam on…

Descriptors: College Entrance Examinations, Computer Assisted Testing, Scores, Test Format

Exploration of Latent Structure in Test Revision and Review Log Data

Peer reviewed

Direct link

Zhang, Susu; Li, Anqi; Wang, Shiyu – Educational Measurement: Issues and Practice, 2023

In computer-based tests allowing revision and reviews, examinees' sequence of visits and answer changes to questions can be recorded. The variable-length revision log data introduce new complexities to the collected data but, at the same time, provide additional information on examinees' test-taking behavior, which can inform test development and…

Descriptors: Computer Assisted Testing, Test Construction, Test Wiseness, Test Items

Measurement Invariance for Multilingual Learners Using Item Response and Response Time in PISA 2018

Peer reviewed

Direct link

Jung Yeon Park; Sean Joo; Zikun Li; Hyejin Yoon – Educational Measurement: Issues and Practice, 2025

This study examines potential assessment bias based on students' primary language status in PISA 2018. Specifically, multilingual (MLs) and nonmultilingual (non-MLs) students in the United States are compared with regard to their response time as well as scored responses across three cognitive domains (reading, mathematics, and science).…

Descriptors: Achievement Tests, Secondary School Students, International Assessment, Test Bias

Adjusting for Ability Differences of Equating Samples When Randomization Is Suboptimal

Peer reviewed

Direct link

Kim, Sooyeon; Walker, Michael E. – Educational Measurement: Issues and Practice, 2022

Test equating requires collecting data to link the scores from different forms of a test. Problems arise when equating samples are not equivalent and the test forms to be linked share no common items by which to measure or adjust for the group nonequivalence. Using data from five operational test forms, we created five pairs of research forms for…

Descriptors: Ability, Tests, Equated Scores, Testing Problems

Evolving Educational Testing to Meet Students' Needs: Design-in-Real-Time Assessment

Peer reviewed

Direct link

Stephen G. Sireci; Javier Suárez-Álvarez; April L. Zenisky; Maria Elena Oliveri – Educational Measurement: Issues and Practice, 2024

The goal in personalized assessment is to best fit the needs of each individual test taker, given the assessment purposes. Design-in-Real-Time (DIRTy) assessment reflects the progressive evolution in testing from a single test, to an adaptive test, to an adaptive assessment "system." In this article, we lay the foundation for DIRTy…

Descriptors: Educational Assessment, Student Needs, Test Format, Test Construction

Item Response Theory Models for Polytomous Multidimensional Forced-Choice Items to Measure Construct Differentiation

Peer reviewed

Direct link

Xuelan Qiu; Jimmy de la Torre; You-Gan Wang; Jinran Wu – Educational Measurement: Issues and Practice, 2024

Multidimensional forced-choice (MFC) items have been found to be useful to reduce response biases in personality assessments. However, conventional scoring methods for the MFC items result in ipsative data, hindering the wider applications of the MFC format. In the last decade, a number of item response theory (IRT) models have been developed,…

Descriptors: Item Response Theory, Personality Traits, Personality Measures, Personality Assessment

Disrupted Data: Using Longitudinal Assessment Systems to Monitor Test Score Quality

Peer reviewed

Direct link

An, Lily Shiao; Ho, Andrew Dean; Davis, Laurie Laughlin – Educational Measurement: Issues and Practice, 2022

Technical documentation for educational tests focuses primarily on properties of individual scores at single points in time. Reliability, standard errors of measurement, item parameter estimates, fit statistics, and linking constants are standard technical features that external stakeholders use to evaluate items and individual scale scores.…

Descriptors: Documentation, Scores, Evaluation Methods, Longitudinal Studies

Previous Page | Next Page »

Pages: 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11

Rios, Joseph A.	5
Sinharay, Sandip	5
Sireci, Stephen G.	4
Wind, Stefanie A.	4
Bennett, Randy E.	3
Ihlenfeldt, Samuel D.	3
Khorramdel, Lale	3
Kuncel, Nathan R.	3
Sackett, Paul R.	3
Wyse, Adam E.	3
Yamamoto, Kentaro	3
Ames, Allison	2
Burkhardt, Amy	2
Domingue, Benjamin W.	2
Grammatikopoulos, Vasilis	2
Gregoriadis, Athanasios	2
Guo, Hongwen	2
Jones, Andrew T.	2
Katz, Irvin R.	2
Keehner, Madeleine	2
Kosh, Audra E.	2
Krupa, Erin E.	2
Leventhal, Brian C.	2
Lewis, Jennifer	2
Li, Dongmei	2
More ▼