ERIC - Search Results

Publication Date

In 2025	4
Since 2024	6
Since 2021 (last 5 years)	6
Since 2016 (last 10 years)	6
Since 2006 (last 20 years)	14

Descriptor

Comparative Testing	17
Error of Measurement	17
Evaluation Methods	6
Test Validity	5
Foreign Countries	4
Test Reliability	4
Computer Assisted Testing	3
Equated Scores	3
Evaluation Criteria	3
Evaluation Problems	3
Measurement Techniques	3
Psychometrics	3
Scoring	3
Standardized Tests	3
Academic Standards	2
Adaptive Testing	2
Causal Models	2
Comparative Analysis	2
Correlation	2
Estimation (Mathematics)	2
Interrater Reliability	2
Item Analysis	2
Item Response Theory	2
Scaling	2
Scores	2
More ▼

Source

Applied Psychological…	2
Journal of Educational…	2
Advances in Physiology…	1
British Educational Research…	1
Educational Measurement:…	1
Educational and Psychological…	1
Evaluation and the Health…	1
Field Methods	1
Grantee Submission	1
International Journal of…	1
Journal of Autism and…	1
Journal of Policy Analysis…	1
Journal of School Choice	1
Measurement:…	1
Practical Assessment,…	1
More ▼

Publication Type

Journal Articles	17
Reports - Research	11
Reports - Evaluative	6
Reports - Descriptive	1

Education Level

Higher Education	4
Postsecondary Education	3
Elementary Secondary Education	2
High Schools	1
Secondary Education	1

Audience

Location

Canada	1
Ethiopia	1
Germany	1

Laws, Policies, & Programs

Assessments and Surveys

Iowa Tests of Basic Skills	1
National Assessment of…	1

What Works Clearinghouse Rating

Showing 1 to 15 of 17 results Save | Export

Grading Exams Using Large Language Models: A Comparison between Human and AI Grading of Exams in Higher Education Using ChatGPT

Peer reviewed

Direct link

Jonas Flodén – British Educational Research Journal, 2025

This study compares how the generative AI (GenAI) large language model (LLM) ChatGPT performs in grading university exams compared to human teachers. Aspects investigated include consistency, large discrepancies and length of answer. Implications for higher education, including the role of teachers and ethics, are also discussed. Three…

Descriptors: College Faculty, Artificial Intelligence, Comparative Testing, Scoring

Comparing Measurement Reliability Estimation Techniques: Correlation Coefficient vs. Bland-Altman Plot

Peer reviewed

Direct link

Tülin Otbiçer Acar – Measurement: Interdisciplinary Research and Perspectives, 2024

The aim of this study is to compare the results of correlation coefficient estimation of reliability with those obtained through the Bland-Altman plot technique. The scale was first divided into two halves using three different approaches. A linear and high-level relationship was found between the scale scores obtained from the halved forms.…

Descriptors: High School Students, Measurement Techniques, Psychometrics, Comparative Testing

Poverty and Wealth without a Ladder? An Appraisal of the Stages of Progress Method among Agro-Pastoralists in Ethiopia's Lower Omo Valley

Peer reviewed

Direct link

Edward G. J. Stevenson; Jil Molenaar; David-Paul Pertaub; Dessalegn Tekle – Field Methods, 2025

Is it possible to measure wealth and poverty across settings while being faithful to local understandings? The stages of progress method (SoP) attempts to do this by building ladders of wealth in locally relevant terms and using these in comparisons across groups. This approach is potentially useful among pastoralist populations where monetary…

Descriptors: Foreign Countries, Poverty, Social Mobility, Evaluation Methods

Evidence-Based Evaluation of Student and Marker Performances in Assessment and Examination

Peer reviewed

Direct link

Ole J. Kemi – Advances in Physiology Education, 2025

Students are assessed by coursework and/or exams, all of which are marked by assessors (markers). Student and marker performances are then subject to end-of-session board of examiner handling and analysis. This occurs annually and is the basis for evaluating students but also the wider learning and teaching efficiency of an academic institution.…

Descriptors: Undergraduate Students, Evaluation Methods, Evaluation Criteria, Academic Standards

Assessing the Contribution of Measures of Attention and Executive Function to Diagnosis of ADHD or Autism

Peer reviewed

Direct link

Kelsey Harkness; Signe Bray; Chelsea M. Durber; Deborah Dewey; Kara Murias – Journal of Autism and Developmental Disorders, 2025

Attention and executive function (EF) dysregulation are common in a number of disorders including autism and attention-deficit/hyperactivity disorder (ADHD). Better understanding of the relationship between indirect and direct measures of attention and EF and common neurodevelopmental diagnoses may contribute to more efficient and effective…

Descriptors: Adolescents, Autism Spectrum Disorders, Attention Deficit Hyperactivity Disorder, Executive Function

Signal-to-Noise Ratio in Estimating and Testing the Mediation Effect: Structural Equation Modeling versus Path Analysis with Weighted Composites

Peer reviewed

Direct link

Ke-Hai Yuan; Zhiyong Zhang; Lijuan Wang – Grantee Submission, 2024

Mediation analysis plays an important role in understanding causal processes in social and behavioral sciences. While path analysis with composite scores was criticized to yield biased parameter estimates when variables contain measurement errors, recent literature has pointed out that the population values of parameters of latent-variable models…

Descriptors: Structural Equation Models, Path Analysis, Weighted Scores, Comparative Testing

Impact of Sample Size and Variability on the Power and Type I Error Rates of Equivalence Tests: A Simulation Study

Peer reviewed
PDF on ERIC

Download full text

Rusticus, Shayna A.; Lovato, Chris Y. – Practical Assessment, Research & Evaluation, 2014

The question of equivalence between two or more groups is frequently of interest to many applied researchers. Equivalence testing is a statistical method designed to provide evidence that groups are comparable by demonstrating that the mean differences found between groups are small enough that they are considered practically unimportant. Few…

Descriptors: Sample Size, Equivalency Tests, Simulation, Error of Measurement

Strengthening the Regression Discontinuity Design Using Additional Design Elements: A Within-Study Comparison

Peer reviewed

Direct link

Wing, Coady; Cook, Thomas D. – Journal of Policy Analysis and Management, 2013

The sharp regression discontinuity design (RDD) has three key weaknesses compared to the randomized clinical trial (RCT). It has lower statistical power, it is more dependent on statistical modeling assumptions, and its treatment effect estimates are limited to the narrow subpopulation of cases immediately around the cutoff, which is rarely of…

Descriptors: Regression (Statistics), Research Design, Statistical Analysis, Research Problems

Wise and Proper Use of National Assessment of Educational Progress (NAEP) Data

Peer reviewed

Direct link

Innes, Richard G. – Journal of School Choice, 2012

This article provides examples of how serious misconceptions can result when only "all student" scores from the National Assessment of Educational Progress (NAEP) are used for simplistic state-to-state comparisons. Suggestions for better treatment are presented. The article also compares Kentucky's eighth grade EXPLORE testing to NAEP…

Descriptors: National Competency Tests, Scoring, Misconceptions, Academic Achievement

Tests in Europe: Where We Are and Where We Should Go

Peer reviewed

Direct link

Elosua, Paula; Iliescu, Dragos – International Journal of Testing, 2012

Psychometric practice does not always converge with the advances of psychometric theory. In order to investigate this gap, the authors focus on the 10 most used psychological tests in Europe, as identified by recent surveys. The article analyzes test manuals published in 6 different European countries for these 10 most used tests. A total of 32…

Descriptors: Psychological Testing, Personality Measures, Error of Measurement, Foreign Countries

Same-Form Retest Effects on Credentialing Examinations

Peer reviewed

Direct link

Raymond, Mark R.; Neustel, Sandra; Anderson, Dan – Educational Measurement: Issues and Practice, 2009

Examinees who take high-stakes assessments are usually given an opportunity to repeat the test if they are unsuccessful on their initial attempt. To prevent examinees from obtaining unfair score increases by memorizing the content of specific test items, testing agencies usually assign a different test form to repeat examinees. The use of multiple…

Descriptors: Test Results, Test Items, Testing, Aptitude Tests

Performance Assessments with Microworlds and Their Difficulty

Peer reviewed

Direct link

Kluge, Annette – Applied Psychological Measurement, 2008

The use of microworlds (MWs), or complex dynamic systems, in educational testing and personnel selection is hampered by systematic measurement errors because these new and innovative item formats are not adequately controlled for their difficulty. This empirical study introduces a way to operationalize an MW's difficulty and demonstrates the…

Descriptors: Personnel Selection, Self Efficacy, Educational Testing, Computer Uses in Education

A Comparison of Score Level Estimates of the Standard Error of Measurement.

Peer reviewed

Qualls-Payne, Audrey L. – Journal of Educational Measurement, 1992

Six methods for estimating the standard error of measurement (SEM) at specific score levels are compared by comparing score level SEM estimates from a single test administration to estimates from two test administrations, using Iowa Tests of Basic Skills data for 2,138 examinees. L. S. Feldt's method is preferred. (SLD)

Descriptors: Comparative Testing, Elementary Education, Elementary School Students, Error of Measurement

The Comparability of the Standardized Mean Difference Effect Size across Different Measures of the Same Construct: Measurement Considerations

Peer reviewed

Direct link

Nugent, William R. – Educational and Psychological Measurement, 2006

One of the most important effect sizes used in meta-analysis is the standardized mean difference (SMD). In this article, the conditions under which SMD effect sizes based on different measures of the same construct are directly comparable are investigated. The results show that SMD effect sizes from different measures of the same construct are…

Descriptors: Effect Size, Meta Analysis, True Scores, Error of Measurement

Equating Scores from Adaptive to Linear Tests

Peer reviewed

Direct link

van der Linden, Wim J. – Applied Psychological Measurement, 2006

Two local methods for observed-score equating are applied to the problem of equating an adaptive test to a linear test. In an empirical study, the methods were evaluated against a method based on the test characteristic function (TCF) of the linear test and traditional equipercentile equating applied to the ability estimates on the adaptive test…

Descriptors: Adaptive Testing, Computer Assisted Testing, Test Format, Equated Scores

Previous Page | Next Page »

Pages: 1 | 2

Anderson, Dan	1
Bergstrom, Betty A.	1
Chelsea M. Durber	1
Cook, Thomas D.	1
David-Paul Pertaub	1
Deborah Dewey	1
Dessalegn Tekle	1
Edward G. J. Stevenson	1
Elosua, Paula	1
Iliescu, Dragos	1
Innes, Richard G.	1
Jil Molenaar	1
Jonas Flodén	1
Kara Murias	1
Ke-Hai Yuan	1
Kelsey Harkness	1
Kluge, Annette	1
Li, Yuan H.	1
Lijuan Wang	1
Lissitz, Robert W.	1
Lovato, Chris Y.	1
Lunz, Mary E.	1
Neustel, Sandra	1
Nugent, William R.	1
Ole J. Kemi	1
More ▼