ERIC - Search Results

Publication Date

In 2025	2
Since 2024	6
Since 2021 (last 5 years)	18
Since 2016 (last 10 years)	36
Since 2006 (last 20 years)	77

Descriptor

Error of Measurement	141
Scoring	141
Test Reliability	34
Item Response Theory	30
Scores	30
Test Items	26
Test Validity	25
Interrater Reliability	23
Psychometrics	21
Comparative Analysis	19
Testing	18
Measurement Techniques	17
Statistical Analysis	16
Test Construction	16
Test Interpretation	16
Generalizability Theory	15
Language Tests	15
Mathematics Tests	15
Reliability	15
Testing Problems	15
Computer Assisted Testing	14
Correlation	14
Equated Scores	14
Evaluation Methods	13
Higher Education	13
More ▼

Publication Type

Journal Articles	75
Reports - Research	75
Reports - Evaluative	28
Speeches/Meeting Papers	23
Reports - Descriptive	16
Dissertations/Theses -…	7
Numerical/Quantitative Data	7
Opinion Papers	5
Guides - Non-Classroom	3
ERIC Digests in Full Text	2
ERIC Publications	2
Information Analyses	2
Tests/Questionnaires	2
More ▼

Education Level

Elementary Education	12
Higher Education	11
Early Childhood Education	8
Grade 3	8
Grade 4	8
Grade 5	8
Primary Education	8
Elementary Secondary Education	7
Intermediate Grades	7
Postsecondary Education	7
Secondary Education	7
Middle Schools	6
Grade 6	5
Grade 7	5
Grade 8	5
Junior High Schools	5
Grade 2	1
Kindergarten	1
More ▼

Audience

Researchers	8
Practitioners	3
Parents	1

Location

New York	5
Florida	2
New Mexico	2
United States	2
China	1
China (Shanghai)	1
Germany	1
Netherlands	1
Pennsylvania	1
United Kingdom	1

Laws, Policies, & Programs

What Works Clearinghouse Rating

Showing 1 to 15 of 141 results Save | Export

Monitoring Rater Quality in Observational Systems: Issues Due to Unreliable Estimates of Rater Quality

Peer reviewed

Direct link

Mark White; Matt Ronfeldt – Educational Assessment, 2024

Standardized observation systems seek to reliably measure a specific conceptualization of teaching quality, managing rater error through mechanisms such as certification, calibration, validation, and double-scoring. These mechanisms both support high quality scoring and generate the empirical evidence used to support the scoring inference (i.e.,…

Descriptors: Interrater Reliability, Quality Control, Teacher Effectiveness, Error Patterns

Asymptotic Standard Errors of Model-Based Oral Reading Fluency Score Equating

Download full text

Xin Qiao; Akihito Kamata; Cornelis Potgieter – Grantee Submission, 2023

Oral reading fluency (ORF) assessments are commonly used to screen at-risk readers and to evaluate the effectiveness of interventions as curriculum-based measurements. As with other assessments, equating ORF scores becomes necessary when we want to compare ORF scores from different test forms. Recently, Kara et al. (2023) proposed a model-based…

Descriptors: Error of Measurement, Oral Reading, Reading Fluency, Equated Scores

Grading Exams Using Large Language Models: A Comparison between Human and AI Grading of Exams in Higher Education Using ChatGPT

Peer reviewed

Direct link

Jonas Flodén – British Educational Research Journal, 2025

This study compares how the generative AI (GenAI) large language model (LLM) ChatGPT performs in grading university exams compared to human teachers. Aspects investigated include consistency, large discrepancies and length of answer. Implications for higher education, including the role of teachers and ethics, are also discussed. Three…

Descriptors: College Faculty, Artificial Intelligence, Comparative Testing, Scoring

Linking Errors Introduced by Rapid Guessing Responses When Employing Multigroup Concurrent IRT Scaling

Direct link

Jiayi Deng – ProQuest LLC, 2024

Test score comparability in international large-scale assessments (LSA) is of utmost importance in measuring the effectiveness of education systems and understanding the impact of education on economic growth. To effectively compare test scores on an international scale, score linking is widely used to convert raw scores from different linguistic…

Descriptors: Item Response Theory, Scoring Rubrics, Scoring, Error of Measurement

New Tests of Rater Drift in Trend Scoring

Peer reviewed

Direct link

John R. Donoghue; Carol Eckerly – Applied Measurement in Education, 2024

Trend scoring constructed response items (i.e. rescoring Time A responses at Time B) gives rise to two-way data that follow a product multinomial distribution rather than the multinomial distribution that is usually assumed. Recent work has shown that the difference in sampling model can have profound negative effects on statistics usually used to…

Descriptors: Scoring, Error of Measurement, Reliability, Scoring Rubrics

Statistically Guided Grading Judgements: Contextualisation or Contamination?

Peer reviewed

Direct link

Louise Badham – Oxford Review of Education, 2025

Different sources of assessment evidence are reviewed during International Baccalaureate (IB) grade awarding to convert marks into grades and ensure fair results for students. Qualitative and quantitative evidence are analysed to determine grade boundaries, with statistical evidence weighed against examiner judgement and teachers' feedback on…

Descriptors: Advanced Placement Programs, Grading, Interrater Reliability, Evaluative Thinking

Comparison of the Results of the Generalizability Theory with the Inter-Rater Agreement Coefficients

Peer reviewed
PDF on ERIC

Download full text

Eser, Mehmet Taha; Aksu, Gökhan – International Journal of Curriculum and Instruction, 2022

The agreement between raters is examined within the scope of the concept of "inter-rater reliability". Although there are clear definitions of the concepts of agreement between raters and reliability between raters, there is no clear information about the conditions under which agreement and reliability level methods are appropriate to…

Descriptors: Generalizability Theory, Interrater Reliability, Evaluation Methods, Test Theory

The Effect of Student Examiner Errors on WAIS-IV and WISC-V Composite Scores

Direct link

Atehortua, Laura – ProQuest LLC, 2022

Intelligence tests are used in a variety of settings such as schools, clinics, and courts to assess the intellectual capacity of individuals of all ages. Intelligence tests are used to make high-stakes decisions such as special education placement, employment, eligibility for social security services, and determination of the death penalty.…

Descriptors: Adults, Intelligence Tests, Children, Error of Measurement

Resolving and Re-Scoring Constructed Response Items in Mixed-Format Assessments: An Exploration of Three Approaches

Peer reviewed

Direct link

Stefanie A. Wind; Yangmeng Xu – Educational Assessment, 2024

We explored three approaches to resolving or re-scoring constructed-response items in mixed-format assessments: rater agreement, person fit, and targeted double scoring (TDS). We used a simulation study to consider how the three approaches impact the psychometric properties of student achievement estimates, with an emphasis on person fit. We found…

Descriptors: Interrater Reliability, Error of Measurement, Evaluation Methods, Examiners

On the Merits of Longitudinal Multiple Group Modelling: An Alternative to Multilevel Modelling for Intervention Evaluations

Peer reviewed

Direct link

Little, Todd D.; Bontempo, Daniel; Rioux, Charlie; Tracy, Allison – International Journal of Research & Method in Education, 2022

Multilevel modelling (MLM) is the most frequently used approach for evaluating interventions with clustered data. MLM, however, has some limitations that are associated with numerous obstacles to model estimation and valid inferences. Longitudinal multiple-group (LMG) modelling is a longstanding approach for testing intervention effects using…

Descriptors: Longitudinal Studies, Hierarchical Linear Modeling, Alternative Assessment, Intervention

Administration and Scoring Errors on the Woodcock-Johnson IV Tests of Achievement: Before and during COVID-19

Peer reviewed

Direct link

Lockwood, Adam B.; Klatka, Kelsey; Parker, Brandon; Benson, Nicholas – Journal of Psychoeducational Assessment, 2023

Eighty Woodcock-Johnson IV Tests of Achievement protocols from 40 test administrators were examined to determine the types and frequencies of administration and scoring errors made. Non-critical errors (e.g., failure to record verbatim) were found on every protocol (M = 37.2). Critical (e.g., standard score, start point) errors were found on 98.8%…

Descriptors: Achievement Tests, Testing, Scoring, Error of Measurement

Maintaining Score Scales over Time: A Comparison of Five Scoring Methods

Peer reviewed

Direct link

Kim, Stella Yun; Lee, Won-Chan – Applied Measurement in Education, 2023

This study evaluates various scoring methods including number-correct scoring, IRT theta scoring, and hybrid scoring in terms of scale-score stability over time. A simulation study was conducted to examine the relative performance of five scoring methods in terms of preserving the first two moments of scale scores for a population in a chain of…

Descriptors: Scoring, Comparative Analysis, Item Response Theory, Simulation

A Multidimensional Item Response Theory Model for Continuous and Graded Responses with Error in Persons and Items

Peer reviewed

Direct link

Ferrando, Pere J.; Navarro-González, David – Educational and Psychological Measurement, 2021

Item response theory "dual" models (DMs) in which both items and individuals are viewed as sources of differential measurement error so far have been proposed only for unidimensional measures. This article proposes two multidimensional extensions of existing DMs: the M-DTCRM (dual Thurstonian continuous response model), intended for…

Descriptors: Item Response Theory, Error of Measurement, Models, Factor Analysis

Latent "D"-Scoring Modeling: Estimation of Item and Person Parameters

Peer reviewed

Direct link

Dimitrov, Dimiter M.; Atanasov, Dimitar V. – Educational and Psychological Measurement, 2021

This study presents a latent (item response theory--like) framework of a recently developed classical approach to test scoring, equating, and item analysis, referred to as "D"-scoring method. Specifically, (a) person and item parameters are estimated under an item response function model on the "D"-scale (from 0 to 1) using…

Descriptors: Scoring, Equated Scores, Item Analysis, Item Response Theory

Automated Essay Scoring Effect on Test Equating Errors in Mixed-Format Test

Peer reviewed
PDF on ERIC

Download full text

Uysal, Ibrahim; Dogan, Nuri – International Journal of Assessment Tools in Education, 2021

Scoring constructed-response items can be highly difficult, time-consuming, and costly in practice. Improvements in computer technology have enabled automated scoring of constructed-response items. However, the application of automated scoring without an investigation of test equating can lead to serious problems. The goal of this study was to…

Descriptors: Computer Assisted Testing, Scoring, Item Response Theory, Test Format

Previous Page | Next Page »

Pages: 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10

Educational and Psychological…	10
ETS Research Report Series	7
Journal of Educational…	7
ProQuest LLC	7
Grantee Submission	6
Educational Measurement:…	5
New York State Education…	5
Applied Measurement in…	3
Educational Testing Service	3
Journal of Psychoeducational…	3
Educational Assessment	2
Journal of Educational and…	2
Journal of School Psychology	2
Journal of Vocational Behavior	2
Language Testing	2
New Mexico Public Education…	2
Psychological Assessment	2
Psychology in the Schools	2
Advances in Health Sciences…	1
Applied Psychological…	1
Assessment	1
Assessment for Effective…	1
Audio-Visual Language Journal	1
British Educational Research…	1
CALICO Journal	1
More ▼

Brennan, Robert L.	3
Kim, Sooyeon	3
Lee, Won-Chan	3
Livingston, Samuel A.	3
Puhan, Gautam	3
Schoen, Robert C.	3
Yang, Xiaotong	3
Anna-Maria Fall	2
Beula M. Magimairaj	2
Dimitrov, Dimiter M.	2
Greg Roberts	2
Haberman, Shelby J.	2
Haertel, Edward H.	2
Longford, Nicholas T.	2
Norcini, John J.	2
Paek, Insu	2
Philip Capin	2
Richard, David C. S.	2
Ronald B. Gillam	2
Sandra L. Gillam	2
Sharon Vaughn	2
Shavelson, Richard J.	2
Slate, John R.	2
Solano-Flores, Guillermo	2
More ▼

Wechsler Intelligence Scale…	8
National Assessment of…	4
Graduate Record Examinations	2
Praxis Series	2
Test of English as a Foreign…	2
Trends in International…	2
Wechsler Adult Intelligence…	2
ACT Assessment	1
Advanced Placement…	1
Alabama High School…	1
California Achievement Tests	1
Early Childhood Longitudinal…	1
Florida Comprehensive…	1
Iowa Tests of Basic Skills	1
Medical College Admission Test	1
National Longitudinal Survey…	1
Program for International…	1
Rod and Frame Test	1
SAT (College Admission Test)	1
Test of Standard Written…	1
Torrance Tests of Creative…	1
Woodcock Johnson Tests of…	1
Work Keys (ACT)	1
More ▼