ERIC - Search Results

Publication Date

In 2025	1
Since 2024	2
Since 2021 (last 5 years)	3
Since 2016 (last 10 years)	7
Since 2006 (last 20 years)	9

Descriptor

Error of Measurement	12
Interrater Reliability	12
Reliability	12
Correlation	5
Scores	4
Accuracy	3
Evaluation Methods	3
Generalizability Theory	3
Performance Based Assessment	3
Scoring	3
Statistical Analysis	3
Data Analysis	2
Data Collection	2
Graduate Students	2
Language Tests	2
Measures (Individuals)	2
Psychometrics	2
Rating Scales	2
Research Design	2
Validity	2
Allied Health Personnel	1
Anatomy	1
Artificial Intelligence	1
Athletics	1
Behavior Rating Scales	1
More ▼

Source

International Journal of…	2
Applied Measurement in…	1
Athletic Training Education…	1
British Educational Research…	1
Developmental Medicine &…	1
Educational Assessment	1
Language Assessment Quarterly	1
Language Testing	1
Research Synthesis Methods	1

Publication Type

Journal Articles	10
Reports - Research	10
Reports - Evaluative	2
Speeches/Meeting Papers	2
Tests/Questionnaires	1

Education Level

Higher Education	3
Postsecondary Education	3

Audience

Researchers

Location

China (Beijing)

Laws, Policies, & Programs

Assessments and Surveys

What Works Clearinghouse Rating

Showing all 12 results Save | Export

Grading Exams Using Large Language Models: A Comparison between Human and AI Grading of Exams in Higher Education Using ChatGPT

Peer reviewed

Direct link

Jonas Flodén – British Educational Research Journal, 2025

This study compares how the generative AI (GenAI) large language model (LLM) ChatGPT performs in grading university exams compared to human teachers. Aspects investigated include consistency, large discrepancies and length of answer. Implications for higher education, including the role of teachers and ethics, are also discussed. Three…

Descriptors: College Faculty, Artificial Intelligence, Comparative Testing, Scoring

New Tests of Rater Drift in Trend Scoring

Peer reviewed

Direct link

John R. Donoghue; Carol Eckerly – Applied Measurement in Education, 2024

Trend scoring constructed response items (i.e. rescoring Time A responses at Time B) gives rise to two-way data that follow a product multinomial distribution rather than the multinomial distribution that is usually assumed. Recent work has shown that the difference in sampling model can have profound negative effects on statistics usually used to…

Descriptors: Scoring, Error of Measurement, Reliability, Scoring Rubrics

The Reliability of Simultaneous versus Individual Data Collection during Stuttering Assessment

Peer reviewed

Direct link

Davidow, Jason H.; Ye, Jun; Edge, Robin L. – International Journal of Language & Communication Disorders, 2023

Background: Speech-language pathologists often multitask in order to be efficient with their commonly large caseloads. In stuttering assessment, multitasking often involves collecting multiple measures simultaneously. Aims: The present study sought to determine reliability when collecting multiple measures simultaneously versus individually.…

Descriptors: Graduate Students, Measurement, Reliability, Group Activities

Estimating Hazard Ratios from Published Kaplan-Meier Survival Curves: A Methods Validation Study

Peer reviewed

Direct link

Saluja, Ronak; Cheng, Sierra; delos Santos, Keemo Althea; Chan, Kelvin K. W. – Research Synthesis Methods, 2019

Objective: Various statistical methods have been developed to estimate hazard ratios (HRs) from published Kaplan-Meier (KM) curves for the purpose of performing meta-analyses. The objective of this study was to determine the reliability, accuracy, and precision of four commonly used methods by Guyot, Williamson, Parmar, and Hoyle and Henley.…

Descriptors: Meta Analysis, Reliability, Accuracy, Randomized Controlled Trials

Working with Sparse Data in Rated Language Tests: Generalizability Theory Applications

Peer reviewed

Direct link

Lin, Chih-Kai – Language Testing, 2017

Sparse-rated data are common in operational performance-based language tests, as an inevitable result of assigning examinee responses to a fraction of available raters. The current study investigates the precision of two generalizability-theory methods (i.e., the rating method and the subdividing method) specifically designed to accommodate the…

Descriptors: Data Analysis, Language Tests, Generalizability Theory, Accuracy

Reliability of Entry-Level Athletic Trainers' Palpation Skills of Bony Anatomical Landmarks in the Lumbopelvic Region

Peer reviewed

Direct link

Schultz, Sarah M.; Jacobs, Michelle M.; Gorgos, Kara S.; Wasylyk, Nicole T.; Hanrahan, Sean; Van Lunen, Bonnie L. – Athletic Training Education Journal, 2015

Context: Accuracy of locating various lumbopelvic landmarks for novice athletic trainers has not been examined. Objective: To examine reliability of novice athletic trainers for identification of the L4 spinous process and right and left posterior superior iliac spine (PSIS). Design: Cross-sectional reliability. Setting: Laboratory. Patients or…

Descriptors: Athletics, Allied Health Personnel, Entry Workers, Reliability

Reliability of the Test of Integrated Language and Literacy Skills (TILLS)

Peer reviewed

Direct link

Mailend, Marja-Liisa; Plante, Elena; Anderson, Michele A.; Applegate, E. Brooks; Nelson, Nickola W. – International Journal of Language & Communication Disorders, 2016

Background: As new standardized tests become commercially available, it is critical that clinicians have access to the information about a test's psychometric properties, including aspects of reliability. Aims: The purpose of the three studies reported in this article was to investigate the reliability of a new test, the Test of Integrated…

Descriptors: Standardized Tests, Psychometrics, Reliability, Language Skills

Investigating Score Dependability in English/Chinese Interpreter Certification Performance Testing: A Generalizability Theory Approach

Peer reviewed

Direct link

Han, Chao – Language Assessment Quarterly, 2016

As a property of test scores, reliability/dependability constitutes an important psychometric consideration, and it underpins the validity of measurement results. A review of interpreter certification performance tests (ICPTs) reveals that (a) although reliability/dependability checking has been recognized as an important concern, its theoretical…

Descriptors: Foreign Countries, Scores, English, Chinese

Rating Scales for Dystonia in Cerebral Palsy: Reliability and Validity

Peer reviewed

Direct link

Monbaliu, E.; Ortibus, E.; Roelens, F.; Desloovere, K.; Deklerck, J.; Prinzie, P.; De Cock, P.; Feys, H. – Developmental Medicine & Child Neurology, 2010

Aim: This study investigated the reliability and validity of the Barry-Albright Dystonia Scale (BADS), the Burke-Fahn-Marsden Movement Scale (BFMMS), and the Unified Dystonia Rating Scale (UDRS) in patients with bilateral dystonic cerebral palsy (CP). Method: Three raters independently scored videotapes of 10 patients (five males, five females;…

Descriptors: Content Validity, Cerebral Palsy, Validity, Interrater Reliability

Estimating the Reliability of Dynamic Variables Requiring Rater Judgment: A Generalizability Paradigm.

Download full text

Webber, Larry; And Others – 1986

Generalizability theory, which subsumes classical measurement theory as a special case, provides a general model for estimating the reliability of observational rating data by estimating the variance components of the measurement design. Research data from the "Heart Smart" health intervention program were analyzed as a heuristic tool.…

Descriptors: Behavior Rating Scales, Cardiovascular System, Error of Measurement, Generalizability Theory

Generalizability Theory in Program Evaluation.

Rothman, M. L.; And Others – 1982

A practical application of generalizability theory, demonstrating how the variance components contribute to understanding and interpreting the data collected to evaluate a program, is described. The evaluation concerned 120 learning modules developed for the Dental Auxiliary Education Project. The goals of the project were to design, implement,…

Descriptors: Correlation, Data Collection, Dental Schools, Educational Research

Reliability and Decision Consistency: An Analysis of Writing Mode at Two Times on a Statewide Test.

Peer reviewed

Hollenbeck, Keith; Tindal, Gerald; Almond, Patricia – Educational Assessment, 1999

Studied the amount of measurement error in a state's performance-based writing task as it relates to high-stakes decision reproducibility. Using 175 eighth-grade writing samples, the study finds moderate correlations between the two raters' scores, with significant differences for the rates for the handwritten, but not the typed, essays.(SLD)

Descriptors: Decision Making, Error of Measurement, Essay Tests, Grade 8

Almond, Patricia	1
Anderson, Michele A.	1
Applegate, E. Brooks	1
Carol Eckerly	1
Chan, Kelvin K. W.	1
Cheng, Sierra	1
Davidow, Jason H.	1
De Cock, P.	1
Deklerck, J.	1
Desloovere, K.	1
Edge, Robin L.	1
Feys, H.	1
Gorgos, Kara S.	1
Han, Chao	1
Hanrahan, Sean	1
Hollenbeck, Keith	1
Jacobs, Michelle M.	1
John R. Donoghue	1
Jonas Flodén	1
Lin, Chih-Kai	1
Mailend, Marja-Liisa	1
Monbaliu, E.	1
Nelson, Nickola W.	1
Ortibus, E.	1
More ▼