ERIC - Search Results

Publication Date

In 2025	4
Since 2024	6
Since 2021 (last 5 years)	8
Since 2016 (last 10 years)	12
Since 2006 (last 20 years)	16

Descriptor

Error of Measurement	26
Evaluation Methods	26
Interrater Reliability	26
Performance Based Assessment	6
Rating Scales	6
Scores	6
Scoring	6
Test Reliability	6
Psychometrics	5
Accuracy	4
Evaluation Research	4
Higher Education	4
Measurement Techniques	4
Standardized Tests	4
Correlation	3
Evaluators	3
Generalizability Theory	3
Reliability	3
Student Evaluation	3
Validity	3
Alternative Assessment	2
Classification	2
Comparative Testing	2
Evaluation Criteria	2
Examiners	2
More ▼

Publication Type

Journal Articles	19
Reports - Research	14
Reports - Evaluative	7
Reports - Descriptive	3
Speeches/Meeting Papers	3
Dissertations/Theses -…	1
ERIC Digests in Full Text	1
ERIC Publications	1

Education Level

Higher Education	3
Postsecondary Education	3
Adult Education	1
Elementary Secondary Education	1

Audience

Researchers

Location

California	1
Illinois	1
Nevada	1
Ohio	1
Oklahoma	1
Rhode Island	1

Laws, Policies, & Programs

Assessments and Surveys

National Assessment of…	1
Praxis Series	1

What Works Clearinghouse Rating

Showing 1 to 15 of 26 results Save | Export

Technical Adequacy-Reliability

Peer reviewed

Direct link

Susan K. Johnsen – Gifted Child Today, 2025

The author provides information about reliability and areas that educators should examine in determining if an assessment is consistent and trustworthy for use, and how it should be interpreted in making decisions about students. Reliability areas that are discussed in the column include internal consistency, test-retest or stability, inter-scorer…

Descriptors: Test Reliability, Academically Gifted, Student Evaluation, Error of Measurement

Inter-Rater Reliability Methods in Qualitative Case Study Research

Peer reviewed

Direct link

Rosanna Cole – Sociological Methods & Research, 2024

The use of inter-rater reliability (IRR) methods may provide an opportunity to improve the transparency and consistency of qualitative case study data analysis in terms of the rigor of how codes and constructs have been developed from the raw data. Few articles on qualitative research methods in the literature conduct IRR assessments or neglect to…

Descriptors: Interrater Reliability, Error of Measurement, Evaluation Methods, Research Methodology

Grading Exams Using Large Language Models: A Comparison between Human and AI Grading of Exams in Higher Education Using ChatGPT

Peer reviewed

Direct link

Jonas Flodén – British Educational Research Journal, 2025

This study compares how the generative AI (GenAI) large language model (LLM) ChatGPT performs in grading university exams compared to human teachers. Aspects investigated include consistency, large discrepancies and length of answer. Implications for higher education, including the role of teachers and ethics, are also discussed. Three…

Descriptors: College Faculty, Artificial Intelligence, Comparative Testing, Scoring

Comparison of the Results of the Generalizability Theory with the Inter-Rater Agreement Coefficients

Peer reviewed
PDF on ERIC

Download full text

Eser, Mehmet Taha; Aksu, Gökhan – International Journal of Curriculum and Instruction, 2022

The agreement between raters is examined within the scope of the concept of "inter-rater reliability". Although there are clear definitions of the concepts of agreement between raters and reliability between raters, there is no clear information about the conditions under which agreement and reliability level methods are appropriate to…

Descriptors: Generalizability Theory, Interrater Reliability, Evaluation Methods, Test Theory

Challenges in Using Parent-Reported Bed and Wake Times for Actigraphy Scoring in Rett-Related Syndromes

Peer reviewed

Direct link

Breanne J. Byiers; Alyssa M. Merbler; Chantel C. Burkitt; Frank J. Symons – American Journal on Intellectual and Developmental Disabilities, 2025

Sleep problems are common in Rett syndrome and other neurogenetic syndromes. Actigraphy is a cost-effective, objective method for measuring sleep. Current guidelines require caregiver-reported bed and wake times to facilitate actigraphy data scoring. The current study examined missingness and consistency of caregiver-reported bed and wake times…

Descriptors: Sleep, Neurodevelopmental Disorders, Psychomotor Skills, Genetic Disorders

Evidence-Based Evaluation of Student and Marker Performances in Assessment and Examination

Peer reviewed

Direct link

Ole J. Kemi – Advances in Physiology Education, 2025

Students are assessed by coursework and/or exams, all of which are marked by assessors (markers). Student and marker performances are then subject to end-of-session board of examiner handling and analysis. This occurs annually and is the basis for evaluating students but also the wider learning and teaching efficiency of an academic institution.…

Descriptors: Undergraduate Students, Evaluation Methods, Evaluation Criteria, Academic Standards

Resolving and Re-Scoring Constructed Response Items in Mixed-Format Assessments: An Exploration of Three Approaches

Peer reviewed

Direct link

Stefanie A. Wind; Yangmeng Xu – Educational Assessment, 2024

We explored three approaches to resolving or re-scoring constructed-response items in mixed-format assessments: rater agreement, person fit, and targeted double scoring (TDS). We used a simulation study to consider how the three approaches impact the psychometric properties of student achievement estimates, with an emphasis on person fit. We found…

Descriptors: Interrater Reliability, Error of Measurement, Evaluation Methods, Examiners

Exploring Rating Quality in the Context of High-Stakes Rater-Mediated Educational Assessments

Direct link

Wenjing Guo – ProQuest LLC, 2021

Constructed response (CR) items are widely used in large-scale testing programs, including the National Assessment of Educational Progress (NAEP) and many district and state-level assessments in the United States. One unique feature of CR items is that they depend on human raters to assess the quality of examinees' work. The judgment of human…

Descriptors: National Competency Tests, Responses, Interrater Reliability, Error of Measurement

An Unbiased Estimate of Global Interrater Agreement

Peer reviewed

Direct link

Cousineau, Denis; Laurencelle, Louis – Educational and Psychological Measurement, 2017

Assessing global interrater agreement is difficult as most published indices are affected by the presence of mixtures of agreements and disagreements. A previously proposed method was shown to be specifically sensitive to global agreement, excluding mixtures, but also negatively biased. Here, we propose two alternatives in an attempt to find what…

Descriptors: Interrater Reliability, Evaluation Methods, Statistical Bias, Accuracy

Inter-Rater Variability as Mutual Disagreement: Identifying Raters' Divergent Points of View

Peer reviewed

Direct link

Gingerich, Andrea; Ramlo, Susan E.; van der Vleuten, Cees P. M.; Eva, Kevin W.; Regehr, Glenn – Advances in Health Sciences Education, 2017

Whenever multiple observers provide ratings, even of the same performance, inter-rater variation is prevalent. The resulting "idiosyncratic rater variance" is considered to be unusable error of measurement in psychometric models and is a threat to the defensibility of our assessments. Prior studies of inter-rater variation in clinical…

Descriptors: Interrater Reliability, Error of Measurement, Psychometrics, Q Methodology

Processes and Procedures for Estimating Score Reliability and Precision

Peer reviewed

Direct link

Bardhoshi, Gerta; Erford, Bradley T. – Measurement and Evaluation in Counseling and Development, 2017

Precision is a key facet of test development, with score reliability determined primarily according to the types of error one wants to approximate and demonstrate. This article identifies and discusses several primary forms of reliability estimation: internal consistency (i.e., split-half, KR-20, a), test-retest, alternate forms, interscorer, and…

Descriptors: Scores, Test Reliability, Accuracy, Pretests Posttests

Working with Sparse Data in Rated Language Tests: Generalizability Theory Applications

Peer reviewed

Direct link

Lin, Chih-Kai – Language Testing, 2017

Sparse-rated data are common in operational performance-based language tests, as an inevitable result of assigning examinee responses to a fraction of available raters. The current study investigates the precision of two generalizability-theory methods (i.e., the rating method and the subdividing method) specifically designed to accommodate the…

Descriptors: Data Analysis, Language Tests, Generalizability Theory, Accuracy

Validity Research on Teacher Evaluation Systems Based on the Framework for Teaching

Download full text

Milanowski, Anthony T. – Online Submission, 2011

After decades of disinterest, evaluation of the performance of elementary and secondary teachers in the United States has become an important educational policy issue. As U.S. states and districts have tried to upgrade their evaluation processes, one of the models that has been increasingly used is the Framework for Teaching. This paper summarizes…

Descriptors: Evidence, Teacher Effectiveness, Teacher Evaluation, Observation

Generalizability of Student Writing across Multiple Tasks: A Challenge for Authentic Assessment

Peer reviewed
PDF on ERIC

Download full text

Hathcoat, John D.; Penn, Jeremy D. – Research & Practice in Assessment, 2012

Critics of standardized testing have recommended replacing standardized tests with more authentic assessment measures, such as classroom assignments, projects, or portfolios rated by a panel of raters using common rubrics. Little research has examined the consistency of scores across multiple authentic assignments or the implications of this…

Descriptors: Generalizability Theory, Performance Based Assessment, Writing Across the Curriculum, Standardized Tests

Rating Scales for Dystonia in Cerebral Palsy: Reliability and Validity

Peer reviewed

Direct link

Monbaliu, E.; Ortibus, E.; Roelens, F.; Desloovere, K.; Deklerck, J.; Prinzie, P.; De Cock, P.; Feys, H. – Developmental Medicine & Child Neurology, 2010

Aim: This study investigated the reliability and validity of the Barry-Albright Dystonia Scale (BADS), the Burke-Fahn-Marsden Movement Scale (BFMMS), and the Unified Dystonia Rating Scale (UDRS) in patients with bilateral dystonic cerebral palsy (CP). Method: Three raters independently scored videotapes of 10 patients (five males, five females;…

Descriptors: Content Validity, Cerebral Palsy, Validity, Interrater Reliability

Previous Page | Next Page »

Pages: 1 | 2

Educational and Psychological…	2
Advances in Health Sciences…	1
Advances in Physiology…	1
American Journal on…	1
British Educational Research…	1
Communication Education	1
Developmental Medicine &…	1
Educational Assessment	1
Evaluation and the Health…	1
Gifted Child Today	1
International Journal of…	1
Journal of Educational…	1
Journal of Speech and Hearing…	1
Language Testing	1
Learning Disability Quarterly	1
Measurement and Evaluation in…	1
Online Submission	1
ProQuest LLC	1
Research & Practice in…	1
Sociological Methods &…	1
More ▼

Aksu, Gökhan	1
Alyssa M. Merbler	1
Bardhoshi, Gerta	1
Bohn, Christine A.	1
Bohn, Emil	1
Breanne J. Byiers	1
Busch, John Christian	1
Cason, Carolyn L.	1
Cason, Gerald J.	1
Chang, Lei	1
Chantel C. Burkitt	1
Cousineau, Denis	1
De Cock, P.	1
Deklerck, J.	1
Desloovere, K.	1
Erford, Bradley T.	1
Eser, Mehmet Taha	1
Eva, Kevin W.	1
Feys, H.	1
Frank J. Symons	1
Gingerich, Andrea	1
Gross, Leon J.	1
Hathcoat, John D.	1
Jaeger, Richard M.	1
More ▼