ERIC - Search Results

Publication Date

In 2026	0
Since 2025	8
Since 2022 (last 5 years)	17
Since 2017 (last 10 years)	30
Since 2007 (last 20 years)	74

Descriptor

Evaluation Methods	140
Scoring	140
Interrater Reliability	68
Test Reliability	50
Reliability	43
Student Evaluation	34
Test Validity	29
Writing Evaluation	29
Scores	24
Validity	20
Evaluators	19
Performance Based Assessment	19
Foreign Countries	18
Computer Assisted Testing	16
Elementary Secondary Education	15
Comparative Analysis	14
Educational Assessment	14
Higher Education	14
Measurement Techniques	14
Correlation	13
Testing	13
English (Second Language)	12
Essays	12
Test Construction	12
Writing (Composition)	12
More ▼

Education Level

Higher Education	19
Elementary Secondary Education	14
Postsecondary Education	14
Elementary Education	11
Secondary Education	9
Early Childhood Education	6
Middle Schools	6
Grade 4	4
High Schools	4
Grade 6	3
Grade 8	3
Junior High Schools	3
Kindergarten	3
Primary Education	3
Grade 2	2
Grade 3	2
Grade 5	2
Grade 7	2
Intermediate Grades	2
Preschool Education	2
Adult Education	1
Grade 1	1
More ▼

Audience

Practitioners	8
Researchers	7
Teachers	4
Policymakers	3
Students	1

Location

Vermont	5
Australia	3
United Kingdom (England)	3
Canada	2
Connecticut	2
New Hampshire	2
New York	2
North Carolina	2
Rhode Island	2
Alabama	1
Austria	1
California	1
China	1
Colorado	1
Europe	1
Finland	1
Hong Kong	1
Israel	1
Nebraska	1
Russia	1
Singapore	1
Spain	1
Sweden	1
Texas	1
Turkey	1
More ▼

Laws, Policies, & Programs

Every Student Succeeds Act…	3
Elementary and Secondary…	1

Assessments and Surveys

National Assessment of…	5
Advanced Placement…	2
Graduate Record Examinations	2
New York State Regents…	2
Test of English as a Foreign…	2
Child Behavior Checklist	1
Childrens Depression Inventory	1
Minnesota Multiphasic…	1
National Teacher Examinations	1

What Works Clearinghouse Rating

Showing 1 to 15 of 140 results Save | Export

Grading Exams Using Large Language Models: A Comparison between Human and AI Grading of Exams in Higher Education Using ChatGPT

Peer reviewed

Direct link

Jonas Flodén – British Educational Research Journal, 2025

This study compares how the generative AI (GenAI) large language model (LLM) ChatGPT performs in grading university exams compared to human teachers. Aspects investigated include consistency, large discrepancies and length of answer. Implications for higher education, including the role of teachers and ethics, are also discussed. Three…

Descriptors: College Faculty, Artificial Intelligence, Comparative Testing, Scoring

Comparison of the Results of the Generalizability Theory with the Inter-Rater Agreement Coefficients

Peer reviewed
PDF on ERIC

Download full text

Eser, Mehmet Taha; Aksu, Gökhan – International Journal of Curriculum and Instruction, 2022

The agreement between raters is examined within the scope of the concept of "inter-rater reliability". Although there are clear definitions of the concepts of agreement between raters and reliability between raters, there is no clear information about the conditions under which agreement and reliability level methods are appropriate to…

Descriptors: Generalizability Theory, Interrater Reliability, Evaluation Methods, Test Theory

Examining the Psychometric Impact of Targeted and Random Double-Scoring in Mixed-Format Assessments

Peer reviewed

Direct link

Yangmeng Xu; Stefanie A. Wind – Educational Measurement: Issues and Practice, 2025

Double-scoring constructed-response items is a common but costly practice in mixed-format assessments. This study explored the impacts of Targeted Double-Scoring (TDS) and random double-scoring procedures on the quality of psychometric outcomes, including student achievement estimates, person fit, and student classifications under various…

Descriptors: Academic Achievement, Psychometrics, Scoring, Evaluation Methods

Engaging Undergraduate Students in Preprint Peer Review

Peer reviewed

Direct link

Dawn Holford; Janet McLean; Alex O. Holcombe; Iratxe Puebla; Vera Kempe – Active Learning in Higher Education, 2025

Authentic assessment allows students to demonstrate knowledge and skills in real-world tasks. In research, peer review is one such task that researchers learn by doing, as they evaluate other researchers' work. This means peer review could serve as an authentic assessment that engages students' critical thinking skills in a process of active…

Descriptors: Undergraduate Students, Evaluation Methods, Peer Evaluation, Interrater Reliability

Evaluating the Consistency and Reliability of Attribution Methods in Automated Short Answer Grading (ASAG) Systems: Toward an Explainable Scoring System

Peer reviewed

Direct link

Wallace N. Pinto Jr.; Jinnie Shin – Journal of Educational Measurement, 2025

In recent years, the application of explainability techniques to automated essay scoring and automated short-answer grading (ASAG) models, particularly those based on transformer architectures, has gained significant attention. However, the reliability and consistency of these techniques remain underexplored. This study systematically investigates…

Descriptors: Automation, Grading, Computer Assisted Testing, Scoring

Operationalizing a Weighted Performance Scoring Model for Sustainable e-Learning in Medical Education: Insights from Expert Judgement

Peer reviewed
PDF on ERIC

Download full text

Deborah Oluwadele; Yashik Singh; Timothy Adeliyi – Electronic Journal of e-Learning, 2024

Validation is needed for any newly developed model or framework because it requires several real-life applications. The investment made into e-learning in medical education is daunting, as is the expectation for a positive return on investment. The medical education domain requires data-wise implementation of e-learning as the debate continues…

Descriptors: Electronic Learning, Evaluation Methods, Medical Education, Sustainability

Using Machine Learning to Score Multi-Dimensional Assessments of Chemistry and Physics

Peer reviewed

Direct link

Maestrales, Sarah; Zhai, Xiaoming; Touitou, Israel; Baker, Quinton; Schneider, Barbara; Krajcik, Joseph – Journal of Science Education and Technology, 2021

In response to the call for promoting three-dimensional science learning (NRC, 2012), researchers argue for developing assessment items that go beyond rote memorization tasks to ones that require deeper understanding and the use of reasoning that can improve science literacy. Such assessment items are usually performance-based constructed…

Descriptors: Artificial Intelligence, Scoring, Evaluation Methods, Chemistry

Analyzing Inter-Rater Variation: Exploring Consistency in Mathematics Teachers' Scoring of Exam Papers

Peer reviewed
PDF on ERIC

Download full text

Hosseinali Gholami – Mathematics Teaching Research Journal, 2025

Scoring mathematics exam papers accurately is vital for fostering students' engagement and interest in the subject. Incorrect scoring practices can erode motivation and lead to the development of false self-confidence. Therefore, the implementation of appropriate scoring methods is essential for the success of mathematics education. This study…

Descriptors: Interrater Reliability, Mathematics Teachers, Scoring, Mathematics Tests

A Systematic Review of Early Writing Assessment Tools

Peer reviewed

Direct link

Katherine L. Buchanan; Milena Keller-Margulis; Amanda Hut; Weihua Fan; Sarah S. Mire; G. Thomas Schanding Jr. – Early Childhood Education Journal, 2025

There is considerable research regarding measures of early reading but much less in early writing. Nevertheless, writing is a critical skill for success in school and early difficulties in writing are likely to persist without intervention. A necessary step toward identifying those students who need additional support is the use of screening…

Descriptors: Writing Evaluation, Evaluation Methods, Emergent Literacy, Beginning Writing

Utilizing Deep Learning AI to Analyze Scientific Models: Overcoming Challenges

Peer reviewed

Direct link

Tingting Li; Kevin Haudek; Joseph Krajcik – Journal of Science Education and Technology, 2025

Scientific modeling is a vital educational practice that helps students apply scientific knowledge to real-world phenomena. Despite advances in AI, challenges in accurately assessing such models persist, primarily due to the complexity of cognitive constructs and data imbalances in educational settings. This study addresses these challenges by…

Descriptors: Artificial Intelligence, Scientific Concepts, Models, Automation

Resolving and Re-Scoring Constructed Response Items in Mixed-Format Assessments: An Exploration of Three Approaches

Peer reviewed

Direct link

Stefanie A. Wind; Yangmeng Xu – Educational Assessment, 2024

We explored three approaches to resolving or re-scoring constructed-response items in mixed-format assessments: rater agreement, person fit, and targeted double scoring (TDS). We used a simulation study to consider how the three approaches impact the psychometric properties of student achievement estimates, with an emphasis on person fit. We found…

Descriptors: Interrater Reliability, Error of Measurement, Evaluation Methods, Examiners

Exploring an Alternative to Record Motor Competence Assessment: Interrater and Intrarater Audio-Video Reliability

Peer reviewed

Direct link

Cristina Menescardi; Aida Carballo-Fazanes; Núria Ortega-Benavent; Isaac Estevan – Journal of Motor Learning and Development, 2024

The Canadian Agility and Movement Skill Assessment (CAMSA) is a valid and reliable circuit-based test of motor competence which can be used to assess children's skills in a live or recorded performance and then coded. We aimed to analyze the intrarater reliability of the CAMSA scores (total, time, and skill score) and time measured, by comparing…

Descriptors: Interrater Reliability, Evaluators, Scoring, Psychomotor Skills

Comparative Judgement for Evaluating Young Learners' EFL Writing Performances: Reliability and Teacher Perceptions of Holistic and Dimension-Based Judgements

Peer reviewed

Direct link

Rebecca Sickinger; Tineke Brunfaut; John Pill – Language Testing, 2025

Comparative Judgement (CJ) is an evaluation method, typically conducted online, whereby a rank order is constructed, and scores calculated, from judges' pairwise comparisons of performances. CJ has been researched in various educational contexts, though only rarely in English as a Foreign Language (EFL) writing settings, and is generally agreed to…

Descriptors: Writing Evaluation, English (Second Language), Second Language Learning, Second Language Instruction

Examining Human and Automated Ratings of Elementary Students' Writing Quality: A Multivariate Generalizability Theory Application

Peer reviewed

Direct link

Chen, Dandan; Hebert, Michael; Wilson, Joshua – American Educational Research Journal, 2022

We used multivariate generalizability theory to examine the reliability of hand-scoring and automated essay scoring (AES) and to identify how these scoring methods could be used in conjunction to optimize writing assessment. Students (n = 113) included subsamples of struggling writers and non-struggling writers in Grades 3-5 drawn from a larger…

Descriptors: Reliability, Scoring, Essays, Automation

Reconsidering the Assessment Policy: Practical Use of Liberal Multiple-Choice Tests (SAC Method)

Peer reviewed
PDF on ERIC

Download full text

Cesur, Kursat – Educational Policy Analysis and Strategic Research, 2019

Examinees' performances are assessed using a wide variety of different techniques. Multiple-choice (MC) tests are among the most frequently used ones. Nearly, all standardized achievement tests make use of MC test items and there is a variety of ways to score these tests. The study compares number right and liberal scoring (SAC) methods. Mixed…

Descriptors: Multiple Choice Tests, Scoring, Evaluation Methods, Guessing (Tests)

Previous Page | Next Page »

Pages: 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10

ETS Research Report Series	3
Journal of Psychoeducational…	3
Online Submission	3
ProQuest LLC	3
Assessing Writing	2
Bill & Melinda Gates…	2
College Teaching	2
Educational Measurement:…	2
Educational and Psychological…	2
English Language Teaching	2
Grantee Submission	2
Journal of Experimental…	2
Journal of Science Education…	2
Journal of Technology,…	2
Language Testing	2
Language Testing in Asia	2
National Center for Research…	2
Psychology in the Schools	2
Active Learning in Higher…	1
Advances in Health Sciences…	1
American Educational Research…	1
Applied Measurement in…	1
Applied Psychological…	1
Assessment	1
Assessment and Evaluation in…	1
More ▼

Jaeger, Richard M.	4
Gearhart, Maryl	3
Koretz, Daniel	3
Algina, James	2
Bejar, Isaac I.	2
Busch, John Christian	2
Darling-Hammond, Linda	2
Friedman, Greg	2
Hambleton, Ronald K.	2
Hebert, Michael	2
Kane, Thomas J.	2
McGinty, Dixie	2
McLaughlin, Tara W.	2
Michaels, Hillary	2
Neel, John H.	2
Ochieng, Charles	2
Snyder, Patricia A.	2
Staiger, Douglas O.	2
Stefanie A. Wind	2
Yangmeng Xu	2
Yen, Shu Jing	2
Aida Carballo-Fazanes	1
Akihito Kamata	1
Aksu, Gökhan	1
More ▼

Journal Articles	80
Reports - Research	75
Reports - Evaluative	35
Speeches/Meeting Papers	23
Reports - Descriptive	10
Tests/Questionnaires	7
Information Analyses	6
Opinion Papers	6
Numerical/Quantitative Data	4
Books	3
Dissertations/Theses -…	3
ERIC Publications	3
Guides - Non-Classroom	3
Collected Works - General	2
ERIC Digests in Full Text	2
Guides - Classroom - Teacher	2
Reports - General	2
Guides - General	1
Historical Materials	1
Reference Materials -…	1
More ▼