ERIC - Search Results

Publication Date

In 2025	10
Since 2024	25
Since 2021 (last 5 years)	60
Since 2016 (last 10 years)	98
Since 2006 (last 20 years)	275

Descriptor

Evaluation Methods	501
Interrater Reliability	501
Student Evaluation	90
Test Reliability	90
Evaluators	78
Foreign Countries	78
Test Validity	70
Scoring	67
Correlation	59
Higher Education	56
Comparative Analysis	53
Measurement Techniques	51
Evaluation Criteria	50
Rating Scales	47
Validity	44
Scores	41
Teacher Evaluation	41
Performance Based Assessment	40
Measures (Individuals)	39
Psychometrics	39
Observation	38
Statistical Analysis	38
Writing Evaluation	35
Reliability	33
Scoring Rubrics	33
More ▼

Education Level

Higher Education	91
Postsecondary Education	54
Elementary Education	26
Elementary Secondary Education	22
Secondary Education	17
Adult Education	14
Early Childhood Education	11
Middle Schools	9
High Schools	8
Preschool Education	7
Grade 4	6
Grade 6	6
Grade 8	4
Grade 7	3
Intermediate Grades	3
Junior High Schools	3
Primary Education	3
Grade 1	2
Grade 2	2
Grade 3	2
Grade 5	2
Two Year Colleges	2
Kindergarten	1
More ▼

Audience

Researchers	39
Practitioners	13
Teachers	8
Administrators	3

Location

Australia	9
China	7
Netherlands	7
North Carolina	7
United Kingdom (England)	7
Canada	6
Florida	6
Israel	6
United Kingdom	6
California	5
Turkey	5
United States	5
Pennsylvania	4
Texas	4
Germany	3
India	3
Italy	3
Ohio	3
South Korea	3
Sweden	3
Tennessee	3
Asia	2
Brazil	2
Colombia	2
Connecticut	2
More ▼

Laws, Policies, & Programs

No Child Left Behind Act 2001	2
Race to the Top	2
Americans with Disabilities…	1
Rehabilitation Act 1973…	1

What Works Clearinghouse Rating

Showing 1 to 15 of 501 results Save | Export

Technical Adequacy-Reliability

Peer reviewed

Direct link

Susan K. Johnsen – Gifted Child Today, 2025

The author provides information about reliability and areas that educators should examine in determining if an assessment is consistent and trustworthy for use, and how it should be interpreted in making decisions about students. Reliability areas that are discussed in the column include internal consistency, test-retest or stability, inter-scorer…

Descriptors: Test Reliability, Academically Gifted, Student Evaluation, Error of Measurement

Examining the Psychometric Impact of Targeted and Random Double-Scoring in Mixed-Format Assessments

Peer reviewed

Direct link

Yangmeng Xu; Stefanie A. Wind – Educational Measurement: Issues and Practice, 2025

Double-scoring constructed-response items is a common but costly practice in mixed-format assessments. This study explored the impacts of Targeted Double-Scoring (TDS) and random double-scoring procedures on the quality of psychometric outcomes, including student achievement estimates, person fit, and student classifications under various…

Descriptors: Academic Achievement, Psychometrics, Scoring, Evaluation Methods

"Rater Training" Re-Imagined for Work-Based Assessment in Medical Education

Peer reviewed

Direct link

Tavares, Walter; Kinnear, Benjamin; Schumacher, Daniel J.; Forte, Milena – Advances in Health Sciences Education, 2023

In this perspective, the authors critically examine "rater training" as it has been conceptualized and used in medical education. By "rater training," they mean the educational events intended to "improve" rater performance and contributions during assessment events. Historically, rater training programs have focused…

Descriptors: Medical Education, Interrater Reliability, Evaluation Methods, Training

Agree to Disagree: Multiple Methods to Assess Rater Agreement during Student Teaching

Peer reviewed

Direct link

Elayne P. Colón; Lori M. Dassa; Thomas M. Dana; Nathan P. Hanson – Action in Teacher Education, 2024

To meet accreditation expectations, teacher preparation programs must demonstrate their candidates are evaluated using summative assessment tools that yield sound, reliable, and valid data. These tools are primarily used by the clinical experience team -- university supervisors and mentor teachers. Institutional beliefs regarding best practices…

Descriptors: Student Teachers, Teacher Interns, Evaluation Methods, Interrater Reliability

Inter-Rater Reliability Methods in Qualitative Case Study Research

Peer reviewed

Direct link

Rosanna Cole – Sociological Methods & Research, 2024

The use of inter-rater reliability (IRR) methods may provide an opportunity to improve the transparency and consistency of qualitative case study data analysis in terms of the rigor of how codes and constructs have been developed from the raw data. Few articles on qualitative research methods in the literature conduct IRR assessments or neglect to…

Descriptors: Interrater Reliability, Error of Measurement, Evaluation Methods, Research Methodology

Grading Exams Using Large Language Models: A Comparison between Human and AI Grading of Exams in Higher Education Using ChatGPT

Peer reviewed

Direct link

Jonas Flodén – British Educational Research Journal, 2025

This study compares how the generative AI (GenAI) large language model (LLM) ChatGPT performs in grading university exams compared to human teachers. Aspects investigated include consistency, large discrepancies and length of answer. Implications for higher education, including the role of teachers and ethics, are also discussed. Three…

Descriptors: College Faculty, Artificial Intelligence, Comparative Testing, Scoring

Reliability Evidence for the NC Teacher Evaluation Process Using a Variety of Indicators of Inter-Rater Agreement

Peer reviewed
PDF on ERIC

Download full text

Holcomb, T. Scott; Lambert, Richard; Bottoms, Bryndle L. – Journal of Educational Supervision, 2022

In this study, various statistical indexes of agreement were calculated using empirical data from a group of evaluators (n = 45) of early childhood teachers. The group of evaluators rated ten fictitious teacher profiles using the North Carolina Teacher Evaluation Process (NCTEP) rubric. The exact and adjacent agreement percentages were calculated…

Descriptors: Interrater Reliability, Teacher Evaluation, Statistical Analysis, Early Childhood Teachers

Using Machine Learning to Score Multi-Dimensional Assessments of Chemistry and Physics

Peer reviewed

Direct link

Maestrales, Sarah; Zhai, Xiaoming; Touitou, Israel; Baker, Quinton; Schneider, Barbara; Krajcik, Joseph – Journal of Science Education and Technology, 2021

In response to the call for promoting three-dimensional science learning (NRC, 2012), researchers argue for developing assessment items that go beyond rote memorization tasks to ones that require deeper understanding and the use of reasoning that can improve science literacy. Such assessment items are usually performance-based constructed…

Descriptors: Artificial Intelligence, Scoring, Evaluation Methods, Chemistry

Analyzing Inter-Rater Variation: Exploring Consistency in Mathematics Teachers' Scoring of Exam Papers

Peer reviewed
PDF on ERIC

Download full text

Hosseinali Gholami – Mathematics Teaching Research Journal, 2025

Scoring mathematics exam papers accurately is vital for fostering students' engagement and interest in the subject. Incorrect scoring practices can erode motivation and lead to the development of false self-confidence. Therefore, the implementation of appropriate scoring methods is essential for the success of mathematics education. This study…

Descriptors: Interrater Reliability, Mathematics Teachers, Scoring, Mathematics Tests

Using Automated Procedures to Score Educational Essays Written in Three Languages

Peer reviewed

Direct link

Tahereh Firoozi; Hamid Mohammadi; Mark J. Gierl – Journal of Educational Measurement, 2025

The purpose of this study is to describe and evaluate a multilingual automated essay scoring (AES) system for grading essays in three languages. Two different sentence embedding models were evaluated within the AES system, multilingual BERT (mBERT) and language-agnostic BERT sentence embedding (LaBSE). German, Italian, and Czech essays were…

Descriptors: College Students, Slavic Languages, German, Italian

Investigating the Effect of Classroom-Based Feedback on Speaking Assessment: A Multifaceted Rasch Analysis

Peer reviewed

Direct link

Bijani, Houman; Hashempour, Bahareh; Ibrahim, Khaled Ahmed Abdel-Al; Orabah, Salim Said Bani; Heydarnejad, Tahereh – Language Testing in Asia, 2022

Due to subjectivity in oral assessment, much concentration has been put on obtaining a satisfactory measure of consistency among raters. However, the process for obtaining more consistency might not result in valid decisions. One matter that is at the core of both reliability and validity in oral assessment is rater training. Recently,…

Descriptors: Oral Language, Language Tests, Feedback (Response), Bias

Same Grade for Different Reasons, Different Grades for the Same Reason?

Peer reviewed

Direct link

Ilona Rinne – Assessment & Evaluation in Higher Education, 2024

It is widely acknowledged in research that common criteria and aligned standards do not result in consistent assessment of such a complex performance as the final undergraduate thesis. Assessment is determined by examiners' understanding of rubrics and their views on thesis quality. There is still a gap in the research literature about how…

Descriptors: Foreign Countries, Undergraduate Students, Teacher Education Programs, Evaluation Criteria

The Reliability of Using ChatGPT in Rating EFL Writings

Peer reviewed
PDF on ERIC

Download full text

Yang Yang – Shanlax International Journal of Education, 2024

This paper explores the reliability of using ChatGPT in evaluating EFL writing by assessing its intra- and inter-rater reliability. Eighty-two compositions were randomly sampled from the Written English Corpus of Chinese Learners. These compositions were rated by three experienced raters with regard to 'language', 'content', and 'organization'.…

Descriptors: English (Second Language), Second Language Instruction, Writing (Composition), Evaluation Methods

Examining Inter-Rater Reliability of Evaluators Judging Teacher Performance: Proposing an Alternative to Cohen's Kappa. CEME Technical Report. CEMETR-2021-06

Download full text

Lambert, Richard G.; Holcomb, T. Scott; Bottoms, Bryndle L. – Center for Educational Measurement and Evaluation, 2021

The validity of the Kappa coefficient of chance-corrected agreement has been questioned when the prevalence of specific rating scale categories is low and agreement between raters is high. The researchers proposed the Lambda Coefficient of Rater-Mediated Agreement as an alternative to Kappa to address these concerns. Lambda corrects for chance…

Descriptors: Interrater Reliability, Teacher Evaluation, Test Validity, Evaluation Methods

Do Source Use Features Impact Raters' Judgment of Argumentation? An Experimental Study

Peer reviewed

Direct link

Ping-Lin Chuang – Language Testing, 2025

This experimental study explores how source use features impact raters' judgment of argumentation in a second language (L2) integrated writing test. One hundred four experienced and novice raters were recruited to complete a rating task that simulated the scoring assignment of a local English Placement Test (EPT). Sixty written responses were…

Descriptors: Interrater Reliability, Evaluators, Information Sources, Primary Sources

Previous Page | Next Page »

Pages: 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | ... | 34

ProQuest LLC	18
Assessment & Evaluation in…	9
Educational and Psychological…	9
Journal of Autism and…	8
Advances in Health Sciences…	7
Applied Measurement in…	7
Journal of Speech and Hearing…	7
Personnel Psychology	6
Research in Developmental…	6
Journal of Educational…	5
Journal of Speech, Language,…	5
Online Submission	5
Assessment and Evaluation in…	4
Early Childhood Research…	4
Educational Assessment	4
Evaluation and the Health…	4
Language Testing	4
Multivariate Behavioral…	4
Psychology in the Schools	4
Studies in Higher Education	4
American Journal of…	3
American Journal on Mental…	3
Computers & Education	3
Educational Measurement:…	3
Gerontologist	3
More ▼

Jaeger, Richard M.	5
Cason, Carolyn L.	3
Matson, Johnny L.	3
Myford, Carol M.	3
Plake, Barbara S.	3
Wind, Stefanie A.	3
Baer, John	2
Bejar, Isaac I.	2
Bottoms, Bryndle L.	2
Brown, William H.	2
Bursac, Zoran	2
Busch, John Christian	2
Cason, Gerald J.	2
Cordes, Anne K.	2
Dowda, Marsha	2
Einfeld, S. L.	2
Evenhuis, Heleen M.	2
Friedman, Greg	2
Gearhart, Maryl	2
Godbout, Paul	2
Hambleton, Ronald K.	2
Herman, Joan L.	2
Hermans, Heidi	2
Holcomb, T. Scott	2
More ▼

Journal Articles	380
Reports - Research	297
Reports - Evaluative	133
Speeches/Meeting Papers	64
Reports - Descriptive	39
Tests/Questionnaires	27
Dissertations/Theses -…	18
Information Analyses	18
Opinion Papers	11
Numerical/Quantitative Data	7
Books	2
Guides - Non-Classroom	2
Book/Product Reviews	1
Collected Works - General	1
Collected Works - Proceedings	1
Collected Works - Serials	1
Dissertations/Theses	1
ERIC Digests in Full Text	1
ERIC Publications	1
Guides - General	1
Non-Print Media	1
Reports -…	1
More ▼

National Assessment of…	4
Advanced Placement…	3
Child Behavior Checklist	3
Teacher Performance…	3
Autism Diagnostic Observation…	2
Developmental Behavior…	2
Graduate Record Examinations	2
Hamilton Rating Scale for…	2
National Teacher Examinations	2
Praxis Series	2
Test of English as a Foreign…	2
Aberrant Behavior Checklist	1
Adjustment Scales for…	1
Bayley Scales of Infant…	1
Beck Anxiety Inventory	1
Behavior Assessment System…	1
Behavioral and Emotional…	1
Childrens Depression Inventory	1
Goal Attainment Scale	1
Group Assessment of Logical…	1
MacArthur Communicative…	1
Mullen Scales of Early…	1
NEO Personality Inventory	1
Raven Progressive Matrices	1
Reading Miscue Inventory	1
More ▼