Publication Date
In 2025 | 1 |
Since 2024 | 2 |
Since 2021 (last 5 years) | 9 |
Since 2016 (last 10 years) | 27 |
Since 2006 (last 20 years) | 61 |
Descriptor
Comparative Analysis | 65 |
Correlation | 65 |
Interrater Reliability | 65 |
Foreign Countries | 26 |
Statistical Analysis | 19 |
Scores | 17 |
Evaluators | 12 |
Scoring | 12 |
Rating Scales | 11 |
Second Language Learning | 11 |
Student Evaluation | 10 |
More ▼ |
Source
Author
Coniam, David | 3 |
Abrami, Philip C. | 1 |
Alsma, Jelmer | 1 |
Amanda Huee-Ping Wong | 1 |
Ames, Catherine | 1 |
Arnold, Mariah | 1 |
Ash, Ivan K. | 1 |
Attali, Yigal | 1 |
Baker-Henningham, Helen | 1 |
Barclay, Alexandra | 1 |
Beare, Paul | 1 |
More ▼ |
Publication Type
Journal Articles | 62 |
Reports - Research | 54 |
Reports - Evaluative | 8 |
Tests/Questionnaires | 7 |
Information Analyses | 2 |
Collected Works - Proceedings | 1 |
Dissertations/Theses -… | 1 |
Numerical/Quantitative Data | 1 |
Education Level
Audience
Location
Netherlands | 5 |
Asia | 2 |
Canada | 2 |
China | 2 |
Estonia | 2 |
Florida | 2 |
Greece | 2 |
Hong Kong | 2 |
Japan | 2 |
Pennsylvania | 2 |
Philippines | 2 |
More ▼ |
Laws, Policies, & Programs
Assessments and Surveys
What Works Clearinghouse Rating
Yubin Xu; Lin Liu; Jianwen Xiong; Guangtian Zhu – Journal of Baltic Science Education, 2025
As the development and application of large language models (LLMs) in physics education progress, the well-known AI-based chatbot ChatGPT4 has presented numerous opportunities for educational assessment. Investigating the potential of AI tools in practical educational assessment carries profound significance. This study explored the comparative…
Descriptors: Physics, Artificial Intelligence, Computer Software, Accuracy
Wind, Stefanie A. – Measurement: Interdisciplinary Research and Perspectives, 2022
In many performance assessments, one or two raters from the complete rater pool scores each performance, resulting in a sparse rating design, where there are limited observations of each rater relative to the complete sample of students. Although sparse rating designs can be constructed to facilitate estimation of student achievement, the…
Descriptors: Evaluators, Bias, Identification, Performance Based Assessment
Jiyeo Yun – English Teaching, 2023
Studies on automatic scoring systems in writing assessments have also evaluated the relationship between human and machine scores for the reliability of automated essay scoring systems. This study investigated the magnitudes of indices for inter-rater agreement and discrepancy, especially regarding human and machine scoring, in writing assessment.…
Descriptors: Meta Analysis, Interrater Reliability, Essays, Scoring
Saito, Kazuya; Macmillan, Konstantinos; Kachlicka, Magdalena; Kunihara, Takuya; Minematsu, Nobuaki – Studies in Second Language Acquisition, 2023
Whereas many scholars have emphasized the relative importance of "comprehensibility" as an ecologically valid goal for L2 speech training, testing, and development, eliciting listeners' judgments is time-consuming. Following calls for research on more efficient L2 speech rating methods in applied linguistics, and growing attention toward…
Descriptors: Second Language Learning, Second Language Instruction, Interrater Reliability, Speech Communication
Swapna Haresh Teckwani; Amanda Huee-Ping Wong; Nathasha Vihangi Luke; Ivan Cherh Chiet Low – Advances in Physiology Education, 2024
The advent of artificial intelligence (AI), particularly large language models (LLMs) like ChatGPT and Gemini, has significantly impacted the educational landscape, offering unique opportunities for learning and assessment. In the realm of written assessment grading, traditionally viewed as a laborious and subjective process, this study sought to…
Descriptors: Accuracy, Reliability, Computational Linguistics, Standards
Manzano, Dexter L. – International Journal of Language Testing, 2022
The increasing popularity of self-assessment prompted several scholars to investigate its effectiveness and accuracy in relation to teacher assessment. However, most of these studies focused only on the consistency estimate perspective. Thus, the current study investigated the interrater reliability between self- and teacher assessment of…
Descriptors: Oral Language, Self Evaluation (Individuals), College Students, Interrater Reliability
Bronkhorst, Hugo; Roorda, Gerrit; Suhre, Cor; Goedhart, Martin – Research in Mathematics Education, 2022
Logical reasoning as part of critical thinking is becoming more and more important to prepare students for their future life in society, work, and study. This article presents the results of a quasi-experimental study with a pre-test-post-test control group design focusing on the effective use of formalisations to support logical reasoning. The…
Descriptors: Mathematics Instruction, Teaching Methods, Logical Thinking, Critical Thinking
Sacristan, Dolly; Martinez, Colleen D. – Journal of Teaching in Social Work, 2023
Social work educators are compelled to use reliable and valid methods to assess student learning outcomes. This study adapted a clinical simulation by integrating traditional role-play of case scenarios and elements of the Objective Structured Clinical Examination, which is often used to assess students' practice skills. Master of Social Work…
Descriptors: Graduate Students, Counselor Training, Masters Programs, Clinical Experience
Wilhelm, Anne Garrison; Gillespie Rouse, Amy; Jones, Francesca – Practical Assessment, Research & Evaluation, 2018
Although inter-rater reliability is an important aspect of using observational instruments, it has received little theoretical attention. In this article, we offer some guidance for practitioners and consumers of classroom observations so that they can make decisions about inter-rater reliability, both for study design and in the reporting of data…
Descriptors: Interrater Reliability, Measurement, Observation, Educational Research
Leaman, Marion C.; Edmonds, Lisa A. – Journal of Speech, Language, and Hearing Research, 2021
Purpose: This study evaluated interrater reliability (IRR) and test-retest stability (TRTS) of seven linguistic measures (percent correct information units, relevance, subject-verb-[object], complete utterance, grammaticality, referential cohesion, global coherence), and communicative success in unstructured conversation and in a story narrative…
Descriptors: Aphasia, Psychometrics, Correlation, Speech Language Pathology
Saluja, Ronak; Cheng, Sierra; delos Santos, Keemo Althea; Chan, Kelvin K. W. – Research Synthesis Methods, 2019
Objective: Various statistical methods have been developed to estimate hazard ratios (HRs) from published Kaplan-Meier (KM) curves for the purpose of performing meta-analyses. The objective of this study was to determine the reliability, accuracy, and precision of four commonly used methods by Guyot, Williamson, Parmar, and Hoyle and Henley.…
Descriptors: Meta Analysis, Reliability, Accuracy, Randomized Controlled Trials
Guo, Xiuyan; Lei, Pui-Wa – International Journal of Testing, 2020
Little research has been done on the effects of peer raters' quality characteristics on peer rating qualities. This study aims to address this gap and investigate the effects of key variables related to peer raters' qualities, including content knowledge, previous rating experience, training on rating tasks, and rating motivation. In an experiment…
Descriptors: Peer Evaluation, Error Patterns, Correlation, Knowledge Level
Yun, Jiyeo – ProQuest LLC, 2017
Since researchers investigated automatic scoring systems in writing assessments, they have dealt with relationships between human and machine scoring, and then have suggested evaluation criteria for inter-rater agreement. The main purpose of my study is to investigate the magnitudes of and relationships among indices for inter-rater agreement used…
Descriptors: Interrater Reliability, Essays, Scoring, Evaluators
Morris, Darrell; Pennell, Ashley M.; Perney, Jan; Trathen, Woodrow – Reading Psychology, 2018
This study compared reading rate to reading fluency (as measured by a rating scale). After listening to first graders read short passages, we assigned an overall fluency rating (low, average, or high) to each reading. We then used predictive discriminant analyses to determine which of five measures--accuracy, rate (objective); accuracy, phrasing,…
Descriptors: Reading Fluency, Prediction, Grade 1, Elementary School Students
van der Scheer, Emmelien A.; Bijlsma, Hannah J. E.; Glas, Cees A. W. – School Effectiveness and School Improvement, 2019
A Bayesian IRT-model approach was used to investigate the validity and reliability of student perceptions of teaching quality. Furthermore, the student perceptions were compared with ratings of teaching quality by external observers. Grade 4 students (n = 675) filled out a questionnaire that was used to measure their opinions about the lessons of…
Descriptors: Student Attitudes, Validity, Interrater Reliability, Correlation