Publication Date
In 2025 | 2 |
Since 2024 | 2 |
Since 2021 (last 5 years) | 7 |
Since 2016 (last 10 years) | 9 |
Since 2006 (last 20 years) | 25 |
Descriptor
Evaluation Methods | 33 |
Interrater Reliability | 33 |
Reliability | 33 |
Validity | 10 |
Scores | 7 |
Scoring | 6 |
Scoring Rubrics | 6 |
Student Evaluation | 6 |
College Faculty | 5 |
Evaluators | 5 |
Data Analysis | 4 |
More ▼ |
Source
Author
Zwaigenbaum, Lonnie | 2 |
Abbott, Maree J. | 1 |
Abou-Khalil, Rima | 1 |
Albert M. Jimenez | 1 |
Andrade, Heidi | 1 |
Baker, Eva L. | 1 |
Barnett, Miya | 1 |
Benyon, Howard E., III. | 1 |
Bergee, Martin J. | 1 |
Brian, Jessica | 1 |
Bryson, Susan E. | 1 |
More ▼ |
Publication Type
Journal Articles | 24 |
Reports - Research | 17 |
Reports - Evaluative | 8 |
Speeches/Meeting Papers | 5 |
Dissertations/Theses -… | 2 |
Information Analyses | 2 |
Opinion Papers | 2 |
Reports - Descriptive | 2 |
Books | 1 |
Non-Print Media | 1 |
Education Level
Higher Education | 8 |
Postsecondary Education | 4 |
Elementary Education | 3 |
Elementary Secondary Education | 2 |
Junior High Schools | 2 |
Middle Schools | 2 |
Secondary Education | 2 |
Grade 7 | 1 |
High Schools | 1 |
Audience
Researchers | 4 |
Practitioners | 1 |
Location
Belgium | 1 |
Canada | 1 |
China | 1 |
Connecticut | 1 |
Netherlands | 1 |
North Carolina | 1 |
Pennsylvania | 1 |
United Kingdom (England) | 1 |
West Germany | 1 |
Laws, Policies, & Programs
Assessments and Surveys
Childrens Depression Inventory | 1 |
What Works Clearinghouse Rating
Jonas Flodén – British Educational Research Journal, 2025
This study compares how the generative AI (GenAI) large language model (LLM) ChatGPT performs in grading university exams compared to human teachers. Aspects investigated include consistency, large discrepancies and length of answer. Implications for higher education, including the role of teachers and ethics, are also discussed. Three…
Descriptors: College Faculty, Artificial Intelligence, Comparative Testing, Scoring
Schmidt, Ellyn M.; Rothenberg, W. Andrew; Davidson, Bridget C.; Barnett, Miya; Jent, Jason; Cadenas, Heleny; Fernandez, Corina; Davis, Eileen – Journal of Behavioral Education, 2023
Measuring classroom behavior among young children is important to guide assessment and intervention decisions, yet there is limited literature on appropriate direct observation tools for this purpose. This article describes the psychometric properties of the Behavior Assessment System for Children, Student Observation System (BASC-3 SOS) with 135…
Descriptors: Young Children, Special Education, Child Behavior, Psychometrics
Shasha Chen; Shaohui Chi; Zuhao Wang – Journal of Baltic Science Education, 2025
Interdisciplinary thinking is critical for equipping students to apply scientific knowledge and tackle societal challenges across various disciplines, which has been recognized as a key objective of twenty-first century science education. However, research on effective interdisciplinary assessment in secondary school science education is still…
Descriptors: Thinking Skills, Interdisciplinary Approach, Science Instruction, Grade 7
Di Rezze, Briano; Gentles, Stephen James; Hidecker, Mary Jo Cooley; Zwaigenbaum, Lonnie; Rosenbaum, Peter; Duku, Eric; Georgiades, Stelios; Roncadin, Caroline; Fang, Hanna; Tajik-Parvinchi, Diana; Viveiros, Helena – Journal of Autism and Developmental Disorders, 2022
The Autism Classification System of Functioning: Social Communication (ACSF) describes social communication functioning levels. First developed for preschoolers with ASD, this study tests an expanded age range (2-to-18 years). The ACFS rates the child's typical and best (i.e., capacity) performance. Qualitative methods tested parent and clinician…
Descriptors: Content Validity, Reliability, Autism Spectrum Disorders, Classification
The AI Teacher Test: Measuring the Pedagogical Ability of Blender and GPT-3 in Educational Dialogues
Tack, Anaïs; Piech, Chris – International Educational Data Mining Society, 2022
How can we test whether state-of-the-art generative models, such as Blender and GPT-3, are good AI teachers, capable of replying to a student in an educational dialogue? Designing an AI teacher test is challenging: although evaluation methods are much-needed, there is no off-the-shelf solution to measuring pedagogical ability. This paper reports…
Descriptors: Artificial Intelligence, Dialogs (Language), Bayesian Statistics, Decision Making
Rossin, Emily G.; Bergee, Martin J. – Journal of Research in Music Education, 2021
This is the sixth and culminating study in a series whose purpose has been to acquire a conceptual understanding of school band performance and to develop an assessment based on this understanding. With the present study, we cross-validated and applied a rating scale for school band performance. In the cross-validation phase, college students…
Descriptors: Music Education, Music Activities, Music, Performance
Lanah Stafford; Erin Cousins; Linda Bol; Megan Mize – Research & Practice in Assessment, 2023
Integrative learning is an important outcome for graduates of higher education. Therefore, it should be well-defined and assessed reliably. The American Association of Colleges & Universities has developed a rubric to define and assess integrative learning, but it has low reliability. This pilot study examines whether this rubric's reliability…
Descriptors: Scoring Rubrics, Reliability, Evaluation Methods, Faculty Development
Albert M. Jimenez; Sally J. Zepeda – Sage Research Methods Cases, 2017
The work presented in this case study results from a study conducted in 2012-2014 examining a newly created teacher evaluation system to determine the inter-rater reliability of the classroom observation instrument. The teacher evaluation system was the result of a partnership between the school district and the university in the same city…
Descriptors: Case Studies, Interrater Reliability, Teacher Evaluation, Observation
Lin, Chih-Kai – Language Testing, 2017
Sparse-rated data are common in operational performance-based language tests, as an inevitable result of assigning examinee responses to a fraction of available raters. The current study investigates the precision of two generalizability-theory methods (i.e., the rating method and the subdividing method) specifically designed to accommodate the…
Descriptors: Data Analysis, Language Tests, Generalizability Theory, Accuracy
Mearman, Kimberly A. – ProQuest LLC, 2013
Because of the critical function of the IEP in the planning and implementation of effective instruction for students with disabilities, educators need a reference to determine the standards of a quality IEP and a process by which to compare an IEP to those standards. A rubric can support educators in examining the quality of IEPs. This study used…
Descriptors: Construct Validity, Reliability, Scoring Rubrics, Individualized Education Programs
Chaplin, Duncan; Gill, Brian; Thompkins, Allison; Miller, Hannah – Regional Educational Laboratory Mid-Atlantic, 2014
Responding to federal and state prompting, school districts across the country are implementing new teacher evaluation systems that aim to increase the rigor of evaluation ratings, better differentiate effective teaching, and support personnel and staff development initiatives that promote teacher effectiveness and ultimately improve student…
Descriptors: Teacher Effectiveness, Public Schools, Teacher Evaluation, Student Surveys
Benyon, Howard E., III. – ProQuest LLC, 2014
This policy analysis project focused on state-level education policy which lacks evaluator training as well as on requirements for research-based best practices. Due to federal mandates and funding as well as accountability to all stakeholders, states are adopting more rigorous evaluation systems. These high-stakes evaluation systems are putting…
Descriptors: Educational Policy, Policy Analysis, Evaluators, Professional Training
Greenberg, Kathleen Puglisi – Teaching of Psychology, 2012
The scoring instrument described in this article is based on a deconstruction of the seven sections of an American Psychological Association (APA)-style empirical research report into a set of learning outcomes divided into content-, expression-, and format-related categories. A double-weighting scheme used to score the report yields a final grade…
Descriptors: Scoring, Research Reports, Grading, Outcome Measures
Haley, Katarina L.; Jacks, Adam; de Riesthal, Michael; Abou-Khalil, Rima; Roth, Heidi L. – Journal of Speech, Language, and Hearing Research, 2012
Purpose: We explored the reliability and validity of 2 quantitative approaches to document presence and severity of speech properties associated with apraxia of speech (AOS). Method: A motor speech evaluation was administered to 39 individuals with aphasia. Audio-recordings of the evaluation were presented to 3 experienced clinicians to determine…
Descriptors: Neurological Impairments, Speech Impairments, Speech Evaluation, Evaluation Methods
Sheehan, Dwayne P.; Lafave, Mark R.; Katz, Larry – Measurement in Physical Education and Exercise Science, 2011
This study was designed to test the intra- and inter-rater reliability of the University of North Carolina's Balance Error Scoring System in 9- and 10-year-old children. Additionally, a modified version of the Balance Error Scoring System was tested to determine if it was more sensitive in this population ("raw scores"). Forty-six…
Descriptors: Elementary School Students, Interrater Reliability, Scoring, Raw Scores