Publication Date
| In 2026 | 0 |
| Since 2025 | 51 |
| Since 2022 (last 5 years) | 277 |
| Since 2017 (last 10 years) | 773 |
| Since 2007 (last 20 years) | 2035 |
Descriptor
| Interrater Reliability | 3117 |
| Foreign Countries | 653 |
| Evaluation Methods | 502 |
| Test Reliability | 502 |
| Test Validity | 410 |
| Correlation | 401 |
| Scoring | 344 |
| Comparative Analysis | 327 |
| Scores | 324 |
| Validity | 310 |
| Student Evaluation | 306 |
| More ▼ | |
Source
Author
Publication Type
Education Level
Audience
| Researchers | 130 |
| Practitioners | 42 |
| Teachers | 22 |
| Administrators | 11 |
| Counselors | 3 |
| Policymakers | 2 |
Location
| Australia | 56 |
| Turkey | 53 |
| United Kingdom | 46 |
| Canada | 45 |
| Netherlands | 40 |
| China | 38 |
| California | 37 |
| United States | 30 |
| United Kingdom (England) | 24 |
| Taiwan | 23 |
| Germany | 22 |
| More ▼ | |
Laws, Policies, & Programs
Assessments and Surveys
What Works Clearinghouse Rating
| Meets WWC Standards without Reservations | 3 |
| Meets WWC Standards with or without Reservations | 3 |
| Does not meet standards | 3 |
Susan K. Johnsen – Gifted Child Today, 2025
The author provides information about reliability and areas that educators should examine in determining if an assessment is consistent and trustworthy for use, and how it should be interpreted in making decisions about students. Reliability areas that are discussed in the column include internal consistency, test-retest or stability, inter-scorer…
Descriptors: Test Reliability, Academically Gifted, Student Evaluation, Error of Measurement
Samuel D'Emanuele; Francesca Nardello; Fabrizio Garau; Diego Campaci; Federico Schena; Cantor Tarperi – Measurement in Physical Education and Exercise Science, 2025
The agreement between a wearable inertial sensor (GYKO, G) and the force platform (P) was assessed by evaluating "test-retest" and "inter-rater reliability." Thirty-eight subjects were enrolled; the selected indices of balance were investigated over foot positions and (un)stable conditions. Intraclass correlation coefficient…
Descriptors: Human Posture, Measurement Equipment, Interrater Reliability, Measurement Techniques
Mazin T. Alqhazo; Tha’er Al-Kadi; Firas S. Alfwaress – Language, Speech, and Hearing Services in Schools, 2025
Purpose: The Stuttering Severity Instrument--Fourth Edition (SSI-4) is unavailable in Arabic language. The purpose of the current research is to translate the SSI-4 (Riley, 2009) into Arabic and to discuss its validity, as well as its intrajudge and interjudge reliability. Method: Archived videos of 28 school-aged children who stutter ranged in…
Descriptors: Arabic, Translation, Test Validity, Test Reliability
Angus Kittelman; Sara Izzard; Kent McIntosh; Kelsey R. Morris; Timothy J. Lewis – Assessment for Effective Intervention, 2024
The purpose of this study was to evaluate the psychometric properties of the Self-Assessment Survey (SAS) 4.0, an updated measure assessing implementation fidelity of positive behavioral interventions and supports (PBIS). A total of 627 school personnel from 33 schools in six U.S. states completed the SAS 4.0 during the 2021-2022 school year. We…
Descriptors: Positive Behavior Supports, Teachers, Self Evaluation (Individuals), Test Reliability
Marcus Messer; Neil C. C. Brown; Michael Kölling; Miaojing Shi – ACM Transactions on Computing Education, 2025
Providing consistent summative assessment to students is important, as the grades they are awarded affect their progression through university and future career prospects. While small cohorts are typically assessed by a single assessor, such as the module/class leader, larger cohorts are often assessed by multiple assessors, typically teaching…
Descriptors: Foreign Countries, Grading, Interrater Reliability, Teaching Assistants
Jonas Flodén – British Educational Research Journal, 2025
This study compares how the generative AI (GenAI) large language model (LLM) ChatGPT performs in grading university exams compared to human teachers. Aspects investigated include consistency, large discrepancies and length of answer. Implications for higher education, including the role of teachers and ethics, are also discussed. Three…
Descriptors: College Faculty, Artificial Intelligence, Comparative Testing, Scoring
Matthew K. Burns; Heba Z. Abdelnaby; Jonie B. Welland; Katherine A. Graves; Kari Kurto – Assessment for Effective Intervention, 2024
The current study examined the reliability of The Reading League Curriculum-Evaluation Guidelines (CEGs), which were developed to help school-based teams rate the presence of red flags when considering adopting specific literacy curricula. Coders (n = 30) independently used the CEGs to evaluate a free online English language arts curriculum. The…
Descriptors: English Curriculum, English Instruction, Language Arts, Curriculum Evaluation
Enninga, Annemieke; Waninge, Aly; Post, Wendy J.; van der Putten, Annette A. J. – Journal of Applied Research in Intellectual Disabilities, 2023
Background: Persons with profound intellectual and multiple disabilities (PIMD) are vulnerable when it comes to experiencing pain. Reliable assessment of pain-related behaviour in these persons is difficult. "Aim" To determine how pain items can be reliably scored in adults with PIMD. Methods: We developed an instruction protocol for the…
Descriptors: Test Reliability, Pain, Behavior, Adults
Ichikowitz, Kerri; Bruce, Carolyn; Meitanis, Vanessa; Cheung, Kelly; Kim, Yekyung; Talbourdet, Esther; Newton, Caroline – International Journal of Language & Communication Disorders, 2023
Background: People with aphasia (PWA) can experience functional numeracy difficulties, that is, problems understanding or using numbers in everyday life, which can have numerous negative impacts on their daily lives. There is growing interest in designing functional numeracy interventions for PWA; however, there are limited suitable assessments…
Descriptors: Test Construction, Test Validity, Numeracy, Adults
Tahereh Firoozi; Hamid Mohammadi; Mark J. Gierl – Journal of Educational Measurement, 2025
The purpose of this study is to describe and evaluate a multilingual automated essay scoring (AES) system for grading essays in three languages. Two different sentence embedding models were evaluated within the AES system, multilingual BERT (mBERT) and language-agnostic BERT sentence embedding (LaBSE). German, Italian, and Czech essays were…
Descriptors: College Students, Slavic Languages, German, Italian
Chase Young; Benjamin Mitchell-Yellin; George Kevin Randall – Active Learning in Higher Education, 2025
The purpose of this study was to develop a valid, reliable, and brief measure of active learning in college classrooms that is cheap and easy to complete and yields results that faculty can easily use to inform their development as instructors. Initial construct and face validity was achieved by modifying existing instruments and creating a draft…
Descriptors: College Faculty, College Students, Active Learning, Classroom Observation Techniques
Aislinn Ganci; Miran Qazizada; Brianna Fehr; Ana Vucenovic; Edmond Lou; Eric Parent – Measurement in Physical Education and Exercise Science, 2024
Spinal alignment can be assessed without radiation using three-dimensional ultrasound imaging (3DUS). Reliable measurements could inform the ideal arm position for scoliosis radiographs. This study determined the inter-evaluator reliability of axial vertebral rotation (AVR) measurements and sagittal curve angles in healthy females from 3DUS spinal…
Descriptors: Foreign Countries, Young Adults, Adults, Adolescents
John R. Donoghue; Carol Eckerly – Applied Measurement in Education, 2024
Trend scoring constructed response items (i.e. rescoring Time A responses at Time B) gives rise to two-way data that follow a product multinomial distribution rather than the multinomial distribution that is usually assumed. Recent work has shown that the difference in sampling model can have profound negative effects on statistics usually used to…
Descriptors: Scoring, Error of Measurement, Reliability, Scoring Rubrics
Rice, C. E.; Carpenter, L. A.; Morrier, M. J.; Lord, C.; DiRienzo, M.; Boan, A.; Skowyra, C.; Fusco, A.; Baio, J.; Esler, A.; Zahorodny, W.; Hobson, N.; Mars, A.; Thurm, A.; Bishop, S.; Wiggins, L. D. – Journal of Autism and Developmental Disorders, 2022
This paper describes a process to define a comprehensive list of exemplars for seven core Diagnostic and Statistical Manual (DSM) diagnostic criteria for autism spectrum disorder (ASD), and report on interrater reliability in applying these exemplars to determine ASD case classification. Clinicians completed an iterative process to map specific…
Descriptors: Autism Spectrum Disorders, Clinical Diagnosis, Test Reliability, Interrater Reliability
Hulteen, Ryan M.; True, Larissa; Kroc, Edward – Measurement in Physical Education and Exercise Science, 2023
The typical process for assessing inter-rater reliability is facilitated by training raters within a research team. Lacking is an understanding if inter-rater reliability scores "between" research teams demonstrate adequate reliability. This study examined inter-rater reliability between 16 researchers who assessed fundamental motor…
Descriptors: Psychomotor Skills, Scores, Reliability, Interrater Reliability

Peer reviewed
Direct link
