Publication Date
In 2025 | 0 |
Since 2024 | 1 |
Since 2021 (last 5 years) | 3 |
Since 2016 (last 10 years) | 13 |
Since 2006 (last 20 years) | 29 |
Descriptor
Evaluation Methods | 38 |
Interrater Reliability | 38 |
Statistical Analysis | 38 |
Correlation | 11 |
Foreign Countries | 7 |
Measures (Individuals) | 7 |
Evaluators | 5 |
Test Reliability | 5 |
Comparative Analysis | 4 |
Interviews | 4 |
Measurement Techniques | 4 |
More ▼ |
Source
Author
Publication Type
Education Level
Audience
Practitioners | 1 |
Location
Florida | 2 |
Netherlands | 2 |
Asia | 1 |
Colombia | 1 |
Malaysia | 1 |
Nigeria | 1 |
North Carolina | 1 |
United Kingdom (Great Britain) | 1 |
Laws, Policies, & Programs
Assessments and Surveys
Autism Diagnostic Observation… | 1 |
Child Behavior Checklist | 1 |
MacArthur Communicative… | 1 |
Mullen Scales of Early… | 1 |
Reading Miscue Inventory | 1 |
What Works Clearinghouse Rating
Elayne P. Colón; Lori M. Dassa; Thomas M. Dana; Nathan P. Hanson – Action in Teacher Education, 2024
To meet accreditation expectations, teacher preparation programs must demonstrate their candidates are evaluated using summative assessment tools that yield sound, reliable, and valid data. These tools are primarily used by the clinical experience team -- university supervisors and mentor teachers. Institutional beliefs regarding best practices…
Descriptors: Student Teachers, Teacher Interns, Evaluation Methods, Interrater Reliability
Holcomb, T. Scott; Lambert, Richard; Bottoms, Bryndle L. – Journal of Educational Supervision, 2022
In this study, various statistical indexes of agreement were calculated using empirical data from a group of evaluators (n = 45) of early childhood teachers. The group of evaluators rated ten fictitious teacher profiles using the North Carolina Teacher Evaluation Process (NCTEP) rubric. The exact and adjacent agreement percentages were calculated…
Descriptors: Interrater Reliability, Teacher Evaluation, Statistical Analysis, Early Childhood Teachers
Practices in Instrument Use and Development in "Chemistry Education Research and Practice" 2010-2021
Lazenby, Katherine; Tenney, Kristin; Marcroft, Tina A.; Komperda, Regis – Chemistry Education Research and Practice, 2023
Assessment instruments that generate quantitative data on attributes (cognitive, affective, behavioral, "etc.") of participants are commonly used in the chemistry education community to draw conclusions in research studies or inform practice. Recently, articles and editorials have stressed the importance of providing evidence for the…
Descriptors: Chemistry, Periodicals, Journal Articles, Science Education
Looney, Marilyn A. – Measurement in Physical Education and Exercise Science, 2018
The purpose of this article was two-fold (1) provide an overview of the commonly reported and under-reported absolute agreement indices in the kinesiology literature for continuous data; and (2) present examples of these indices for hypothetical data along with recommendations for future use. It is recommended that three types of information be…
Descriptors: Interrater Reliability, Evaluation Methods, Kinetics, Indexes
Wind, Stefanie A.; Peterson, Meghan E. – Language Testing, 2018
The use of assessments that require rater judgment (i.e., rater-mediated assessments) has become increasingly popular in high-stakes language assessments worldwide. Using a systematic literature review, the purpose of this study is to identify and explore the dominant methods for evaluating rating quality within the context of research on…
Descriptors: Language Tests, Evaluators, Evaluation Methods, Interrater Reliability
Cousineau, Denis; Laurencelle, Louis – Educational and Psychological Measurement, 2017
Assessing global interrater agreement is difficult as most published indices are affected by the presence of mixtures of agreements and disagreements. A previously proposed method was shown to be specifically sensitive to global agreement, excluding mixtures, but also negatively biased. Here, we propose two alternatives in an attempt to find what…
Descriptors: Interrater Reliability, Evaluation Methods, Statistical Bias, Accuracy
Raczynski, Kevin; Cohen, Allan – Applied Measurement in Education, 2018
The literature on Automated Essay Scoring (AES) systems has provided useful validation frameworks for any assessment that includes AES scoring. Furthermore, evidence for the scoring fidelity of AES systems is accumulating. Yet questions remain when appraising the scoring performance of AES systems. These questions include: (a) which essays are…
Descriptors: Essay Tests, Test Scoring Machines, Test Validity, Evaluators
Lambie, Glenn W.; Mullen, Patrick R.; Swank, Jacqueline M.; Blount, Ashley – Measurement and Evaluation in Counseling and Development, 2018
Supervisors evaluated counselors-in-training at multiple points during their practicum experience using the Counseling Competencies Scale (CCS; N = 1,070). The CCS evaluations were randomly split to conduct exploratory factor analysis and confirmatory factor analysis, resulting in a 2-factor model (61.5% of the variance explained).
Descriptors: Counselor Training, Counseling, Measures (Individuals), Competence
Raykov, Tenko; Dimitrov, Dimiter M.; von Eye, Alexander; Marcoulides, George A. – Educational and Psychological Measurement, 2013
A latent variable modeling method for evaluation of interrater agreement is outlined. The procedure is useful for point and interval estimation of the degree of agreement among a given set of judges evaluating a group of targets. In addition, the approach allows one to test for identity in underlying thresholds across raters as well as to identify…
Descriptors: Interrater Reliability, Models, Statistical Analysis, Computation
Kuiken, Folkert; Vedder, Ineke – Language Testing, 2017
The importance of functional adequacy as an essential component of L2 proficiency has been observed by several authors (Pallotti, 2009; De Jong, Steinel, Florijn, Schoonen, & Hulstijn, 2012a, b). The rationale underlying the present study is that the assessment of writing proficiency in L2 is not fully possible without taking into account the…
Descriptors: Second Language Learning, Rating Scales, Computational Linguistics, Persuasive Discourse
Stefanic, Nicholas; Randles, Clint – Music Education Research, 2015
The purpose of this study was to explore the reliability of measures of both individual and group creative work using the consensual assessment technique (CAT). CAT was used to measure individual and group creativity among a population of pre-service music teachers enrolled in a secondary general music class (n = 23) and was evaluated from…
Descriptors: Music Education, Creativity, Preservice Teachers, Music Teachers
Pijl, Mirjam K. J.; Rommelse, Nanda N. J.; Hendriks, Monica; De Korte, Manon W. P.; Buitelaar, Jan K.; Oosterling, Iris J. – Autism: The International Journal of Research and Practice, 2018
The field of early autism research is in dire need of outcome measures that adequately reflect subtle changes in core autistic behaviors. This article compares the ability of a newly developed measure, the Brief Observation of Social Communication Change (BOSCC), and the Autism Diagnostic Observation Schedule (ADOS) to detect changes in core…
Descriptors: Intervention, Autism, Interpersonal Communication, Interrater Reliability
Robertson, Clare; Ramsay, Craig; Gurung, Tara; Mowatt, Graham; Pickard, Robert; Sharma, Pawana – Research Synthesis Methods, 2014
We describe our experience of using a modified version of the Cochrane risk of bias (RoB) tool for randomised and non-randomised comparative studies. Objectives: (1) To assess time to complete RoB assessment; (2) To assess inter-rater agreement; and (3) To explore the association between RoB and treatment effect size. Methods: Cochrane risk of…
Descriptors: Risk, Randomized Controlled Trials, Research Design, Comparative Analysis
A. C., John; Manabete, S. S. – Journal of Education and Practice, 2015
This study sought to determine the procedural influence on internal and external assessment scores of undergraduate research projects in vocational and technical education programmes in the university under study. A survey research design was used for the conduct of this study. The population consisted of 130 lecturers and 1,847 students in the…
Descriptors: Foreign Countries, Undergraduate Students, Student Research, Research Projects
Henderson, Sheila J.; Horton, Ruth A.; Saito, Paul K.; Shorter-Gooden, Kumea – Multicultural Learning and Teaching, 2016
The purpose of this research was to develop a new tool for assessing multicultural and international competency in faculty teaching through vignette scenarios of university classroom critical incidents--across disciplines of clinical and forensics psychology, business, and education. Construct and content validity of the initial draft vignettes…
Descriptors: Scoring Rubrics, Critical Incidents Method, Construct Validity, Content Validity