Publication Date
| In 2026 | 0 |
| Since 2025 | 0 |
| Since 2022 (last 5 years) | 0 |
| Since 2017 (last 10 years) | 11 |
| Since 2007 (last 20 years) | 28 |
Descriptor
| Evaluators | 33 |
| Statistical Analysis | 33 |
| Interrater Reliability | 26 |
| Correlation | 15 |
| Second Language Learning | 12 |
| English (Second Language) | 11 |
| Foreign Countries | 11 |
| Comparative Analysis | 8 |
| Language Tests | 8 |
| Reliability | 8 |
| Scores | 8 |
| More ▼ | |
Source
Author
Publication Type
| Journal Articles | 30 |
| Reports - Research | 23 |
| Reports - Evaluative | 7 |
| Tests/Questionnaires | 6 |
| Dissertations/Theses -… | 2 |
| Information Analyses | 1 |
| Reports - Descriptive | 1 |
Education Level
| Higher Education | 9 |
| Postsecondary Education | 7 |
| Secondary Education | 3 |
| Elementary Secondary Education | 1 |
| Grade 1 | 1 |
| Grade 11 | 1 |
| Grade 6 | 1 |
| Grade 7 | 1 |
Audience
Location
| Hong Kong | 2 |
| United Kingdom | 2 |
| Australia | 1 |
| Finland | 1 |
| Iran | 1 |
| Iran (Tehran) | 1 |
| Israel | 1 |
| Japan | 1 |
| Minnesota | 1 |
| Nigeria | 1 |
| Ohio | 1 |
| More ▼ | |
Laws, Policies, & Programs
Assessments and Surveys
| Test of English as a Foreign… | 4 |
| Flesch Kincaid Grade Level… | 2 |
What Works Clearinghouse Rating
Wind, Stefanie A.; Peterson, Meghan E. – Language Testing, 2018
The use of assessments that require rater judgment (i.e., rater-mediated assessments) has become increasingly popular in high-stakes language assessments worldwide. Using a systematic literature review, the purpose of this study is to identify and explore the dominant methods for evaluating rating quality within the context of research on…
Descriptors: Language Tests, Evaluators, Evaluation Methods, Interrater Reliability
Conger, Anthony J. – Educational and Psychological Measurement, 2017
Drawing parallels to classical test theory, this article clarifies the difference between rater accuracy and reliability and demonstrates how category marginal frequencies affect rater agreement and Cohen's kappa. Category assignment paradigms are developed: comparing raters to a standard (index) versus comparing two raters to one another…
Descriptors: Interrater Reliability, Evaluators, Accuracy, Statistical Analysis
Lehan, Tara; Hussey, Heather; Mika, Eva – Journal of University Teaching and Learning Practice, 2016
Throughout the dissertation process, the chair and committee members provide feedback regarding quality to help the doctoral candidate to produce the highest-quality document and become an independent scholar. Nevertheless, results of previous research suggest that overall dissertation quality generally is poor. Because much of the feedback about…
Descriptors: Graduate Students, Doctoral Dissertations, Student Evaluation, Feedback (Response)
Raczynski, Kevin; Cohen, Allan – Applied Measurement in Education, 2018
The literature on Automated Essay Scoring (AES) systems has provided useful validation frameworks for any assessment that includes AES scoring. Furthermore, evidence for the scoring fidelity of AES systems is accumulating. Yet questions remain when appraising the scoring performance of AES systems. These questions include: (a) which essays are…
Descriptors: Essay Tests, Test Scoring Machines, Test Validity, Evaluators
Virtanen, T. E.; Pakarinen, E.; Lerkkanen, M.-K.; Poikkeus, A.-M.; Siekkinen, M.; Nurmi, J.-E. – Journal of Early Adolescence, 2018
This study examined the reliability and validity of the Classroom Assessment Scoring System-Secondary (CLASS-S) in Finnish classrooms. Trained observers coded classroom interactions based on video recordings of 46 Grade 6 classrooms (450 cycles). Concurrent associations were investigated with respect to teacher self-ratings (e.g., efficacy beliefs…
Descriptors: Factor Analysis, Classroom Observation Techniques, Foreign Countries, Factor Structure
Morris, Darrell; Pennell, Ashley M.; Perney, Jan; Trathen, Woodrow – Reading Psychology, 2018
This study compared reading rate to reading fluency (as measured by a rating scale). After listening to first graders read short passages, we assigned an overall fluency rating (low, average, or high) to each reading. We then used predictive discriminant analyses to determine which of five measures--accuracy, rate (objective); accuracy, phrasing,…
Descriptors: Reading Fluency, Prediction, Grade 1, Elementary School Students
Ebuoh, Casmir N. – World Journal of Education, 2018
Literature revealed that the patterns/methods of scoring essay tests had been criticized for not being reliable and this unreliability is more likely to be more in internal examinations than in the external examinations. The purpose of this study is to find out the effects of analytical and holistic scoring patterns on scorer reliability in…
Descriptors: Holistic Approach, Scoring, Essay Tests, Biology
Yun, Jiyeo – ProQuest LLC, 2017
Since researchers investigated automatic scoring systems in writing assessments, they have dealt with relationships between human and machine scoring, and then have suggested evaluation criteria for inter-rater agreement. The main purpose of my study is to investigate the magnitudes of and relationships among indices for inter-rater agreement used…
Descriptors: Interrater Reliability, Essays, Scoring, Evaluators
Al-Harthi, Aisha Salim Ali; Campbell, Chris; Karimi, Arafeh – Computers in the Schools, 2018
This study aimed to develop, validate, and trial a rubric for evaluating the cloud-based learning designs (CBLD) that were developed by teachers using virtual learning environments. The rubric was developed using the technological pedagogical content knowledge (TPACK) framework, with rubric development including content and expert validation of…
Descriptors: Computer Assisted Instruction, Scoring Rubrics, Interrater Reliability, Content Validity
Steedle, Jeffrey T.; Ferrara, Steve – Applied Measurement in Education, 2016
As an alternative to rubric scoring, comparative judgment generates essay scores by aggregating decisions about the relative quality of the essays. Comparative judgment eliminates certain scorer biases and potentially reduces training requirements, thereby allowing a large number of judges, including teachers, to participate in essay evaluation.…
Descriptors: Essays, Scoring, Comparative Analysis, Evaluators
Davis, Larry – Language Testing, 2016
Two factors were investigated that are thought to contribute to consistency in rater scoring judgments: rater training and experience in scoring. Also considered were the relative effects of scoring rubrics and exemplars on rater performance. Experienced teachers of English (N = 20) scored recorded responses from the TOEFL iBT speaking test prior…
Descriptors: Evaluators, Oral Language, Scores, Language Tests
Kuiken, Folkert; Vedder, Ineke – Language Testing, 2017
The importance of functional adequacy as an essential component of L2 proficiency has been observed by several authors (Pallotti, 2009; De Jong, Steinel, Florijn, Schoonen, & Hulstijn, 2012a, b). The rationale underlying the present study is that the assessment of writing proficiency in L2 is not fully possible without taking into account the…
Descriptors: Second Language Learning, Rating Scales, Computational Linguistics, Persuasive Discourse
Moshinsky, Avital; Ziegler, David; Gafni, Naomi – International Journal of Testing, 2017
Many medical schools have adopted multiple mini-interviews (MMI) as an advanced selection tool. MMIs are expensive and used to test only a few dozen candidates per day, making it infeasible to develop a different test version for each test administration. Therefore, some items are reused both within and across years. This study investigated the…
Descriptors: Interviews, Medical Schools, Test Validity, Test Reliability
Li, Hui – English Language Teaching, 2016
The aim of the study was to investigate how raters come to their decisions when judging spoken vocabulary. Segmental rating was introduced to quantify raters' decision-making process. It is hoped that this simulated study brings fresh insight to future methodological considerations with spoken data. Twenty trainee raters assessed five Chinese…
Descriptors: Foreign Countries, Evaluators, Interrater Reliability, Decision Making
Skalicky, Stephen; Berger, Cynthia M.; Crossley, Scott A.; McNamara, Danielle S. – Advances in Language and Literary Studies, 2016
A corpus of 313 freshman college essays was analyzed in order to better understand the forms and functions of humor in academic writing. Human ratings of humor and wordplay were statistically aggregated using Factor Analysis to provide an overall "Humor" component score for each essay in the corpus. In addition, the essays were also…
Descriptors: Discourse Analysis, Academic Discourse, Humor, Writing (Composition)

Peer reviewed
Direct link
