Publication Date
In 2025 | 4 |
Since 2024 | 18 |
Since 2021 (last 5 years) | 76 |
Since 2016 (last 10 years) | 152 |
Since 2006 (last 20 years) | 224 |
Descriptor
Source
Author
Trofimovich, Pavel | 6 |
Attali, Yigal | 3 |
Coniam, David | 3 |
King, Jean A. | 3 |
Linacre, John M. | 3 |
McDonough, Kim | 3 |
Saito, Kazuya | 3 |
Wind, Stefanie A. | 3 |
Brown, Robert D. | 2 |
Chafouleas, Sandra M. | 2 |
Chambers, Lucy | 2 |
More ▼ |
Publication Type
Education Level
Audience
Researchers | 6 |
Practitioners | 1 |
Location
China | 11 |
United States | 8 |
Canada | 5 |
Turkey | 5 |
United Kingdom | 5 |
Europe | 4 |
India | 4 |
Iran | 4 |
Japan | 4 |
Australia | 3 |
Hong Kong | 3 |
More ▼ |
Laws, Policies, & Programs
Higher Education Act 1965 | 1 |
No Child Left Behind Act 2001 | 1 |
Assessments and Surveys
What Works Clearinghouse Rating
Timothy J. Wood; Vijay J. Daniels; Debra Pugh; Claire Touchie; Samantha Halman; Susan Humphrey-Murto – Advances in Health Sciences Education, 2024
First impressions can influence rater-based judgments but their contribution to rater bias is unclear. Research suggests raters can overcome first impressions in experimental exam contexts with explicit first impressions, but these findings may not generalize to a workplace context with implicit first impressions. The study had two aims. First, to…
Descriptors: Evaluators, Work Environment, Decision Making, Video Technology
Ethan Weed; Riccardo Fusaroli; Elizabeth Simmons; Inge-Marie Eigsti – Language Learning and Development, 2024
The current study investigated whether the difficulty in finding group differences in prosody between speakers with autism spectrum disorder (ASD) and neurotypical (NT) speakers might be explained by identifying different acoustic profiles of speakers which, while still perceived as atypical, might be characterized by different acoustic qualities.…
Descriptors: Network Analysis, Autism Spectrum Disorders, Intonation, Suprasegmentals
Yuan Tian; Xi Yang; Suhail A. Doi; Luis Furuya-Kanamori; Lifeng Lin; Joey S. W. Kwong; Chang Xu – Research Synthesis Methods, 2024
RobotReviewer is a tool for automatically assessing the risk of bias in randomized controlled trials, but there is limited evidence of its reliability. We evaluated the agreement between RobotReviewer and humans regarding the risk of bias assessment based on 1955 randomized controlled trials. The risk of bias in these trials was assessed via two…
Descriptors: Risk, Randomized Controlled Trials, Classification, Robotics
Qusai Khraisha; Sophie Put; Johanna Kappenberg; Azza Warraitch; Kristin Hadfield – Research Synthesis Methods, 2024
Systematic reviews are vital for guiding practice, research and policy, although they are often slow and labour-intensive. Large language models (LLMs) could speed up and automate systematic reviews, but their performance in such tasks has yet to be comprehensively evaluated against humans, and no study has tested Generative Pre-Trained…
Descriptors: Peer Evaluation, Research Reports, Artificial Intelligence, Computer Software
Christopher D. Daniel – ProQuest LLC, 2024
Districts spend thousands of dollars on computerized teacher screeners without knowing if they are identifying the most effective teacher. Hiring quality staff is one of the most important job functions of a principal, and many times a teacher screener score may eliminate an effective teacher. The current study examined the value of teacher…
Descriptors: Teacher Evaluation, Scores, Screening Tests, Teacher Effectiveness
Taylor, Tessa; Lanovaz, Marc J. – Journal of Applied Behavior Analysis, 2022
Behavior analysts typically rely on visual inspection of single-case experimental designs to make treatment decisions. However, visual inspection is subjective, which has led to the development of supplemental objective methods such as the conservative dual-criteria method. To replicate and extend a study conducted by Wolfe et al. (2018) on the…
Descriptors: Visual Perception, Artificial Intelligence, Decision Making, Evaluators
Song, Yoon Ah; Lee, Won-Chan – Applied Measurement in Education, 2022
This article presents the performance of item response theory (IRT) models when double ratings are used as item scores over single ratings when rater effects are present. Study 1 examined the influence of the number of ratings on the accuracy of proficiency estimation in the generalized partial credit model (GPCM). Study 2 compared the accuracy of…
Descriptors: Item Response Theory, Item Analysis, Scores, Accuracy
Kelly, Kate Tremain; Richardson, Mary; Isaacs, Talia – Assessment in Education: Principles, Policy & Practice, 2022
Comparative judgment is gaining popularity as an assessment tool, including for high-stakes testing purposes, despite relatively little research on the use of the technique. Advocates claim two main rationales for its use: that comparative judgment is valid because humans are better at comparative than absolute judgment, and because it distils the…
Descriptors: Comparative Analysis, Evaluation Methods, Evaluative Thinking, High Stakes Tests
Tucker, Susan; Stevahn, Laurie; King, Jean A. – American Journal of Evaluation, 2023
This article compares the purposes and content of the four foundational documents of the American Evaluation Association (AEA): the Program Evaluation Standards, the AEA Public Statement on Cultural Competence in Evaluation, the AEA Evaluator Competencies, and the AEA Guiding Principles. This reflection on alignment is an early effort in the third…
Descriptors: Professionalism, Comparative Analysis, Professional Associations, Program Evaluation
Jordan M. Wheeler; Allan S. Cohen; Shiyu Wang – Journal of Educational and Behavioral Statistics, 2024
Topic models are mathematical and statistical models used to analyze textual data. The objective of topic models is to gain information about the latent semantic space of a set of related textual data. The semantic space of a set of textual data contains the relationship between documents and words and how they are used. Topic models are becoming…
Descriptors: Semantics, Educational Assessment, Evaluators, Reliability
Reza Shahi; Hamdollah Ravand; Golam Reza Rohani – International Journal of Language Testing, 2025
The current paper intends to exploit the Many Facet Rasch Model to investigate and compare the impact of situations (items) and raters on test takers' performance on the Written Discourse Completion Test (WDCT) and Discourse Self-Assessment Tests (DSAT). In this study, the participants were 110 English as a Foreign Language (EFL) students at…
Descriptors: Comparative Analysis, English (Second Language), Second Language Learning, Second Language Instruction
Yubin Xu; Lin Liu; Jianwen Xiong; Guangtian Zhu – Journal of Baltic Science Education, 2025
As the development and application of large language models (LLMs) in physics education progress, the well-known AI-based chatbot ChatGPT4 has presented numerous opportunities for educational assessment. Investigating the potential of AI tools in practical educational assessment carries profound significance. This study explored the comparative…
Descriptors: Physics, Artificial Intelligence, Computer Software, Accuracy
Kevin C. Haudek; Xiaoming Zhai – International Journal of Artificial Intelligence in Education, 2024
Argumentation, a key scientific practice presented in the "Framework for K-12 Science Education," requires students to construct and critique arguments, but timely evaluation of arguments in large-scale classrooms is challenging. Recent work has shown the potential of automated scoring systems for open response assessments, leveraging…
Descriptors: Accuracy, Persuasive Discourse, Artificial Intelligence, Learning Management Systems
Wind, Stefanie A. – Measurement: Interdisciplinary Research and Perspectives, 2022
In many performance assessments, one or two raters from the complete rater pool scores each performance, resulting in a sparse rating design, where there are limited observations of each rater relative to the complete sample of students. Although sparse rating designs can be constructed to facilitate estimation of student achievement, the…
Descriptors: Evaluators, Bias, Identification, Performance Based Assessment
Fatih Yavuz; Özgür Çelik; Gamze Yavas Çelik – British Journal of Educational Technology, 2025
This study investigates the validity and reliability of generative large language models (LLMs), specifically ChatGPT and Google's Bard, in grading student essays in higher education based on an analytical grading rubric. A total of 15 experienced English as a foreign language (EFL) instructors and two LLMs were asked to evaluate three student…
Descriptors: English (Second Language), Second Language Learning, Second Language Instruction, Computational Linguistics