Publication Date
In 2025 | 39 |
Since 2024 | 132 |
Descriptor
Evaluation Methods | 132 |
Test Reliability | 69 |
Reliability | 46 |
Foreign Countries | 45 |
Test Validity | 40 |
Student Evaluation | 26 |
Interrater Reliability | 25 |
Test Construction | 21 |
Validity | 16 |
Error of Measurement | 14 |
Psychometrics | 14 |
More ▼ |
Source
Author
Erica S. Lembke | 2 |
Kristen L. McMaster | 2 |
Manjary Guha | 2 |
Seohyeon Choi | 2 |
Stefanie A. Wind | 2 |
Yangmeng Xu | 2 |
Yaniv Biton | 2 |
A. Suparmi | 1 |
Aaron T. McLaughlin | 1 |
Aaron Zimmerman | 1 |
Abdullah Alshakhi | 1 |
More ▼ |
Publication Type
Education Level
Audience
Administrators | 1 |
Policymakers | 1 |
Teachers | 1 |
Location
Turkey | 8 |
China | 6 |
Spain | 4 |
Germany | 3 |
Australia | 2 |
Greece | 2 |
Indonesia | 2 |
Israel | 2 |
Saudi Arabia | 2 |
Sweden | 2 |
United Kingdom | 2 |
More ▼ |
Laws, Policies, & Programs
Every Student Succeeds Act… | 1 |
No Child Left Behind Act 2001 | 1 |
Assessments and Surveys
Aberrant Behavior Checklist | 1 |
Eyberg Child Behavior… | 1 |
Program for International… | 1 |
Social Skills Improvement… | 1 |
Wechsler Intelligence Scale… | 1 |
What Works Clearinghouse Rating
Susan K. Johnsen – Gifted Child Today, 2025
The author provides information about reliability and areas that educators should examine in determining if an assessment is consistent and trustworthy for use, and how it should be interpreted in making decisions about students. Reliability areas that are discussed in the column include internal consistency, test-retest or stability, inter-scorer…
Descriptors: Test Reliability, Academically Gifted, Student Evaluation, Error of Measurement
Jonas Flodén – British Educational Research Journal, 2025
This study compares how the generative AI (GenAI) large language model (LLM) ChatGPT performs in grading university exams compared to human teachers. Aspects investigated include consistency, large discrepancies and length of answer. Implications for higher education, including the role of teachers and ethics, are also discussed. Three…
Descriptors: College Faculty, Artificial Intelligence, Comparative Testing, Scoring
Tahereh Firoozi; Hamid Mohammadi; Mark J. Gierl – Journal of Educational Measurement, 2025
The purpose of this study is to describe and evaluate a multilingual automated essay scoring (AES) system for grading essays in three languages. Two different sentence embedding models were evaluated within the AES system, multilingual BERT (mBERT) and language-agnostic BERT sentence embedding (LaBSE). German, Italian, and Czech essays were…
Descriptors: College Students, Slavic Languages, German, Italian
Ole J. Kemi – Advances in Physiology Education, 2025
Students are assessed by coursework and/or exams, all of which are marked by assessors (markers). Student and marker performances are then subject to end-of-session board of examiner handling and analysis. This occurs annually and is the basis for evaluating students but also the wider learning and teaching efficiency of an academic institution.…
Descriptors: Undergraduate Students, Evaluation Methods, Evaluation Criteria, Academic Standards
Riana Nurhayati; Suranto Aw; Siti Irene Astuti Dwiningrum; Mami Hajaroh; Herwin Herwin – International Journal of Educational Methodology, 2024
Evaluation of child-friendly school (CFS) policies is essential to determine the achievements of school efforts in reducing violence cases. This research aims to proving the reliability and validity of CFS policy evaluation instruments in elementary schools with different locations. This investigation uses the Context Input Process Product (CIPP)…
Descriptors: Validity, Reliability, School Policy, Program Evaluation
Swapneel Thite; Jayashri Ravishankar; Inmaculada Tomeo-Reyes; Araceli Martinez Ortiz – European Journal of Engineering Education, 2024
Effectively working in an engineering workplace requires strong teamwork skills, yet the existing literature within various disciplines reveals discrepancies in evaluating these skills. This complicates the design of a generic teamwork peer evaluation tool for engineering students. This study aims to address this gap by introducing the DRIVE…
Descriptors: Scoring Rubrics, Evaluation Methods, Peer Evaluation, Teamwork
Janice Kinghorn; Katherine McGuire; Bethany L. Miller; Aaron Zimmerman – Assessment Update, 2024
In this article, the authors share their reflections on how different experiences and paradigms have broadened their understanding of the work of assessment in higher education. As they collaborated to create a panel for the 2024 International Conference on Assessing Quality in Higher Education, they recognized that they, as assessment…
Descriptors: Higher Education, Assessment Literacy, Evaluation Criteria, Evaluation Methods
Shasha Chen; Shaohui Chi; Zuhao Wang – Journal of Baltic Science Education, 2025
Interdisciplinary thinking is critical for equipping students to apply scientific knowledge and tackle societal challenges across various disciplines, which has been recognized as a key objective of twenty-first century science education. However, research on effective interdisciplinary assessment in secondary school science education is still…
Descriptors: Thinking Skills, Interdisciplinary Approach, Science Instruction, Grade 7
Paul Alexander Siegel – ProQuest LLC, 2024
While multimodality and multiliteracies has been a concept for 25 years (Kalantzis & Cope, 2023; The New London Group, 1996), research on and application of the concept within text complexity measures has been limited. Attempts to assess multiliteracies and multimodality (Jacobs, 2013; Schmerbeck & Lucht, 2017; Wyatt-Smith & Kimber,…
Descriptors: Multiple Literacies, Learning Modalities, Test Validity, Test Reliability
Guido Schwarzer; Gerta Rücker; Cristina Semaca – Research Synthesis Methods, 2024
The "LFK" index has been promoted as an improved method to detect bias in meta-analysis. Putatively, its performance does not depend on the number of studies in the meta-analysis. We conducted a simulation study, comparing the "LFK" index test to three standard tests for funnel plot asymmetry in settings with smaller or larger…
Descriptors: Bias, Meta Analysis, Simulation, Evaluation Methods
Yangmeng Xu; Stefanie A. Wind – Educational Measurement: Issues and Practice, 2025
Double-scoring constructed-response items is a common but costly practice in mixed-format assessments. This study explored the impacts of Targeted Double-Scoring (TDS) and random double-scoring procedures on the quality of psychometric outcomes, including student achievement estimates, person fit, and student classifications under various…
Descriptors: Academic Achievement, Psychometrics, Scoring, Evaluation Methods
Russell P. Houpt; Kevin J. Grimm; Aaron T. McLaughlin; Daryl R. Van Tongeren – Structural Equation Modeling: A Multidisciplinary Journal, 2024
Numerous methods exist to determine the optimal number of classes when using latent profile analysis (LPA), but none are consistently correct. Recently, the likelihood incremental percentage per parameter (LI3P) was proposed as a model effect-size measure. To evaluate the LI3P more thoroughly, we simulated 50,000 datasets, manipulating factors…
Descriptors: Structural Equation Models, Profiles, Sample Size, Evaluation Methods
Tenko Raykov; Bingsheng Zhang – Structural Equation Modeling: A Multidisciplinary Journal, 2024
Multidimensional measuring instruments are often used in behavioral, social, educational, marketing, and biomedical research. For these scales, the paper discusses how to find the optimal score based on their components that is associated with the highest possible reliability. Within the framework of structural equation modeling, an approach to…
Descriptors: Multidimensional Scaling, Measurement Equipment, Measurement Techniques, Test Reliability
Melissa Raspa; Angela Gwaltney; Carla Bann; Jana von Hehn; Timothy A. Benke; Eric D. Marsh; Sarika U. Peters; Amitha Ananth; Alan K. Percy; Jeffrey L. Neul – Journal of Autism and Developmental Disorders, 2025
Rett syndrome is a severe neurodevelopmental disorder that affects about 1 in 10,000 females. Clinical trials of disease modifying therapies are on the rise, but there are few psychometrically sound caregiver-reported outcome measures available to assess treatment benefit. We report on a new caregiver-reported outcome measure, the Rett Caregiver…
Descriptors: Neurodevelopmental Disorders, Genetic Disorders, Females, Test Validity
Qiong Wu; Liping Gu – Sociological Methods & Research, 2024
Family income questions in general purpose surveys are usually collected with either a single-question summary design or a multiple-question disaggregation design. It is unclear how estimates from the two approaches agree with each other. The current paper takes advantage of a large-scale survey that has collected family income with both methods.…
Descriptors: Foreign Countries, Family Income, Questionnaires, Research Design