Publication Date
In 2025 | 3 |
Since 2024 | 4 |
Since 2021 (last 5 years) | 10 |
Since 2016 (last 10 years) | 33 |
Since 2006 (last 20 years) | 97 |
Descriptor
Evaluation Methods | 141 |
Scores | 141 |
Test Validity | 141 |
Test Reliability | 65 |
Student Evaluation | 40 |
Foreign Countries | 26 |
Test Construction | 24 |
Standardized Tests | 20 |
Correlation | 19 |
Measurement Techniques | 19 |
Psychometrics | 19 |
More ▼ |
Source
Author
Erford, Bradley T. | 2 |
Frazier, Thomas W. | 2 |
Kane, Michael T. | 2 |
McIntyre, Nancy | 2 |
Mundy, Peter | 2 |
Novotny, Stephanie | 2 |
Oswald, Tasha | 2 |
Ryser, Gail R. | 2 |
Swain-Lerro, Lindsey | 2 |
Youngstrom, Eric A. | 2 |
Zajic, Matt | 2 |
More ▼ |
Publication Type
Education Level
Location
Illinois | 3 |
Massachusetts | 3 |
United Kingdom | 3 |
United States | 3 |
Florida | 2 |
Germany | 2 |
Kenya | 2 |
Michigan | 2 |
Minnesota | 2 |
North Carolina | 2 |
Texas | 2 |
More ▼ |
Laws, Policies, & Programs
Elementary and Secondary… | 2 |
Comprehensive Education… | 1 |
Elementary and Secondary… | 1 |
Every Student Succeeds Act… | 1 |
Individuals with Disabilities… | 1 |
Assessments and Surveys
What Works Clearinghouse Rating
Kylie Gorney; Sandip Sinharay – Journal of Educational Measurement, 2025
Although there exists an extensive amount of research on subscores and their properties, limited research has been conducted on categorical subscores and their interpretations. In this paper, we focus on the claim of Feinberg and von Davier that categorical subscores are useful for remediation and instructional purposes. We investigate this claim…
Descriptors: Tests, Scores, Test Interpretation, Alternative Assessment
Shun-Fu Hu; Amery D. Wu; Jake Stone – Journal of Educational Measurement, 2025
Scoring high-dimensional assessments (e.g., > 15 traits) can be a challenging task. This paper introduces the multilabel neural network (MNN) as a scoring method for high-dimensional assessments. Additionally, it demonstrates how MNN can score the same test responses to maximize different performance metrics, such as accuracy, recall, or…
Descriptors: Tests, Testing, Scores, Test Construction
Karen Blackburn Hoeve – ProQuest LLC, 2021
High stakes test-based accountability systems primarily rely on aggregates and derivatives of scores from tests that were originally developed to measure individual student mastery of content specifications. Current validity models do not explicitly address this use of aggregate scores to measure the performance of teachers, administrators, and…
Descriptors: Accountability, Test Validity, High Stakes Tests, Hierarchical Linear Modeling
Kelsey Nason; Christine E. DeMars – Research & Practice in Assessment, 2023
Universities administer assessments for accountability and program improvement. Student effort is low during assessments due to minimal perceived consequences. The effects of low effort are compounded by assessment context. This project investigates validity concerns caused by minimal effort and exacerbated by contextual factors. Systematic…
Descriptors: Test Validity, COVID-19, Pandemics, Environmental Influences
Stephen M. Leach; Jason C. Immekus; Jeffrey C. Valentine; Prathiba Batley; Dena Dossett; Tamara Lewis; Thomas Reece – Assessment for Effective Intervention, 2025
Educators commonly use school climate survey scores to inform and evaluate interventions for equitably improving learning and reducing educational disparities. Unfortunately, validity evidence to support these (and other) score uses often falls short. In response, Whitehouse et al. proposed a collaborative, two-part validity testing framework for…
Descriptors: School Surveys, Measurement, Hierarchical Linear Modeling, Educational Environment
An, Lily Shiao; Ho, Andrew Dean; Davis, Laurie Laughlin – Educational Measurement: Issues and Practice, 2022
Technical documentation for educational tests focuses primarily on properties of individual scores at single points in time. Reliability, standard errors of measurement, item parameter estimates, fit statistics, and linking constants are standard technical features that external stakeholders use to evaluate items and individual scale scores.…
Descriptors: Documentation, Scores, Evaluation Methods, Longitudinal Studies
Emre Zengin; Yasemin Karal – International Journal of Assessment Tools in Education, 2024
This study was carried out to develop a test to assess algorithmic thinking skills. To this end, the twelve steps suggested by Downing (2006) were adopted. Throughout the test development, 24 middle school sixth-grade students and eight experts in different areas took part as needed in the tasks on the project. The test was given to 252 students…
Descriptors: Grade 6, Algorithms, Thinking Skills, Evaluation Methods
Mattern, Krista; Radunzel, Justine – ACT, Inc., 2019
When applicants take the ACT® more than once, how do colleges and universities reconcile and make sense of the multiple scores? In terms of validity, fairness, and impact on subgroup differences, are certain score-use polices better than others? The focus of this issue brief is to summarize evidence on the validity and fairness of various…
Descriptors: Scoring, College Entrance Examinations, Test Validity, Evaluation Methods
Baraldi Cunha, Andrea; Babik, Iryna; Koziol, Natalie A.; Hsu, Lin-Ya; Nord, Jayden; Harbourne, Regina T.; Westcott-McCoy, Sarah; Dusing, Stacey C.; Bovaird, James A.; Lobo, Michele A. – Grantee Submission, 2021
Purpose: To evaluate the validity, reliability, and sensitivity of the novel Means-End Problem-Solving Assessment Tool (MEPSAT). Methods: Children with typical development and those with motor delay were assessed throughout the first 2 years of life using the MEPSAT. MEPSAT scores were validated against the cognitive and motor subscales of the…
Descriptors: Problem Solving, Early Intervention, Evaluation Methods, Motor Development
Lynch, Sarah – Practical Assessment, Research & Evaluation, 2022
In today's digital age, tests are increasingly being delivered on computers. Many of these computer-based tests (CBTs) have been adapted from paper-based tests (PBTs). However, this change in mode of test administration has the potential to introduce construct-irrelevant variance, affecting the validity of score interpretations. Because of this,…
Descriptors: Computer Assisted Testing, Tests, Scores, Scoring
Interaction, Change, and the Role of the Historical in Validation: The Case of L2 Dynamic Assessment
Poehner, Matthew E.; van Compernolle, Rémi A. – Journal of Cognitive Education and Psychology, 2018
This article examines the implications of argument-based validity for the continued development of dynamic assessment (DA) research and practice. We propose that the move toward validation as a process of interpretation and evidence-based argument is commensurable with DA but that fundamental ontological differences with conventional approaches to…
Descriptors: Alternative Assessment, Evaluation Methods, Second Language Learning, Interaction
Reardon, Sean F.; Ho, Andrew D.; Kalogrides, Demetra – Stanford Center for Education Policy Analysis, 2019
Linking score scales across different tests is considered speculative and fraught, even at the aggregate level (Feuer et al., 1999; Thissen, 2007). We introduce and illustrate validation methods for aggregate linkages, using the challenge of linking U.S. school district average test scores across states as a motivating example. We show that…
Descriptors: Test Validity, Evaluation Methods, School Districts, Scores
Morphew, Jason W.; Mestre, Jose P.; Kang, Hyeon-Ah; Chang, Hua-Hua; Fabry, Gregory – Physical Review Physics Education Research, 2018
Prior research has established that students often underprepare for midterm examinations yet remain overconfident in their proficiency. Research concerning the testing effect has demonstrated that utilizing testing as a study strategy leads to higher performance and more accurate confidence compared to more common study strategies such as…
Descriptors: Computer Assisted Testing, Physics, Science Instruction, Introductory Courses
Wakabayashi, Tomoko; Claxton, Jill; Smith, Everett V., Jr. – Journal of Psychoeducational Assessment, 2019
The Child Observation Record (COR), initially developed in 1993 by HighScope Educational Research Foundation, is an observation-based instrument that provides systematic assessment of young children's knowledge and abilities in all major areas of development. Teachers or caregivers spend a few minutes each day writing brief notes or…
Descriptors: Observation, Evaluation Methods, Early Childhood Education, Kindergarten
Daniel Rodriguez-Segura; Beth E. Schueler – Annenberg Institute for School Reform at Brown University, 2022
School closures induced by COVID-19 placed heightened emphasis on alternative ways to measure student learning besides in-person exams. We leverage the administration of phone-based assessments (PBAs) measuring numeracy and literacy for primary school children in Kenya, along with in-person standardized tests administered to the same students…
Descriptors: Foreign Countries, School Closing, COVID-19, Pandemics