Publication Date
In 2025 | 0 |
Since 2024 | 1 |
Since 2021 (last 5 years) | 2 |
Since 2016 (last 10 years) | 3 |
Since 2006 (last 20 years) | 22 |
Descriptor
Educational Testing | 40 |
Error of Measurement | 40 |
Scores | 13 |
Correlation | 9 |
Item Response Theory | 8 |
Measurement Techniques | 8 |
Models | 8 |
Statistical Analysis | 8 |
Achievement Tests | 7 |
Test Items | 7 |
Test Reliability | 7 |
More ▼ |
Source
Author
Publication Type
Education Level
Elementary Secondary Education | 7 |
Elementary Education | 3 |
Grade 4 | 3 |
Grade 3 | 2 |
Grade 5 | 2 |
Grade 8 | 2 |
Secondary Education | 2 |
Grade 6 | 1 |
Grade 7 | 1 |
Higher Education | 1 |
Intermediate Grades | 1 |
More ▼ |
Audience
Researchers | 1 |
Location
California | 3 |
New York | 3 |
Arizona | 1 |
Germany | 1 |
Illinois | 1 |
Ireland | 1 |
Missouri | 1 |
New Jersey | 1 |
North Carolina | 1 |
Tennessee | 1 |
Texas | 1 |
More ▼ |
Laws, Policies, & Programs
No Child Left Behind Act 2001 | 1 |
Assessments and Surveys
National Assessment of… | 2 |
ACT Assessment | 1 |
Iowa Tests of Basic Skills | 1 |
Measures of Academic Progress | 1 |
Sequential Tests of… | 1 |
Stanford Achievement Tests | 1 |
What Works Clearinghouse Rating
Stefanie A. Wind; Yangmeng Xu – Educational Assessment, 2024
We explored three approaches to resolving or re-scoring constructed-response items in mixed-format assessments: rater agreement, person fit, and targeted double scoring (TDS). We used a simulation study to consider how the three approaches impact the psychometric properties of student achievement estimates, with an emphasis on person fit. We found…
Descriptors: Interrater Reliability, Error of Measurement, Evaluation Methods, Examiners
Hong, Seong Eun; Monroe, Scott; Falk, Carl F. – Journal of Educational Measurement, 2020
In educational and psychological measurement, a person-fit statistic (PFS) is designed to identify aberrant response patterns. For parametric PFSs, valid inference depends on several assumptions, one of which is that the item response theory (IRT) model is correctly specified. Previous studies have used empirical data sets to explore the effects…
Descriptors: Educational Testing, Psychological Testing, Goodness of Fit, Error of Measurement
Reardon, Sean F.; Kalogrides, Demetra; Ho, Andrew D. – Journal of Educational and Behavioral Statistics, 2021
Linking score scales across different tests is considered speculative and fraught, even at the aggregate level. We introduce and illustrate validation methods for aggregate linkages, using the challenge of linking U.S. school district average test scores across states as a motivating example. We show that aggregate linkages can be validated both…
Descriptors: Equated Scores, Validity, Methods, School Districts
Socha, Alan; DeMars, Christine E.; Zilberberg, Anna; Phan, Ha – International Journal of Testing, 2015
The Mantel-Haenszel (MH) procedure is commonly used to detect items that function differentially for groups of examinees from various demographic and linguistic backgrounds--for example, in international assessments. As in some other DIF methods, the total score is used to match examinees on ability. In thin matching, each of the total score…
Descriptors: Test Items, Educational Testing, Evaluation Methods, Ability Grouping
Woodruff, David; Traynor, Anne; Cui, Zhongmin; Fang, Yu – ACT, Inc., 2013
Professional standards for educational testing recommend that both the overall standard error of measurement and the conditional standard error of measurement (CSEM) be computed on the score scale used to report scores to examinees. Several methods have been developed to compute scale score CSEMs. This paper compares three methods, based on…
Descriptors: Comparative Analysis, Error of Measurement, Scores, Scaling
Topczewski, Anna Marie – ProQuest LLC, 2013
Developmental score scales represent the performance of students along a continuum, where as students learn more they move higher along that continuum. Unidimensional item response theory (UIRT) vertical scaling has become a commonly used method to create developmental score scales. Research has shown that UIRT vertical scaling methods can be…
Descriptors: Item Response Theory, Scaling, Scores, Student Development
Gorad, Stephen; Hordosy, Rita; Siddiqui, Nadia – International Education Studies, 2013
This paper re-considers the widespread use of value-added approaches to estimate school "effects", and shows the results to be very unstable over time. The paper uses as an example the contextualised value-added scores of all secondary schools in England. The study asks how many schools with at least 99% of their pupils included in the…
Descriptors: Foreign Countries, Outcomes of Education, Secondary Education, Educational Testing
Zwick, Rebecca – ETS Research Report Series, 2012
Differential item functioning (DIF) analysis is a key component in the evaluation of the fairness and validity of educational tests. The goal of this project was to review the status of ETS DIF analysis procedures, focusing on three aspects: (a) the nature and stringency of the statistical rules used to flag items, (b) the minimum sample size…
Descriptors: Test Bias, Sample Size, Bayesian Statistics, Evaluation Methods
Loeb, Susanna; Candelaria, Christopher A. – Carnegie Foundation for the Advancement of Teaching, 2012
Value-added models measure teacher performance by the test score gains of their students, adjusted for a variety of factors such as the performance of students when they enter the class. The measures are based on desired student outcomes such as math and reading scores, but they have a number of potential drawbacks. One of them is the…
Descriptors: Academic Achievement, Teacher Effectiveness, Scores, Peer Influence
Han, Kyung T. – Practical Assessment, Research & Evaluation, 2012
For several decades, the "three-parameter logistic model" (3PLM) has been the dominant choice for practitioners in the field of educational measurement for modeling examinees' response data from multiple-choice (MC) items. Past studies, however, have pointed out that the c-parameter of 3PLM should not be interpreted as a guessing…
Descriptors: Statistical Analysis, Models, Multiple Choice Tests, Guessing (Tests)
Boyd, Donald; Lankford, Hamilton; Loeb, Susanna; Wyckoff, James – Journal of Educational and Behavioral Statistics, 2013
Test-based accountability as well as value-added asessments and much experimental and quasi-experimental research in education rely on achievement tests to measure student skills and knowledge. Yet, we know little regarding fundamental properties of these tests, an important example being the extent of measurement error and its implications for…
Descriptors: Accountability, Educational Research, Educational Testing, Error of Measurement
Popham, W. James – Educational Leadership, 2009
If a person were to ask an educator to identify the two most important attributes of an education test, the response most certainly would be "validity and reliability." These two tightly wedded concepts have become icons in the field of education assessment. As far as validity is concerned, the term doesn't refer to the accuracy of a test. Rather,…
Descriptors: Educational Testing, Educational Assessment, Student Evaluation, Test Reliability
Chang, Yuan-chin Ivan; Lu, Hung-Yi – Psychometrika, 2010
Item calibration is an essential issue in modern item response theory based psychological or educational testing. Due to the popularity of computerized adaptive testing, methods to efficiently calibrate new items have become more important than that in the time when paper and pencil test administration is the norm. There are many calibration…
Descriptors: Test Items, Educational Testing, Adaptive Testing, Measurement
Olsen, Robert B.; Unlu, Fatih; Price, Cristofer; Jaciw, Andrew P. – National Center for Education Evaluation and Regional Assistance, 2011
This report examines the differences in impact estimates and standard errors that arise when these are derived using state achievement tests only (as pre-tests and post-tests), study-administered tests only, or some combination of state- and study-administered tests. State tests may yield different evaluation results relative to a test that is…
Descriptors: Achievement Tests, Standardized Tests, State Standards, Reading Achievement
Haberman, Shelby J. – Journal of Educational and Behavioral Statistics, 2008
In educational tests, subscores are often generated from a portion of the items in a larger test. Guidelines based on mean squared error are proposed to indicate whether subscores are worth reporting. Alternatives considered are direct reports of subscores, estimates of subscores based on total score, combined estimates based on subscores and…
Descriptors: Testing Programs, Regression (Statistics), Scores, Student Evaluation