Publication Date
In 2025 | 0 |
Since 2024 | 1 |
Since 2021 (last 5 years) | 2 |
Since 2016 (last 10 years) | 8 |
Since 2006 (last 20 years) | 22 |
Descriptor
Reliability | 50 |
Validity | 17 |
Scores | 15 |
Test Construction | 11 |
Error of Measurement | 10 |
Generalizability Theory | 10 |
Test Items | 9 |
Correlation | 8 |
Item Response Theory | 8 |
Comparative Analysis | 6 |
Computation | 6 |
More ▼ |
Source
Applied Measurement in… | 50 |
Author
Feldt, Leonard S. | 6 |
Bandalos, Deborah L. | 2 |
Enders, Craig K. | 2 |
Gao, Xiaohong | 2 |
Kane, Michael | 2 |
Qualls, Audrey L. | 2 |
Shavelson, Richard J. | 2 |
Wise, Steven L. | 2 |
Yen, Wendy M. | 2 |
Bell, Robert M. | 1 |
Bosker, Roel J. | 1 |
More ▼ |
Publication Type
Journal Articles | 50 |
Reports - Research | 31 |
Reports - Evaluative | 16 |
Reports - Descriptive | 3 |
Speeches/Meeting Papers | 2 |
Education Level
Grade 3 | 3 |
Grade 5 | 3 |
Early Childhood Education | 2 |
Elementary Education | 2 |
Elementary Secondary Education | 2 |
Grade 4 | 2 |
Grade 8 | 2 |
Primary Education | 2 |
Grade 1 | 1 |
Grade 12 | 1 |
Grade 2 | 1 |
More ▼ |
Audience
Laws, Policies, & Programs
Assessments and Surveys
Advanced Placement… | 1 |
Iowa Tests of Basic Skills | 1 |
National Assessment of… | 1 |
Program for International… | 1 |
Stanford Achievement Tests | 1 |
Texas Assessment of Academic… | 1 |
United States Medical… | 1 |
What Works Clearinghouse Rating
John R. Donoghue; Carol Eckerly – Applied Measurement in Education, 2024
Trend scoring constructed response items (i.e. rescoring Time A responses at Time B) gives rise to two-way data that follow a product multinomial distribution rather than the multinomial distribution that is usually assumed. Recent work has shown that the difference in sampling model can have profound negative effects on statistics usually used to…
Descriptors: Scoring, Error of Measurement, Reliability, Scoring Rubrics
DeMars, Christine E. – Applied Measurement in Education, 2021
Estimation of parameters for the many-facets Rasch model requires that conditional on the values of the facets, such as person ability, item difficulty, and rater severity, the observed responses within each facet are independent. This requirement has often been discussed for the Rasch models and 2PL and 3PL models, but it becomes more complex…
Descriptors: Item Response Theory, Test Items, Ability, Scores
Wise, Steven L. – Applied Measurement in Education, 2019
The identification of rapid guessing is important to promote the validity of achievement test scores, particularly with low-stakes tests. Effective methods for identifying rapid guesses require reliable threshold methods that are also aligned with test taker behavior. Although several common threshold methods are based on rapid guessing response…
Descriptors: Guessing (Tests), Identification, Reaction Time, Reliability
Wells, Craig S.; Sireci, Stephen G. – Applied Measurement in Education, 2020
Student growth percentiles (SGPs) are currently used by several states and school districts to provide information about individual students as well as to evaluate teachers, schools, and school districts. For SGPs to be defensible for these purposes, they should be reliable. In this study, we examine the amount of systematic and random error in…
Descriptors: Growth Models, Reliability, Scores, Error Patterns
Kim, Stella Y.; Lee, Won-Chan – Applied Measurement in Education, 2019
This study explores classification consistency and accuracy for mixed-format tests using real and simulated data. In particular, the current study compares six methods of estimating classification consistency and accuracy for seven mixed-format tests. The relative performance of the estimation methods is evaluated using simulated data. Study…
Descriptors: Classification, Reliability, Accuracy, Test Format
DeMars, Christine – Applied Measurement in Education, 2015
In generalizability theory studies in large-scale testing contexts, sometimes a facet is very sparsely crossed with the object of measurement. For example, when assessments are scored by human raters, it may not be practical to have every rater score all students. Sometimes the scoring is systematically designed such that the raters are…
Descriptors: Educational Assessment, Measurement, Data, Generalizability Theory
Schmidgall, Jonathan – Applied Measurement in Education, 2017
This study utilizes an argument-based approach to validation to examine the implications of reliability in order to further differentiate the concepts of score and decision consistency. In a methodological example, the framework of generalizability theory was used to estimate appropriate indices of score consistency and evaluations of the…
Descriptors: Scores, Reliability, Validity, Generalizability Theory
Steedle, Jeffrey T.; Ferrara, Steve – Applied Measurement in Education, 2016
As an alternative to rubric scoring, comparative judgment generates essay scores by aggregating decisions about the relative quality of the essays. Comparative judgment eliminates certain scorer biases and potentially reduces training requirements, thereby allowing a large number of judges, including teachers, to participate in essay evaluation.…
Descriptors: Essays, Scoring, Comparative Analysis, Evaluators
Hawley, Leslie R.; Bovaird, James A.; Wu, ChaoRong – Applied Measurement in Education, 2017
Value-added assessment methods have been criticized by researchers and policy makers for a number of reasons. One issue includes the sensitivity of model results across different outcome measures. This study examined the utility of incorporating multivariate latent variable approaches within a traditional value-added framework. We evaluated the…
Descriptors: Value Added Models, Reliability, Multivariate Analysis, Scaling
Clauser, Jerome C.; Clauser, Brian E.; Hambleton, Ronald K. – Applied Measurement in Education, 2014
The purpose of the present study was to extend past work with the Angoff method for setting standards by examining judgments at the judge level rather than the panel level. The focus was on investigating the relationship between observed Angoff standard setting judgments and empirical conditional probabilities. This relationship has been used as a…
Descriptors: Standard Setting (Scoring), Validity, Reliability, Correlation
Kannan, Priya; Sgammato, Adrienne; Tannenbaum, Richard J.; Katz, Irvin R. – Applied Measurement in Education, 2015
The Angoff method requires experts to view every item on the test and make a probability judgment. This can be time consuming when there are large numbers of items on the test. In this study, a G-theory framework was used to determine if a subset of items can be used to make generalizable cut-score recommendations. Angoff ratings (i.e.,…
Descriptors: Reliability, Standard Setting (Scoring), Cutting Scores, Test Items
Schweig, Jonathan David – Applied Measurement in Education, 2014
Developing indicators that reflect important aspects of school and classroom environments has become central in a nationwide effort to develop comprehensive programs that measure teacher quality and effectiveness. Formulating teacher evaluation policy necessitates accurate and reliable methods for measuring these environmental variables. This…
Descriptors: Error of Measurement, Educational Environment, Classroom Environment, Surveys
Deunk, Marjolein I.; van Kuijk, Mechteld F.; Bosker, Roel J. – Applied Measurement in Education, 2014
Standard setting methods, like the Bookmark procedure, are used to assist education experts in formulating performance standards. Small group discussion is meant to help these experts in setting more reliable and valid cutoff scores. This study is an analysis of 15 small group discussions during two standards setting trajectories and their effect…
Descriptors: Cutting Scores, Standard Setting, Group Discussion, Reading Tests
Oliveri, Maria E.; Ercikan, Kadriye – Applied Measurement in Education, 2011
In this study, we examine the degree of construct comparability and possible sources of incomparability of the English and French versions of the Programme for International Student Assessment (PISA) 2003 problem-solving measure administered in Canada. Several approaches were used to examine construct comparability at the test- (examination of…
Descriptors: Foreign Countries, English, French, Tests
Leighton, Jacqueline P. – Applied Measurement in Education, 2013
The Standards for Educational and Psychological Testing indicate that multiple sources of validity evidence should be used to support the interpretation of test scores. In the past decade, examinee response processes, as a source of validity evidence, have received increased attention. However, there have been relatively few methodological studies…
Descriptors: Psychological Testing, Standards, Interviews, Protocol Analysis