ERIC - Search Results

Publication Date

In 2025	0
Since 2024	1
Since 2021 (last 5 years)	2
Since 2016 (last 10 years)	8
Since 2006 (last 20 years)	22

Descriptor

Reliability	50
Validity	17
Scores	15
Test Construction	11
Error of Measurement	10
Generalizability Theory	10
Test Items	9
Correlation	8
Item Response Theory	8
Comparative Analysis	6
Computation	6
Elementary Secondary Education	5
Licensing Examinations…	5
Performance Based Assessment	5
Sampling	5
Scoring	5
Accuracy	4
Educational Assessment	4
Elementary School Students	4
Mathematics Tests	4
Monte Carlo Methods	4
Scaling	4
Standard Setting (Scoring)	4
Test Length	4
Test Use	4
More ▼

Source

Applied Measurement in…

Publication Type

Journal Articles	50
Reports - Research	31
Reports - Evaluative	16
Reports - Descriptive	3
Speeches/Meeting Papers	2

Education Level

Grade 3	3
Grade 5	3
Early Childhood Education	2
Elementary Education	2
Elementary Secondary Education	2
Grade 4	2
Grade 8	2
Primary Education	2
Grade 1	1
Grade 12	1
Grade 2	1
Grade 6	1
High Schools	1
Higher Education	1
Intermediate Grades	1
Middle Schools	1
More ▼

Audience

Location

California	2
Canada	2
California (Los Angeles)	1
Louisiana	1
Netherlands	1
North Carolina	1
Tennessee	1

Laws, Policies, & Programs

Assessments and Surveys

Advanced Placement…	1
Iowa Tests of Basic Skills	1
National Assessment of…	1
Program for International…	1
Stanford Achievement Tests	1
Texas Assessment of Academic…	1
United States Medical…	1

What Works Clearinghouse Rating

Showing 1 to 15 of 50 results Save | Export

New Tests of Rater Drift in Trend Scoring

Peer reviewed

Direct link

John R. Donoghue; Carol Eckerly – Applied Measurement in Education, 2024

Trend scoring constructed response items (i.e. rescoring Time A responses at Time B) gives rise to two-way data that follow a product multinomial distribution rather than the multinomial distribution that is usually assumed. Recent work has shown that the difference in sampling model can have profound negative effects on statistics usually used to…

Descriptors: Scoring, Error of Measurement, Reliability, Scoring Rubrics

Violation of Conditional Independence in the Many-Facets Rasch Model

Peer reviewed

Direct link

DeMars, Christine E. – Applied Measurement in Education, 2021

Estimation of parameters for the many-facets Rasch model requires that conditional on the values of the facets, such as person ability, item difficulty, and rater severity, the observed responses within each facet are independent. This requirement has often been discussed for the Rasch models and 2PL and 3PL models, but it becomes more complex…

Descriptors: Item Response Theory, Test Items, Ability, Scores

An Information-Based Approach to Identifying Rapid-Guessing Thresholds

Peer reviewed

Direct link

Wise, Steven L. – Applied Measurement in Education, 2019

The identification of rapid guessing is important to promote the validity of achievement test scores, particularly with low-stakes tests. Effective methods for identifying rapid guesses require reliable threshold methods that are also aligned with test taker behavior. Although several common threshold methods are based on rapid guessing response…

Descriptors: Guessing (Tests), Identification, Reaction Time, Reliability

Evaluating Random and Systematic Error in Student Growth Percentiles

Peer reviewed

Direct link

Wells, Craig S.; Sireci, Stephen G. – Applied Measurement in Education, 2020

Student growth percentiles (SGPs) are currently used by several states and school districts to provide information about individual students as well as to evaluate teachers, schools, and school districts. For SGPs to be defensible for these purposes, they should be reliable. In this study, we examine the amount of systematic and random error in…

Descriptors: Growth Models, Reliability, Scores, Error Patterns

Classification Consistency and Accuracy for Mixed-Format Tests

Peer reviewed

Direct link

Kim, Stella Y.; Lee, Won-Chan – Applied Measurement in Education, 2019

This study explores classification consistency and accuracy for mixed-format tests using real and simulated data. In particular, the current study compares six methods of estimating classification consistency and accuracy for seven mixed-format tests. The relative performance of the estimation methods is evaluated using simulated data. Study…

Descriptors: Classification, Reliability, Accuracy, Test Format

Estimating Variance Components from Sparse Data Matrices in Large-Scale Educational Assessments

Peer reviewed

Direct link

DeMars, Christine – Applied Measurement in Education, 2015

In generalizability theory studies in large-scale testing contexts, sometimes a facet is very sparsely crossed with the object of measurement. For example, when assessments are scored by human raters, it may not be practical to have every rater score all students. Sometimes the scoring is systematically designed such that the raters are…

Descriptors: Educational Assessment, Measurement, Data, Generalizability Theory

Evaluating Score and Decision Consistency across Claims in a Validation Argument

Peer reviewed

Direct link

Schmidgall, Jonathan – Applied Measurement in Education, 2017

This study utilizes an argument-based approach to validation to examine the implications of reliability in order to further differentiate the concepts of score and decision consistency. In a methodological example, the framework of generalizability theory was used to estimate appropriate indices of score consistency and evaluations of the…

Descriptors: Scores, Reliability, Validity, Generalizability Theory

Evaluating Comparative Judgment as an Approach to Essay Scoring

Peer reviewed

Direct link

Steedle, Jeffrey T.; Ferrara, Steve – Applied Measurement in Education, 2016

As an alternative to rubric scoring, comparative judgment generates essay scores by aggregating decisions about the relative quality of the essays. Comparative judgment eliminates certain scorer biases and potentially reduces training requirements, thereby allowing a large number of judges, including teachers, to participate in essay evaluation.…

Descriptors: Essays, Scoring, Comparative Analysis, Evaluators

Stability of Teacher Value-Added Rankings across Measurement Model and Scaling Conditions

Peer reviewed

Direct link

Hawley, Leslie R.; Bovaird, James A.; Wu, ChaoRong – Applied Measurement in Education, 2017

Value-added assessment methods have been criticized by researchers and policy makers for a number of reasons. One issue includes the sensitivity of model results across different outcome measures. This study examined the utility of incorporating multivariate latent variable approaches within a traditional value-added framework. We evaluated the…

Descriptors: Value Added Models, Reliability, Multivariate Analysis, Scaling

Increasing the Validity of Angoff Standards through Analysis of Judge-Level Internal Consistency

Peer reviewed

Direct link

Clauser, Jerome C.; Clauser, Brian E.; Hambleton, Ronald K. – Applied Measurement in Education, 2014

The purpose of the present study was to extend past work with the Angoff method for setting standards by examining judgments at the judge level rather than the panel level. The focus was on investigating the relationship between observed Angoff standard setting judgments and empirical conditional probabilities. This relationship has been used as a…

Descriptors: Standard Setting (Scoring), Validity, Reliability, Correlation

Evaluating the Consistency of Angoff-Based Cut Scores Using Subsets of Items within a Generalizability Theory Framework

Peer reviewed

Direct link

Kannan, Priya; Sgammato, Adrienne; Tannenbaum, Richard J.; Katz, Irvin R. – Applied Measurement in Education, 2015

The Angoff method requires experts to view every item on the test and make a probability judgment. This can be time consuming when there are large numbers of items on the test. In this study, a G-theory framework was used to determine if a subset of items can be used to make generalizable cut-score recommendations. Angoff ratings (i.e.,…

Descriptors: Reliability, Standard Setting (Scoring), Cutting Scores, Test Items

Quantifying Error in Survey Measures of School and Classroom Environments

Peer reviewed

Direct link

Schweig, Jonathan David – Applied Measurement in Education, 2014

Developing indicators that reflect important aspects of school and classroom environments has become central in a nationwide effort to develop comprehensive programs that measure teacher quality and effectiveness. Formulating teacher evaluation policy necessitates accurate and reliable methods for measuring these environmental variables. This…

Descriptors: Error of Measurement, Educational Environment, Classroom Environment, Surveys

The Effect of Small Group Discussion on Cutoff Scores during Standard Setting

Peer reviewed

Direct link

Deunk, Marjolein I.; van Kuijk, Mechteld F.; Bosker, Roel J. – Applied Measurement in Education, 2014

Standard setting methods, like the Bookmark procedure, are used to assist education experts in formulating performance standards. Small group discussion is meant to help these experts in setting more reliable and valid cutoff scores. This study is an analysis of 15 small group discussions during two standards setting trajectories and their effect…

Descriptors: Cutting Scores, Standard Setting, Group Discussion, Reading Tests

Do Different Approaches to Examining Construct Comparability in Multilanguage Assessments Lead to Similar Conclusions?

Peer reviewed

Direct link

Oliveri, Maria E.; Ercikan, Kadriye – Applied Measurement in Education, 2011

In this study, we examine the degree of construct comparability and possible sources of incomparability of the English and French versions of the Programme for International Student Assessment (PISA) 2003 problem-solving measure administered in Canada. Several approaches were used to examine construct comparability at the test- (examination of…

Descriptors: Foreign Countries, English, French, Tests

Item Difficulty and Interviewer Knowledge Effects on the Accuracy and Consistency of Examinee Response Processes in Verbal Reports

Peer reviewed

Direct link

Leighton, Jacqueline P. – Applied Measurement in Education, 2013

The Standards for Educational and Psychological Testing indicate that multiple sources of validity evidence should be used to support the interpretation of test scores. In the past decade, examinee response processes, as a source of validity evidence, have received increased attention. However, there have been relatively few methodological studies…

Descriptors: Psychological Testing, Standards, Interviews, Protocol Analysis

Previous Page | Next Page »

Pages: 1 | 2 | 3 | 4

Feldt, Leonard S.	6
Bandalos, Deborah L.	2
Enders, Craig K.	2
Gao, Xiaohong	2
Kane, Michael	2
Qualls, Audrey L.	2
Shavelson, Richard J.	2
Wise, Steven L.	2
Yen, Wendy M.	2
Bell, Robert M.	1
Bosker, Roel J.	1
Bovaird, James A.	1
Brennan, Robert L.	1
Bush, M. Joan	1
Calfee, Robert	1
Candell, Gregory L.	1
Carol Eckerly	1
Case, Susan M.	1
Chinn, Roberta N.	1
Clauser, Brian E.	1
Clauser, Jerome C.	1
Comfort, Kathy	1
Crone, Linda J.	1
DeMars, Christine	1
More ▼