ERIC - Search Results

Publication Date

In 2025	0
Since 2024	1
Since 2021 (last 5 years)	1
Since 2016 (last 10 years)	6
Since 2006 (last 20 years)	7

Descriptor

Statistical Analysis	8
Interrater Reliability	5
Scoring	4
Correlation	3
Evaluators	3
Reliability	3
Automation	2
Comparative Analysis	2
Error of Measurement	2
Essay Tests	2
Essays	2
Evaluation Methods	2
Models	2
Scores	2
Teacher Evaluation	2
Test Scoring Machines	2
Test Validity	2
Accuracy	1
Achievement Tests	1
Alternative Assessment	1
Competition	1
Computer Assisted Testing	1
Cues	1
Data Analysis	1
Design	1
More ▼

Source

Applied Measurement in…

Author

Ben-Simon, Anat	1
Bovaird, James A.	1
Boyer, Michelle	1
Carol Eckerly	1
Cohen, Allan	1
Cohen, Yoav	1
Ferrara, Steve	1
Hambleton, Ronald K.	1
Hawley, Leslie R.	1
John R. Donoghue	1
Kieftenbeld, Vincent	1
Levi, Effi	1
Pastor, Dena A.	1
Plake, Barbara S.	1
Raczynski, Kevin	1
Steedle, Jeffrey T.	1
Taylor, Melinda Ann	1
Wu, ChaoRong	1
More ▼

Publication Type

Journal Articles	8
Reports - Research	6
Reports - Evaluative	2

Education Level

Elementary Education	2
Early Childhood Education	1
Elementary Secondary Education	1
Grade 1	1
Grade 10	1
Grade 2	1
Grade 3	1
Grade 5	1
Grade 7	1
Grade 8	1
High Schools	1
Intermediate Grades	1
Junior High Schools	1
Middle Schools	1
Primary Education	1
Secondary Education	1
More ▼

Audience

Location

Israel	1
Tennessee	1

Laws, Policies, & Programs

Assessments and Surveys

Stanford Achievement Tests

What Works Clearinghouse Rating

Showing all 8 results Save | Export

New Tests of Rater Drift in Trend Scoring

Peer reviewed

Direct link

John R. Donoghue; Carol Eckerly – Applied Measurement in Education, 2024

Trend scoring constructed response items (i.e. rescoring Time A responses at Time B) gives rise to two-way data that follow a product multinomial distribution rather than the multinomial distribution that is usually assumed. Recent work has shown that the difference in sampling model can have profound negative effects on statistics usually used to…

Descriptors: Scoring, Error of Measurement, Reliability, Scoring Rubrics

Statistically Comparing the Performance of Multiple Automated Raters across Multiple Items

Peer reviewed

Direct link

Kieftenbeld, Vincent; Boyer, Michelle – Applied Measurement in Education, 2017

Automated scoring systems are typically evaluated by comparing the performance of a single automated rater item-by-item to human raters. This presents a challenge when the performance of multiple raters needs to be compared across multiple items. Rankings could depend on specifics of the ranking procedure; observed differences could be due to…

Descriptors: Automation, Scoring, Comparative Analysis, Test Items

Appraising the Scoring Performance of Automated Essay Scoring Systems--Some Additional Considerations: Which Essays? Which Human Raters? Which Scores?

Peer reviewed

Direct link

Raczynski, Kevin; Cohen, Allan – Applied Measurement in Education, 2018

The literature on Automated Essay Scoring (AES) systems has provided useful validation frameworks for any assessment that includes AES scoring. Furthermore, evidence for the scoring fidelity of AES systems is accumulating. Yet questions remain when appraising the scoring performance of AES systems. These questions include: (a) which essays are…

Descriptors: Essay Tests, Test Scoring Machines, Test Validity, Evaluators

Validating Human and Automated Scoring of Essays against "True" Scores

Peer reviewed

Direct link

Cohen, Yoav; Levi, Effi; Ben-Simon, Anat – Applied Measurement in Education, 2018

In the current study, two pools of 250 essays, all written as a response to the same prompt, were rated by two groups of raters (14 or 15 raters per group), thereby providing an approximation to the essay's true score. An automated essay scoring (AES) system was trained on the datasets and then scored the essays using a cross-validation scheme. By…

Descriptors: Test Validity, Automation, Scoring, Computer Assisted Testing

Evaluating Comparative Judgment as an Approach to Essay Scoring

Peer reviewed

Direct link

Steedle, Jeffrey T.; Ferrara, Steve – Applied Measurement in Education, 2016

As an alternative to rubric scoring, comparative judgment generates essay scores by aggregating decisions about the relative quality of the essays. Comparative judgment eliminates certain scorer biases and potentially reduces training requirements, thereby allowing a large number of judges, including teachers, to participate in essay evaluation.…

Descriptors: Essays, Scoring, Comparative Analysis, Evaluators

Stability of Teacher Value-Added Rankings across Measurement Model and Scaling Conditions

Peer reviewed

Direct link

Hawley, Leslie R.; Bovaird, James A.; Wu, ChaoRong – Applied Measurement in Education, 2017

Value-added assessment methods have been criticized by researchers and policy makers for a number of reasons. One issue includes the sensitivity of model results across different outcome measures. This study examined the utility of incorporating multivariate latent variable approaches within a traditional value-added framework. We evaluated the…

Descriptors: Value Added Models, Reliability, Multivariate Analysis, Scaling

An Application of Generalizability Theory to Evaluate the Technical Quality of an Alternate Assessment

Peer reviewed

Direct link

Taylor, Melinda Ann; Pastor, Dena A. – Applied Measurement in Education, 2013

Although federal regulations require testing students with severe cognitive disabilities, there is little guidance regarding how technical quality should be established. It is known that challenges exist with documentation of the reliability of scores for alternate assessments. Typical measures of reliability do little in modeling multiple sources…

Descriptors: Generalizability Theory, Alternative Assessment, Test Reliability, Scores

Using an Extended Angoff Procedure to Set Standards on Complex Performance Assessments.

Peer reviewed

Hambleton, Ronald K.; Plake, Barbara S. – Applied Measurement in Education, 1995

Several extensions to the Angoff method of standard setting are described that can accommodate characteristics of performance-based assessment. A study involving 12 panelists supported the effectiveness of the new approach but suggested that panelists preferred an approach that was at least partially conjunctive. (SLD)

Descriptors: Educational Assessment, Evaluation Methods, Evaluators, Interrater Reliability