ERIC - Search Results

Publication Date

In 2025	0
Since 2024	1
Since 2021 (last 5 years)	2
Since 2016 (last 10 years)	9
Since 2006 (last 20 years)	16

Descriptor

Interrater Reliability	28
Scoring	12
Evaluation Methods	7
Essay Tests	6
Evaluators	6
Performance Based Assessment	6
Standard Setting (Scoring)	6
Correlation	5
Educational Assessment	5
Scores	5
Statistical Analysis	5
Test Items	5
Elementary School Students	4
Item Analysis	4
Measurement Techniques	4
Scoring Rubrics	4
Test Validity	4
Academic Standards	3
Elementary Secondary Education	3
Foreign Countries	3
High Stakes Tests	3
Reliability	3
State Standards	3
Student Evaluation	3
Teacher Evaluation	3
More ▼

Source

Applied Measurement in…

Publication Type

Journal Articles	28
Reports - Research	18
Reports - Evaluative	10
Speeches/Meeting Papers	2
Information Analyses	1
Reports - Descriptive	1

Education Level

Higher Education	4
Postsecondary Education	2
Elementary Education	1
Grade 7	1
Grade 8	1
High Schools	1

Audience

Location

California	2
Australia	1
Israel	1
United Kingdom	1
United States	1

Laws, Policies, & Programs

Assessments and Surveys

National Teacher Examinations	1
SAT (College Admission Test)	1

What Works Clearinghouse Rating

Showing 1 to 15 of 28 results Save | Export

New Tests of Rater Drift in Trend Scoring

Peer reviewed

Direct link

John R. Donoghue; Carol Eckerly – Applied Measurement in Education, 2024

Trend scoring constructed response items (i.e. rescoring Time A responses at Time B) gives rise to two-way data that follow a product multinomial distribution rather than the multinomial distribution that is usually assumed. Recent work has shown that the difference in sampling model can have profound negative effects on statistics usually used to…

Descriptors: Scoring, Error of Measurement, Reliability, Scoring Rubrics

Reviewing the Test Reviews: Quality Judgments and Reviewer Agreements in the Mental Measurements Yearbook

Peer reviewed

Direct link

Hogan, Thomas; DeStefano, Marissa; Gilby, Caitlin; Kosman, Dana; Peri, Joshua – Applied Measurement in Education, 2021

Buros' "Mental Measurements Yearbook (MMY)" has provided professional reviews of commercially published psychological and educational tests for over 80 years. It serves as a kind of conscience for the testing industry. For a random sample of 50 entries in the "19th MMY" (a total of 100 separate reviews) this study determined…

Descriptors: Test Reviews, Interrater Reliability, Psychological Testing, Educational Testing

Evaluating Human Scoring Using Generalizability Theory

Peer reviewed

Direct link

Bimpeh, Yaw; Pointer, William; Smith, Ben Alexander; Harrison, Liz – Applied Measurement in Education, 2020

Many high-stakes examinations in the United Kingdom (UK) use both constructed-response items and selected-response items. We need to evaluate the inter-rater reliability for constructed-response items that are scored by humans. While there are a variety of methods for evaluating rater consistency across ratings in the psychometric literature, we…

Descriptors: Scoring, Generalizability Theory, Interrater Reliability, Foreign Countries

Validating Rubric Scoring Processes: An Application of an Item Response Tree Model

Peer reviewed

Direct link

Myers, Aaron J.; Ames, Allison J.; Leventhal, Brian C.; Holzman, Madison A. – Applied Measurement in Education, 2020

When rating performance assessments, raters may ascribe different scores for the same performance when rubric application does not align with the intended application of the scoring criteria. Given performance assessment score interpretation assumes raters apply rubrics as rubric developers intended, misalignment between raters' scoring processes…

Descriptors: Scoring Rubrics, Validity, Item Response Theory, Interrater Reliability

Statistically Comparing the Performance of Multiple Automated Raters across Multiple Items

Peer reviewed

Direct link

Kieftenbeld, Vincent; Boyer, Michelle – Applied Measurement in Education, 2017

Automated scoring systems are typically evaluated by comparing the performance of a single automated rater item-by-item to human raters. This presents a challenge when the performance of multiple raters needs to be compared across multiple items. Rankings could depend on specifics of the ranking procedure; observed differences could be due to…

Descriptors: Automation, Scoring, Comparative Analysis, Test Items

Appraising the Scoring Performance of Automated Essay Scoring Systems--Some Additional Considerations: Which Essays? Which Human Raters? Which Scores?

Peer reviewed

Direct link

Raczynski, Kevin; Cohen, Allan – Applied Measurement in Education, 2018

The literature on Automated Essay Scoring (AES) systems has provided useful validation frameworks for any assessment that includes AES scoring. Furthermore, evidence for the scoring fidelity of AES systems is accumulating. Yet questions remain when appraising the scoring performance of AES systems. These questions include: (a) which essays are…

Descriptors: Essay Tests, Test Scoring Machines, Test Validity, Evaluators

Validating Human and Automated Scoring of Essays against "True" Scores

Peer reviewed

Direct link

Cohen, Yoav; Levi, Effi; Ben-Simon, Anat – Applied Measurement in Education, 2018

In the current study, two pools of 250 essays, all written as a response to the same prompt, were rated by two groups of raters (14 or 15 raters per group), thereby providing an approximation to the essay's true score. An automated essay scoring (AES) system was trained on the datasets and then scored the essays using a cross-validation scheme. By…

Descriptors: Test Validity, Automation, Scoring, Computer Assisted Testing

Regression Effects in Angoff Ratings: Examples from Credentialing Exams

Peer reviewed

Direct link

Wyse, Adam E. – Applied Measurement in Education, 2018

This article discusses regression effects that are commonly observed in Angoff ratings where panelists tend to think that hard items are easier than they are and easy items are more difficult than they are in comparison to estimated item difficulties. Analyses of data from two credentialing exams illustrate these regression effects and the…

Descriptors: Regression (Statistics), Test Items, Difficulty Level, Licensing Examinations (Professions)

Using Multigroup Confirmatory Factor Analysis to Test Measurement Invariance in Raters: A Clinical Skills Examination Application

Peer reviewed

Direct link

Kahraman, Nilufer; Brown, Crystal B. – Applied Measurement in Education, 2015

Psychometric models based on structural equation modeling framework are commonly used in many multiple-choice test settings to assess measurement invariance of test items across examinee subpopulations. The premise of the current article is that they may also be useful in the context of performance assessment tests to test measurement invariance…

Descriptors: Factor Analysis, Structural Equation Models, Medical Students, Performance Based Assessment

Applying a Thurstonian, Two-Stage Method in the Standardized Assessment of Writing

Peer reviewed

Direct link

McGrane, Joshua Aaron; Humphry, Stephen Mark; Heldsinger, Sandra – Applied Measurement in Education, 2018

National standardized assessment programs have increasingly included extended written performances, amplifying the need for reliable, valid, and efficient methods of assessment. This article examines a two-stage method using comparative judgments and calibrated exemplars as a complement and alternative to existing methods of assessing writing.…

Descriptors: Standardized Tests, Foreign Countries, Writing Tests, Writing Evaluation

Validating Automated Essay Scoring: A (Modest) Refinement of the "Gold Standard"

Peer reviewed

Direct link

Powers, Donald E.; Escoffery, David S.; Duchnowski, Matthew P. – Applied Measurement in Education, 2015

By far, the most frequently used method of validating (the interpretation and use of) automated essay scores has been to compare them with scores awarded by human raters. Although this practice is questionable, human-machine agreement is still often regarded as the "gold standard." Our objective was to refine this model and apply it to…

Descriptors: Essays, Test Scoring Machines, Program Validation, Criterion Referenced Tests

Criterion-Focused Approach to Reducing Adverse Impact in College Admissions

Peer reviewed

Direct link

Sinha, Ruchi; Oswald, Frederick; Imus, Anna; Schmitt, Neal – Applied Measurement in Education, 2011

The current study examines how using a multidimensional battery of predictors (high-school grade point average (GPA), SAT/ACT, and biodata), and weighting the predictors based on the different values institutions place on various student performance dimensions (college GPA, organizational citizenship behaviors (OCBs), and behaviorally anchored…

Descriptors: Grade Point Average, Interrater Reliability, Rating Scales, College Admission

Application of Generalizability Theory to Concept Map Assessment Research

Peer reviewed

Direct link

Yin, Yue; Shavelson, Richard J. – Applied Measurement in Education, 2008

In the first part of this article, the use of Generalizability (G) theory in examining the dependability of concept map assessment scores and designing a concept map assessment for a particular practical application is discussed. In the second part, the application of G theory is demonstrated by comparing the technical qualities of two frequently…

Descriptors: Generalizability Theory, Concept Mapping, Validity, Reliability

Recommendations for Preparing and Scoring Constructed-Response Items: What the Experts Say

Peer reviewed

Direct link

Hogan, Thomas P.; Murphy, Gavin – Applied Measurement in Education, 2007

We determined the recommendations for preparing and scoring constructed-response (CR) test items in 25 sources (textbooks and chapters) on educational and psychological measurement. The project was similar to Haladyna's (2004) analysis for multiple-choice items. We identified 12 recommendations for preparing CR items given by multiple sources,…

Descriptors: Test Items, Scoring, Test Construction, Educational Indicators

Peer reviewed

Direct link

Webb, Norman L. – Applied Measurement in Education, 2007

A process for judging the alignment between curriculum standards and assessments developed by the author is presented. This process produces information on the relationship of standards and assessments on four alignment criteria: Categorical Concurrence, Depth of Knowledge Consistency, Range of Knowledge Correspondence, and Balance of…

Descriptors: Educational Assessment, Academic Standards, Item Analysis, Interrater Reliability

Previous Page | Next Page »

Pages: 1 | 2

Johnson, Robert L.	2
Shavelson, Richard J.	2
Ames, Allison J.	1
Angoff, William H.	1
Bell, Robert M.	1
Ben-Simon, Anat	1
Bimpeh, Yaw	1
Boyer, Michelle	1
Brown, Crystal B.	1
Busch, John Christian	1
Carol Eckerly	1
Cohen, Allan	1
Cohen, Yoav	1
Comfort, Kathy	1
DeStefano, Marissa	1
Downing, Steven M.	1
Duchnowski, Matthew P.	1
Dunbar, Stephen B.	1
Ercikan, Kadriye	1
Escoffery, David S.	1
Ferrara, Steven	1
Fisher, Steve	1
Fitzpatrick, Anne R.	1
Gilby, Caitlin	1
More ▼