ERIC - Search Results

Publication Date

In 2025	0
Since 2024	0
Since 2021 (last 5 years)	2
Since 2016 (last 10 years)	7
Since 2006 (last 20 years)	13

Descriptor

Evaluators	18
Interrater Reliability	18
Models	18
Accuracy	4
Evaluation Methods	4
Language Tests	4
Rating Scales	4
Academic Achievement	3
Accountability	3
Computer Assisted Testing	3
Computer Software	3
Essays	3
Foreign Countries	3
Item Response Theory	3
Licensing Examinations…	3
Performance Based Assessment	3
School Districts	3
Scores	3
Test Items	3
Testing Problems	3
Comparative Analysis	2
Correlation	2
Definitions	2
Educational Change	2
Educational Improvement	2
More ▼

Source

ETS Research Report Series	2
Applied Measurement in…	1
Applied Psychological…	1
Center on Great Teachers and…	1
Educational Assessment	1
Educational and Psychological…	1
Grantee Submission	1
International Educational…	1
Journal of Educational…	1
Journal of MultiDisciplinary…	1
National Comprehensive Center…	1
Society for Research on…	1
More ▼

Publication Type

Reports - Research	11
Journal Articles	8
Speeches/Meeting Papers	6
Reports - Evaluative	5
Guides - Non-Classroom	2
Reports - Descriptive	2
Information Analyses	1
Tests/Questionnaires	1

Education Level

Higher Education	3
Postsecondary Education	2
Elementary Education	1
Grade 7	1
High Schools	1
Secondary Education	1

Audience

Policymakers	1
Researchers	1

Location

Germany	1
Netherlands	1
Switzerland	1

Laws, Policies, & Programs

Assessments and Surveys

Test of English as a Foreign…

What Works Clearinghouse Rating

Showing 1 to 15 of 18 results Save | Export

Evaluating Quadratic Weighted Kappa as the Standard Performance Metric for Automated Essay Scoring

Peer reviewed
PDF on ERIC

Download full text

Doewes, Afrizal; Kurdhi, Nughthoh Arfawi; Saxena, Akrati – International Educational Data Mining Society, 2023

Automated Essay Scoring (AES) tools aim to improve the efficiency and consistency of essay scoring by using machine learning algorithms. In the existing research work on this topic, most researchers agree that human-automated score agreement remains the benchmark for assessing the accuracy of machine-generated scores. To measure the performance of…

Descriptors: Essays, Writing Evaluation, Evaluators, Accuracy

Contextual Definition Generation

Peer reviewed
PDF on ERIC

Download full text

Direct link

Yarbro, Jeffrey T.; Olney, Andrew M. – Grantee Submission, 2021

This paper explores the concept of dynamically generating definitions using a deep-learning model. We do this by creating a dataset that contains definition entries and contexts associated with each definition. We then fine-tune a GPT-2 based model on the dataset to allow the model to generate contextual definitions. We evaluate our model with…

Descriptors: Definitions, Learning Processes, Models, Context Effect

Kappa and Rater Accuracy: Paradigms and Parameters

Peer reviewed

Direct link

Conger, Anthony J. – Educational and Psychological Measurement, 2017

Drawing parallels to classical test theory, this article clarifies the difference between rater accuracy and reliability and demonstrates how category marginal frequencies affect rater agreement and Cohen's kappa. Category assignment paradigms are developed: comparing raters to a standard (index) versus comparing two raters to one another…

Descriptors: Interrater Reliability, Evaluators, Accuracy, Statistical Analysis

Appraising the Scoring Performance of Automated Essay Scoring Systems--Some Additional Considerations: Which Essays? Which Human Raters? Which Scores?

Peer reviewed

Direct link

Raczynski, Kevin; Cohen, Allan – Applied Measurement in Education, 2018

The literature on Automated Essay Scoring (AES) systems has provided useful validation frameworks for any assessment that includes AES scoring. Furthermore, evidence for the scoring fidelity of AES systems is accumulating. Yet questions remain when appraising the scoring performance of AES systems. These questions include: (a) which essays are…

Descriptors: Essay Tests, Test Scoring Machines, Test Validity, Evaluators

Automated Essay Scoring at Scale: A Case Study in Switzerland and Germany. TOEFL® Research Report. RR-86. ETS RR-19-12

Peer reviewed
PDF on ERIC

Download full text

Rupp, André A.; Casabianca, Jodi M.; Krüger, Maleika; Keller, Stefan; Köller, Olaf – ETS Research Report Series, 2019

In this research report, we describe the design and empirical findings for a large-scale study of essay writing ability with approximately 2,500 high school students in Germany and Switzerland on the basis of 2 tasks with 2 associated prompts, each from a standardized writing assessment whose scoring involved both human and automated components.…

Descriptors: Automation, Foreign Countries, English (Second Language), Language Tests

Item Response Models for Local Dependence among Multiple Ratings

Peer reviewed

Direct link

Wang, Wen-Chung; Su, Chi-Ming; Qiu, Xue-Lan – Journal of Educational Measurement, 2014

Ratings given to the same item response may have a stronger correlation than those given to different item responses, especially when raters interact with one another before giving ratings. The rater bundle model was developed to account for such local dependence by forming multiple ratings given to an item response as a bundle and assigning…

Descriptors: Item Response Theory, Interrater Reliability, Models, Correlation

High-Dimensional Explanatory Random Item Effects Models for Rater-Mediated Assessments

Peer reviewed
PDF on ERIC

Download full text

Kelcey, Ben; Wang, Shanshan; Cox, Kyle – Society for Research on Educational Effectiveness, 2016

Valid and reliable measurement of unobserved latent variables is essential to understanding and improving education. A common and persistent approach to assessing latent constructs in education is the use of rater inferential judgment. The purpose of this study is to develop high-dimensional explanatory random item effects models designed for…

Descriptors: Test Items, Models, Evaluators, Longitudinal Studies

Exploring the Effects of Rater Linking Designs and Rater Fit on Achievement Estimates within the Context of Music Performance Assessments

Peer reviewed

Direct link

Wind, Stefanie A.; Engelhard, George, Jr.; Wesolowski, Brian – Educational Assessment, 2016

When good model-data fit is observed, the Many-Facet Rasch (MFR) model acts as a linking and equating model that can be used to estimate student achievement, item difficulties, and rater severity on the same linear continuum. Given sufficient connectivity among the facets, the MFR model provides estimates of student achievement that are equated to…

Descriptors: Evaluators, Interrater Reliability, Academic Achievement, Music Education

Practical Guide to Designing Comprehensive Teacher Evaluation Systems: A Tool to Assist in the Development of Teacher Evaluation Systems. Revised Edition

Download full text

Goe, Laura; Holdheide, Lynn; Miller, Tricia – Center on Great Teachers and Leaders, 2014

Across the nation, states and districts are in the process of building better teacher evaluation systems that not only identify highly effective teachers but also systematically provide data and feedback that can be used to improve teacher practice. The "Practical Guide to Designing Comprehensive Teacher Evaluation Systems" is a tool…

Descriptors: Teacher Evaluation, Evaluators, Educational Change, Accountability

The Impact of Statistically Adjusting for Rater Effects on Conditional Standard Errors of Performance Ratings

Peer reviewed

Direct link

Raymond, Mark R.; Harik, Polina; Clauser, Brian E. – Applied Psychological Measurement, 2011

Prior research indicates that the overall reliability of performance ratings can be improved by using ordinary least squares (OLS) regression to adjust for rater effects. The present investigation extends previous work by evaluating the impact of OLS adjustment on standard errors of measurement ("SEM") at specific score levels. In…

Descriptors: Performance Based Assessment, Licensing Examinations (Professions), Least Squares Statistics, Item Response Theory

Investigating the Suitability of Implementing the "e-rater"® Scoring Engine in a Large-Scale English Language Testing Program. Research Report. ETS RR-13-36

Peer reviewed
PDF on ERIC

Download full text

Zhang, Mo; Breyer, F. Jay; Lorenz, Florian – ETS Research Report Series, 2013

In this research, we investigated the suitability of implementing "e-rater"® automated essay scoring in a high-stakes large-scale English language testing program. We examined the effectiveness of generic scoring and 2 variants of prompt-based scoring approaches. Effectiveness was evaluated on a number of dimensions, including agreement…

Descriptors: Computer Assisted Testing, Computer Software, Scoring, Language Tests

Forms and Functions of Participatory Evaluation in International Development: A Review of the Empirical and Theoretical Literature

Peer reviewed

Direct link

Cullen, Anne; Coryn, Chris L. S. – Journal of MultiDisciplinary Evaluation, 2011

Background: Since the late 1970s participatory approaches have been widely promoted to evaluate international development programs. However, there is no universal agreement of what is meant by participatory evaluation. For some evaluators, participatory evaluations involve the extensive participation of all stakeholder groups (from donor to…

Descriptors: Economic Development, Global Approach, International Programs, Program Evaluation

A Practical Guide to Designing Comprehensive Teacher Evaluation Systems: A Tool to Assist in the Development of Teacher Evaluation Systems

Download full text

Goe, Laura; Holdheide, Lynn; Miller, Tricia – National Comprehensive Center for Teacher Quality, 2011

Across the nation, states and districts are in the process of building better teacher evaluation systems that not only identify highly effective teachers but also systematically provide data and feedback that can be used to improve teacher practice. "A Practical Guide to Designing Comprehensive Teacher Evaluation Systems" is a tool…

Descriptors: Feedback (Response), Teacher Effectiveness, Evaluators, Teacher Evaluation

Objectivity for Judge-Intermediated Certification Examinations.

Download full text

Linacre, John M. – 1989

An accepted criterion for gauging the fairness of examinees' scores, derived from judge-awarded ratings, has been the size of the correlation between the judges and the inter-rater reliability. Various means of achieving inter-rater reliability were reviewed, and a model to measure inter-rater reliability is forwarded. Both theoretical and…

Descriptors: Evaluators, Interrater Reliability, Latent Trait Theory, Licensing Examinations (Professions)

Evaluating the Efficacy of Rater Self-Training.

Download full text

Kenyon, Dorry; Stansfield, Charles W. – 1993

This paper examines whether individuals who train themselves to score a performance assessment will rate acceptably when compared to known standards. Research on the efficacy of rater self-training materials developed by the Center for Applied Linguistics for the Texas Oral Proficiency Test (TOPT) is examined. Rater self-materials are described…

Descriptors: Bilingual Education, Comparative Analysis, Evaluators, Individual Characteristics

Previous Page | Next Page »

Pages: 1 | 2

Goe, Laura	2
Holdheide, Lynn	2
Miller, Tricia	2
Raymond, Mark R.	2
Breyer, F. Jay	1
Casabianca, Jodi M.	1
Clauser, Brian E.	1
Cohen, Allan	1
Conger, Anthony J.	1
Coryn, Chris L. S.	1
Cox, Kyle	1
Cullen, Anne	1
Doewes, Afrizal	1
Engelhard, George, Jr.	1
Harik, Polina	1
Kelcey, Ben	1
Keller, Stefan	1
Kenyon, Dorry	1
Kreeft, Henk	1
Krüger, Maleika	1
Kurdhi, Nughthoh Arfawi	1
Köller, Olaf	1
Linacre, John M.	1
Lorenz, Florian	1
Novak, Carl D.	1
More ▼