ERIC - Search Results

Publication Date

In 2025	1
Since 2024	1
Since 2021 (last 5 years)	7
Since 2016 (last 10 years)	8
Since 2006 (last 20 years)	9

Descriptor

Difficulty Level	13
Evaluators	13
Scoring	13
Interrater Reliability	5
Standard Setting (Scoring)	4
Writing Evaluation	4
Elementary Secondary Education	3
Evaluation Criteria	3
Kindergarten	3
Language Tests	3
Minimum Competency Testing	3
Rating Scales	3
Test Interpretation	3
Accuracy	2
Correlation	2
Decision Making	2
English	2
Essays	2
Estimation (Mathematics)	2
Higher Education	2
Item Analysis	2
Item Response Theory	2
Mathematics Tests	2
Minimum Competencies	2
Performance Based Assessment	2
More ▼

Source

Language Testing	3
Educational Measurement:…	2
Education Sciences	1
Educational and Psychological…	1
Journal of Education and…	1
Journal of Experimental…	1
Measurement:…	1
Reading & Writing Quarterly	1
Thomas B. Fordham Institute	1

Publication Type

Journal Articles	11
Reports - Research	10
Reports - Evaluative	3
Tests/Questionnaires	2
Speeches/Meeting Papers	1

Education Level

Elementary Education	4
Early Childhood Education	3
Kindergarten	3
Primary Education	3
Elementary Secondary Education	2
Higher Education	2
Postsecondary Education	2
Secondary Education	1

Audience

Location

California	2
Florida	2
Idaho	1
Illinois	1
Japan	1
Maryland	1
Oregon	1
Pennsylvania	1
Utah	1
Wisconsin	1

Laws, Policies, & Programs

Assessments and Surveys

edTPA (Teacher Performance…

What Works Clearinghouse Rating

Showing all 13 results Save | Export

Exploring Difficult-to-Score Essays with a Hyperbolic Cosine Accuracy Model and Coh-Metrix Indices

Peer reviewed

Direct link

Wang, Jue; Engelhard, George; Combs, Trenton – Journal of Experimental Education, 2023

Unfolding models are frequently used to develop scales for measuring attitudes. Recently, unfolding models have been applied to examine rater severity and accuracy within the context of rater-mediated assessments. One of the problems in applying unfolding models to rater-mediated assessments is that the substantive interpretations of the latent…

Descriptors: Writing Evaluation, Scoring, Accuracy, Computational Linguistics

Detecting Rater Centrality Effects in Performance Assessments: A Model-Based Comparison of Centrality Indices

Peer reviewed

Direct link

Jin, Kuan-Yu; Eckes, Thomas – Measurement: Interdisciplinary Research and Perspectives, 2022

Recent research on rater effects in performance assessments has increasingly focused on rater centrality, the tendency to assign scores clustering around the rating scale's middle categories. In the present paper, we adopted Jin and Wang's (2018) extended facets modeling approach and constructed a centrality continuum, ranging from raters…

Descriptors: Performance Based Assessment, Evaluators, Scoring, Sample Size

Scoring Difficulty in Summary Writing Assessment: Toward the Reconstruction of Analytic Rubric

Peer reviewed
PDF on ERIC

Download full text

Makiko Kato – Journal of Education and Learning, 2025

This study aims to examine whether differences exist in the factors influencing the difficulty of scoring English summaries and determining scores based on the raters' attributes, and to collect candid opinions, considerations, and tentative suggestions for future improvements to the analytic rubric of summary writing for English learners. In this…

Descriptors: Writing Evaluation, Scoring, Writing Skills, English (Second Language)

Operationalizing the Reading-into-Writing Construct in Analytic Rating Scales: Effects of Different Approaches on Rating

Peer reviewed

Direct link

Lestari, Santi B.; Brunfaut, Tineke – Language Testing, 2023

Assessing integrated reading-into-writing task performances is known to be challenging, and analytic rating scales have been found to better facilitate the scoring of these performances than other common types of rating scales. However, little is known about how specific operationalizations of the reading-into-writing construct in analytic rating…

Descriptors: Reading Writing Relationship, Writing Tests, Rating Scales, Writing Processes

Application of an Automated Essay Scoring Engine to English Writing Assessment Using Many-Facet Rasch Measurement

Peer reviewed

Direct link

Chan, Kinnie Kin Yee; Bond, Trevor; Yan, Zi – Language Testing, 2023

We investigated the relationship between the scores assigned by an Automated Essay Scoring (AES) system, the Intelligent Essay Assessor (IEA), and grades allocated by trained, professional human raters to English essay writing by instigating two procedures novel to written-language assessment: the logistic transformation of AES raw scores into…

Descriptors: Computer Assisted Testing, Essays, Scoring, Scores

Developing a Comprehensive Decoding Instruction Observation Protocol for Special Education Teachers

Peer reviewed

Direct link

Moylan, Laura A.; Johnson, Evelyn S.; Zheng, Yuzhu – Reading & Writing Quarterly, 2022

This study describes the development of a special education teacher observation protocol detailing the elements of effective decoding instruction. The psychometric properties of the protocol were investigated through many-facet Rasch measurement (MFRM). Video observations of classroom decoding instruction from 20 special education teachers across…

Descriptors: Decoding (Reading), Special Education Teachers, Psychometrics, Video Technology

Low Inter-Rater Reliability of a High Stakes Performance Assessment of Teacher Candidates

Peer reviewed
PDF on ERIC

Download full text

Lyness, Scott A.; Peterson, Kent; Yates, Kenneth – Education Sciences, 2021

The Performance Assessment for California Teachers (PACT) is a high stakes summative assessment that was designed to measure pre-service teacher readiness. We examined the inter-rater reliability (IRR) of trained PACT evaluators who rated 19 candidates. As measured by Cohen's weighted kappa, the overall IRR estimate was 0.17 (poor strength of…

Descriptors: High Stakes Tests, Performance Based Assessment, Teacher Effectiveness, Academic Language

The State of the Sunshine State Standards: Florida's B.E.S.T. Edition

Download full text

Friedberg, Solomon; Shanahan, Tim; Fennell, Francis; Fisher, Douglas; Howe, Roger – Thomas B. Fordham Institute, 2020

A decade ago, states across the nation adopted the Common Core State Standards (CCSS) in an effort to raise the academic bar for their students. This has provoked countless political battles since then--including an especially intense one in Florida. That fight culminated in Governor Ron DeSantis vowing to eliminate and replace the Common Core,…

Descriptors: Common Core State Standards, Academic Achievement, Benchmarking, Academic Standards

Rating Written Performance: What Do Raters Do and Why?

Peer reviewed

Direct link

Kuiken, Folkert; Vedder, Ineke – Language Testing, 2014

This study investigates the relationship in L2 writing between raters' judgments of communicative adequacy and linguistic complexity by means of six-point Likert scales, and general measures of linguistic performance. The participants were 39 learners of Italian and 32 of Dutch, who wrote two short argumentative essays. The same writing tasks…

Descriptors: Writing Evaluation, Second Language Learning, Evaluators, Native Language

Defining Minimal Competence.

Peer reviewed

Mills, Craig N.; And Others – Educational Measurement: Issues and Practice, 1991

An approach is presented to the definition of minimal competence for judges to use in standard setting. Panelists in standard setting must receive training to ensure that differences in rating result from differences in perceptions of item difficulty, not in differences of opinion about the definition of minimal competence. (SLD)

Descriptors: Cutting Scores, Decision Making, Definitions, Difficulty Level

Training Judges to Generate Standard-Setting Data.

Peer reviewed

Reid, Jerry B. – Educational Measurement: Issues and Practice, 1991

Training judges to generate item ratings in standard setting once the reference group has been defined is discussed. It is proposed that sensitivity to the factors that determine difficulty can be improved through training. Three criteria for determining when training is sufficient are offered. (SLD)

Descriptors: Computer Assisted Instruction, Difficulty Level, Evaluators, Interrater Reliability

Effects of Item Context on Intrajudge Consistency of Expert Judgments via the Nedelsky Standard Setting Method.

Peer reviewed

Plake, Barbara S.; Melican, Gerald J. – Educational and Psychological Measurement, 1989

The impact of overall test length and difficulty on the expert judgments of item performance by the Nedelsky method were studied. Five university-level instructors predicting the performance of minimally competent candidates on a mathematics examination were fairly consistent in their assessments regardless of length or difficulty of the test.…

Descriptors: Difficulty Level, Estimation (Mathematics), Evaluators, Higher Education

The Relationship between Modified Angoff Knowledge Estimation Judgments and Item Difficulty Values for Seven NTE Specialty Area Tests.

Wheeler, Patricia – 1991

The appropriateness of the Angoff method (W. H. Angoff, 1971) for setting standards on tests was studied. Evaluators (judges) from California school districts and teacher training institutions reviewed 15 NTE (National Teacher Examinations) Program Specialty Area Tests published by the Educational Testing Service for their appropriateness in…

Descriptors: Art Education, Biology, Difficulty Level, Elementary Secondary Education

Bond, Trevor	1
Brunfaut, Tineke	1
Chan, Kinnie Kin Yee	1
Combs, Trenton	1
Eckes, Thomas	1
Engelhard, George	1
Fennell, Francis	1
Fisher, Douglas	1
Friedberg, Solomon	1
Howe, Roger	1
Jin, Kuan-Yu	1
Johnson, Evelyn S.	1
Kuiken, Folkert	1
Lestari, Santi B.	1
Lyness, Scott A.	1
Makiko Kato	1
Melican, Gerald J.	1
Mills, Craig N.	1
Moylan, Laura A.	1
Peterson, Kent	1
Plake, Barbara S.	1
Reid, Jerry B.	1
Shanahan, Tim	1
Vedder, Ineke	1
More ▼