ERIC - Search Results

Publication Date

In 2025	2
Since 2024	2
Since 2021 (last 5 years)	3
Since 2016 (last 10 years)	3
Since 2006 (last 20 years)	6

Descriptor

Evaluation Methods	14
Scoring	9
Educational Assessment	5
Standard Setting (Scoring)	5
Comparative Analysis	4
Cutting Scores	4
Secondary Education	4
Evaluation Research	3
Graduation Requirements	3
Minimum Competency Testing	3
Scores	3
Student Evaluation	3
Computer Assisted Testing	2
Educational Testing	2
Evaluation Problems	2
High Stakes Tests	2
Measurement	2
Satire	2
Standards	2
Teacher Evaluation	2
Test Reliability	2
Test Validity	2
Testing Problems	2
Writing Evaluation	2
Accuracy	1
More ▼

Source

Journal of Educational…

Publication Type

Journal Articles	14
Reports - Research	11
Opinion Papers	2
Reports - Evaluative	1

Education Level

Elementary Secondary Education	1
Secondary Education	1

Audience

Researchers

Location

Laws, Policies, & Programs

Assessments and Surveys

National Teacher Examinations	2
Advanced Placement…	1
SAT (College Admission Test)	1

What Works Clearinghouse Rating

Showing all 14 results Save | Export

Evaluating the Consistency and Reliability of Attribution Methods in Automated Short Answer Grading (ASAG) Systems: Toward an Explainable Scoring System

Peer reviewed

Direct link

Wallace N. Pinto Jr.; Jinnie Shin – Journal of Educational Measurement, 2025

In recent years, the application of explainability techniques to automated essay scoring and automated short-answer grading (ASAG) models, particularly those based on transformer architectures, has gained significant attention. However, the reliability and consistency of these techniques remain underexplored. This study systematically investigates…

Descriptors: Automation, Grading, Computer Assisted Testing, Scoring

A Note on the Use of Categorical Subscores

Peer reviewed

Direct link

Kylie Gorney; Sandip Sinharay – Journal of Educational Measurement, 2025

Although there exists an extensive amount of research on subscores and their properties, limited research has been conducted on categorical subscores and their interpretations. In this paper, we focus on the claim of Feinberg and von Davier that categorical subscores are useful for remediation and instructional purposes. We investigate this claim…

Descriptors: Tests, Scores, Test Interpretation, Alternative Assessment

Validity Arguments Meet Artificial Intelligence in Innovative Educational Assessment

Peer reviewed

Direct link

Dorsey, David W.; Michaels, Hillary R. – Journal of Educational Measurement, 2022

We have dramatically advanced our ability to create rich, complex, and effective assessments across a range of uses through technology advancement. Artificial Intelligence (AI) enabled assessments represent one such area of advancement--one that has captured our collective interest and imagination. Scientists and practitioners within the domains…

Descriptors: Validity, Ethics, Artificial Intelligence, Evaluation Methods

Judges' Use of Examinee Performance Data in an Angoff Standard-Setting Exercise for a Medical Licensing Examination: An Experimental Study

Peer reviewed

Direct link

Clauser, Brian E.; Mee, Janet; Baldwin, Su G.; Margolis, Melissa J.; Dillon, Gerard F. – Journal of Educational Measurement, 2009

Although the Angoff procedure is among the most widely used standard setting procedures for tests comprising multiple-choice items, research has shown that subject matter experts have considerable difficulty accurately making the required judgments in the absence of examinee performance data. Some authors have viewed the need to provide…

Descriptors: Standard Setting (Scoring), Program Effectiveness, Expertise, Health Personnel

Monitoring Rater Performance over Time: A Framework for Detecting Differential Accuracy and Differential Scale Category Use

Peer reviewed

Direct link

Myford, Carol M.; Wolfe, Edward W. – Journal of Educational Measurement, 2009

In this study, we describe a framework for monitoring rater performance over time. We present several statistical indices to identify raters whose standards drift and explain how to use those indices operationally. To illustrate the use of the framework, we analyzed rating data from the 2002 Advanced Placement English Literature and Composition…

Descriptors: English Literature, Advanced Placement, Measures (Individuals), Writing (Composition)

Comment on Rowley's Paper, Historical Antecedents of the Standard-Setting Debate: An Inside Account of The Minimal-Beardedness Controversy.

Peer reviewed

Livingston, Samuel A. – Journal of Educational Measurement, 1982

To set a standard on the "beardedness" test (see TM 507 062) the probability that a student with a specific score will be judged as bearded must be estimated for each test score. To get an unbiased estimate of that probability, a representative sample of students at each test score level must be chosen. (BW)

Descriptors: Cutting Scores, Evaluation Methods, Graduation Requirements, Minimum Competency Testing

Response to Livingston's Comment on Minimal Beardedness.

Peer reviewed

Rowley, Glenn L. – Journal of Educational Measurement, 1982

Livingston's (TM 507 218) response to Rowley (TM 507 062) is compared with the original Zieky and Livingston formulation of the Contrasting Groups Method of setting standards. (BW)

Descriptors: Cutting Scores, Evaluation Methods, Graduation Requirements, Minimum Competency Testing

An Application of Score Equity Assessment: Invariance of Linkage of New SAT[R] to Old SAT across Gender Groups

Peer reviewed

Direct link

Liu, Jinghua; Cahn, Miriam F.; Dorans, Neil J. – Journal of Educational Measurement, 2006

The College Board's SAT[R] data are used to illustrate how the score equity assessment (SEA) can help inform the program about equatability. SEA is used to examine whether the content change(s) to the revised new SAT result in differential linking functions across gender groups. Results of population sensitivity analyses are reported on the…

Descriptors: Aptitude Tests, Comparative Analysis, Gender Differences, Scores

A Comparison of Three Methods for Establishing Minimum Standards on the National Teacher Examinations.

Peer reviewed

Cross, Lawrence H.; And Others – Journal of Educational Measurement, 1984

Minimum standards were established for the National Teacher Examinations (NTE) by teacher educators instructed in the use of the Angoff, Nedelsky, or Jaeger procedures. The anticipated failure rates, the psychometric characteristics of the ratings, and other factors suggest the Angoff procedure yields the most defensible standards for the NTE area…

Descriptors: Analysis of Variance, Cutting Scores, Evaluation Methods, Occupational Tests

Evaluation of Procedure-Based Scoring for Hands-On Science Assessment.

Peer reviewed

Baxter, Gail P.; And Others – Journal of Educational Measurement, 1992

A procedure-based observational scoring system and a notebook completed by students were evaluated as science assessments for 41 fifth grade students experienced in hands-on science and 55 fifth grade students inexperienced in hands-on science. Results suggest that notebooks may be a reasonable, although less reliable, surrogate for observed…

Descriptors: Classroom Observation Techniques, Comparative Analysis, Educational Assessment, Elementary School Students

A Comparison of Procedures to Assess Written Language Skills at Grades 4, 7, and 10.

Peer reviewed

Moss, Pamela A.; And Others – Journal of Educational Measurement, 1982

Scores on a multiple-choice language test involving recognition of language errors were related to those on writing samples, scored atomistically for the same language errors and holistically for communicative effectiveness and correctness. Results suggest the need for clear limits in generalizing from one assessment to others. (Author/GK)

Descriptors: Comparative Analysis, Elementary Secondary Education, Evaluation Methods, Grade 10

Establishing Minimum Standards for Essays: Blind Versus Informed Reviews.

Peer reviewed

Cross, Lawrence H.; And Others – Journal of Educational Measurement, 1985

This study evaluated procedures for establishing a minimum performance standard for the essay subtest of the National Teacher Examinations Communications Skills test. Results indicated the preferred procedure for setting standards on essays should involve a blind review followed by an informed review. (Author/DWH)

Descriptors: Beginning Teachers, Cutting Scores, Essay Tests, Evaluation Methods

Equating Minimum-Competency Tests: Comparison of Methods.

Peer reviewed

Hills, John R.; And Others – Journal of Educational Measurement, 1988

Five methods of equating minimum-competency tests were compared using the Florida Statewide Student Assessment Test, Part II, for 1984 and 1986. Four of five methods yielded essentially comparable results for the highest scoring 84% of the students. Different lengths of anchor items were compared, using the concurrent item response theory equating…

Descriptors: Comparative Analysis, Equated Scores, Evaluation Methods, Graduation Requirements

Scoring a Performance-Based Assessment by Modeling the Judgments of Experts.

Peer reviewed

Clauser, Brian E.; And Others – Journal of Educational Measurement, 1995

A scoring algorithm for performance assessments is described that is based on expert judgments but requires the rating of only a sample of performances. A regression-based policy capturing procedure was implemented for clinicians evaluating skills of 280 medical students. Results demonstrate the usefulness of the algorithm. (SLD)

Descriptors: Algorithms, Clinical Diagnosis, Computer Simulation, Educational Assessment

Clauser, Brian E.	2
Cross, Lawrence H.	2
Baldwin, Su G.	1
Baxter, Gail P.	1
Cahn, Miriam F.	1
Dillon, Gerard F.	1
Dorans, Neil J.	1
Dorsey, David W.	1
Hills, John R.	1
Jinnie Shin	1
Kylie Gorney	1
Liu, Jinghua	1
Livingston, Samuel A.	1
Margolis, Melissa J.	1
Mee, Janet	1
Michaels, Hillary R.	1
Moss, Pamela A.	1
Myford, Carol M.	1
Rowley, Glenn L.	1
Sandip Sinharay	1
Wallace N. Pinto Jr.	1
Wolfe, Edward W.	1
More ▼