ERIC - Search Results

Publication Date

In 2025	0
Since 2024	0
Since 2021 (last 5 years)	0
Since 2016 (last 10 years)	3
Since 2006 (last 20 years)	6

Descriptor

Evaluation Methods	12
Interrater Reliability	7
Test Reliability	5
Performance Based Assessment	4
Student Evaluation	4
Test Validity	4
Educational Assessment	3
Evaluators	3
Item Response Theory	3
Validity	3
Decision Making	2
Grade 7	2
Measurement Techniques	2
Standard Setting (Scoring)	2
Standards	2
Statistical Analysis	2
Teacher Evaluation	2
Test Bias	2
Writing Evaluation	2
Academic Achievement	1
Accuracy	1
Bayesian Statistics	1
College Admission	1
College Entrance Examinations	1
College Students	1
More ▼

Source

Applied Measurement in…

Publication Type

Journal Articles	12
Reports - Evaluative	6
Reports - Research	6

Education Level

Higher Education	2
Elementary Education	1
Grade 7	1
Postsecondary Education	1

Audience

Location

Australia	1
United States	1

Laws, Policies, & Programs

Assessments and Surveys

SAT (College Admission Test)

What Works Clearinghouse Rating

Showing all 12 results Save | Export

Validating Rubric Scoring Processes: An Application of an Item Response Tree Model

Peer reviewed

Direct link

Myers, Aaron J.; Ames, Allison J.; Leventhal, Brian C.; Holzman, Madison A. – Applied Measurement in Education, 2020

When rating performance assessments, raters may ascribe different scores for the same performance when rubric application does not align with the intended application of the scoring criteria. Given performance assessment score interpretation assumes raters apply rubrics as rubric developers intended, misalignment between raters' scoring processes…

Descriptors: Scoring Rubrics, Validity, Item Response Theory, Interrater Reliability

Appraising the Scoring Performance of Automated Essay Scoring Systems--Some Additional Considerations: Which Essays? Which Human Raters? Which Scores?

Peer reviewed

Direct link

Raczynski, Kevin; Cohen, Allan – Applied Measurement in Education, 2018

The literature on Automated Essay Scoring (AES) systems has provided useful validation frameworks for any assessment that includes AES scoring. Furthermore, evidence for the scoring fidelity of AES systems is accumulating. Yet questions remain when appraising the scoring performance of AES systems. These questions include: (a) which essays are…

Descriptors: Essay Tests, Test Scoring Machines, Test Validity, Evaluators

Applying a Thurstonian, Two-Stage Method in the Standardized Assessment of Writing

Peer reviewed

Direct link

McGrane, Joshua Aaron; Humphry, Stephen Mark; Heldsinger, Sandra – Applied Measurement in Education, 2018

National standardized assessment programs have increasingly included extended written performances, amplifying the need for reliable, valid, and efficient methods of assessment. This article examines a two-stage method using comparative judgments and calibrated exemplars as a complement and alternative to existing methods of assessing writing.…

Descriptors: Standardized Tests, Foreign Countries, Writing Tests, Writing Evaluation

Using Testlet Response Theory to Examine Local Dependence in C-Tests

Peer reviewed

Direct link

Eckes, Thomas; Baghaei, Purya – Applied Measurement in Education, 2015

C-tests are gap-filling tests widely used to assess general language proficiency for purposes of placement, screening, or provision of feedback to language learners. C-tests consist of several short texts in which parts of words are missing. We addressed the issue of local dependence in C-tests using an explicit modeling approach based on testlet…

Descriptors: Language Proficiency, Language Tests, Item Response Theory, Test Reliability

Impact of Design Effects in Large-Scale District and State Assessments

Peer reviewed

Direct link

Phillips, Gary W. – Applied Measurement in Education, 2015

This article proposes that sampling design effects have potentially huge unrecognized impacts on the results reported by large-scale district and state assessments in the United States. When design effects are unrecognized and unaccounted for they lead to underestimating the sampling error in item and test statistics. Underestimating the sampling…

Descriptors: State Programs, Sampling, Research Design, Error of Measurement

Criterion-Focused Approach to Reducing Adverse Impact in College Admissions

Peer reviewed

Direct link

Sinha, Ruchi; Oswald, Frederick; Imus, Anna; Schmitt, Neal – Applied Measurement in Education, 2011

The current study examines how using a multidimensional battery of predictors (high-school grade point average (GPA), SAT/ACT, and biodata), and weighting the predictors based on the different values institutions place on various student performance dimensions (college GPA, organizational citizenship behaviors (OCBs), and behaviorally anchored…

Descriptors: Grade Point Average, Interrater Reliability, Rating Scales, College Admission

Score Resolution: An Investigation of the Reliability and Validity of Resolved Scores

Peer reviewed

Direct link

Johnson, Robert L.; Penny, Jim; Fisher, Steve; Kuhs, Therese – Applied Measurement in Education, 2003

When raters assign different scores to a performance task, a method for resolving rating differences is required to report a single score to the examinee. Recent studies indicate that decisions about examinees, such as pass/fail decisions, differ across resolution methods. Previous studies also investigated the interrater reliability of…

Descriptors: Test Reliability, Test Validity, Scores, Interrater Reliability

Performance of SIBTEST When the Percentage of DIF Items Is Large

Peer reviewed

Direct link

Gierl, Mark J.; Gotzmann, Andrea; Boughton, Keith A. – Applied Measurement in Education, 2004

Differential item functioning (DIF) analyses are used to identify items that operate differently between two groups, after controlling for ability. The Simultaneous Item Bias Test (SIBTEST) is a popular DIF detection method that matches examinees on a true score estimate of ability. However in some testing situations, like test translation and…

Descriptors: True Scores, Simulation, Test Bias, Student Evaluation

Using an Extended Angoff Procedure to Set Standards on Complex Performance Assessments.

Peer reviewed

Hambleton, Ronald K.; Plake, Barbara S. – Applied Measurement in Education, 1995

Several extensions to the Angoff method of standard setting are described that can accommodate characteristics of performance-based assessment. A study involving 12 panelists supported the effectiveness of the new approach but suggested that panelists preferred an approach that was at least partially conjunctive. (SLD)

Descriptors: Educational Assessment, Evaluation Methods, Evaluators, Interrater Reliability

Setting Performance Standards through Two-Stage Judgmental Policy Capturing.

Peer reviewed

Jaeger, Richard M. – Applied Measurement in Education, 1995

A performance-standard setting procedure termed judgmental policy capturing (JPC) and its application are described. A study involving 12 panelists demonstrated the feasibility of the JPC method for setting performance standards for classroom teachers seeking certification from the National Board for Professional Teaching Standards. (SLD)

Descriptors: Decision Making, Educational Assessment, Evaluation Methods, Evaluators

The Development and Use of Literacy Portfolios for Students, Classes, and Teachers.

Peer reviewed

Valencia, Sheila W.; Calfee, Robert – Applied Measurement in Education, 1991

Using portfolios in assessing literacy is explored, considering student portfolios and the teacher's class portfolio. Portfolio assessment is a valuable complement to externally mandated tests, but technical issues must be addressed if the portfolio movement is to survive. Portfolios must be linked to the broader task of instructional improvement.…

Descriptors: Academic Achievement, Educational Assessment, Educational Improvement, Elementary School Teachers

How to Evaluate the Legal Defensibility of High-Stakes Tests.

Peer reviewed

Mehrens, William A.; Popham, W. James – Applied Measurement in Education, 1992

This paper discusses how to determine whether a test was developed in a legally defensible manner, reviewing general issues, specific cases bearing on different types of test use, some evaluative dimensions, and evidence of test quality. Tests constructed and used according to existing standards will generally stand legal scrutiny. (SLD)

Descriptors: College Entrance Examinations, Compliance (Legal), Constitutional Law, Court Litigation

Ames, Allison J.	1
Baghaei, Purya	1
Boughton, Keith A.	1
Calfee, Robert	1
Cohen, Allan	1
Eckes, Thomas	1
Fisher, Steve	1
Gierl, Mark J.	1
Gotzmann, Andrea	1
Hambleton, Ronald K.	1
Heldsinger, Sandra	1
Holzman, Madison A.	1
Humphry, Stephen Mark	1
Imus, Anna	1
Jaeger, Richard M.	1
Johnson, Robert L.	1
Kuhs, Therese	1
Leventhal, Brian C.	1
McGrane, Joshua Aaron	1
Mehrens, William A.	1
Myers, Aaron J.	1
Oswald, Frederick	1
Penny, Jim	1
Phillips, Gary W.	1
Plake, Barbara S.	1
More ▼