NotesFAQContact Us
Collection
Advanced
Search Tips
Showing all 14 results Save | Export
Peer reviewed Peer reviewed
Direct linkDirect link
Wallace N. Pinto Jr.; Jinnie Shin – Journal of Educational Measurement, 2025
In recent years, the application of explainability techniques to automated essay scoring and automated short-answer grading (ASAG) models, particularly those based on transformer architectures, has gained significant attention. However, the reliability and consistency of these techniques remain underexplored. This study systematically investigates…
Descriptors: Automation, Grading, Computer Assisted Testing, Scoring
Peer reviewed Peer reviewed
Direct linkDirect link
Kylie Gorney; Sandip Sinharay – Journal of Educational Measurement, 2025
Although there exists an extensive amount of research on subscores and their properties, limited research has been conducted on categorical subscores and their interpretations. In this paper, we focus on the claim of Feinberg and von Davier that categorical subscores are useful for remediation and instructional purposes. We investigate this claim…
Descriptors: Tests, Scores, Test Interpretation, Alternative Assessment
Peer reviewed Peer reviewed
Direct linkDirect link
Dorsey, David W.; Michaels, Hillary R. – Journal of Educational Measurement, 2022
We have dramatically advanced our ability to create rich, complex, and effective assessments across a range of uses through technology advancement. Artificial Intelligence (AI) enabled assessments represent one such area of advancement--one that has captured our collective interest and imagination. Scientists and practitioners within the domains…
Descriptors: Validity, Ethics, Artificial Intelligence, Evaluation Methods
Peer reviewed Peer reviewed
Direct linkDirect link
Clauser, Brian E.; Mee, Janet; Baldwin, Su G.; Margolis, Melissa J.; Dillon, Gerard F. – Journal of Educational Measurement, 2009
Although the Angoff procedure is among the most widely used standard setting procedures for tests comprising multiple-choice items, research has shown that subject matter experts have considerable difficulty accurately making the required judgments in the absence of examinee performance data. Some authors have viewed the need to provide…
Descriptors: Standard Setting (Scoring), Program Effectiveness, Expertise, Health Personnel
Peer reviewed Peer reviewed
Direct linkDirect link
Myford, Carol M.; Wolfe, Edward W. – Journal of Educational Measurement, 2009
In this study, we describe a framework for monitoring rater performance over time. We present several statistical indices to identify raters whose standards drift and explain how to use those indices operationally. To illustrate the use of the framework, we analyzed rating data from the 2002 Advanced Placement English Literature and Composition…
Descriptors: English Literature, Advanced Placement, Measures (Individuals), Writing (Composition)
Peer reviewed Peer reviewed
Livingston, Samuel A. – Journal of Educational Measurement, 1982
To set a standard on the "beardedness" test (see TM 507 062) the probability that a student with a specific score will be judged as bearded must be estimated for each test score. To get an unbiased estimate of that probability, a representative sample of students at each test score level must be chosen. (BW)
Descriptors: Cutting Scores, Evaluation Methods, Graduation Requirements, Minimum Competency Testing
Peer reviewed Peer reviewed
Rowley, Glenn L. – Journal of Educational Measurement, 1982
Livingston's (TM 507 218) response to Rowley (TM 507 062) is compared with the original Zieky and Livingston formulation of the Contrasting Groups Method of setting standards. (BW)
Descriptors: Cutting Scores, Evaluation Methods, Graduation Requirements, Minimum Competency Testing
Peer reviewed Peer reviewed
Direct linkDirect link
Liu, Jinghua; Cahn, Miriam F.; Dorans, Neil J. – Journal of Educational Measurement, 2006
The College Board's SAT[R] data are used to illustrate how the score equity assessment (SEA) can help inform the program about equatability. SEA is used to examine whether the content change(s) to the revised new SAT result in differential linking functions across gender groups. Results of population sensitivity analyses are reported on the…
Descriptors: Aptitude Tests, Comparative Analysis, Gender Differences, Scores
Peer reviewed Peer reviewed
Cross, Lawrence H.; And Others – Journal of Educational Measurement, 1984
Minimum standards were established for the National Teacher Examinations (NTE) by teacher educators instructed in the use of the Angoff, Nedelsky, or Jaeger procedures. The anticipated failure rates, the psychometric characteristics of the ratings, and other factors suggest the Angoff procedure yields the most defensible standards for the NTE area…
Descriptors: Analysis of Variance, Cutting Scores, Evaluation Methods, Occupational Tests
Peer reviewed Peer reviewed
Baxter, Gail P.; And Others – Journal of Educational Measurement, 1992
A procedure-based observational scoring system and a notebook completed by students were evaluated as science assessments for 41 fifth grade students experienced in hands-on science and 55 fifth grade students inexperienced in hands-on science. Results suggest that notebooks may be a reasonable, although less reliable, surrogate for observed…
Descriptors: Classroom Observation Techniques, Comparative Analysis, Educational Assessment, Elementary School Students
Peer reviewed Peer reviewed
Moss, Pamela A.; And Others – Journal of Educational Measurement, 1982
Scores on a multiple-choice language test involving recognition of language errors were related to those on writing samples, scored atomistically for the same language errors and holistically for communicative effectiveness and correctness. Results suggest the need for clear limits in generalizing from one assessment to others. (Author/GK)
Descriptors: Comparative Analysis, Elementary Secondary Education, Evaluation Methods, Grade 10
Peer reviewed Peer reviewed
Cross, Lawrence H.; And Others – Journal of Educational Measurement, 1985
This study evaluated procedures for establishing a minimum performance standard for the essay subtest of the National Teacher Examinations Communications Skills test. Results indicated the preferred procedure for setting standards on essays should involve a blind review followed by an informed review. (Author/DWH)
Descriptors: Beginning Teachers, Cutting Scores, Essay Tests, Evaluation Methods
Peer reviewed Peer reviewed
Hills, John R.; And Others – Journal of Educational Measurement, 1988
Five methods of equating minimum-competency tests were compared using the Florida Statewide Student Assessment Test, Part II, for 1984 and 1986. Four of five methods yielded essentially comparable results for the highest scoring 84% of the students. Different lengths of anchor items were compared, using the concurrent item response theory equating…
Descriptors: Comparative Analysis, Equated Scores, Evaluation Methods, Graduation Requirements
Peer reviewed Peer reviewed
Clauser, Brian E.; And Others – Journal of Educational Measurement, 1995
A scoring algorithm for performance assessments is described that is based on expert judgments but requires the rating of only a sample of performances. A regression-based policy capturing procedure was implemented for clinicians evaluating skills of 280 medical students. Results demonstrate the usefulness of the algorithm. (SLD)
Descriptors: Algorithms, Clinical Diagnosis, Computer Simulation, Educational Assessment