ERIC - Search Results

Publication Date

In 2025	1
Since 2024	2
Since 2021 (last 5 years)	2
Since 2016 (last 10 years)	4
Since 2006 (last 20 years)	12

Descriptor

Error of Measurement	15
Reliability	15
Scoring	15
Scores	9
Generalizability Theory	5
Comparative Analysis	3
Computer Assisted Testing	3
Correlation	3
Interrater Reliability	3
Psychometrics	3
Tests	3
Accuracy	2
Data Analysis	2
Elementary School Students	2
English	2
Evaluation Methods	2
Evaluators	2
Grade 5	2
Grading	2
Inferences	2
Prediction	2
Responses	2
Scoring Rubrics	2
Statistical Analysis	2
Test Items	2
More ▼

Source

Applied Measurement in…	2
Advances in Health Sciences…	1
Assessment for Effective…	1
British Educational Research…	1
CALICO Journal	1
ETS Research Report Series	1
Educational Testing Service	1
Educational and Psychological…	1
Journal of Psychoeducational…	1
Language Testing	1
ProQuest LLC	1
More ▼

Publication Type

Journal Articles	10
Reports - Research	9
Reports - Evaluative	5
Speeches/Meeting Papers	2
Dissertations/Theses -…	1

Education Level

Higher Education	3
Postsecondary Education	3
Elementary Education	2
Grade 5	2
Grade 4	1
Intermediate Grades	1
Middle Schools	1
Secondary Education	1

Audience

Location

China

Laws, Policies, & Programs

Assessments and Surveys

Advanced Placement…	1
National Assessment of…	1
Praxis Series	1

What Works Clearinghouse Rating

Showing all 15 results Save | Export

Grading Exams Using Large Language Models: A Comparison between Human and AI Grading of Exams in Higher Education Using ChatGPT

Peer reviewed

Direct link

Jonas Flodén – British Educational Research Journal, 2025

This study compares how the generative AI (GenAI) large language model (LLM) ChatGPT performs in grading university exams compared to human teachers. Aspects investigated include consistency, large discrepancies and length of answer. Implications for higher education, including the role of teachers and ethics, are also discussed. Three…

Descriptors: College Faculty, Artificial Intelligence, Comparative Testing, Scoring

New Tests of Rater Drift in Trend Scoring

Peer reviewed

Direct link

John R. Donoghue; Carol Eckerly – Applied Measurement in Education, 2024

Trend scoring constructed response items (i.e. rescoring Time A responses at Time B) gives rise to two-way data that follow a product multinomial distribution rather than the multinomial distribution that is usually assumed. Recent work has shown that the difference in sampling model can have profound negative effects on statistics usually used to…

Descriptors: Scoring, Error of Measurement, Reliability, Scoring Rubrics

Working with Sparse Data in Rated Language Tests: Generalizability Theory Applications

Peer reviewed

Direct link

Lin, Chih-Kai – Language Testing, 2017

Sparse-rated data are common in operational performance-based language tests, as an inevitable result of assigning examinee responses to a fraction of available raters. The current study investigates the precision of two generalizability-theory methods (i.e., the rating method and the subdividing method) specifically designed to accommodate the…

Descriptors: Data Analysis, Language Tests, Generalizability Theory, Accuracy

Evaluating Procedures for Reducing Measurement Error in Math Curriculum-Based Measurement Probes

Peer reviewed

Direct link

Methe, Scott A.; Briesch, Amy M.; Hulac, David – Assessment for Effective Intervention, 2015

At present, it is unclear whether math curriculum-based measurement (M-CBM) procedures provide a dependable measure of student progress in math computation because support for its technical properties is based largely upon a body of correlational research. Recent investigations into the dependability of M-CBM scores have found that evaluating…

Descriptors: Measurement Techniques, Error of Measurement, Mathematics Curriculum, Curriculum Based Assessment

Internet Administration of the Paper-and-Pencil Gifted Rating Scale: Assessing Psychometric Equivalence

Peer reviewed

Direct link

Yarnell, Jordy B.; Pfeiffer, Steven I. – Journal of Psychoeducational Assessment, 2015

The present study examined the psychometric equivalence of administering a computer-based version of the Gifted Rating Scale (GRS) compared with the traditional paper-and-pencil GRS-School Form (GRS-S). The GRS-S is a teacher-completed rating scale used in gifted assessment. The GRS-Electronic Form provides an alternative method of administering…

Descriptors: Gifted, Psychometrics, Rating Scales, Computer Assisted Testing

Investigating the Application of Automated Writing Evaluation to Chinese Undergraduate English Majors: A Case Study of "WriteToLearn"

Peer reviewed
PDF on ERIC

Download full text

Liu, Sha; Kunnan, Antony John – CALICO Journal, 2016

This study investigated the application of "WriteToLearn" on Chinese undergraduate English majors' essays in terms of its scoring ability and the accuracy of its error feedback. Participants were 163 second-year English majors from a university located in Sichuan province who wrote 326 essays from two writing prompts. Each paper was…

Descriptors: Foreign Countries, Undergraduate Students, English (Second Language), Second Language Learning

Reliability and Validity of Inferences about Teachers Based on Student Scores. William H. Angoff Memorial Lecture Series

Download full text

Haertel, Edward H. – Educational Testing Service, 2013

Policymakers and school administrators have embraced value-added models of teacher effectiveness as tools for educational improvement. Teacher value-added estimates may be viewed as complicated scores of a certain kind. This suggests using a test validation model to examine their reliability and validity. Validation begins with an interpretive…

Descriptors: Reliability, Validity, Inferences, Teacher Effectiveness

Effect of Observation Mode on Measures of Secondary Mathematics Teaching

Peer reviewed

Direct link

Casabianca, Jodi M.; McCaffrey, Daniel F.; Gitomer, Drew H.; Bell, Courtney A.; Hamre, Bridget K.; Pianta, Robert C. – Educational and Psychological Measurement, 2013

Classroom observation of teachers is a significant part of educational measurement; measurements of teacher practice are being used in teacher evaluation systems across the country. This research investigated whether observations made live in the classroom and from video recording of the same lessons yielded similar inferences about teaching.…

Descriptors: Secondary School Mathematics, Mathematics Instruction, Classroom Observation Techniques, Algebra

Optimization of Answer Keys for Script Concordance Testing: Should We Exclude Deviant Panelists, Deviant Responses, or Neither?

Peer reviewed

Direct link

Gagnon, Robert; Lubarsky, Stuart; Lambert, Carole; Charlin, Bernard – Advances in Health Sciences Education, 2011

The Script Concordance Test (SCT) uses a panel-based, aggregate scoring method that aims to capture the variability of responses of experienced practitioners to particular clinical situations. The use of this type of scoring method is a key determinant of the tool's discriminatory power, but deviant answers could potentially diminish the…

Descriptors: Expertise, Oncology, Scoring, Error of Measurement

Rater Language Background as a Source of Measurement Error in the Testing of English Language Learners

Peer reviewed

Direct link

Kachchaf, Rachel; Solano-Flores, Guillermo – Applied Measurement in Education, 2012

We examined how rater language background affects the scoring of short-answer, open-ended test items in the assessment of English language learners (ELLs). Four native English and four native Spanish-speaking certified bilingual teachers scored 107 responses of fourth- and fifth-grade Spanish-speaking ELLs to mathematics items administered in…

Descriptors: Error of Measurement, English Language Learners, Scoring, Bilingual Teachers

Generalizability Theory: Measuring the Dependability of Selected Methods for Scoring Classroom Assessments

Direct link

Lengh, Carolyn J. – ProQuest LLC, 2010

This study compares the dependability of four classroom assessment scoring methods. Generalizability theory (G) and alternative decision (D) are used to measure the results of students' classroom assessment scores and compare the results of the four scoring methods on variability of rater by person variance and the level of G and D coefficients…

Descriptors: Generalizability Theory, Scoring, Social Studies, Tests

Subscores for Institutions. Research Report. ETS RR-06-13

Peer reviewed
PDF on ERIC

Download full text

Haberman, Shelby J.; Sinharay, Sadip; Puhan, Gautam – ETS Research Report Series, 2006

Recently, there has been an increasing level of interest in reporting subscores. This paper examines the issue of reporting subscores at an aggregate level, especially at the level of institutions that the examinees belong to. A series of statistical analyses is suggested to determine when subscores at the institutional level have any added value…

Descriptors: Scores, Statistical Analysis, Error of Measurement, Reliability

Identifying and Managing Local Item Dependence in Context-Dependent Item Sets.

Download full text

Allen, Sally; Sudweeks, Richard R. – 2001

A study was conducted to identify local item dependence (LID) in the context-dependent item sets used in an examination prepared for use in an introductory university physics class and to assess the effects of LID on estimates of the reliability and standard error of measurement. Test scores were obtained for 487 students in the physics class. The…

Descriptors: College Students, Error of Measurement, Higher Education, Physics

Estimating the Consistency and Accuracy of Classifications Based on Test Scores.

Download full text

Livingston, Samuel A.; Lewis, Charles – 1993

This paper presents a method for estimating the accuracy and consistency of classifications based on test scores. The scores can be produced by any scoring method, including the formation of a weighted composite. The estimates use data from a single form. The reliability of the score is used to estimate its effective test length in terms of…

Descriptors: Classification, Error of Measurement, Estimation (Mathematics), Reliability

A Case for Adjusting Subjectively Rated Scores in the Advanced Placement Tests. Program Statistics Research. Technical Report No. 94-5.

Download full text

Longford, Nicholas T. – 1994

A case is presented for adjusting the scores for free response items in the Advanced Placement (AP) tests. Using information about the rating process from the reliability studies, administrations of the AP test for three subject areas, psychology, computer science, and English language and composition, are analyzed. In the reliability studies, 299…

Descriptors: Advanced Placement, Computer Science, English, Error of Measurement

Allen, Sally	1
Bell, Courtney A.	1
Briesch, Amy M.	1
Carol Eckerly	1
Casabianca, Jodi M.	1
Charlin, Bernard	1
Gagnon, Robert	1
Gitomer, Drew H.	1
Haberman, Shelby J.	1
Haertel, Edward H.	1
Hamre, Bridget K.	1
Hulac, David	1
John R. Donoghue	1
Jonas Flodén	1
Kachchaf, Rachel	1
Kunnan, Antony John	1
Lambert, Carole	1
Lengh, Carolyn J.	1
Lewis, Charles	1
Lin, Chih-Kai	1
Liu, Sha	1
Livingston, Samuel A.	1
Longford, Nicholas T.	1
Lubarsky, Stuart	1
McCaffrey, Daniel F.	1
More ▼