ERIC Number: EJ1488405
Record Type: Journal
Publication Date: 2025
Pages: 26
Abstractor: As Provided
ISBN: N/A
ISSN: ISSN-1560-4292
EISSN: EISSN-1560-4306
Available Date: 2024-10-18
Can LLMs Grade Open Response Reading Comprehension Questions? An Empirical Study Using the ROARs Dataset
Owen Henkel1; Libby Hills2; Bill Roberts3; Joshua McGrane4
International Journal of Artificial Intelligence in Education, v35 n2 p651-676 2025
Formative assessment plays a critical role in improving learning outcomes by providing feedback on student mastery. Open-ended questions, which require students to produce multi-word, nontrivial responses, are a popular tool for formative assessment as they provide more specific insights into what students do and do not know. However, grading open-ended questions can be time-consuming and susceptible to errors, leading teachers to resort to simpler question formats or conduct fewer formative assessments. While there has been a longstanding interest in automating the grading of short answer questions, previous approaches have been technically complex, limiting their use in formative assessment contexts. The newest generation of Large Language Models (LLMs) potentially makes grading short answer questions more feasible, as the models are lexible and easier to use. This paper addresses the lack of empirical research on the potential of the newest generation LLMs for grading of short answer questions in two ways. First, it introduces a novel dataset of short answer reading comprehension questions, drawn from a battery of reading assessments conducted with over 150 students in Ghana. This dataset allows for the evaluation of LLMs in a new context, as they are predominantly designed and trained on data from high-income countries. Second, the paper empirically evaluates how well various configurations of generative LLMs can grade student short answer responses compared to expert human raters. The findings show that GPT-4, with minimal prompt engineering, performed extremely well on evaluating the novel dataset (QWK 0.91, F1 0.87), reaching near-parity with expert human raters. To our knowledge, this work is the first to empirically evaluate the performance of generative LLMs on short answer reading comprehension questions using real student data. These findings suggest that generative LLMs could be used to grade formative assessment tasks, potentially benefiting real-world educational settings.
Descriptors: Artificial Intelligence, Grading, Reading Comprehension, Natural Language Processing, Questioning Techniques, Formative Evaluation, Foreign Countries, Reading Tests, Test Items
Springer. Available from: Springer Nature. One New York Plaza, Suite 4600, New York, NY 10004. Tel: 800-777-4643; Tel: 212-460-1500; Fax: 212-460-1700; e-mail: customerservice@springernature.com; Web site: https://link.springer.com/
Publication Type: Journal Articles; Reports - Research
Education Level: N/A
Audience: N/A
Language: English
Sponsor: N/A
Authoring Institution: N/A
Identifiers - Location: Ghana
Grant or Contract Numbers: N/A
Author Affiliations: 1University of Oxford, Department of Education, Oxford, UK; 2Jacobs Foundation, Zürich, Switzerland; 3Legible Labs, New York City, USA; 4University of Melbourne, Graduate School of Education, Melbourne, Australia

Peer reviewed
Direct link
