Can Large Language Models Match Tutoring System Adaptivity? A Benchmarking Study.

Conrad Borchers; Tianze Shou

Notes FAQ Contact Us

Back to results

Peer reviewed

Direct link

ERIC Number: ED674426

Record Type: Non-Journal

Publication Date: 2025-Jul-15

Pages: 15

Abstractor: As Provided

ISBN: N/A

ISSN: N/A

EISSN: N/A

Available Date: 0000-00-00

Can Large Language Models Match Tutoring System Adaptivity? A Benchmarking Study

Conrad Borchers¹; Tianze Shou¹

Grantee Submission, Paper presented at the International Conference on Artificial Intelligence in Education (26th, Palermo, Italy, Jul 22-26, 2025)

Large Language Models (LLMs) hold promise as dynamic instructional aids. Yet, it remains unclear whether LLMs can replicate the adaptivity of intelligent tutoring systems (ITS)--where student knowledge and pedagogical strategies are explicitly modeled. We propose a prompt variation framework to assess LLM-generated instructional moves' adaptivity and pedagogical soundness across 75 real-world tutoring scenarios from an ITS. We systematically remove key context components (e.g., student errors and knowledge components) from prompts to create variations of each scenario. Three representative LLMs (Llama3-8B, Llama3-70B, and GPT-4o) generate 1,350 instructional moves. We use text embeddings and randomization tests to measure how the omission of each context feature impacts the LLMs' outputs (adaptivity) and a validated tutor-training classifier to evaluate response quality (pedagogical soundness). Surprisingly, even the best-performing model only marginally mimics the adaptivity of ITS. Specifically, Llama3-70B demonstrates statistically significant adaptivity to student errors. Although Llama3-8B's recommendations receive higher pedagogical soundness scores than the other models, it struggles with instruction-following behaviors, including output formatting. By contrast, GPT-4o reliably adheres to instructions but tends to provide overly direct feedback that diverges from effective tutoring, prompting learners with open-ended questions to gauge knowledge. Given these results, we discuss how current LLM-based tutoring is unlikely to produce learning benefits rivaling known-to-be-effective ITS tutoring. Through our open-source benchmarking code, we contribute a reproducible method for evaluating LLMs' instructional adaptivity and fidelity.

Descriptors: Benchmarking, Computational Linguistics, Artificial Intelligence, Computer Software, Technology Integration, Classification, Intelligent Tutoring Systems, Models, Scores, Fidelity, Feedback (Response), Open Source Technology, Error Patterns, Problem Solving

Publication Type: Speeches/Meeting Papers; Reports - Research

Education Level: N/A

Audience: N/A

Language: English

Sponsor: Institute of Education Sciences (ED)

Authoring Institution: N/A

IES Funded: Yes

Grant or Contract Numbers: R305A220386

Department of Education Funded: Yes

Author Affiliations: ¹Carnegie Mellon University, Pittsburgh, PA, USA