NotesFAQContact Us
Collection
Advanced
Search Tips
Back to results
Peer reviewed Peer reviewed
Direct linkDirect link
ERIC Number: EJ1473727
Record Type: Journal
Publication Date: 2025-May
Pages: 24
Abstractor: As Provided
ISBN: N/A
ISSN: ISSN-2398-5348
EISSN: EISSN-2398-5356
Available Date: 2025-04-07
De-Identifying Student Personally Identifying Information in Discussion Forum Posts with Large Language Models
Andres Felipe Zambrano1; Shreya Singhal2; Maciej Pankiewicz1; Ryan Shaun Baker1; Chelsea Porter1; Xiner Liu1
Information and Learning Sciences, v126 n5-6 p401-424 2025
Purpose: This study aims to evaluate the effectiveness of three large language models (LLMs), GPT-4o, Llama 3.3 70B and Llama 3.1 8B, in redacting personally identifying information (PII) from forum data in massive open online courses (MOOCs). Design/methodology/approach: Forum posts from students enrolled in nine MOOCs were redacted by three human reviewers. The GPT and Llama models were then tasked with de-identifying the same data set using standardized prompts. Discrepancies between LLM and human redactions were analyzed to identify patterns in LLM errors. Findings: All models achieved an average recall of over 0.9 in identifying PII and identified PII instances overlooked by humans. However, their precisions were lower -- 0.579 for GPT-4o, 0.506 for Llama 3.3 and 0.262 for Llama 3.1 -- showing a tendency to over-redact non-PII names and locations. Research limitations/implications: Several courses' data were analyzed to increase findings' generalizability but the models' performance may vary in other contexts. GPT and Llama models were selected because of their availability and cost-effectiveness at the time of the study; future newer models may improve performance. Practical implications: The use of downloadable LLMs enables researchers to de-identify data without training specialized models or involving external companies, ensuring that student data remains private. Originality/value: Previous research on LLM text de-identification has largely used proprietary models, which require sharing data containing sensitive PII with third-party companies. This study evaluates the performance of two open weight models that can be deployed locally, eliminating the need to share sensitive data externally.
Emerald Publishing Limited. Howard House, Wagon Lane, Bingley, West Yorkshire, BD16 1WA, UK. Tel: +44-1274-777700; Fax: +44-1274-785201; e-mail: emerald@emeraldinsight.com; Web site: http://www.emerald.com/insight
Publication Type: Journal Articles; Reports - Research
Education Level: Higher Education; Postsecondary Education
Audience: N/A
Language: English
Sponsor: N/A
Authoring Institution: N/A
Identifiers - Location: Pennsylvania (Philadelphia)
Grant or Contract Numbers: N/A
Author Affiliations: 1Graduate School of Education, University of Pennsylvania, Philadelphia, Pennsylvania, USA; 2Graduate School of Education, Harvard University, Cambridge, Massachusetts, USA