ERIC Number: EJ1473727
Record Type: Journal
Publication Date: 2025-May
Pages: 24
Abstractor: As Provided
ISBN: N/A
ISSN: ISSN-2398-5348
EISSN: EISSN-2398-5356
Available Date: 2025-04-07
De-Identifying Student Personally Identifying Information in Discussion Forum Posts with Large Language Models
Andres Felipe Zambrano1; Shreya Singhal2; Maciej Pankiewicz1; Ryan Shaun Baker1; Chelsea Porter1; Xiner Liu1
Information and Learning Sciences, v126 n5-6 p401-424 2025
Purpose: This study aims to evaluate the effectiveness of three large language models (LLMs), GPT-4o, Llama 3.3 70B and Llama 3.1 8B, in redacting personally identifying information (PII) from forum data in massive open online courses (MOOCs). Design/methodology/approach: Forum posts from students enrolled in nine MOOCs were redacted by three human reviewers. The GPT and Llama models were then tasked with de-identifying the same data set using standardized prompts. Discrepancies between LLM and human redactions were analyzed to identify patterns in LLM errors. Findings: All models achieved an average recall of over 0.9 in identifying PII and identified PII instances overlooked by humans. However, their precisions were lower -- 0.579 for GPT-4o, 0.506 for Llama 3.3 and 0.262 for Llama 3.1 -- showing a tendency to over-redact non-PII names and locations. Research limitations/implications: Several courses' data were analyzed to increase findings' generalizability but the models' performance may vary in other contexts. GPT and Llama models were selected because of their availability and cost-effectiveness at the time of the study; future newer models may improve performance. Practical implications: The use of downloadable LLMs enables researchers to de-identify data without training specialized models or involving external companies, ensuring that student data remains private. Originality/value: Previous research on LLM text de-identification has largely used proprietary models, which require sharing data containing sensitive PII with third-party companies. This study evaluates the performance of two open weight models that can be deployed locally, eliminating the need to share sensitive data externally.
Descriptors: Artificial Intelligence, Identification, Privacy, Information Security, Discussion Groups, MOOCs, College Students
Emerald Publishing Limited. Howard House, Wagon Lane, Bingley, West Yorkshire, BD16 1WA, UK. Tel: +44-1274-777700; Fax: +44-1274-785201; e-mail: emerald@emeraldinsight.com; Web site: http://www.emerald.com/insight
Publication Type: Journal Articles; Reports - Research
Education Level: Higher Education; Postsecondary Education
Audience: N/A
Language: English
Sponsor: N/A
Authoring Institution: N/A
Identifiers - Location: Pennsylvania (Philadelphia)
Grant or Contract Numbers: N/A
Author Affiliations: 1Graduate School of Education, University of Pennsylvania, Philadelphia, Pennsylvania, USA; 2Graduate School of Education, Harvard University, Cambridge, Massachusetts, USA