Data Augmentation for Sparse Multidimensional Learning Performance Data Using Generative AI

IF 2.9 3区教育学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

IEEE Transactions on Learning Technologies Pub Date : 2025-01-07 DOI:10.1109/TLT.2025.3526582

Liang Zhang;Jionghao Lin;John Sabatini;Conrad Borchers;Daniel Weitekamp;Meng Cao;John Hollander;Xiangen Hu;Arthur C. Graesser

{"title":"Data Augmentation for Sparse Multidimensional Learning Performance Data Using Generative AI","authors":"Liang Zhang;Jionghao Lin;John Sabatini;Conrad Borchers;Daniel Weitekamp;Meng Cao;John Hollander;Xiangen Hu;Arthur C. Graesser","doi":"10.1109/TLT.2025.3526582","DOIUrl":null,"url":null,"abstract":"Learning performance data, such as correct or incorrect answers and problem-solving attempts in intelligent tutoring systems (ITSs), facilitate the assessment of knowledge mastery and the delivery of effective instructions. However, these data tend to be highly sparse (80%<inline-formula><tex-math>$\\sim$</tex-math></inline-formula>90% missing observations) in most real-world applications. This data sparsity presents challenges to using learner models to effectively predict learners' future performance and explore new hypotheses about learning. This article proposes a systematic framework for augmenting learning performance data to address data sparsity. First, learning performance data can be represented as a 3-D tensor with dimensions corresponding to learners, questions, and attempts, effectively capturing longitudinal knowledge states during learning. Second, a tensor factorization method is used to impute missing values in sparse tensors of collected learner data, thereby grounding the imputation on knowledge tracing (KT) tasks that predict missing performance values based on real observations. Third, data augmentation using generative artificial intelligence models, including generative adversarial network (GAN), specifically vanilla GANs and generative pretrained transformers (GPTs, specifically GPT-4o), generate data tailored to individual clusters of learning performance. We tested this systemic framework on adult literacy datasets from AutoTutor lessons developed for adult reading comprehension. We found that tensor factorization outperformed baseline KT techniques in tracing and predicting learning performance, demonstrating higher fidelity in data imputation, and the vanilla GAN-based augmentation demonstrated greater overall stability across varying sample sizes, whereas GPT-4o-based augmentation exhibited higher variability, with occasional cases showing closer fidelity to the original data distribution. This framework facilitates the effective augmentation of learning performance data, enabling controlled, cost-effective approach for the evaluation and optimization of ITS instructional designs in both online and offline environments prior to deployment, and supporting advanced educational data mining and learning analytics.","PeriodicalId":49191,"journal":{"name":"IEEE Transactions on Learning Technologies","volume":"18 ","pages":"145-164"},"PeriodicalIF":2.9000,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Learning Technologies","FirstCategoryId":"95","ListUrlMain":"https://ieeexplore.ieee.org/document/10830556/","RegionNum":3,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Learning performance data, such as correct or incorrect answers and problem-solving attempts in intelligent tutoring systems (ITSs), facilitate the assessment of knowledge mastery and the delivery of effective instructions. However, these data tend to be highly sparse (80%

$\sim$

90% missing observations) in most real-world applications. This data sparsity presents challenges to using learner models to effectively predict learners' future performance and explore new hypotheses about learning. This article proposes a systematic framework for augmenting learning performance data to address data sparsity. First, learning performance data can be represented as a 3-D tensor with dimensions corresponding to learners, questions, and attempts, effectively capturing longitudinal knowledge states during learning. Second, a tensor factorization method is used to impute missing values in sparse tensors of collected learner data, thereby grounding the imputation on knowledge tracing (KT) tasks that predict missing performance values based on real observations. Third, data augmentation using generative artificial intelligence models, including generative adversarial network (GAN), specifically vanilla GANs and generative pretrained transformers (GPTs, specifically GPT-4o), generate data tailored to individual clusters of learning performance. We tested this systemic framework on adult literacy datasets from AutoTutor lessons developed for adult reading comprehension. We found that tensor factorization outperformed baseline KT techniques in tracing and predicting learning performance, demonstrating higher fidelity in data imputation, and the vanilla GAN-based augmentation demonstrated greater overall stability across varying sample sizes, whereas GPT-4o-based augmentation exhibited higher variability, with occasional cases showing closer fidelity to the original data distribution. This framework facilitates the effective augmentation of learning performance data, enabling controlled, cost-effective approach for the evaluation and optimization of ITS instructional designs in both online and offline environments prior to deployment, and supporting advanced educational data mining and learning analytics.

查看原文本刊更多论文

基于生成式人工智能的稀疏多维学习性能数据增强

学习表现数据，例如智能辅导系统（ITSs）中的正确或错误答案和解决问题的尝试，有助于评估知识掌握和提供有效的指导。然而，在大多数实际应用中，这些数据往往是高度稀疏的（80%$\sim$90%缺少观测值）。这种数据稀疏性对使用学习者模型来有效预测学习者未来的表现和探索关于学习的新假设提出了挑战。本文提出了一个系统的框架来增强学习性能数据，以解决数据稀疏问题。首先，学习绩效数据可以表示为三维张量，其维度对应于学习者、问题和尝试，有效捕获学习过程中的纵向知识状态。其次，使用张量分解方法在收集的学习器数据的稀疏张量中输入缺失值，从而将输入建立在知识跟踪（KT）任务上，该任务根据实际观察预测缺失的性能值。第三，使用生成式人工智能模型的数据增强，包括生成式对抗网络（GAN），特别是香草GAN和生成式预训练变形器（gpt，特别是gpt - 40），生成针对单个学习性能集群的数据。我们在AutoTutor为成人阅读理解开发的课程中的成人读写数据集上测试了这个系统框架。我们发现，张量分解在跟踪和预测学习性能方面优于基线KT技术，在数据输入方面表现出更高的保真度，基于gan的增强在不同样本量上表现出更大的整体稳定性，而基于gpt - 40的增强表现出更高的可变性，偶尔会显示出更接近原始数据分布的保真度。该框架促进了学习绩效数据的有效增强，在部署之前，为在线和离线环境中的ITS教学设计的评估和优化提供了可控的、具有成本效益的方法，并支持先进的教育数据挖掘和学习分析。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Learning Technologies COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS-

CiteScore

7.50

自引率

5.40%

发文量

审稿时长

>12 weeks

期刊介绍： The IEEE Transactions on Learning Technologies covers all advances in learning technologies and their applications, including but not limited to the following topics: innovative online learning systems; intelligent tutors; educational games; simulation systems for education and training; collaborative learning tools; learning with mobile devices; wearable devices and interfaces for learning; personalized and adaptive learning systems; tools for formative and summative assessment; tools for learning analytics and educational data mining; ontologies for learning systems; standards and web services that support learning; authoring tools for learning materials; computer support for peer tutoring; learning via computer-mediated inquiry, field, and lab work; social learning techniques; social networks and infrastructures for learning and knowledge sharing; and creation and management of learning objects.