Generating synthetic data for CALL research with GenAI: A proof-of-concept study

Research Methods in Applied Linguistics Pub Date : 2025-08-26 DOI:10.1016/j.rmal.2025.100248

Dennis Foung , Lucas Kohnke

{"title":"Generating synthetic data for CALL research with GenAI: A proof-of-concept study","authors":"Dennis Foung , Lucas Kohnke","doi":"10.1016/j.rmal.2025.100248","DOIUrl":null,"url":null,"abstract":"<div><div>Popular tools like ChatGPT have placed generative artificial intelligence (GenAI) in the spotlight in recent years. One use of GenAI tools is to generate simulated data—or synthetic data—when the full scope of the required microdata is unavailable. Despite suggestions for educational researchers to use synthetic data, little (if any) computer-assisted language learning (CALL) research has used synthetic data thus far. This study addresses this research gap by exploring the possibility of using synthetic datasets in CALL. The publicly available dataset resembles a typical study with a small sample size (<em>n</em> = 55) performed using a CALL platform. Two synthetic datasets are generated from the original datasets using the <em>synthpop</em> package and generative adversarial networks (GAN) in <em>R</em> (via the <em>RGAN</em> package), which are both common synthetic data generation methods. This study evaluates the synthetic datasets by (a) comparing the distribution between the synthetic and original datasets, (b) examining the model parameters of the rebuilt linear models using the synthetic and original datasets, and (c) examining the privacy disclosure metrics. The results suggest that <em>synthpop</em> better represents the original data and preserves privacy. Notably, the GAN-generated dataset does not produce satisfactory results. This demonstrates GAN’s key challenges alongside the potential benefits of generating synthetic data with <em>synthpop</em>.</div></div>","PeriodicalId":101075,"journal":{"name":"Research Methods in Applied Linguistics","volume":"4 3","pages":"Article 100248"},"PeriodicalIF":0.0000,"publicationDate":"2025-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Research Methods in Applied Linguistics","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772766125000692","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Popular tools like ChatGPT have placed generative artificial intelligence (GenAI) in the spotlight in recent years. One use of GenAI tools is to generate simulated data—or synthetic data—when the full scope of the required microdata is unavailable. Despite suggestions for educational researchers to use synthetic data, little (if any) computer-assisted language learning (CALL) research has used synthetic data thus far. This study addresses this research gap by exploring the possibility of using synthetic datasets in CALL. The publicly available dataset resembles a typical study with a small sample size (n = 55) performed using a CALL platform. Two synthetic datasets are generated from the original datasets using the synthpop package and generative adversarial networks (GAN) in R (via the RGAN package), which are both common synthetic data generation methods. This study evaluates the synthetic datasets by (a) comparing the distribution between the synthetic and original datasets, (b) examining the model parameters of the rebuilt linear models using the synthetic and original datasets, and (c) examining the privacy disclosure metrics. The results suggest that synthpop better represents the original data and preserves privacy. Notably, the GAN-generated dataset does not produce satisfactory results. This demonstrates GAN’s key challenges alongside the potential benefits of generating synthetic data with synthpop.

查看原文本刊更多论文

利用GenAI为CALL研究生成合成数据：一项概念验证研究

近年来，ChatGPT等流行工具将生成式人工智能（GenAI）置于聚光灯下。GenAI工具的一个用途是在无法获得所需微数据的全部范围时生成模拟数据或合成数据。尽管建议教育研究人员使用合成数据，但迄今为止，计算机辅助语言学习（CALL）研究很少（如果有的话）使用合成数据。本研究通过探索在CALL中使用合成数据集的可能性来解决这一研究缺口。公开可用的数据集类似于使用CALL平台执行的小样本量（n = 55）的典型研究。使用R中的synthpop包和生成对抗网络（GAN）（通过RGAN包）从原始数据集生成两个合成数据集，这两种方法都是常见的合成数据生成方法。本研究通过(a)比较合成数据集和原始数据集之间的分布，(b)检查使用合成数据集和原始数据集重建的线性模型的模型参数，以及(c)检查隐私披露指标来评估合成数据集。结果表明，synthpop更好地代表了原始数据并保护了隐私。值得注意的是，gan生成的数据集没有产生令人满意的结果。这展示了GAN的主要挑战以及使用synthpop生成合成数据的潜在好处。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Research Methods in Applied Linguistics

CiteScore

4.10

自引率

0.00%

发文量