{"title":"Generating synthetic data for CALL research with GenAI: A proof-of-concept study","authors":"Dennis Foung , Lucas Kohnke","doi":"10.1016/j.rmal.2025.100248","DOIUrl":null,"url":null,"abstract":"<div><div>Popular tools like ChatGPT have placed generative artificial intelligence (GenAI) in the spotlight in recent years. One use of GenAI tools is to generate simulated data—or synthetic data—when the full scope of the required microdata is unavailable. Despite suggestions for educational researchers to use synthetic data, little (if any) computer-assisted language learning (CALL) research has used synthetic data thus far. This study addresses this research gap by exploring the possibility of using synthetic datasets in CALL. The publicly available dataset resembles a typical study with a small sample size (<em>n</em> = 55) performed using a CALL platform. Two synthetic datasets are generated from the original datasets using the <em>synthpop</em> package and generative adversarial networks (GAN) in <em>R</em> (via the <em>RGAN</em> package), which are both common synthetic data generation methods. This study evaluates the synthetic datasets by (a) comparing the distribution between the synthetic and original datasets, (b) examining the model parameters of the rebuilt linear models using the synthetic and original datasets, and (c) examining the privacy disclosure metrics. The results suggest that <em>synthpop</em> better represents the original data and preserves privacy. Notably, the GAN-generated dataset does not produce satisfactory results. This demonstrates GAN’s key challenges alongside the potential benefits of generating synthetic data with <em>synthpop</em>.</div></div>","PeriodicalId":101075,"journal":{"name":"Research Methods in Applied Linguistics","volume":"4 3","pages":"Article 100248"},"PeriodicalIF":0.0000,"publicationDate":"2025-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Research Methods in Applied Linguistics","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772766125000692","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Popular tools like ChatGPT have placed generative artificial intelligence (GenAI) in the spotlight in recent years. One use of GenAI tools is to generate simulated data—or synthetic data—when the full scope of the required microdata is unavailable. Despite suggestions for educational researchers to use synthetic data, little (if any) computer-assisted language learning (CALL) research has used synthetic data thus far. This study addresses this research gap by exploring the possibility of using synthetic datasets in CALL. The publicly available dataset resembles a typical study with a small sample size (n = 55) performed using a CALL platform. Two synthetic datasets are generated from the original datasets using the synthpop package and generative adversarial networks (GAN) in R (via the RGAN package), which are both common synthetic data generation methods. This study evaluates the synthetic datasets by (a) comparing the distribution between the synthetic and original datasets, (b) examining the model parameters of the rebuilt linear models using the synthetic and original datasets, and (c) examining the privacy disclosure metrics. The results suggest that synthpop better represents the original data and preserves privacy. Notably, the GAN-generated dataset does not produce satisfactory results. This demonstrates GAN’s key challenges alongside the potential benefits of generating synthetic data with synthpop.