Exploiting GPT for synthetic data generation: An empirical study

IF 10 1区 管理学 Q1 INFORMATION SCIENCE & LIBRARY SCIENCE
Tony Busker , Sunil Choenni , Mortaza S. Bargh
{"title":"Exploiting GPT for synthetic data generation: An empirical study","authors":"Tony Busker ,&nbsp;Sunil Choenni ,&nbsp;Mortaza S. Bargh","doi":"10.1016/j.giq.2024.101988","DOIUrl":null,"url":null,"abstract":"<div><div>There are many good reasons to use synthetic data instead of real data for research purposes. These reasons may range from the business sensitiveness of real data to increased cost of collecting real data in accordance with GDPR requirements. In this paper, we elaborate upon the potentials of the Large Language Model GPT as a tool to generate synthetic data for analytical purposes when there is no real-data available or accessible. Primarily, we show that by varying the scope of probes adequately, we can generate data of different granularities. To show this, we generated stereotypical data with three levels of granularity by posing more than 18,500 probes to GPT. In total, we generated stereotypical data for eight different views, which can be categorized in three view types corresponding to the three levels of granularity. Secondarily, we show that by varying the scope of probes one can create meaningful information. To show this, we performed a so-called similarity analysis on the generated stereotypical data. We used data visualizations, e.g. heatmaps, to show the views and categories within the views that are similar and those that are at odd with each other. We elaborate upon the application areas of the insight gained about such similarities and differences. Furthermore, we discuss several other types of analysis that can be performed on the generated stereotypical data.</div></div>","PeriodicalId":48258,"journal":{"name":"Government Information Quarterly","volume":"42 1","pages":"Article 101988"},"PeriodicalIF":10.0000,"publicationDate":"2024-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Government Information Quarterly","FirstCategoryId":"91","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0740624X24000807","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"INFORMATION SCIENCE & LIBRARY SCIENCE","Score":null,"Total":0}
引用次数: 0

Abstract

There are many good reasons to use synthetic data instead of real data for research purposes. These reasons may range from the business sensitiveness of real data to increased cost of collecting real data in accordance with GDPR requirements. In this paper, we elaborate upon the potentials of the Large Language Model GPT as a tool to generate synthetic data for analytical purposes when there is no real-data available or accessible. Primarily, we show that by varying the scope of probes adequately, we can generate data of different granularities. To show this, we generated stereotypical data with three levels of granularity by posing more than 18,500 probes to GPT. In total, we generated stereotypical data for eight different views, which can be categorized in three view types corresponding to the three levels of granularity. Secondarily, we show that by varying the scope of probes one can create meaningful information. To show this, we performed a so-called similarity analysis on the generated stereotypical data. We used data visualizations, e.g. heatmaps, to show the views and categories within the views that are similar and those that are at odd with each other. We elaborate upon the application areas of the insight gained about such similarities and differences. Furthermore, we discuss several other types of analysis that can be performed on the generated stereotypical data.
利用GPT合成数据生成:一个实证研究
出于研究目的,有很多很好的理由使用合成数据而不是真实数据。这些原因可能包括实际数据的业务敏感性,以及根据GDPR要求收集实际数据的成本增加。在本文中,我们详细阐述了大型语言模型GPT作为一种工具的潜力,当没有可用或可访问的实际数据时,它可以生成用于分析目的的合成数据。首先,我们表明,通过适当地改变探针的范围,我们可以生成不同粒度的数据。为了证明这一点,我们通过向GPT放置超过18,500个探针,生成了具有三个粒度级别的典型数据。总的来说,我们为8个不同的视图生成了原型数据,这些数据可以分为三种视图类型,对应于三个粒度级别。其次,我们表明,通过改变探针的范围,可以创建有意义的信息。为了证明这一点,我们对生成的刻板印象数据进行了所谓的相似性分析。我们使用数据可视化,例如热图,来显示视图和视图中的相似和不一致的视图和类别。我们详细阐述了关于这些相似点和不同点的见解的应用领域。此外,我们还讨论了可以对生成的原型数据执行的几种其他类型的分析。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Government Information Quarterly
Government Information Quarterly INFORMATION SCIENCE & LIBRARY SCIENCE-
CiteScore
15.70
自引率
16.70%
发文量
106
期刊介绍: Government Information Quarterly (GIQ) delves into the convergence of policy, information technology, government, and the public. It explores the impact of policies on government information flows, the role of technology in innovative government services, and the dynamic between citizens and governing bodies in the digital age. GIQ serves as a premier journal, disseminating high-quality research and insights that bridge the realms of policy, information technology, government, and public engagement.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信