Exploiting GPT for synthetic data generation: An empirical study

IF 10 1区管理学 Q1 INFORMATION SCIENCE & LIBRARY SCIENCE

Government Information Quarterly Pub Date : 2024-12-19 DOI:10.1016/j.giq.2024.101988

Tony Busker , Sunil Choenni , Mortaza S. Bargh

{"title":"Exploiting GPT for synthetic data generation: An empirical study","authors":"Tony Busker , Sunil Choenni , Mortaza S. Bargh","doi":"10.1016/j.giq.2024.101988","DOIUrl":null,"url":null,"abstract":"<div><div>There are many good reasons to use synthetic data instead of real data for research purposes. These reasons may range from the business sensitiveness of real data to increased cost of collecting real data in accordance with GDPR requirements. In this paper, we elaborate upon the potentials of the Large Language Model GPT as a tool to generate synthetic data for analytical purposes when there is no real-data available or accessible. Primarily, we show that by varying the scope of probes adequately, we can generate data of different granularities. To show this, we generated stereotypical data with three levels of granularity by posing more than 18,500 probes to GPT. In total, we generated stereotypical data for eight different views, which can be categorized in three view types corresponding to the three levels of granularity. Secondarily, we show that by varying the scope of probes one can create meaningful information. To show this, we performed a so-called similarity analysis on the generated stereotypical data. We used data visualizations, e.g. heatmaps, to show the views and categories within the views that are similar and those that are at odd with each other. We elaborate upon the application areas of the insight gained about such similarities and differences. Furthermore, we discuss several other types of analysis that can be performed on the generated stereotypical data.</div></div>","PeriodicalId":48258,"journal":{"name":"Government Information Quarterly","volume":"42 1","pages":"Article 101988"},"PeriodicalIF":10.0000,"publicationDate":"2024-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Government Information Quarterly","FirstCategoryId":"91","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0740624X24000807","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"INFORMATION SCIENCE & LIBRARY SCIENCE","Score":null,"Total":0}

引用次数: 0

Abstract

There are many good reasons to use synthetic data instead of real data for research purposes. These reasons may range from the business sensitiveness of real data to increased cost of collecting real data in accordance with GDPR requirements. In this paper, we elaborate upon the potentials of the Large Language Model GPT as a tool to generate synthetic data for analytical purposes when there is no real-data available or accessible. Primarily, we show that by varying the scope of probes adequately, we can generate data of different granularities. To show this, we generated stereotypical data with three levels of granularity by posing more than 18,500 probes to GPT. In total, we generated stereotypical data for eight different views, which can be categorized in three view types corresponding to the three levels of granularity. Secondarily, we show that by varying the scope of probes one can create meaningful information. To show this, we performed a so-called similarity analysis on the generated stereotypical data. We used data visualizations, e.g. heatmaps, to show the views and categories within the views that are similar and those that are at odd with each other. We elaborate upon the application areas of the insight gained about such similarities and differences. Furthermore, we discuss several other types of analysis that can be performed on the generated stereotypical data.

查看原文本刊更多论文

利用GPT合成数据生成：一个实证研究

出于研究目的，有很多很好的理由使用合成数据而不是真实数据。这些原因可能包括实际数据的业务敏感性，以及根据GDPR要求收集实际数据的成本增加。在本文中，我们详细阐述了大型语言模型GPT作为一种工具的潜力，当没有可用或可访问的实际数据时，它可以生成用于分析目的的合成数据。首先，我们表明，通过适当地改变探针的范围，我们可以生成不同粒度的数据。为了证明这一点，我们通过向GPT放置超过18,500个探针，生成了具有三个粒度级别的典型数据。总的来说，我们为8个不同的视图生成了原型数据，这些数据可以分为三种视图类型，对应于三个粒度级别。其次，我们表明，通过改变探针的范围，可以创建有意义的信息。为了证明这一点，我们对生成的刻板印象数据进行了所谓的相似性分析。我们使用数据可视化，例如热图，来显示视图和视图中的相似和不一致的视图和类别。我们详细阐述了关于这些相似点和不同点的见解的应用领域。此外，我们还讨论了可以对生成的原型数据执行的几种其他类型的分析。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Government Information Quarterly INFORMATION SCIENCE & LIBRARY SCIENCE-

CiteScore

15.70

自引率

16.70%

发文量

106

期刊介绍： Government Information Quarterly (GIQ) delves into the convergence of policy, information technology, government, and the public. It explores the impact of policies on government information flows, the role of technology in innovative government services, and the dynamic between citizens and governing bodies in the digital age. GIQ serves as a premier journal, disseminating high-quality research and insights that bridge the realms of policy, information technology, government, and public engagement.