Generative Adversarial Networks for Synthetic Data Generation in Finance: Evaluating Statistical Similarities and Quality Assessment

AI Pub Date : 2024-05-13 DOI:10.3390/ai5020035

Faisal Ramzan, Claudio Sartori, Sergio Consoli, Diego Reforgiato Recupero

{"title":"Generative Adversarial Networks for Synthetic Data Generation in Finance: Evaluating Statistical Similarities and Quality Assessment","authors":"Faisal Ramzan, Claudio Sartori, Sergio Consoli, Diego Reforgiato Recupero","doi":"10.3390/ai5020035","DOIUrl":null,"url":null,"abstract":"Generating synthetic data is a complex task that necessitates accurately replicating the statistical and mathematical properties of the original data elements. In sectors such as finance, utilizing and disseminating real data for research or model development can pose substantial privacy risks owing to the inclusion of sensitive information. Additionally, authentic data may be scarce, particularly in specialized domains where acquiring ample, varied, and high-quality data is difficult or costly. This scarcity or limited data availability can limit the training and testing of machine-learning models. In this paper, we address this challenge. In particular, our task is to synthesize a dataset with similar properties to an input dataset about the stock market. The input dataset is anonymized and consists of very few columns and rows, contains many inconsistencies, such as missing rows and duplicates, and its values are not normalized, scaled, or balanced. We explore the utilization of generative adversarial networks, a deep-learning technique, to generate synthetic data and evaluate its quality compared to the input stock dataset. Our innovation involves generating artificial datasets that mimic the statistical properties of the input elements without revealing complete information. For example, synthetic datasets can capture the distribution of stock prices, trading volumes, and market trends observed in the original dataset. The generated datasets cover a wider range of scenarios and variations, enabling researchers and practitioners to explore different market conditions and investment strategies. This diversity can enhance the robustness and generalization of machine-learning models. We evaluate our synthetic data in terms of the mean, similarities, and correlations.","PeriodicalId":503525,"journal":{"name":"AI","volume":"102 22","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"AI","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/ai5020035","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Generating synthetic data is a complex task that necessitates accurately replicating the statistical and mathematical properties of the original data elements. In sectors such as finance, utilizing and disseminating real data for research or model development can pose substantial privacy risks owing to the inclusion of sensitive information. Additionally, authentic data may be scarce, particularly in specialized domains where acquiring ample, varied, and high-quality data is difficult or costly. This scarcity or limited data availability can limit the training and testing of machine-learning models. In this paper, we address this challenge. In particular, our task is to synthesize a dataset with similar properties to an input dataset about the stock market. The input dataset is anonymized and consists of very few columns and rows, contains many inconsistencies, such as missing rows and duplicates, and its values are not normalized, scaled, or balanced. We explore the utilization of generative adversarial networks, a deep-learning technique, to generate synthetic data and evaluate its quality compared to the input stock dataset. Our innovation involves generating artificial datasets that mimic the statistical properties of the input elements without revealing complete information. For example, synthetic datasets can capture the distribution of stock prices, trading volumes, and market trends observed in the original dataset. The generated datasets cover a wider range of scenarios and variations, enabling researchers and practitioners to explore different market conditions and investment strategies. This diversity can enhance the robustness and generalization of machine-learning models. We evaluate our synthetic data in terms of the mean, similarities, and correlations.

查看原文本刊更多论文

用于金融合成数据生成的生成对抗网络：评估统计相似性和质量评估

生成合成数据是一项复杂的任务，需要准确复制原始数据元素的统计和数学属性。在金融等行业，由于包含敏感信息，利用和传播真实数据进行研究或模型开发可能会带来巨大的隐私风险。此外，真实数据可能非常稀缺，特别是在专业领域，获取大量、多样和高质量的数据非常困难或成本高昂。这种稀缺性或有限的数据可用性会限制机器学习模型的训练和测试。在本文中，我们将应对这一挑战。具体来说，我们的任务是合成一个与股票市场输入数据集属性相似的数据集。输入数据集是匿名的，由很少的列和行组成，包含很多不一致的地方，如缺失行和重复行，而且其值没有经过归一化、缩放或平衡处理。我们探索利用生成式对抗网络（一种深度学习技术）生成合成数据，并评估其与输入股票数据集相比的质量。我们的创新包括生成人工数据集，在不透露完整信息的情况下模仿输入元素的统计属性。例如，合成数据集可以捕捉原始数据集中的股票价格分布、交易量和市场趋势。生成的数据集涵盖更广泛的情景和变化，使研究人员和从业人员能够探索不同的市场条件和投资策略。这种多样性可以增强机器学习模型的稳健性和通用性。我们从平均值、相似性和相关性方面对合成数据进行了评估。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

自引率

0.00%

发文量