A distributive and attentive generative model for multi-party data synthesis in highly imbalanced data

IF 6.2 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Future Generation Computer Systems-The International Journal of Escience Pub Date : 2025-09-25 DOI:10.1016/j.future.2025.108166

Imam Mustafa Kamal, Chastine Fatichah

{"title":"A distributive and attentive generative model for multi-party data synthesis in highly imbalanced data","authors":"Imam Mustafa Kamal, Chastine Fatichah","doi":"10.1016/j.future.2025.108166","DOIUrl":null,"url":null,"abstract":"<div><div>In the era of Artificial Intelligence (AI), where data plays a pivotal role, researchers are increasingly leveraging synthetic data to address privacy concerns, mitigate data scarcity, and enhance model robustness. This approach is particularly promising in critical domains such as healthcare, finance, government, and autonomous systems, where diverse and representative datasets are essential for effective AI training. The integration of data from multiple sources or parties in the context of big data can significantly enrich the available information. However, the data contributed by each party often exhibits distinct characteristics, leading to highly imbalanced distributions. This challenge introduces an additional layer of complexity known as the double imbalance problem, characterized by imbalances both within individual parties and across multiple parties. To address these challenges, we propose a novel generative adversarial network (GAN) framework incorporating distributed discriminators and dual attention mechanisms. Our approach utilizes a single generator to synthesize data conditioned on multiple parties, with each party maintaining its own Critic and dataset to ensure privacy preservation. We introduce local and global attention mechanisms, along with gradient-casting techniques during training, to effectively address the dual imbalance issues prevalent in multi-party data synthesis. The local attention mechanism addresses imbalances within individual parties, while the global attention mechanism targets imbalances across parties, resulting in a more stable generative model in the presence of highly imbalanced data distributions. To validate our approach, we conducted empirical experiments using six real-world tabular datasets, deliberately setting up dual imbalance scenarios across various intra- and inter-party contexts. We evaluated the utility of the synthetic data generated by multiple parties by assessing its efficacy in machine learning tasks. The results demonstrate that our distributed GAN with dual attention mechanisms outperforms existing generative models in addressing these challenges.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"176 ","pages":"Article 108166"},"PeriodicalIF":6.2000,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X25004601","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

In the era of Artificial Intelligence (AI), where data plays a pivotal role, researchers are increasingly leveraging synthetic data to address privacy concerns, mitigate data scarcity, and enhance model robustness. This approach is particularly promising in critical domains such as healthcare, finance, government, and autonomous systems, where diverse and representative datasets are essential for effective AI training. The integration of data from multiple sources or parties in the context of big data can significantly enrich the available information. However, the data contributed by each party often exhibits distinct characteristics, leading to highly imbalanced distributions. This challenge introduces an additional layer of complexity known as the double imbalance problem, characterized by imbalances both within individual parties and across multiple parties. To address these challenges, we propose a novel generative adversarial network (GAN) framework incorporating distributed discriminators and dual attention mechanisms. Our approach utilizes a single generator to synthesize data conditioned on multiple parties, with each party maintaining its own Critic and dataset to ensure privacy preservation. We introduce local and global attention mechanisms, along with gradient-casting techniques during training, to effectively address the dual imbalance issues prevalent in multi-party data synthesis. The local attention mechanism addresses imbalances within individual parties, while the global attention mechanism targets imbalances across parties, resulting in a more stable generative model in the presence of highly imbalanced data distributions. To validate our approach, we conducted empirical experiments using six real-world tabular datasets, deliberately setting up dual imbalance scenarios across various intra- and inter-party contexts. We evaluated the utility of the synthetic data generated by multiple parties by assessing its efficacy in machine learning tasks. The results demonstrate that our distributed GAN with dual attention mechanisms outperforms existing generative models in addressing these challenges.

查看原文本刊更多论文

高度不平衡数据中多方数据综合的分布式细心生成模型

在人工智能（AI）时代，数据发挥着关键作用，研究人员越来越多地利用合成数据来解决隐私问题，缓解数据稀缺，并增强模型鲁棒性。这种方法在医疗保健、金融、政府和自治系统等关键领域尤其有前景，在这些领域，多样化和代表性的数据集对于有效的人工智能训练至关重要。在大数据的背景下，将多来源或多方的数据进行整合，可以极大地丰富可用信息。然而，各方提供的数据往往表现出不同的特征，导致分布高度不平衡。这一挑战引入了一个额外的复杂性层，即所谓的双重不平衡问题，其特征是单个政党内部和多个政党之间的不平衡。为了解决这些挑战，我们提出了一种新的生成对抗网络（GAN）框架，该框架结合了分布式鉴别器和双注意机制。我们的方法利用单个生成器来合成以多方为条件的数据，每一方都维护自己的Critic和数据集，以确保隐私保护。我们在训练中引入了局部和全局注意力机制，以及梯度铸造技术，以有效解决多方数据合成中普遍存在的双重不平衡问题。局部注意机制解决个体内部的不平衡，而全局注意机制针对各方之间的不平衡，从而在数据分布高度不平衡的情况下产生更稳定的生成模型。为了验证我们的方法，我们使用六个真实世界的表格数据集进行了实证实验，故意在各种内部和内部环境中设置双重不平衡情景。我们通过评估其在机器学习任务中的有效性来评估多方生成的合成数据的效用。结果表明，我们的具有双注意机制的分布式GAN在解决这些挑战方面优于现有的生成模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Future Generation Computer Systems-The International Journal of Escience 工程技术-计算机：理论方法

CiteScore

19.90

自引率

2.70%

发文量

376

审稿时长

10.6 months

期刊介绍： Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications. Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration. Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.