{"title":"A distributive and attentive generative model for multi-party data synthesis in highly imbalanced data","authors":"Imam Mustafa Kamal, Chastine Fatichah","doi":"10.1016/j.future.2025.108166","DOIUrl":null,"url":null,"abstract":"<div><div>In the era of Artificial Intelligence (AI), where data plays a pivotal role, researchers are increasingly leveraging synthetic data to address privacy concerns, mitigate data scarcity, and enhance model robustness. This approach is particularly promising in critical domains such as healthcare, finance, government, and autonomous systems, where diverse and representative datasets are essential for effective AI training. The integration of data from multiple sources or parties in the context of big data can significantly enrich the available information. However, the data contributed by each party often exhibits distinct characteristics, leading to highly imbalanced distributions. This challenge introduces an additional layer of complexity known as the double imbalance problem, characterized by imbalances both within individual parties and across multiple parties. To address these challenges, we propose a novel generative adversarial network (GAN) framework incorporating distributed discriminators and dual attention mechanisms. Our approach utilizes a single generator to synthesize data conditioned on multiple parties, with each party maintaining its own Critic and dataset to ensure privacy preservation. We introduce local and global attention mechanisms, along with gradient-casting techniques during training, to effectively address the dual imbalance issues prevalent in multi-party data synthesis. The local attention mechanism addresses imbalances within individual parties, while the global attention mechanism targets imbalances across parties, resulting in a more stable generative model in the presence of highly imbalanced data distributions. To validate our approach, we conducted empirical experiments using six real-world tabular datasets, deliberately setting up dual imbalance scenarios across various intra- and inter-party contexts. We evaluated the utility of the synthetic data generated by multiple parties by assessing its efficacy in machine learning tasks. The results demonstrate that our distributed GAN with dual attention mechanisms outperforms existing generative models in addressing these challenges.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"176 ","pages":"Article 108166"},"PeriodicalIF":6.2000,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X25004601","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
In the era of Artificial Intelligence (AI), where data plays a pivotal role, researchers are increasingly leveraging synthetic data to address privacy concerns, mitigate data scarcity, and enhance model robustness. This approach is particularly promising in critical domains such as healthcare, finance, government, and autonomous systems, where diverse and representative datasets are essential for effective AI training. The integration of data from multiple sources or parties in the context of big data can significantly enrich the available information. However, the data contributed by each party often exhibits distinct characteristics, leading to highly imbalanced distributions. This challenge introduces an additional layer of complexity known as the double imbalance problem, characterized by imbalances both within individual parties and across multiple parties. To address these challenges, we propose a novel generative adversarial network (GAN) framework incorporating distributed discriminators and dual attention mechanisms. Our approach utilizes a single generator to synthesize data conditioned on multiple parties, with each party maintaining its own Critic and dataset to ensure privacy preservation. We introduce local and global attention mechanisms, along with gradient-casting techniques during training, to effectively address the dual imbalance issues prevalent in multi-party data synthesis. The local attention mechanism addresses imbalances within individual parties, while the global attention mechanism targets imbalances across parties, resulting in a more stable generative model in the presence of highly imbalanced data distributions. To validate our approach, we conducted empirical experiments using six real-world tabular datasets, deliberately setting up dual imbalance scenarios across various intra- and inter-party contexts. We evaluated the utility of the synthetic data generated by multiple parties by assessing its efficacy in machine learning tasks. The results demonstrate that our distributed GAN with dual attention mechanisms outperforms existing generative models in addressing these challenges.
期刊介绍:
Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications.
Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration.
Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.