Research on modeling of the imbalanced fraudulent transaction detection problem based on embedding-aware conditional GAN

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Research Pub Date : 2025-08-13 DOI:10.1016/j.bdr.2025.100557

Luping Zhi , Wanmin Wang

{"title":"Research on modeling of the imbalanced fraudulent transaction detection problem based on embedding-aware conditional GAN","authors":"Luping Zhi , Wanmin Wang","doi":"10.1016/j.bdr.2025.100557","DOIUrl":null,"url":null,"abstract":"<div><div>Detecting fraudulent transactions in structured financial data presents significant challenges due to multimodal, non-Gaussian continuous variables, mixed-type features, and severe class imbalance. To address these issues, we propose an Embedding-Aware Conditional Generative Adversarial Network (EAC-GAN), which incorporates trainable label embeddings into both the generator and discriminator to enable semantically controlled synthesis of minority-class samples. In addition to adversarial training, EAC-GAN introduces an auxiliary classification objective, forming a joint optimization strategy that improves the fidelity and class consistency of generated data, especially for underrepresented classes. Experiments conducted on a real-world credit card dataset demonstrate that EAC-GAN achieves stable convergence even with limited labeled data. When combined with LightGBM classifiers, the synthetic samples generated by EAC-GAN significantly enhance fraud detection performance, yielding a precision of 96.8%, an AUC of 96.38%, an AUPRC of 83.89%, and an MCC of 88.94%. Furthermore, dimensionality reduction using Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE) reveals that the generated samples closely align with the real data distribution and exhibit clear class separability in the latent space. These results underscore the effectiveness of EAC-GAN in synthesizing high-quality minority-class samples and improving downstream fraud detection, outperforming traditional oversampling techniques and baseline generative models.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"41 ","pages":"Article 100557"},"PeriodicalIF":4.2000,"publicationDate":"2025-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Big Data Research","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2214579625000528","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Detecting fraudulent transactions in structured financial data presents significant challenges due to multimodal, non-Gaussian continuous variables, mixed-type features, and severe class imbalance. To address these issues, we propose an Embedding-Aware Conditional Generative Adversarial Network (EAC-GAN), which incorporates trainable label embeddings into both the generator and discriminator to enable semantically controlled synthesis of minority-class samples. In addition to adversarial training, EAC-GAN introduces an auxiliary classification objective, forming a joint optimization strategy that improves the fidelity and class consistency of generated data, especially for underrepresented classes. Experiments conducted on a real-world credit card dataset demonstrate that EAC-GAN achieves stable convergence even with limited labeled data. When combined with LightGBM classifiers, the synthetic samples generated by EAC-GAN significantly enhance fraud detection performance, yielding a precision of 96.8%, an AUC of 96.38%, an AUPRC of 83.89%, and an MCC of 88.94%. Furthermore, dimensionality reduction using Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE) reveals that the generated samples closely align with the real data distribution and exhibit clear class separability in the latent space. These results underscore the effectiveness of EAC-GAN in synthesizing high-quality minority-class samples and improving downstream fraud detection, outperforming traditional oversampling techniques and baseline generative models.

查看原文本刊更多论文

基于嵌入感知条件GAN的不平衡欺诈交易检测问题建模研究

由于多模态、非高斯连续变量、混合类型特征和严重的类不平衡，在结构化金融数据中检测欺诈交易提出了重大挑战。为了解决这些问题，我们提出了一个嵌入感知条件生成对抗网络（EAC-GAN），它将可训练的标签嵌入到生成器和鉴别器中，以实现少数类样本的语义控制合成。除了对抗性训练之外，EAC-GAN还引入了一个辅助分类目标，形成了一个联合优化策略，提高了生成数据的保真度和类别一致性，特别是对于代表性不足的类别。在真实的信用卡数据集上进行的实验表明，即使标记数据有限，EAC-GAN也能实现稳定的收敛。当与LightGBM分类器结合使用时，EAC-GAN生成的合成样本显著提高了欺诈检测性能，精度为96.8%，AUC为96.38%，AUPRC为83.89%，MCC为88.94%。此外，使用主成分分析（PCA）和t分布随机邻居嵌入（t-SNE）进行降维，表明生成的样本与真实数据分布紧密一致，并且在潜在空间中表现出明显的类可分性。这些结果强调了EAC-GAN在合成高质量少数类样本和改进下游欺诈检测方面的有效性，优于传统的过采样技术和基线生成模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Big Data Research Computer Science-Computer Science Applications

CiteScore

8.40

自引率

3.00%

发文量

期刊介绍： The journal aims to promote and communicate advances in big data research by providing a fast and high quality forum for researchers, practitioners and policy makers from the very many different communities working on, and with, this topic. The journal will accept papers on foundational aspects in dealing with big data, as well as papers on specific Platforms and Technologies used to deal with big data. To promote Data Science and interdisciplinary collaboration between fields, and to showcase the benefits of data driven research, papers demonstrating applications of big data in domains as diverse as Geoscience, Social Web, Finance, e-Commerce, Health Care, Environment and Climate, Physics and Astronomy, Chemistry, life sciences and drug discovery, digital libraries and scientific publications, security and government will also be considered. Occasionally the journal may publish whitepapers on policies, standards and best practices.