BalancerGNN: Balancer Graph Neural Networks for imbalanced datasets: A case study on fraud detection

IF 6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Neural Networks Pub Date : 2024-11-23 DOI:10.1016/j.neunet.2024.106926

Mallika Boyapati , Ramazan Aygun

{"title":"BalancerGNN: Balancer Graph Neural Networks for imbalanced datasets: A case study on fraud detection","authors":"Mallika Boyapati , Ramazan Aygun","doi":"10.1016/j.neunet.2024.106926","DOIUrl":null,"url":null,"abstract":"<div><div>Fraud detection for imbalanced datasets is challenging due to machine learning models inclination to learn the majority class. Imbalance in fraud detection datasets affects how graphs are built, an important step in many Graph Neural Networks (GNNs). In this paper, we introduce our <em>BalancerGNN</em> framework to tackle with imbalanced datasets and show its effectiveness on fraud detection. Our framework has three major components: (i) node construction with feature representations, (ii) graph construction using balanced neighbor sampling, and (iii) GNN training using balanced training batches leveraging a custom loss function with multiple components. For node construction, we have introduced (i) Graph-based Variable Clustering (GVC) to optimize feature selection and remove redundancies by analyzing multi-collinearity and (ii) Encoder–Decoder based Dimensionality Reduction (EDDR) using transformer-based techniques to reduce feature dimensions while keeping important information intact about textual embeddings. Our experiments on Medicare, Equifax, IEEE, and auto insurance fraud datasets highlight the importance of node construction with features representations. BalancerGNN trained with balanced batches consistently outperforms other methods, showing strong abilities in identifying fraud cases, with sensitivity rates ranging from 72.87% to 81.23% across datasets while balancing specificity. Additionally, BalancerGNN achieves impressive accuracy rates, ranging from 73.99% to 94.28%. These outcomes underscore the crucial role of graph representation and neighbor sampling techniques in optimizing BalancerGNN for fraud detection models in real-world applications.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"182 ","pages":"Article 106926"},"PeriodicalIF":6.0000,"publicationDate":"2024-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0893608024008554","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Fraud detection for imbalanced datasets is challenging due to machine learning models inclination to learn the majority class. Imbalance in fraud detection datasets affects how graphs are built, an important step in many Graph Neural Networks (GNNs). In this paper, we introduce our BalancerGNN framework to tackle with imbalanced datasets and show its effectiveness on fraud detection. Our framework has three major components: (i) node construction with feature representations, (ii) graph construction using balanced neighbor sampling, and (iii) GNN training using balanced training batches leveraging a custom loss function with multiple components. For node construction, we have introduced (i) Graph-based Variable Clustering (GVC) to optimize feature selection and remove redundancies by analyzing multi-collinearity and (ii) Encoder–Decoder based Dimensionality Reduction (EDDR) using transformer-based techniques to reduce feature dimensions while keeping important information intact about textual embeddings. Our experiments on Medicare, Equifax, IEEE, and auto insurance fraud datasets highlight the importance of node construction with features representations. BalancerGNN trained with balanced batches consistently outperforms other methods, showing strong abilities in identifying fraud cases, with sensitivity rates ranging from 72.87% to 81.23% across datasets while balancing specificity. Additionally, BalancerGNN achieves impressive accuracy rates, ranging from 73.99% to 94.28%. These outcomes underscore the crucial role of graph representation and neighbor sampling techniques in optimizing BalancerGNN for fraud detection models in real-world applications.

查看原文本刊更多论文

失衡数据集的平衡图神经网络：欺诈检测的案例研究

由于机器学习模型倾向于学习大多数类别，因此不平衡数据集的欺诈检测具有挑战性。欺诈检测数据集的不平衡影响图的构建，这是许多图神经网络（gnn）的重要步骤。在本文中，我们介绍了我们的BalancerGNN框架来处理不平衡数据集，并展示了它在欺诈检测方面的有效性。我们的框架有三个主要组成部分：(i)使用特征表示的节点构建，（ii）使用平衡邻居采样的图构建，以及（iii）使用平衡训练批次利用具有多个组件的自定义损失函数的GNN训练。对于节点构建，我们引入了(i)基于图的变量聚类（GVC），通过分析多重共线性来优化特征选择和消除冗余；（ii）基于编码器-解码器的降维（EDDR），使用基于变压器的技术来降低特征维度，同时保持文本嵌入的重要信息完整。我们在Medicare、Equifax、IEEE和汽车保险欺诈数据集上的实验强调了用特征表示构建节点的重要性。使用平衡批次训练的BalancerGNN始终优于其他方法，在识别欺诈案例方面表现出强大的能力，在平衡特异性的同时，跨数据集的敏感性从72.87%到81.23%不等。此外，BalancerGNN实现了令人印象深刻的准确率，范围从73.99%到94.28%。这些结果强调了图表示和邻居采样技术在优化BalancerGNN在现实应用中的欺诈检测模型中的关键作用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Neural Networks 工程技术-计算机：人工智能

CiteScore

13.90

自引率

7.70%

发文量

425

审稿时长

67 days

期刊介绍： Neural Networks is a platform that aims to foster an international community of scholars and practitioners interested in neural networks, deep learning, and other approaches to artificial intelligence and machine learning. Our journal invites submissions covering various aspects of neural networks research, from computational neuroscience and cognitive modeling to mathematical analyses and engineering applications. By providing a forum for interdisciplinary discussions between biology and technology, we aim to encourage the development of biologically-inspired artificial intelligence.