A Novel Approach for Unsupervised Learning of Highly-Imbalanced Data

Robert K. L. Kennedy, Zahra Salekshahrezaee, T. Khoshgoftaar
{"title":"A Novel Approach for Unsupervised Learning of Highly-Imbalanced Data","authors":"Robert K. L. Kennedy, Zahra Salekshahrezaee, T. Khoshgoftaar","doi":"10.1109/CogMI56440.2022.00018","DOIUrl":null,"url":null,"abstract":"Typical fraud datasets lack consistent and accurate labels and, as such, are typically highly imbalanced with non-fraud examples greatly outnumbering the fraudulent ones. This presents significant challenges to machine learning researchers and practitioners. Due to these challenges, an effective approach in identifying fraudulent data points needs to handle highly-imbalanced datasets and be robust to class labeling. This paper introduces a novel unsupervised procedure for learning from imbalanced datasets without class labels by iteratively cleaning the training dataset. Our methodology uses an autoencoder as an underlying learner. We describe its fraud detection performance and compare it to a baseline unsupervised fraud detection learner. Our results show that our procedure significantly outperforms the baseline, in both AUC and TPR, when testing on a publicly available highly-imbalanced credit card fraud detection dataset.","PeriodicalId":211430,"journal":{"name":"2022 IEEE 4th International Conference on Cognitive Machine Intelligence (CogMI)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 4th International Conference on Cognitive Machine Intelligence (CogMI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CogMI56440.2022.00018","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Typical fraud datasets lack consistent and accurate labels and, as such, are typically highly imbalanced with non-fraud examples greatly outnumbering the fraudulent ones. This presents significant challenges to machine learning researchers and practitioners. Due to these challenges, an effective approach in identifying fraudulent data points needs to handle highly-imbalanced datasets and be robust to class labeling. This paper introduces a novel unsupervised procedure for learning from imbalanced datasets without class labels by iteratively cleaning the training dataset. Our methodology uses an autoencoder as an underlying learner. We describe its fraud detection performance and compare it to a baseline unsupervised fraud detection learner. Our results show that our procedure significantly outperforms the baseline, in both AUC and TPR, when testing on a publicly available highly-imbalanced credit card fraud detection dataset.
一种高度不平衡数据的无监督学习新方法
典型的欺诈数据集缺乏一致和准确的标签,因此,通常高度不平衡,非欺诈示例的数量大大超过欺诈示例。这对机器学习研究人员和实践者提出了重大挑战。由于这些挑战,识别欺诈数据点的有效方法需要处理高度不平衡的数据集,并且对类标记具有鲁棒性。本文介绍了一种新的无监督学习方法,通过迭代清洗训练数据集,从不平衡的无类标签数据集中学习。我们的方法使用自动编码器作为底层学习器。我们描述了它的欺诈检测性能,并将其与基线无监督欺诈检测学习器进行比较。我们的结果表明,当在公开可用的高度不平衡的信用卡欺诈检测数据集上进行测试时,我们的程序在AUC和TPR方面都明显优于基线。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信