SHAP-Based Feature Selection for Enhanced Unsupervised Labeling

IF 3.6 3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS
Mary Anne Walauskis;Taghi M. Khoshgoftaar
{"title":"SHAP-Based Feature Selection for Enhanced Unsupervised Labeling","authors":"Mary Anne Walauskis;Taghi M. Khoshgoftaar","doi":"10.1109/ACCESS.2025.3591554","DOIUrl":null,"url":null,"abstract":"Manual dataset labeling is expensive, time-consuming, and susceptible to noise and inaccuracies, often necessitating significant financial investments with risks of inconsistencies from human annotations. These challenges are further extended in domains such as fraud detection because of privacy concerns due to manual annotations and severe class imbalance, which negatively impact machine learning models. Our unsupervised approach integrates SHapley Additive exPlanations (SHAP) for feature selection with our novel unsupervised labeling method which uses an ensemble unsupervised method in conjunction with a percentile-based threshold technique on the widely used Kaggle Credit Card Fraud Detection dataset. We create subsets with three and five features using unsupervised SHAP-based feature selection to determine the most impactful features, as well as use the full-featured dataset. To evaluate, we compare the newly generated binary class labels to the actual labels, which were only used for evaluation, and calculate Matthews Correlation Coefficient (MCC), Jaccard Index (JI), and Precision. Furthermore, we compare our method to an unsupervised baseline and show significant improvements. Our empirical results demonstrate that unsupervised SHAP-based feature selection consistently improves the quality of our labels, when compared to the baseline unsupervised method. Lastly, unsupervised SHAP-based feature selection improves label quality when comparing feature subsets to the full-feature dataset while reducing computational complexity. Our work provides an unsupervised framework capable of addressing the challenges of labeling highly imbalanced and unlabeled data while preserving data privacy concerns given the unsupervised nature of our methodology and application of unsupervised SHAP-based feature selection.","PeriodicalId":13079,"journal":{"name":"IEEE Access","volume":"13 ","pages":"130098-130109"},"PeriodicalIF":3.6000,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11088106","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Access","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11088106/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Manual dataset labeling is expensive, time-consuming, and susceptible to noise and inaccuracies, often necessitating significant financial investments with risks of inconsistencies from human annotations. These challenges are further extended in domains such as fraud detection because of privacy concerns due to manual annotations and severe class imbalance, which negatively impact machine learning models. Our unsupervised approach integrates SHapley Additive exPlanations (SHAP) for feature selection with our novel unsupervised labeling method which uses an ensemble unsupervised method in conjunction with a percentile-based threshold technique on the widely used Kaggle Credit Card Fraud Detection dataset. We create subsets with three and five features using unsupervised SHAP-based feature selection to determine the most impactful features, as well as use the full-featured dataset. To evaluate, we compare the newly generated binary class labels to the actual labels, which were only used for evaluation, and calculate Matthews Correlation Coefficient (MCC), Jaccard Index (JI), and Precision. Furthermore, we compare our method to an unsupervised baseline and show significant improvements. Our empirical results demonstrate that unsupervised SHAP-based feature selection consistently improves the quality of our labels, when compared to the baseline unsupervised method. Lastly, unsupervised SHAP-based feature selection improves label quality when comparing feature subsets to the full-feature dataset while reducing computational complexity. Our work provides an unsupervised framework capable of addressing the challenges of labeling highly imbalanced and unlabeled data while preserving data privacy concerns given the unsupervised nature of our methodology and application of unsupervised SHAP-based feature selection.
基于shap的增强无监督标记特征选择
手动数据集标记是昂贵、耗时的,并且容易受到噪声和不准确的影响,通常需要大量的财务投资,并且存在人工注释不一致的风险。这些挑战在欺诈检测等领域进一步扩展,因为人工注释和严重的类不平衡会对机器学习模型产生负面影响。我们的无监督方法将SHapley加性解释(SHAP)用于特征选择与我们的新型无监督标记方法相结合,该方法在广泛使用的Kaggle信用卡欺诈检测数据集上使用了集成无监督方法和基于百分位数的阈值技术。我们使用无监督的基于shap的特征选择来创建具有三个和五个特征的子集,以确定最具影响力的特征,并使用全功能数据集。为了评估,我们将新生成的二进制类标签与仅用于评估的实际标签进行比较,并计算马修斯相关系数(MCC)、Jaccard指数(JI)和精度。此外,我们将我们的方法与无监督基线进行比较,并显示出显着的改进。我们的实证结果表明,与基线无监督方法相比,基于无监督shap的特征选择始终提高了我们标签的质量。最后,在将特征子集与全特征数据集进行比较时,基于无监督shap的特征选择提高了标签质量,同时降低了计算复杂度。我们的工作提供了一个无监督框架,能够解决标记高度不平衡和未标记数据的挑战,同时考虑到我们的方法的无监督性质和基于无监督shap的特征选择的应用,保护数据隐私问题。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
IEEE Access
IEEE Access COMPUTER SCIENCE, INFORMATION SYSTEMSENGIN-ENGINEERING, ELECTRICAL & ELECTRONIC
CiteScore
9.80
自引率
7.70%
发文量
6673
审稿时长
6 weeks
期刊介绍: IEEE Access® is a multidisciplinary, open access (OA), applications-oriented, all-electronic archival journal that continuously presents the results of original research or development across all of IEEE''s fields of interest. IEEE Access will publish articles that are of high interest to readers, original, technically correct, and clearly presented. Supported by author publication charges (APC), its hallmarks are a rapid peer review and publication process with open access to all readers. Unlike IEEE''s traditional Transactions or Journals, reviews are "binary", in that reviewers will either Accept or Reject an article in the form it is submitted in order to achieve rapid turnaround. Especially encouraged are submissions on: Multidisciplinary topics, or applications-oriented articles and negative results that do not fit within the scope of IEEE''s traditional journals. Practical articles discussing new experiments or measurement techniques, interesting solutions to engineering. Development of new or improved fabrication or manufacturing techniques. Reviews or survey articles of new or evolving fields oriented to assist others in understanding the new area.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信