SHAP-Based Feature Selection for Enhanced Unsupervised Labeling

IF 3.6 3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Access Pub Date : 2025-07-22 DOI:10.1109/ACCESS.2025.3591554

Mary Anne Walauskis;Taghi M. Khoshgoftaar

{"title":"SHAP-Based Feature Selection for Enhanced Unsupervised Labeling","authors":"Mary Anne Walauskis;Taghi M. Khoshgoftaar","doi":"10.1109/ACCESS.2025.3591554","DOIUrl":null,"url":null,"abstract":"Manual dataset labeling is expensive, time-consuming, and susceptible to noise and inaccuracies, often necessitating significant financial investments with risks of inconsistencies from human annotations. These challenges are further extended in domains such as fraud detection because of privacy concerns due to manual annotations and severe class imbalance, which negatively impact machine learning models. Our unsupervised approach integrates SHapley Additive exPlanations (SHAP) for feature selection with our novel unsupervised labeling method which uses an ensemble unsupervised method in conjunction with a percentile-based threshold technique on the widely used Kaggle Credit Card Fraud Detection dataset. We create subsets with three and five features using unsupervised SHAP-based feature selection to determine the most impactful features, as well as use the full-featured dataset. To evaluate, we compare the newly generated binary class labels to the actual labels, which were only used for evaluation, and calculate Matthews Correlation Coefficient (MCC), Jaccard Index (JI), and Precision. Furthermore, we compare our method to an unsupervised baseline and show significant improvements. Our empirical results demonstrate that unsupervised SHAP-based feature selection consistently improves the quality of our labels, when compared to the baseline unsupervised method. Lastly, unsupervised SHAP-based feature selection improves label quality when comparing feature subsets to the full-feature dataset while reducing computational complexity. Our work provides an unsupervised framework capable of addressing the challenges of labeling highly imbalanced and unlabeled data while preserving data privacy concerns given the unsupervised nature of our methodology and application of unsupervised SHAP-based feature selection.","PeriodicalId":13079,"journal":{"name":"IEEE Access","volume":"13 ","pages":"130098-130109"},"PeriodicalIF":3.6000,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11088106","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Access","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11088106/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Manual dataset labeling is expensive, time-consuming, and susceptible to noise and inaccuracies, often necessitating significant financial investments with risks of inconsistencies from human annotations. These challenges are further extended in domains such as fraud detection because of privacy concerns due to manual annotations and severe class imbalance, which negatively impact machine learning models. Our unsupervised approach integrates SHapley Additive exPlanations (SHAP) for feature selection with our novel unsupervised labeling method which uses an ensemble unsupervised method in conjunction with a percentile-based threshold technique on the widely used Kaggle Credit Card Fraud Detection dataset. We create subsets with three and five features using unsupervised SHAP-based feature selection to determine the most impactful features, as well as use the full-featured dataset. To evaluate, we compare the newly generated binary class labels to the actual labels, which were only used for evaluation, and calculate Matthews Correlation Coefficient (MCC), Jaccard Index (JI), and Precision. Furthermore, we compare our method to an unsupervised baseline and show significant improvements. Our empirical results demonstrate that unsupervised SHAP-based feature selection consistently improves the quality of our labels, when compared to the baseline unsupervised method. Lastly, unsupervised SHAP-based feature selection improves label quality when comparing feature subsets to the full-feature dataset while reducing computational complexity. Our work provides an unsupervised framework capable of addressing the challenges of labeling highly imbalanced and unlabeled data while preserving data privacy concerns given the unsupervised nature of our methodology and application of unsupervised SHAP-based feature selection.

查看原文本刊更多论文

基于shap的增强无监督标记特征选择

手动数据集标记是昂贵、耗时的，并且容易受到噪声和不准确的影响，通常需要大量的财务投资，并且存在人工注释不一致的风险。这些挑战在欺诈检测等领域进一步扩展，因为人工注释和严重的类不平衡会对机器学习模型产生负面影响。我们的无监督方法将SHapley加性解释（SHAP）用于特征选择与我们的新型无监督标记方法相结合，该方法在广泛使用的Kaggle信用卡欺诈检测数据集上使用了集成无监督方法和基于百分位数的阈值技术。我们使用无监督的基于shap的特征选择来创建具有三个和五个特征的子集，以确定最具影响力的特征，并使用全功能数据集。为了评估，我们将新生成的二进制类标签与仅用于评估的实际标签进行比较，并计算马修斯相关系数（MCC）、Jaccard指数（JI）和精度。此外，我们将我们的方法与无监督基线进行比较，并显示出显着的改进。我们的实证结果表明，与基线无监督方法相比，基于无监督shap的特征选择始终提高了我们标签的质量。最后，在将特征子集与全特征数据集进行比较时，基于无监督shap的特征选择提高了标签质量，同时降低了计算复杂度。我们的工作提供了一个无监督框架，能够解决标记高度不平衡和未标记数据的挑战，同时考虑到我们的方法的无监督性质和基于无监督shap的特征选择的应用，保护数据隐私问题。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Access COMPUTER SCIENCE, INFORMATION SYSTEMSENGIN-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

9.80

自引率

7.70%

发文量

6673

审稿时长

6 weeks

期刊介绍： IEEE Access® is a multidisciplinary, open access (OA), applications-oriented, all-electronic archival journal that continuously presents the results of original research or development across all of IEEE''s fields of interest. IEEE Access will publish articles that are of high interest to readers, original, technically correct, and clearly presented. Supported by author publication charges (APC), its hallmarks are a rapid peer review and publication process with open access to all readers. Unlike IEEE''s traditional Transactions or Journals, reviews are "binary", in that reviewers will either Accept or Reject an article in the form it is submitted in order to achieve rapid turnaround. Especially encouraged are submissions on: Multidisciplinary topics, or applications-oriented articles and negative results that do not fit within the scope of IEEE''s traditional journals. Practical articles discussing new experiments or measurement techniques, interesting solutions to engineering. Development of new or improved fabrication or manufacturing techniques. Reviews or survey articles of new or evolving fields oriented to assist others in understanding the new area.