{"title":"SHAP-Based Feature Selection for Enhanced Unsupervised Labeling","authors":"Mary Anne Walauskis;Taghi M. Khoshgoftaar","doi":"10.1109/ACCESS.2025.3591554","DOIUrl":null,"url":null,"abstract":"Manual dataset labeling is expensive, time-consuming, and susceptible to noise and inaccuracies, often necessitating significant financial investments with risks of inconsistencies from human annotations. These challenges are further extended in domains such as fraud detection because of privacy concerns due to manual annotations and severe class imbalance, which negatively impact machine learning models. Our unsupervised approach integrates SHapley Additive exPlanations (SHAP) for feature selection with our novel unsupervised labeling method which uses an ensemble unsupervised method in conjunction with a percentile-based threshold technique on the widely used Kaggle Credit Card Fraud Detection dataset. We create subsets with three and five features using unsupervised SHAP-based feature selection to determine the most impactful features, as well as use the full-featured dataset. To evaluate, we compare the newly generated binary class labels to the actual labels, which were only used for evaluation, and calculate Matthews Correlation Coefficient (MCC), Jaccard Index (JI), and Precision. Furthermore, we compare our method to an unsupervised baseline and show significant improvements. Our empirical results demonstrate that unsupervised SHAP-based feature selection consistently improves the quality of our labels, when compared to the baseline unsupervised method. Lastly, unsupervised SHAP-based feature selection improves label quality when comparing feature subsets to the full-feature dataset while reducing computational complexity. Our work provides an unsupervised framework capable of addressing the challenges of labeling highly imbalanced and unlabeled data while preserving data privacy concerns given the unsupervised nature of our methodology and application of unsupervised SHAP-based feature selection.","PeriodicalId":13079,"journal":{"name":"IEEE Access","volume":"13 ","pages":"130098-130109"},"PeriodicalIF":3.6000,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11088106","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Access","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11088106/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Manual dataset labeling is expensive, time-consuming, and susceptible to noise and inaccuracies, often necessitating significant financial investments with risks of inconsistencies from human annotations. These challenges are further extended in domains such as fraud detection because of privacy concerns due to manual annotations and severe class imbalance, which negatively impact machine learning models. Our unsupervised approach integrates SHapley Additive exPlanations (SHAP) for feature selection with our novel unsupervised labeling method which uses an ensemble unsupervised method in conjunction with a percentile-based threshold technique on the widely used Kaggle Credit Card Fraud Detection dataset. We create subsets with three and five features using unsupervised SHAP-based feature selection to determine the most impactful features, as well as use the full-featured dataset. To evaluate, we compare the newly generated binary class labels to the actual labels, which were only used for evaluation, and calculate Matthews Correlation Coefficient (MCC), Jaccard Index (JI), and Precision. Furthermore, we compare our method to an unsupervised baseline and show significant improvements. Our empirical results demonstrate that unsupervised SHAP-based feature selection consistently improves the quality of our labels, when compared to the baseline unsupervised method. Lastly, unsupervised SHAP-based feature selection improves label quality when comparing feature subsets to the full-feature dataset while reducing computational complexity. Our work provides an unsupervised framework capable of addressing the challenges of labeling highly imbalanced and unlabeled data while preserving data privacy concerns given the unsupervised nature of our methodology and application of unsupervised SHAP-based feature selection.
IEEE AccessCOMPUTER SCIENCE, INFORMATION SYSTEMSENGIN-ENGINEERING, ELECTRICAL & ELECTRONIC
CiteScore
9.80
自引率
7.70%
发文量
6673
审稿时长
6 weeks
期刊介绍:
IEEE Access® is a multidisciplinary, open access (OA), applications-oriented, all-electronic archival journal that continuously presents the results of original research or development across all of IEEE''s fields of interest.
IEEE Access will publish articles that are of high interest to readers, original, technically correct, and clearly presented. Supported by author publication charges (APC), its hallmarks are a rapid peer review and publication process with open access to all readers. Unlike IEEE''s traditional Transactions or Journals, reviews are "binary", in that reviewers will either Accept or Reject an article in the form it is submitted in order to achieve rapid turnaround. Especially encouraged are submissions on:
Multidisciplinary topics, or applications-oriented articles and negative results that do not fit within the scope of IEEE''s traditional journals.
Practical articles discussing new experiments or measurement techniques, interesting solutions to engineering.
Development of new or improved fabrication or manufacturing techniques.
Reviews or survey articles of new or evolving fields oriented to assist others in understanding the new area.