DIR-SMOTE: a density-influence resampling framework for imbalanced code smell detection

IF 3.1 2区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Automated Software Engineering Pub Date : 2026-05-04 DOI:10.1007/s10515-026-00624-x

Ruchika Malhotra, Bhawna Jain, Marouane Kessentini

{"title":"DIR-SMOTE: a density-influence resampling framework for imbalanced code smell detection","authors":"Ruchika Malhotra, Bhawna Jain, Marouane Kessentini","doi":"10.1007/s10515-026-00624-x","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>Code smell detection is vital for ensuring software quality, but the imbalance between smelly and non-smelly code instances impairs detection, especially for minority smells like Data Class and Feature Envy. Existing oversampling techniques, such as Synthetic Minority Oversampling Technique (SMOTE), Borderline-SMOTE (BL-SMOTE), and Adaptive Synthetic (ADASYN), attempt to mitigate this issue but often introduce noise or semantically irrelevant samples. This study proposes DIR-SMOTE (Density and Influence-based Resampling using SMOTE), a density and explanation-guided resampling framework that integrates local density estimation and SHapley Additive exPlanations (SHAP)-based feature importance to improve the quality of synthetic minority samples. Initially, DIR-SMOTE filters out noisy or isolated minority instances using density metrics. It then employs SHAP to identify the most influential features per instance. Synthetic samples are generated by interpolating between dense neighbors while perturbing only top-ranked SHAP features, thereby preserving semantic integrity. DIR-SMOTE is evaluated on five benchmark datasets, namely, Apache, jEdit, EDTForCSD, DesigniteJava, and MLCQ, across multiple smells such as Long Method, Feature Envy, and Data Class. Compared to nine standard resampling methods, DIR-SMOTE achieves up to 6.7% improvement in F1-score and 5.1% in precision, consistently enhancing smelly code detection in both binary and multiclass settings. Rather than relying on complex generative models, DIR-SMOTE focuses on explanation-guided and density-aware sample generation that remains transparent and computationally efficient. Overall, it offers a lightweight and robust solution that can be seamlessly integrated into practical quality assurance workflows, including automated smell detection tools and IDE-based analyzers.</p>\n </div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"33 3","pages":""},"PeriodicalIF":3.1000,"publicationDate":"2026-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Automated Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10515-026-00624-x","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Code smell detection is vital for ensuring software quality, but the imbalance between smelly and non-smelly code instances impairs detection, especially for minority smells like Data Class and Feature Envy. Existing oversampling techniques, such as Synthetic Minority Oversampling Technique (SMOTE), Borderline-SMOTE (BL-SMOTE), and Adaptive Synthetic (ADASYN), attempt to mitigate this issue but often introduce noise or semantically irrelevant samples. This study proposes DIR-SMOTE (Density and Influence-based Resampling using SMOTE), a density and explanation-guided resampling framework that integrates local density estimation and SHapley Additive exPlanations (SHAP)-based feature importance to improve the quality of synthetic minority samples. Initially, DIR-SMOTE filters out noisy or isolated minority instances using density metrics. It then employs SHAP to identify the most influential features per instance. Synthetic samples are generated by interpolating between dense neighbors while perturbing only top-ranked SHAP features, thereby preserving semantic integrity. DIR-SMOTE is evaluated on five benchmark datasets, namely, Apache, jEdit, EDTForCSD, DesigniteJava, and MLCQ, across multiple smells such as Long Method, Feature Envy, and Data Class. Compared to nine standard resampling methods, DIR-SMOTE achieves up to 6.7% improvement in F1-score and 5.1% in precision, consistently enhancing smelly code detection in both binary and multiclass settings. Rather than relying on complex generative models, DIR-SMOTE focuses on explanation-guided and density-aware sample generation that remains transparent and computationally efficient. Overall, it offers a lightweight and robust solution that can be seamlessly integrated into practical quality assurance workflows, including automated smell detection tools and IDE-based analyzers.

Abstract Image

查看原文本刊更多论文

DIR-SMOTE：用于不平衡代码气味检测的密度影响重采样框架

代码气味检测对于确保软件质量是至关重要的，但是臭气熏天的代码和没有臭气熏天的代码实例之间的不平衡会影响检测，特别是对于少数臭气熏天的代码，比如数据类和特征羡慕。现有的过采样技术，如合成少数过采样技术（SMOTE）、边界过采样技术（BL-SMOTE）和自适应合成（ADASYN），都试图缓解这个问题，但往往会引入噪声或语义无关的样本。本研究提出了DIR-SMOTE (Density and impact -based Resampling using SMOTE)，这是一个密度和解释指导的重采样框架，它结合了局部密度估计和基于SHapley加性解释（SHAP）的特征重要性，以提高合成少数样本的质量。最初，DIR-SMOTE使用密度指标过滤掉噪声或孤立的少数实例。然后，它使用SHAP来识别每个实例中最具影响力的特征。合成样本通过在密集邻居之间插值生成，同时只干扰排名靠前的SHAP特征，从而保持语义完整性。DIR-SMOTE在五个基准数据集上进行评估，即Apache、jEdit、EDTForCSD、DesigniteJava和MLCQ，跨越多种气味，如长方法、特征羡慕和数据类。与9种标准重采样方法相比，DIR-SMOTE的f1评分提高了6.7%，精度提高了5.1%，在二进制和多类设置中都能持续增强臭码检测。DIR-SMOTE不依赖于复杂的生成模型，而是专注于解释导向和密度感知的样本生成，保持透明和计算效率。总的来说，它提供了一个轻量级和强大的解决方案，可以无缝集成到实际的质量保证工作流程中，包括自动气味检测工具和基于ide的分析仪。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Automated Software Engineering 工程技术-计算机：软件工程

CiteScore

4.80

自引率

11.80%

发文量

审稿时长

>12 weeks

期刊介绍： This journal details research, tutorial papers, survey and accounts of significant industrial experience in the foundations, techniques, tools and applications of automated software engineering technology. This includes the study of techniques for constructing, understanding, adapting, and modeling software artifacts and processes. Coverage in Automated Software Engineering examines both automatic systems and collaborative systems as well as computational models of human software engineering activities. In addition, it presents knowledge representations and artificial intelligence techniques applicable to automated software engineering, and formal techniques that support or provide theoretical foundations. The journal also includes reviews of books, software, conferences and workshops.