From code to insight: studying code representation techniques for ML-based God class detection to support intelligent IDEs

IF 3.1 2区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Automated Software Engineering Pub Date : 2025-07-25 DOI:10.1007/s10515-025-00534-4

Elmohanad Haroon, Khaled Tawfik Wassif, Lamia Abo Zaid

{"title":"From code to insight: studying code representation techniques for ML-based God class detection to support intelligent IDEs","authors":"Elmohanad Haroon, Khaled Tawfik Wassif, Lamia Abo Zaid","doi":"10.1007/s10515-025-00534-4","DOIUrl":null,"url":null,"abstract":"<div><p>In the realm of software development, detecting code smells is a critical task for ensuring good code quality. God class code smell specifically has a specific nature associated with a great deal of subjectivity due to the levels of coupling and cohesion associated to it. Automated techniques for code smell detection aim to resolve this subjectivity. Machine learning techniques have shown promising results that tend to improve accuracy and reduce the bias associated with other techniques for God class identification. This is due to their pattern recognition capabilities making them more objective in identifying patterns that indicate code smells. However, current results need to be further improved in terms of both accuracy and generalizability. The challenge in the use of machine learning is not only in selecting the most appropriate technique but also lies in effectively representing source code as input patterns fed to Machine Learning (ML) classifier(s). Code representation plays a pivotal role in encoding source code for ML algorithms. This study aims improving the accuracy and generalizability of God class code smell detection via exploring the effect of using various code representation techniques, namely, tree-based, metric-based, code embedding, and token-based code representation techniques on the ML detection results. The study is conducted on the MLCQ dataset, and applies various ML algorithms (specifically: Logistic Regression, Random Forest, SVM, Decision Tree, Naive Bayes, Gradient Boosting, XGBoost). The evaluation results show how different code representation techniques influence ML detection outcomes and the comparative performance of ML algorithms. The study findings reveal that the F1-score achieved outperforms prior studies on the MLCQ dataset, indicating the effectiveness of the proposed approach. The presented results reveal how the code representation technique used makes a significant impact on the ML classifier results. This paves the way for developing intelligent IDE plugins for just in time God Class code smell detection among other code smells.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 2","pages":""},"PeriodicalIF":3.1000,"publicationDate":"2025-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10515-025-00534-4.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Automated Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10515-025-00534-4","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

In the realm of software development, detecting code smells is a critical task for ensuring good code quality. God class code smell specifically has a specific nature associated with a great deal of subjectivity due to the levels of coupling and cohesion associated to it. Automated techniques for code smell detection aim to resolve this subjectivity. Machine learning techniques have shown promising results that tend to improve accuracy and reduce the bias associated with other techniques for God class identification. This is due to their pattern recognition capabilities making them more objective in identifying patterns that indicate code smells. However, current results need to be further improved in terms of both accuracy and generalizability. The challenge in the use of machine learning is not only in selecting the most appropriate technique but also lies in effectively representing source code as input patterns fed to Machine Learning (ML) classifier(s). Code representation plays a pivotal role in encoding source code for ML algorithms. This study aims improving the accuracy and generalizability of God class code smell detection via exploring the effect of using various code representation techniques, namely, tree-based, metric-based, code embedding, and token-based code representation techniques on the ML detection results. The study is conducted on the MLCQ dataset, and applies various ML algorithms (specifically: Logistic Regression, Random Forest, SVM, Decision Tree, Naive Bayes, Gradient Boosting, XGBoost). The evaluation results show how different code representation techniques influence ML detection outcomes and the comparative performance of ML algorithms. The study findings reveal that the F1-score achieved outperforms prior studies on the MLCQ dataset, indicating the effectiveness of the proposed approach. The presented results reveal how the code representation technique used makes a significant impact on the ML classifier results. This paves the way for developing intelligent IDE plugins for just in time God Class code smell detection among other code smells.

查看原文本刊更多论文

从代码到洞察力：研究基于ml的上帝类检测的代码表示技术，以支持智能ide

在软件开发领域，检测代码气味是确保良好代码质量的关键任务。由于与之相关的耦合和内聚级别，God类代码气味具有与大量主观性相关的特定性质。自动化的代码气味检测技术旨在解决这种主观性。机器学习技术已经显示出有希望的结果，倾向于提高准确性，减少与其他技术相关的上帝类识别的偏差。这是由于它们的模式识别能力使它们在识别指示代码气味的模式方面更加客观。然而，目前的结果在准确性和普遍性方面都需要进一步改进。使用机器学习的挑战不仅在于选择最合适的技术，还在于有效地将源代码作为输入模式表示给机器学习（ML）分类器。代码表示在机器学习算法的源代码编码中起着关键作用。本研究旨在通过探索使用各种代码表示技术（即基于树的、基于度量的、代码嵌入的和基于令牌的代码表示技术）对ML检测结果的影响，提高上帝类代码气味检测的准确性和可泛化性。该研究是在MLCQ数据集上进行的，并应用了各种ML算法（具体而言：逻辑回归、随机森林、支持向量机、决策树、朴素贝叶斯、梯度增强、XGBoost）。评估结果显示了不同的代码表示技术如何影响机器学习检测结果和机器学习算法的比较性能。研究结果表明，获得的f1分数优于先前在MLCQ数据集上的研究，表明所提出方法的有效性。给出的结果揭示了所使用的代码表示技术如何对ML分类器结果产生重大影响。这为开发智能IDE插件铺平了道路，以便及时在其他代码气味中检测God Class代码气味。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Automated Software Engineering 工程技术-计算机：软件工程

CiteScore

4.80

自引率

11.80%

发文量

审稿时长

>12 weeks

期刊介绍： This journal details research, tutorial papers, survey and accounts of significant industrial experience in the foundations, techniques, tools and applications of automated software engineering technology. This includes the study of techniques for constructing, understanding, adapting, and modeling software artifacts and processes. Coverage in Automated Software Engineering examines both automatic systems and collaborative systems as well as computational models of human software engineering activities. In addition, it presents knowledge representations and artificial intelligence techniques applicable to automated software engineering, and formal techniques that support or provide theoretical foundations. The journal also includes reviews of books, software, conferences and workshops.