{"title":"From code to insight: studying code representation techniques for ML-based God class detection to support intelligent IDEs","authors":"Elmohanad Haroon, Khaled Tawfik Wassif, Lamia Abo Zaid","doi":"10.1007/s10515-025-00534-4","DOIUrl":null,"url":null,"abstract":"<div><p>In the realm of software development, detecting code smells is a critical task for ensuring good code quality. God class code smell specifically has a specific nature associated with a great deal of subjectivity due to the levels of coupling and cohesion associated to it. Automated techniques for code smell detection aim to resolve this subjectivity. Machine learning techniques have shown promising results that tend to improve accuracy and reduce the bias associated with other techniques for God class identification. This is due to their pattern recognition capabilities making them more objective in identifying patterns that indicate code smells. However, current results need to be further improved in terms of both accuracy and generalizability. The challenge in the use of machine learning is not only in selecting the most appropriate technique but also lies in effectively representing source code as input patterns fed to Machine Learning (ML) classifier(s). Code representation plays a pivotal role in encoding source code for ML algorithms. This study aims improving the accuracy and generalizability of God class code smell detection via exploring the effect of using various code representation techniques, namely, tree-based, metric-based, code embedding, and token-based code representation techniques on the ML detection results. The study is conducted on the MLCQ dataset, and applies various ML algorithms (specifically: Logistic Regression, Random Forest, SVM, Decision Tree, Naive Bayes, Gradient Boosting, XGBoost). The evaluation results show how different code representation techniques influence ML detection outcomes and the comparative performance of ML algorithms. The study findings reveal that the F1-score achieved outperforms prior studies on the MLCQ dataset, indicating the effectiveness of the proposed approach. The presented results reveal how the code representation technique used makes a significant impact on the ML classifier results. This paves the way for developing intelligent IDE plugins for just in time God Class code smell detection among other code smells.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 2","pages":""},"PeriodicalIF":3.1000,"publicationDate":"2025-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s10515-025-00534-4.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Automated Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10515-025-00534-4","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
Abstract
In the realm of software development, detecting code smells is a critical task for ensuring good code quality. God class code smell specifically has a specific nature associated with a great deal of subjectivity due to the levels of coupling and cohesion associated to it. Automated techniques for code smell detection aim to resolve this subjectivity. Machine learning techniques have shown promising results that tend to improve accuracy and reduce the bias associated with other techniques for God class identification. This is due to their pattern recognition capabilities making them more objective in identifying patterns that indicate code smells. However, current results need to be further improved in terms of both accuracy and generalizability. The challenge in the use of machine learning is not only in selecting the most appropriate technique but also lies in effectively representing source code as input patterns fed to Machine Learning (ML) classifier(s). Code representation plays a pivotal role in encoding source code for ML algorithms. This study aims improving the accuracy and generalizability of God class code smell detection via exploring the effect of using various code representation techniques, namely, tree-based, metric-based, code embedding, and token-based code representation techniques on the ML detection results. The study is conducted on the MLCQ dataset, and applies various ML algorithms (specifically: Logistic Regression, Random Forest, SVM, Decision Tree, Naive Bayes, Gradient Boosting, XGBoost). The evaluation results show how different code representation techniques influence ML detection outcomes and the comparative performance of ML algorithms. The study findings reveal that the F1-score achieved outperforms prior studies on the MLCQ dataset, indicating the effectiveness of the proposed approach. The presented results reveal how the code representation technique used makes a significant impact on the ML classifier results. This paves the way for developing intelligent IDE plugins for just in time God Class code smell detection among other code smells.
期刊介绍:
This journal details research, tutorial papers, survey and accounts of significant industrial experience in the foundations, techniques, tools and applications of automated software engineering technology. This includes the study of techniques for constructing, understanding, adapting, and modeling software artifacts and processes.
Coverage in Automated Software Engineering examines both automatic systems and collaborative systems as well as computational models of human software engineering activities. In addition, it presents knowledge representations and artificial intelligence techniques applicable to automated software engineering, and formal techniques that support or provide theoretical foundations. The journal also includes reviews of books, software, conferences and workshops.