Deep Multimodal Architecture for Detection of Long Parameter List and Switch Statements using DistilBERT

2022 IEEE 22nd International Working Conference on Source Code Analysis and Manipulation (SCAM) Pub Date : 2022-10-01 DOI:10.1109/SCAM55253.2022.00018

Anushka Bhave, Roopak Sinha

{"title":"Deep Multimodal Architecture for Detection of Long Parameter List and Switch Statements using DistilBERT","authors":"Anushka Bhave, Roopak Sinha","doi":"10.1109/SCAM55253.2022.00018","DOIUrl":null,"url":null,"abstract":"Code smell detection and refactoring are crucial to sustain quality, reduce complexity and increase the efficiency of a software application. Code smells are observable patterns in the source code of a program that indicate deeper structural issues. Most traditional methods for code smell classification rely exclusively on structural object-oriented metrics and manually-designed heuristics. We propose a novel multimodal deep learning approach that combines structural and semantic information to detect two commonly-encountered code smells: Long Parameter Lists and Switch Statements. The presented architecture applies transfer learning on DistilBERT to generate vector embeddings representing classes and methods concatenated with numerical metrics for joint feature extraction using CNN, to build a complex mapping between the features and predict the output as smelly or non-smelly. Subsequently, to perform a holistic comparative analysis we also implement two multimodal machine learning pipelines, the first employs a sci-kit learn TF-IDF Vectorizer with Random Forest Classifier, and the second merges CNN with Bi-LSTM. Our approach achieves an accuracy of 91.2% as corroborated by experimental evaluation, outperforming the state-of-the-art techniques.","PeriodicalId":138287,"journal":{"name":"2022 IEEE 22nd International Working Conference on Source Code Analysis and Manipulation (SCAM)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 22nd International Working Conference on Source Code Analysis and Manipulation (SCAM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SCAM55253.2022.00018","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Code smell detection and refactoring are crucial to sustain quality, reduce complexity and increase the efficiency of a software application. Code smells are observable patterns in the source code of a program that indicate deeper structural issues. Most traditional methods for code smell classification rely exclusively on structural object-oriented metrics and manually-designed heuristics. We propose a novel multimodal deep learning approach that combines structural and semantic information to detect two commonly-encountered code smells: Long Parameter Lists and Switch Statements. The presented architecture applies transfer learning on DistilBERT to generate vector embeddings representing classes and methods concatenated with numerical metrics for joint feature extraction using CNN, to build a complex mapping between the features and predict the output as smelly or non-smelly. Subsequently, to perform a holistic comparative analysis we also implement two multimodal machine learning pipelines, the first employs a sci-kit learn TF-IDF Vectorizer with Random Forest Classifier, and the second merges CNN with Bi-LSTM. Our approach achieves an accuracy of 91.2% as corroborated by experimental evaluation, outperforming the state-of-the-art techniques.

查看原文本刊更多论文

基于蒸馏器的长参数表和开关语句检测的深度多模态结构

代码气味检测和重构对于维持软件应用程序的质量、降低复杂性和提高效率至关重要。代码气味是程序源代码中可观察到的模式，表明更深层次的结构问题。大多数传统的代码气味分类方法完全依赖于结构化的面向对象度量和人工设计的启发式方法。我们提出了一种新的多模态深度学习方法，该方法结合了结构和语义信息来检测两种常见的代码气味:长参数列表和开关语句。所提出的架构在蒸馏器上应用迁移学习来生成向量嵌入，表示类和方法与使用CNN进行联合特征提取的数值度量相连接，以构建特征之间的复杂映射，并预测输出为臭或无臭。随后，为了进行整体比较分析，我们还实现了两个多模态机器学习管道，第一个管道使用scikit学习TF-IDF矢量器与随机森林分类器，第二个管道将CNN与Bi-LSTM合并。通过实验评估，我们的方法达到了91.2%的准确率，优于最先进的技术。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE 22nd International Working Conference on Source Code Analysis and Manipulation (SCAM)

自引率

0.00%

发文量