BashExplainer: Retrieval-Augmented Bash Code Comment Generation based on Fine-tuned CodeBERT

2022 IEEE International Conference on Software Maintenance and Evolution (ICSME) Pub Date : 2022-06-27 DOI:10.1109/ICSME55016.2022.00016

Chi Yu, Guang Yang, Xiang Chen, Ke Liu, Yanlin Zhou

{"title":"BashExplainer: Retrieval-Augmented Bash Code Comment Generation based on Fine-tuned CodeBERT","authors":"Chi Yu, Guang Yang, Xiang Chen, Ke Liu, Yanlin Zhou","doi":"10.1109/ICSME55016.2022.00016","DOIUrl":null,"url":null,"abstract":"Developers use shell commands for many tasks, such as file system management, network control, and process management. Bash is one of the most commonly used shells and plays an important role in Linux system development and maintenance. Due to the language flexibility of Bash code, developers who are not familiar with Bash often have difficulty understanding the purpose and functionality of Bash code. In this study, we study Bash code comment generation problem and proposed an automatic method BASHEXPLAINER based on two-stage training strategy. In the first stage, we train a Bash encoder by fine-tuning CodeBERT on our constructed Bash code corpus. In the second stage, we first retrieve the most similar code from the code repository for the target code based on semantic and lexical similarity. Then we use the trained Bash encoder to generate two vector representations. Finally, we fuse these two vector representations via the fusion layer and generate the code comment through the decoder. To show the competitiveness of our proposed method, we construct a high-quality corpus by combining the corpus shared in the previous NL2Bash study and the corpus shared in the NLC2CMD competition. This corpus contains 10,592 Bash codes and corresponding comments. Then we selected ten baselines from previous studies on automatic code comment generation, which cover information retrieval methods, deep learning methods, and hybrid methods. The experimental results show that in terms of the performance measures BLEU-3/4, METEOR, and ROUGR-L, BASHEXPLAINER can outperform all baselines by at least 8.75%, 9.29%, 4.77% and 3.86%. Then we design ablation experiments to show the component setting rationality of BASHEXPLAINER. Later, we conduct a human study to further show the competitiveness of BASHEXPLAINER. Finally, we develop a browser plug-in based on BASHEXPLAINER to facilitate the understanding of the Bash code for developers.","PeriodicalId":300084,"journal":{"name":"2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)","volume":"1710 ","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSME55016.2022.00016","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

Abstract

Developers use shell commands for many tasks, such as file system management, network control, and process management. Bash is one of the most commonly used shells and plays an important role in Linux system development and maintenance. Due to the language flexibility of Bash code, developers who are not familiar with Bash often have difficulty understanding the purpose and functionality of Bash code. In this study, we study Bash code comment generation problem and proposed an automatic method BASHEXPLAINER based on two-stage training strategy. In the first stage, we train a Bash encoder by fine-tuning CodeBERT on our constructed Bash code corpus. In the second stage, we first retrieve the most similar code from the code repository for the target code based on semantic and lexical similarity. Then we use the trained Bash encoder to generate two vector representations. Finally, we fuse these two vector representations via the fusion layer and generate the code comment through the decoder. To show the competitiveness of our proposed method, we construct a high-quality corpus by combining the corpus shared in the previous NL2Bash study and the corpus shared in the NLC2CMD competition. This corpus contains 10,592 Bash codes and corresponding comments. Then we selected ten baselines from previous studies on automatic code comment generation, which cover information retrieval methods, deep learning methods, and hybrid methods. The experimental results show that in terms of the performance measures BLEU-3/4, METEOR, and ROUGR-L, BASHEXPLAINER can outperform all baselines by at least 8.75%, 9.29%, 4.77% and 3.86%. Then we design ablation experiments to show the component setting rationality of BASHEXPLAINER. Later, we conduct a human study to further show the competitiveness of BASHEXPLAINER. Finally, we develop a browser plug-in based on BASHEXPLAINER to facilitate the understanding of the Bash code for developers.

查看原文本刊更多论文

BashExplainer:基于微调CodeBERT的检索增强Bash代码注释生成

开发人员使用shell命令完成许多任务，例如文件系统管理、网络控制和进程管理。Bash是最常用的shell之一，在Linux系统的开发和维护中起着重要的作用。由于Bash代码的语言灵活性，不熟悉Bash的开发人员通常难以理解Bash代码的目的和功能。本文研究了Bash代码注释生成问题，提出了一种基于两阶段训练策略的BASHEXPLAINER自动生成方法。在第一阶段，我们通过在构建的Bash代码语料库上微调CodeBERT来训练Bash编码器。在第二阶段，我们首先根据语义和词法相似性从代码存储库中为目标代码检索最相似的代码。然后我们使用经过训练的Bash编码器生成两个向量表示。最后，我们通过融合层融合这两个向量表示，并通过解码器生成代码注释。为了展示我们提出的方法的竞争力，我们将之前NL2Bash研究中共享的语料库与NLC2CMD竞赛中共享的语料库相结合，构建了一个高质量的语料库。这个语料库包含10,592个Bash代码和相应的注释。在此基础上，选取了10条基于信息检索方法、深度学习方法和混合方法的代码注释自动生成基线。实验结果表明，在BLEU-3/4、METEOR和rour - l的性能度量方面，BASHEXPLAINER至少比所有基线高出8.75%、9.29%、4.77%和3.86%。然后设计烧蚀实验，验证BASHEXPLAINER组件设置的合理性。随后，我们进行了人体研究，进一步证明BASHEXPLAINER的竞争力。最后，我们开发了一个基于BASHEXPLAINER的浏览器插件，以方便开发人员理解Bash代码。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)

自引率

0.00%

发文量