Automatic Code Summarization Using Abbreviation Expansion and Subword Segmentation

IF 3 4区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Expert Systems Pub Date : 2025-01-08 DOI:10.1111/exsy.13835

Yu-Guo Liang, Gui-Sheng Fan, Hui-Qun Yu, Ming-Chen Li, Zi-Jie Huang

{"title":"Automatic Code Summarization Using Abbreviation Expansion and Subword Segmentation","authors":"Yu-Guo Liang, Gui-Sheng Fan, Hui-Qun Yu, Ming-Chen Li, Zi-Jie Huang","doi":"10.1111/exsy.13835","DOIUrl":null,"url":null,"abstract":"<div>\n \n Automatic code summarization refers to generating concise natural language descriptions for code snippets. It is vital for improving the efficiency of program understanding among software developers and maintainers. Despite the impressive strides made by deep learning-based methods, limitations still exist in their ability to understand and model semantic information due to the unique nature of programming languages. We propose two methods to boost code summarization models: context-based abbreviation expansion and unigram language model-based subword segmentation. We use heuristics to expand abbreviations within identifiers, reducing semantic ambiguity and improving the language alignment of code summarization models. Furthermore, we leverage subword segmentation to tokenize code into finer subword sequences, providing more semantic information during training and inference, thereby enhancing program understanding. These methods are model-agnostic and can be readily integrated into existing automatic code summarization approaches. Experiments conducted on two widely used Java code summarization datasets demonstrated the effectiveness of our approach. Specifically, by fusing original and modified code representations into the Transformer model, our Semantic Enhanced Transformer for Code Summarizsation (SETCS) serves as a robust semantic-level baseline. By simply modifying the datasets, our methods achieved performance improvements of up to 7.3%, 10.0%, 6.7%, and 3.2% for representative code summarization models in terms of BLEU-4, METEOR, ROUGE-L and SIDE, respectively.\n </div>","PeriodicalId":51053,"journal":{"name":"Expert Systems","volume":"42 2","pages":""},"PeriodicalIF":3.0000,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/exsy.13835","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Automatic code summarization refers to generating concise natural language descriptions for code snippets. It is vital for improving the efficiency of program understanding among software developers and maintainers. Despite the impressive strides made by deep learning-based methods, limitations still exist in their ability to understand and model semantic information due to the unique nature of programming languages. We propose two methods to boost code summarization models: context-based abbreviation expansion and unigram language model-based subword segmentation. We use heuristics to expand abbreviations within identifiers, reducing semantic ambiguity and improving the language alignment of code summarization models. Furthermore, we leverage subword segmentation to tokenize code into finer subword sequences, providing more semantic information during training and inference, thereby enhancing program understanding. These methods are model-agnostic and can be readily integrated into existing automatic code summarization approaches. Experiments conducted on two widely used Java code summarization datasets demonstrated the effectiveness of our approach. Specifically, by fusing original and modified code representations into the Transformer model, our Semantic Enhanced Transformer for Code Summarizsation (SETCS) serves as a robust semantic-level baseline. By simply modifying the datasets, our methods achieved performance improvements of up to 7.3%, 10.0%, 6.7%, and 3.2% for representative code summarization models in terms of BLEU-4, METEOR, ROUGE-L and SIDE, respectively.

查看原文本刊更多论文

基于缩写展开和子词分词的自动代码摘要

自动代码摘要指的是为代码片段生成简洁的自然语言描述。它对于提高软件开发人员和维护人员对程序的理解效率至关重要。尽管基于深度学习的方法取得了令人印象深刻的进步，但由于编程语言的独特性，它们理解和建模语义信息的能力仍然存在局限性。我们提出了两种增强代码摘要模型的方法：基于上下文的缩写展开和基于单一语言模型的子词分词。我们使用启发式方法扩展标识符中的缩写，减少语义歧义，提高代码摘要模型的语言一致性。此外，我们利用子词分割将代码标记为更精细的子词序列，在训练和推理过程中提供更多的语义信息，从而增强程序的理解。这些方法与模型无关，可以很容易地集成到现有的自动代码汇总方法中。在两个广泛使用的Java代码摘要数据集上进行的实验证明了我们的方法的有效性。具体来说，通过将原始和修改后的代码表示融合到Transformer模型中，我们的用于代码摘要的语义增强Transformer （SETCS）可以作为一个健壮的语义级基线。通过简单地修改数据集，我们的方法在BLEU-4、METEOR、ROUGE-L和SIDE方面的代表性代码摘要模型的性能分别提高了7.3%、10.0%、6.7%和3.2%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Expert Systems 工程技术-计算机：理论方法

CiteScore

7.40

自引率

6.10%

发文量

266

审稿时长

24 months

期刊介绍： Expert Systems: The Journal of Knowledge Engineering publishes papers dealing with all aspects of knowledge engineering, including individual methods and techniques in knowledge acquisition and representation, and their application in the construction of systems – including expert systems – based thereon. Detailed scientific evaluation is an essential part of any paper. As well as traditional application areas, such as Software and Requirements Engineering, Human-Computer Interaction, and Artificial Intelligence, we are aiming at the new and growing markets for these technologies, such as Business, Economy, Market Research, and Medical and Health Care. The shift towards this new focus will be marked by a series of special issues covering hot and emergent topics.