Esale: Enhancing Code-Summary Alignment Learning for Source Code Summarization

IF 5.6 1区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IEEE Transactions on Software Engineering Pub Date : 2024-07-03 DOI:10.1109/TSE.2024.3422274

Chunrong Fang;Weisong Sun;Yuchen Chen;Xiao Chen;Zhao Wei;Quanjun Zhang;Yudu You;Bin Luo;Yang Liu;Zhenyu Chen

{"title":"Esale: Enhancing Code-Summary Alignment Learning for Source Code Summarization","authors":"Chunrong Fang;Weisong Sun;Yuchen Chen;Xiao Chen;Zhao Wei;Quanjun Zhang;Yudu You;Bin Luo;Yang Liu;Zhenyu Chen","doi":"10.1109/TSE.2024.3422274","DOIUrl":null,"url":null,"abstract":"(Source) code summarization aims to automatically generate succinct natural language summaries for given code snippets. Such summaries play a significant role in promoting developers to understand and maintain code. Inspired by neural machine translation, deep learning-based code summarization techniques widely adopt an encoder-decoder framework, where the encoder transforms given code snippets into context vectors, and the decoder decodes context vectors into summaries. Recently, large-scale pre-trained models for source code (e.g., CodeBERT and UniXcoder) are equipped with encoders capable of producing general context vectors and have achieved substantial improvements on the code summarization task. However, although they are usually trained mainly on code-focused tasks and can capture general code features, they still fall short in capturing specific features that need to be summarized. In a nutshell, they fail to learn the alignment between code snippets and summaries (code-summary alignment for short). In this paper, we propose a novel approach to improve code summarization based on summary-focused tasks. Specifically, we exploit a multi-task learning paradigm to train the encoder on three summary-focused tasks to enhance its ability to learn code-summary alignment, including unidirectional language modeling (ULM), masked language modeling (MLM), and action word prediction (AWP). Unlike pre-trained models that mainly predict masked tokens in code snippets, we design ULM and MLM to predict masked words in summaries. Intuitively, predicting words based on given code snippets would help learn the code-summary alignment. In addition, existing work shows that AWP affects the prediction of the entire summary. Therefore, we further introduce the domain-specific task AWP to enhance the ability of the encoder to learn the alignment between action words and code snippets. We evaluate the effectiveness of our approach, called \n<sc>Esale</small>\n, by conducting extensive experiments on four datasets, including two widely used datasets JCSD and PCSD, a cross-project Java dataset CPJD, and a multilingual language dataset CodeSearchNet. Experimental results show that \n<sc>Esale</small>\n significantly outperforms state-of-the-art baselines in all three widely used metrics, including BLEU, METEOR, and ROUGE-L. Moreover, the human evaluation proves that the summaries generated by \n<sc>Esale</small>\n are more informative and closer to the ground-truth summaries.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"50 8","pages":"2077-2095"},"PeriodicalIF":5.6000,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10584357/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

(Source) code summarization aims to automatically generate succinct natural language summaries for given code snippets. Such summaries play a significant role in promoting developers to understand and maintain code. Inspired by neural machine translation, deep learning-based code summarization techniques widely adopt an encoder-decoder framework, where the encoder transforms given code snippets into context vectors, and the decoder decodes context vectors into summaries. Recently, large-scale pre-trained models for source code (e.g., CodeBERT and UniXcoder) are equipped with encoders capable of producing general context vectors and have achieved substantial improvements on the code summarization task. However, although they are usually trained mainly on code-focused tasks and can capture general code features, they still fall short in capturing specific features that need to be summarized. In a nutshell, they fail to learn the alignment between code snippets and summaries (code-summary alignment for short). In this paper, we propose a novel approach to improve code summarization based on summary-focused tasks. Specifically, we exploit a multi-task learning paradigm to train the encoder on three summary-focused tasks to enhance its ability to learn code-summary alignment, including unidirectional language modeling (ULM), masked language modeling (MLM), and action word prediction (AWP). Unlike pre-trained models that mainly predict masked tokens in code snippets, we design ULM and MLM to predict masked words in summaries. Intuitively, predicting words based on given code snippets would help learn the code-summary alignment. In addition, existing work shows that AWP affects the prediction of the entire summary. Therefore, we further introduce the domain-specific task AWP to enhance the ability of the encoder to learn the alignment between action words and code snippets. We evaluate the effectiveness of our approach, called Esale , by conducting extensive experiments on four datasets, including two widely used datasets JCSD and PCSD, a cross-project Java dataset CPJD, and a multilingual language dataset CodeSearchNet. Experimental results show that Esale significantly outperforms state-of-the-art baselines in all three widely used metrics, including BLEU, METEOR, and ROUGE-L. Moreover, the human evaluation proves that the summaries generated by Esale are more informative and closer to the ground-truth summaries.

查看原文本刊更多论文

ESALE：增强源代码摘要的代码摘要对齐学习

(源）代码摘要旨在为给定的代码片段自动生成简洁的自然语言摘要。这些摘要在促进开发人员理解和维护代码方面发挥着重要作用。受神经机器翻译的启发，基于深度学习的代码摘要技术广泛采用编码器-解码器框架，其中编码器将给定代码片段转换为上下文向量，解码器将上下文向量解码为摘要。最近，针对源代码的大规模预训练模型（如 CodeBERT 和 UniXcoder）配备了能够生成一般上下文向量的编码器，并在代码摘要任务中取得了重大改进。不过，尽管它们通常主要针对代码任务进行训练，并能捕捉一般代码特征，但在捕捉需要总结的特定特征方面仍有不足。简而言之，它们无法学习代码片段和摘要之间的对齐（简称代码摘要对齐）。在本文中，我们提出了一种基于以摘要为重点的任务来改进代码摘要的新方法。具体来说，我们利用多任务学习范式，在三个以摘要为重点的任务上训练编码器，以增强其学习代码-摘要对齐的能力，包括单向语言建模（ULM）、屏蔽语言建模（MLM）和动作词预测（AWP）。与主要预测代码片段中被掩盖的标记的预训练模型不同，我们设计了 ULM 和 MLM 来预测摘要中被掩盖的单词。直观地说，根据给定的代码片段预测单词有助于学习代码与摘要的对齐。此外，现有工作表明 AWP 会影响整个摘要的预测。因此，我们进一步引入了特定领域任务 AWP，以增强编码器学习动作词和代码片段之间对齐的能力。我们在四个数据集（包括两个广泛使用的数据集 JCSD 和 PCSD、一个跨项目 Java 数据集 CPJD 和一个多语种语言数据集 CodeSearchNet）上进行了大量实验，评估了我们的方法（称为 Esale）的有效性。实验结果表明，在所有三个广泛使用的指标（包括 BLEU、METEOR 和 ROUGE-L）上，Esale 都明显优于最先进的基线。此外，人工评估证明，Esale 生成的摘要信息量更大，更接近地面实况摘要。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Software Engineering 工程技术-工程：电子与电气

CiteScore

9.70

自引率

10.80%

发文量

724

审稿时长

6 months

期刊介绍： IEEE Transactions on Software Engineering seeks contributions comprising well-defined theoretical results and empirical studies with potential impacts on software construction, analysis, or management. The scope of this Transactions extends from fundamental mechanisms to the development of principles and their application in specific environments. Specific topic areas include: a) Development and maintenance methods and models: Techniques and principles for specifying, designing, and implementing software systems, encompassing notations and process models. b) Assessment methods: Software tests, validation, reliability models, test and diagnosis procedures, software redundancy, design for error control, and measurements and evaluation of process and product aspects. c) Software project management: Productivity factors, cost models, schedule and organizational issues, and standards. d) Tools and environments: Specific tools, integrated tool environments, associated architectures, databases, and parallel and distributed processing issues. e) System issues: Hardware-software trade-offs. f) State-of-the-art surveys: Syntheses and comprehensive reviews of the historical development within specific areas of interest.