Cross-Modal Retrieval-enhanced code Summarization based on joint learning for retrieval and generation

IF 3.8 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information and Software Technology Pub Date : 2024-07-18 DOI:10.1016/j.infsof.2024.107527

Lixuan Li , Bin Liang , Lin Chen , Xiaofang Zhang

{"title":"Cross-Modal Retrieval-enhanced code Summarization based on joint learning for retrieval and generation","authors":"Lixuan Li , Bin Liang , Lin Chen , Xiaofang Zhang","doi":"10.1016/j.infsof.2024.107527","DOIUrl":null,"url":null,"abstract":"<div><h3>Context:</h3><p>Code summarization refers to a task that automatically generates a natural language description of a code snippet to facilitate code comprehension. Existing methods have achieved satisfactory results by incorporating information retrieval into generative deep-learning models for reusing summaries of existing code. However, most of these existing methods employed non-learnable generic retrieval methods for content-based retrieval, resulting in a lack of diversity in the retrieved results during training, thereby making the model over-reliant on retrieved results and reducing the generative model’s ability to generalize to unknown samples.</p></div><div><h3>Objective:</h3><p>To address this issue, this paper introduces CMR-Sum: a novel Cross-Modal Retrieval-enhanced code Summarization framework based on joint learning for generation and retrieval tasks, where both two tasks are allowed to be optimized simultaneously.</p></div><div><h3>Method:</h3><p>Specifically, we use a cross-modal retrieval module to dynamically alter retrieval results during training, which enhances the diversity of the retrieved results and maintains a relative balance between the two tasks. Furthermore, in the summary generation phase, we employ a cross-attention mechanism to generate code summaries based on the alignment between retrieved and generated summaries. We conducted experiments on three real-world datasets, comparing the performance of our method with baseline models. Additionally, we performed extensive qualitative analysis.</p></div><div><h3>Result:</h3><p>Results from qualitative and quantitative experiments indicate that our approach effectively enhances the performance of code summarization. Our method outperforms both the generation-based and the retrieval-enhanced baselines. Further ablation experiments demonstrate the effectiveness of each component of our method. Results from sensitivity analysis experiments suggest that our approach achieves good performance without requiring extensive hyper-parameter search.</p></div><div><h3>Conclusion:</h3><p>The direction of utilizing retrieval-enhanced generation tasks shows great potential. It is essential to increase the diversity of retrieval results during the training process, which is crucial for improving the generality and the performance of the model.</p></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"175 ","pages":"Article 107527"},"PeriodicalIF":3.8000,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information and Software Technology","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950584924001320","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Context:

Code summarization refers to a task that automatically generates a natural language description of a code snippet to facilitate code comprehension. Existing methods have achieved satisfactory results by incorporating information retrieval into generative deep-learning models for reusing summaries of existing code. However, most of these existing methods employed non-learnable generic retrieval methods for content-based retrieval, resulting in a lack of diversity in the retrieved results during training, thereby making the model over-reliant on retrieved results and reducing the generative model’s ability to generalize to unknown samples.

Objective:

To address this issue, this paper introduces CMR-Sum: a novel Cross-Modal Retrieval-enhanced code Summarization framework based on joint learning for generation and retrieval tasks, where both two tasks are allowed to be optimized simultaneously.

Method:

Specifically, we use a cross-modal retrieval module to dynamically alter retrieval results during training, which enhances the diversity of the retrieved results and maintains a relative balance between the two tasks. Furthermore, in the summary generation phase, we employ a cross-attention mechanism to generate code summaries based on the alignment between retrieved and generated summaries. We conducted experiments on three real-world datasets, comparing the performance of our method with baseline models. Additionally, we performed extensive qualitative analysis.

Result:

Results from qualitative and quantitative experiments indicate that our approach effectively enhances the performance of code summarization. Our method outperforms both the generation-based and the retrieval-enhanced baselines. Further ablation experiments demonstrate the effectiveness of each component of our method. Results from sensitivity analysis experiments suggest that our approach achieves good performance without requiring extensive hyper-parameter search.

Conclusion:

The direction of utilizing retrieval-enhanced generation tasks shows great potential. It is essential to increase the diversity of retrieval results during the training process, which is crucial for improving the generality and the performance of the model.

查看原文本刊更多论文

基于检索和生成联合学习的跨模态检索增强代码摘要

背景：代码摘要是指自动生成代码片段的自然语言描述以促进代码理解的任务。现有的方法通过将信息检索纳入生成式深度学习模型来重用现有代码摘要，取得了令人满意的效果。然而，这些现有方法大多采用不可学习的通用检索方法进行基于内容的检索，导致训练过程中检索结果缺乏多样性，从而使模型过度依赖检索结果，降低了生成模型对未知样本的泛化能力。方法：具体来说，我们使用一个跨模态检索模块，在训练过程中动态改变检索结果，从而增强检索结果的多样性，并保持两个任务之间的相对平衡。此外，在摘要生成阶段，我们采用交叉关注机制，根据检索到的摘要和生成的摘要之间的一致性生成代码摘要。我们在三个实际数据集上进行了实验，比较了我们的方法与基准模型的性能。结果：定性和定量实验的结果表明，我们的方法有效地提高了代码摘要的性能。我们的方法优于基于生成和检索增强的基线方法。进一步的消融实验证明了我们方法每个组成部分的有效性。敏感性分析实验的结果表明，我们的方法无需进行大量的超参数搜索即可实现良好的性能。在训练过程中，必须增加检索结果的多样性，这对提高模型的通用性和性能至关重要。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information and Software Technology 工程技术-计算机：软件工程

CiteScore

9.10

自引率

7.70%

发文量

164

审稿时长

9.6 weeks

期刊介绍： Information and Software Technology is the international archival journal focusing on research and experience that contributes to the improvement of software development practices. The journal''s scope includes methods and techniques to better engineer software and manage its development. Articles submitted for review should have a clear component of software engineering or address ways to improve the engineering and management of software development. Areas covered by the journal include: • Software management, quality and metrics, • Software processes, • Software architecture, modelling, specification, design and programming • Functional and non-functional software requirements • Software testing and verification & validation • Empirical studies of all aspects of engineering and managing software development Short Communications is a new section dedicated to short papers addressing new ideas, controversial opinions, "Negative" results and much more. Read the Guide for authors for more information. The journal encourages and welcomes submissions of systematic literature studies (reviews and maps) within the scope of the journal. Information and Software Technology is the premiere outlet for systematic literature studies in software engineering.