Multi-Modal Code Summarization with Retrieved Summary

2022 IEEE 22nd International Working Conference on Source Code Analysis and Manipulation (SCAM) Pub Date : 2022-10-01 DOI:10.1109/SCAM55253.2022.00020

Lile Lin, Zhiqiu Huang, Yaoshen Yu, Ya-Ping Liu

{"title":"Multi-Modal Code Summarization with Retrieved Summary","authors":"Lile Lin, Zhiqiu Huang, Yaoshen Yu, Ya-Ping Liu","doi":"10.1109/SCAM55253.2022.00020","DOIUrl":null,"url":null,"abstract":"A high-quality code summary describes the functionality and purpose of a code snippet concisely, which is key to program comprehension. Automatic code summarization aims to generate natural language summaries from code snippets automatically, which can save developers time and improve efficiency in development and maintenance. Recently, researchers mainly use neural machine translation (NMT) based approaches to fill this task. They apply a neural model to translate code snippets into natural language summaries. However, the performance of existing NMT-based approaches is limited. Although a summary and a code snippet are semantically related, they may not share common lexical tokens or language structures. Such a semantic gap between codes and summaries hinders the effect of NMT-based models. Only using code tokens to represent a code snippet cannot help NMT-based models overcome this gap. To solve this problem, in this paper, we propose a code summarization approach that incorporates lexical, syntactic and semantic modalities of codes. We treat code tokens as the lexical modality and the abstract syntax tree (AST) as the syntactic modality. To obtain the semantic modality, inspired by translation memory (TM) in NMT, we use the information retrieval (IR) technique to retrieve a relevant summary for a code snippet to describe its functionality. We propose a novel approach based on contrastive learning to build a retrieval model to retrieve semantically similar summaries. Our approach learns and fuses those different modalities using Transformer. We evaluate our approach on a large Java dataset, experiment results show that our approach outperforms the state-of-the-art approaches on automatic evaluation metrics BLEU, ROUGE and METEOR by 10%, 8% and 9%.","PeriodicalId":138287,"journal":{"name":"2022 IEEE 22nd International Working Conference on Source Code Analysis and Manipulation (SCAM)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 22nd International Working Conference on Source Code Analysis and Manipulation (SCAM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SCAM55253.2022.00020","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

A high-quality code summary describes the functionality and purpose of a code snippet concisely, which is key to program comprehension. Automatic code summarization aims to generate natural language summaries from code snippets automatically, which can save developers time and improve efficiency in development and maintenance. Recently, researchers mainly use neural machine translation (NMT) based approaches to fill this task. They apply a neural model to translate code snippets into natural language summaries. However, the performance of existing NMT-based approaches is limited. Although a summary and a code snippet are semantically related, they may not share common lexical tokens or language structures. Such a semantic gap between codes and summaries hinders the effect of NMT-based models. Only using code tokens to represent a code snippet cannot help NMT-based models overcome this gap. To solve this problem, in this paper, we propose a code summarization approach that incorporates lexical, syntactic and semantic modalities of codes. We treat code tokens as the lexical modality and the abstract syntax tree (AST) as the syntactic modality. To obtain the semantic modality, inspired by translation memory (TM) in NMT, we use the information retrieval (IR) technique to retrieve a relevant summary for a code snippet to describe its functionality. We propose a novel approach based on contrastive learning to build a retrieval model to retrieve semantically similar summaries. Our approach learns and fuses those different modalities using Transformer. We evaluate our approach on a large Java dataset, experiment results show that our approach outperforms the state-of-the-art approaches on automatic evaluation metrics BLEU, ROUGE and METEOR by 10%, 8% and 9%.

查看原文本刊更多论文

带有检索摘要的多模态代码摘要

高质量的代码摘要简洁地描述了代码段的功能和目的，这是程序理解的关键。自动代码摘要旨在从代码片段中自动生成自然语言摘要，从而节省开发人员的时间，提高开发和维护的效率。目前，研究人员主要使用基于神经机器翻译(NMT)的方法来完成这一任务。他们应用神经模型将代码片段翻译成自然语言摘要。然而，现有的基于神经网络的方法的性能是有限的。尽管摘要和代码片段在语义上是相关的，但它们可能不共享公共词法标记或语言结构。代码和摘要之间的这种语义差距阻碍了基于nmt的模型的效果。仅使用代码令牌来表示代码片段并不能帮助基于nmt的模型克服这一缺陷。为了解决这一问题，本文提出了一种结合词法、句法和语义模式的代码摘要方法。我们将代码标记作为词法形态，将抽象语法树(AST)作为语法形态。为了获得语义模态，我们受NMT中的翻译记忆库(TM)的启发，使用信息检索(IR)技术检索代码片段的相关摘要来描述其功能。我们提出了一种基于对比学习的检索方法来构建语义相似摘要的检索模型。我们的方法使用Transformer学习和融合这些不同的模式。我们在一个大型Java数据集上评估了我们的方法，实验结果表明，我们的方法在自动评估指标BLEU、ROUGE和METEOR上的性能比目前最先进的方法高出10%、8%和9%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE 22nd International Working Conference on Source Code Analysis and Manipulation (SCAM)

自引率

0.00%

发文量