{"title":"Multi-Modal Code Summarization with Retrieved Summary","authors":"Lile Lin, Zhiqiu Huang, Yaoshen Yu, Ya-Ping Liu","doi":"10.1109/SCAM55253.2022.00020","DOIUrl":null,"url":null,"abstract":"A high-quality code summary describes the functionality and purpose of a code snippet concisely, which is key to program comprehension. Automatic code summarization aims to generate natural language summaries from code snippets automatically, which can save developers time and improve efficiency in development and maintenance. Recently, researchers mainly use neural machine translation (NMT) based approaches to fill this task. They apply a neural model to translate code snippets into natural language summaries. However, the performance of existing NMT-based approaches is limited. Although a summary and a code snippet are semantically related, they may not share common lexical tokens or language structures. Such a semantic gap between codes and summaries hinders the effect of NMT-based models. Only using code tokens to represent a code snippet cannot help NMT-based models overcome this gap. To solve this problem, in this paper, we propose a code summarization approach that incorporates lexical, syntactic and semantic modalities of codes. We treat code tokens as the lexical modality and the abstract syntax tree (AST) as the syntactic modality. To obtain the semantic modality, inspired by translation memory (TM) in NMT, we use the information retrieval (IR) technique to retrieve a relevant summary for a code snippet to describe its functionality. We propose a novel approach based on contrastive learning to build a retrieval model to retrieve semantically similar summaries. Our approach learns and fuses those different modalities using Transformer. We evaluate our approach on a large Java dataset, experiment results show that our approach outperforms the state-of-the-art approaches on automatic evaluation metrics BLEU, ROUGE and METEOR by 10%, 8% and 9%.","PeriodicalId":138287,"journal":{"name":"2022 IEEE 22nd International Working Conference on Source Code Analysis and Manipulation (SCAM)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 22nd International Working Conference on Source Code Analysis and Manipulation (SCAM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SCAM55253.2022.00020","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
A high-quality code summary describes the functionality and purpose of a code snippet concisely, which is key to program comprehension. Automatic code summarization aims to generate natural language summaries from code snippets automatically, which can save developers time and improve efficiency in development and maintenance. Recently, researchers mainly use neural machine translation (NMT) based approaches to fill this task. They apply a neural model to translate code snippets into natural language summaries. However, the performance of existing NMT-based approaches is limited. Although a summary and a code snippet are semantically related, they may not share common lexical tokens or language structures. Such a semantic gap between codes and summaries hinders the effect of NMT-based models. Only using code tokens to represent a code snippet cannot help NMT-based models overcome this gap. To solve this problem, in this paper, we propose a code summarization approach that incorporates lexical, syntactic and semantic modalities of codes. We treat code tokens as the lexical modality and the abstract syntax tree (AST) as the syntactic modality. To obtain the semantic modality, inspired by translation memory (TM) in NMT, we use the information retrieval (IR) technique to retrieve a relevant summary for a code snippet to describe its functionality. We propose a novel approach based on contrastive learning to build a retrieval model to retrieve semantically similar summaries. Our approach learns and fuses those different modalities using Transformer. We evaluate our approach on a large Java dataset, experiment results show that our approach outperforms the state-of-the-art approaches on automatic evaluation metrics BLEU, ROUGE and METEOR by 10%, 8% and 9%.