RaxCS: Towards cross-language code summarization with contrastive pre-training and retrieval augmentation

IF 4.3 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information and Software Technology Pub Date : 2025-04-10 DOI:10.1016/j.infsof.2025.107741

Kaiyuan Yang , Junfeng Wang , Zihua Song

{"title":"RaxCS: Towards cross-language code summarization with contrastive pre-training and retrieval augmentation","authors":"Kaiyuan Yang , Junfeng Wang , Zihua Song","doi":"10.1016/j.infsof.2025.107741","DOIUrl":null,"url":null,"abstract":"<div><h3>Context:</h3><div>Code summarization is the task of generating a concise natural language description of the code snippet. Recent efforts have been made to boost the performance of code summarization language from various perspectives, e.g., retrieving external information or introducing large transformer-based models, and thus has achieved promising performance for one specific programming language. While dealing with rapidly expanded cross-language source code datasets, existing approaches suffer from two issues, (1) the difficulty of building a universe code representation for multiple languages; (2) less-well performance for low-resource language.</div></div><div><h3>Objective:</h3><div>To cope with these issues, we propose a novel code summarization approach named RaxCS, which aims to perform code summarization across multiple languages and improve accuracy for low-resource languages by leveraging cross-language knowledge.</div></div><div><h3>Methods:</h3><div>We exploit the pre-trained models with the contrastive learning objective to build a unified code representation towards multiple languages. To fully mine the external knowledge across programming languages, we design a hybrid retrieval module to search functionally equivalent code and its corresponding comment to serve as preliminary information. Finally, we employ a decode-only transformer model to fuse contextual information, which guides the process of generating summaries.</div></div><div><h3>Results:</h3><div>Extensive experiments demonstrate (1) RaxCS outperforms the state-of-the-art on cross-language code summarization (i.e., RaxCS scores 4.39% higher in terms of BLEU metric and 8.65% in terms of BERTScore). (2) For low-resource languages, RaxCS can boost the code summarization performance by a significant magnification (e.g., 6.93% in terms of BLEU for ruby) with cross-language retrieval.</div></div><div><h3>Conclusion:</h3><div>This paper introduces a cross-language code summarization model, which utilizes contrastive pre-training and cross-language retrieval. Both are beneficial for incorporating cross-language knowledge to advance code summarization performance. The experimental results demonstrate that RaxCS is effective in generating accurate code summaries, particularly for low-resource languages.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"183 ","pages":"Article 107741"},"PeriodicalIF":4.3000,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information and Software Technology","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950584925000801","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Context:

Code summarization is the task of generating a concise natural language description of the code snippet. Recent efforts have been made to boost the performance of code summarization language from various perspectives, e.g., retrieving external information or introducing large transformer-based models, and thus has achieved promising performance for one specific programming language. While dealing with rapidly expanded cross-language source code datasets, existing approaches suffer from two issues, (1) the difficulty of building a universe code representation for multiple languages; (2) less-well performance for low-resource language.

Objective:

To cope with these issues, we propose a novel code summarization approach named RaxCS, which aims to perform code summarization across multiple languages and improve accuracy for low-resource languages by leveraging cross-language knowledge.

Methods:

We exploit the pre-trained models with the contrastive learning objective to build a unified code representation towards multiple languages. To fully mine the external knowledge across programming languages, we design a hybrid retrieval module to search functionally equivalent code and its corresponding comment to serve as preliminary information. Finally, we employ a decode-only transformer model to fuse contextual information, which guides the process of generating summaries.

Results:

Extensive experiments demonstrate (1) RaxCS outperforms the state-of-the-art on cross-language code summarization (i.e., RaxCS scores 4.39% higher in terms of BLEU metric and 8.65% in terms of BERTScore). (2) For low-resource languages, RaxCS can boost the code summarization performance by a significant magnification (e.g., 6.93% in terms of BLEU for ruby) with cross-language retrieval.

Conclusion:

This paper introduces a cross-language code summarization model, which utilizes contrastive pre-training and cross-language retrieval. Both are beneficial for incorporating cross-language knowledge to advance code summarization performance. The experimental results demonstrate that RaxCS is effective in generating accurate code summaries, particularly for low-resource languages.

查看原文本刊更多论文

基于对比预训练和检索增强的跨语言代码摘要

上下文：代码摘要是生成代码片段的简洁自然语言描述的任务。最近，人们从不同的角度提高了代码摘要语言的性能，例如，检索外部信息或引入基于大型变压器的模型，从而为一种特定的编程语言实现了有希望的性能。在处理快速扩展的跨语言源代码数据集时，现有方法存在两个问题：(1)难以构建多语言的通用代码表示；(2)低资源语言的性能较差。为了解决这些问题，我们提出了一种新的代码摘要方法，名为RaxCS，该方法旨在利用跨语言知识进行多语言代码摘要，并提高对低资源语言的准确性。方法：以对比学习为目标，利用预先训练好的模型，建立针对多种语言的统一代码表示。为了充分挖掘跨编程语言的外部知识，我们设计了一个混合检索模块来搜索功能等效的代码及其相应的注释作为初步信息。最后，我们使用一个仅解码的转换器模型来融合上下文信息，它指导生成摘要的过程。结果：大量实验表明：(1)RaxCS在跨语言代码摘要方面优于最先进的技术（即，RaxCS在BLEU指标方面的得分高出4.39%，在BERTScore方面的得分高出8.65%）。(2)对于低资源语言，通过跨语言检索，RaxCS可以显著提高代码总结性能（例如，ruby的BLEU提高了6.93%）。结论：本文提出了一种基于对比预训练和跨语言检索的跨语言代码摘要模型。两者都有利于整合跨语言知识，以提高代码汇总性能。实验结果表明，RaxCS在生成准确的代码摘要方面是有效的，特别是对于低资源语言。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information and Software Technology 工程技术-计算机：软件工程

CiteScore

9.10

自引率

7.70%

发文量

164

审稿时长

9.6 weeks

期刊介绍： Information and Software Technology is the international archival journal focusing on research and experience that contributes to the improvement of software development practices. The journal''s scope includes methods and techniques to better engineer software and manage its development. Articles submitted for review should have a clear component of software engineering or address ways to improve the engineering and management of software development. Areas covered by the journal include: • Software management, quality and metrics, • Software processes, • Software architecture, modelling, specification, design and programming • Functional and non-functional software requirements • Software testing and verification & validation • Empirical studies of all aspects of engineering and managing software development Short Communications is a new section dedicated to short papers addressing new ideas, controversial opinions, "Negative" results and much more. Read the Guide for authors for more information. The journal encourages and welcomes submissions of systematic literature studies (reviews and maps) within the scope of the journal. Information and Software Technology is the premiere outlet for systematic literature studies in software engineering.