XCode: Towards Cross-Language Code Representation with Large-Scale Pre-Training

ACM Transactions on Software Engineering and Methodology (TOSEM) Pub Date : 2022-04-09 DOI:10.1145/3506696

Zehao Lin, Guodun Li, Jingfeng Zhang, Yue Deng, Xiangji Zeng, Yin Zhang, Yao Wan

{"title":"XCode: Towards Cross-Language Code Representation with Large-Scale Pre-Training","authors":"Zehao Lin, Guodun Li, Jingfeng Zhang, Yue Deng, Xiangji Zeng, Yin Zhang, Yao Wan","doi":"10.1145/3506696","DOIUrl":null,"url":null,"abstract":"Source code representation learning is the basis of applying artificial intelligence to many software engineering tasks such as code clone detection, algorithm classification, and code summarization. Recently, many works have tried to improve the performance of source code representation from various perspectives, e.g., introducing the structural information of programs into latent representation. However, when dealing with rapidly expanded unlabeled cross-language source code datasets from the Internet, there are still two issues. Firstly, deep learning models for many code-specific tasks still suffer from the lack of high-quality labels. Secondly, the structural differences among programming languages make it more difficult to process multiple languages in a single neural architecture. To address these issues, in this article, we propose a novel Cross-language Code representation with a large-scale pre-training (XCode) method. Concretely, we propose to use several abstract syntax trees and ELMo-enhanced variational autoencoders to obtain multiple pre-trained source code language models trained on about 1.5 million code snippets. To fully utilize the knowledge across programming languages, we further propose a Shared Encoder-Decoder (SED) architecture which uses the multi-teacher single-student method to transfer knowledge from the aforementioned pre-trained models to the distilled SED. The pre-trained models and SED will cooperate to better represent the source code. For evaluation, we examine our approach on three typical downstream cross-language tasks, i.e., source code translation, code clone detection, and code-to-code search, on a real-world dataset composed of programming exercises with multiple solutions. Experimental results demonstrate the effectiveness of our proposed approach on cross-language code representations. Meanwhile, our approach performs significantly better than several code representation baselines on different downstream tasks in terms of multiple automatic evaluation metrics.","PeriodicalId":7398,"journal":{"name":"ACM Transactions on Software Engineering and Methodology (TOSEM)","volume":"20 1 1","pages":"1 - 44"},"PeriodicalIF":0.0000,"publicationDate":"2022-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Software Engineering and Methodology (TOSEM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3506696","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

Source code representation learning is the basis of applying artificial intelligence to many software engineering tasks such as code clone detection, algorithm classification, and code summarization. Recently, many works have tried to improve the performance of source code representation from various perspectives, e.g., introducing the structural information of programs into latent representation. However, when dealing with rapidly expanded unlabeled cross-language source code datasets from the Internet, there are still two issues. Firstly, deep learning models for many code-specific tasks still suffer from the lack of high-quality labels. Secondly, the structural differences among programming languages make it more difficult to process multiple languages in a single neural architecture. To address these issues, in this article, we propose a novel Cross-language Code representation with a large-scale pre-training (XCode) method. Concretely, we propose to use several abstract syntax trees and ELMo-enhanced variational autoencoders to obtain multiple pre-trained source code language models trained on about 1.5 million code snippets. To fully utilize the knowledge across programming languages, we further propose a Shared Encoder-Decoder (SED) architecture which uses the multi-teacher single-student method to transfer knowledge from the aforementioned pre-trained models to the distilled SED. The pre-trained models and SED will cooperate to better represent the source code. For evaluation, we examine our approach on three typical downstream cross-language tasks, i.e., source code translation, code clone detection, and code-to-code search, on a real-world dataset composed of programming exercises with multiple solutions. Experimental results demonstrate the effectiveness of our proposed approach on cross-language code representations. Meanwhile, our approach performs significantly better than several code representation baselines on different downstream tasks in terms of multiple automatic evaluation metrics.

查看原文本刊更多论文

XCode:通过大规模预训练实现跨语言代码表示

源代码表示学习是将人工智能应用于许多软件工程任务(如代码克隆检测、算法分类和代码摘要)的基础。近年来，许多研究都试图从不同的角度来提高源代码表示的性能，例如将程序的结构信息引入到潜在表示中。然而，当处理来自Internet的快速扩展的未标记的跨语言源代码数据集时，仍然存在两个问题。首先，许多代码特定任务的深度学习模型仍然缺乏高质量的标签。其次，编程语言之间的结构差异使得在单一神经网络架构中处理多种语言变得更加困难。为了解决这些问题，在本文中，我们提出了一种具有大规模预训练(XCode)方法的新颖的跨语言代码表示。具体来说，我们建议使用几个抽象语法树和elmo增强的变分自编码器来获得多个预训练的源代码语言模型，这些模型训练了大约150万个代码片段。为了充分利用跨编程语言的知识，我们进一步提出了一种共享编码器-解码器(Shared Encoder-Decoder, SED)架构，该架构使用多教师单个学生的方法将知识从上述预训练模型转移到蒸馏的SED。预训练的模型和SED将合作以更好地表示源代码。为了评估，我们在一个由多种解决方案的编程练习组成的真实数据集上，研究了我们在三个典型的下游跨语言任务上的方法，即源代码翻译、代码克隆检测和代码到代码搜索。实验结果证明了该方法在跨语言代码表示方面的有效性。同时，就多个自动评估指标而言，我们的方法在不同的下游任务上比几个代码表示基线执行得好得多。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Software Engineering and Methodology (TOSEM)

自引率

0.00%

发文量