Cross-Language Clone Detection by Learning Over Abstract Syntax Trees

2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR) Pub Date : 2019-05-01 DOI:10.1109/MSR.2019.00078

Daniel Perez, S. Chiba

引用次数: 36

Abstract

Clone detection across programs written in the same programming language has been studied extensively in the literature. On the contrary, the task of detecting clones across multiple programming languages has not been studied as much, and approaches based on comparison cannot be directly applied. In this paper, we present a clone detection method based on semi-supervised machine learning designed to detect clones across programming languages with similar syntax. Our method uses an unsupervised learning approach to learn token-level vector representations and an LSTM-based neural network to predict whether two code fragments are clones. To train our network, we present a cross-language code clone dataset - which is to the best of our knowledge the first of its kind - containing around 45,000 code fragments written in Java and Python. We evaluate our approach on the dataset we created and show that our method gives promising results when detecting similarities between code fragments written in Java and Python.

查看原文本刊更多论文

基于抽象语法树学习的跨语言克隆检测

用同一种编程语言编写的程序之间的克隆检测已经在文献中得到了广泛的研究。相反，跨多种编程语言的克隆检测任务研究较少，基于比较的方法不能直接应用。在本文中，我们提出了一种基于半监督机器学习的克隆检测方法，旨在检测具有相似语法的编程语言之间的克隆。我们的方法使用无监督学习方法来学习标记级向量表示，并使用基于lstm的神经网络来预测两个代码片段是否为克隆。为了训练我们的网络，我们提供了一个跨语言代码克隆数据集——据我们所知，这是第一个此类数据集——包含大约45,000个用Java和Python编写的代码片段。我们在我们创建的数据集上评估了我们的方法，并表明我们的方法在检测用Java和Python编写的代码片段之间的相似性时给出了有希望的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)

自引率

0.00%

发文量