利用基于转换编译器的数据增强进行跨语言克隆检测的途径

2023 IEEE/ACM 31st International Conference on Program Comprehension (ICPC) Pub Date : 2023-03-02 DOI:10.1109/ICPC58990.2023.00031

Subroto Nag Pinku, Debajyoti Mondal, C. Roy

{"title":"利用基于转换编译器的数据增强进行跨语言克隆检测的途径","authors":"Subroto Nag Pinku, Debajyoti Mondal, C. Roy","doi":"10.1109/ICPC58990.2023.00031","DOIUrl":null,"url":null,"abstract":"Software clones are often introduced when developers reuse code fragments to implement similar functionalities in the same or different software systems resulting in duplicated fragments or code clones in those systems. Due to the adverse effect of clones on software maintenance, a great many tools and techniques and techniques have appeared in the literature to detect clones. Many high-performing clone detection tools today are based on deep learning techniques and are mostly used for detecting clones written in the same programming language, whereas clone detection tools for detecting cross-language clones are also emerging rapidly. The popularity of deep learning-based clone detection tools creates an opportunity to investigate how known strategies that boost the performances of deep learning models could be further leveraged to improve the clone detection tools. In this paper, we investigate such a strategy, data augmentation, which has not yet been explored for cross-language clone detection as opposed to single language clone detection. We show how the existing knowledge on transcompilers (source-to-source translators) can be used for data augmentation to boost the performance of cross-language clone detection models, as well as to adapt single-language clone detection models to create cross-language clone detection pipelines. To demonstrate the performance boost for cross-language clone detection through data augmentation, we exploit Transcoder, which is a pre-trained source-to-source translator. To show how to extend single-language models for cross-language clone detection, we extend a popular single-language model, Graph Matching Network (GMN), in a combination with the transcompilers and code parsers (srcML). We evaluated our models on popular benchmark datasets. Our experimental results showed improvements in F1 scores (sometimes up to 3%) for the cutting-edge cross-language clone detection models. Even when extending GMN for cross-language clone detection, the models built leveraging data augmentation outperformed the baseline with scores of 0.90, 0.92, and 0.91 for precision, recall, and F1 score, respectively.","PeriodicalId":376593,"journal":{"name":"2023 IEEE/ACM 31st International Conference on Program Comprehension (ICPC)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Pathways to Leverage Transcompiler based Data Augmentation for Cross-Language Clone Detection\",\"authors\":\"Subroto Nag Pinku, Debajyoti Mondal, C. Roy\",\"doi\":\"10.1109/ICPC58990.2023.00031\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Software clones are often introduced when developers reuse code fragments to implement similar functionalities in the same or different software systems resulting in duplicated fragments or code clones in those systems. Due to the adverse effect of clones on software maintenance, a great many tools and techniques and techniques have appeared in the literature to detect clones. Many high-performing clone detection tools today are based on deep learning techniques and are mostly used for detecting clones written in the same programming language, whereas clone detection tools for detecting cross-language clones are also emerging rapidly. The popularity of deep learning-based clone detection tools creates an opportunity to investigate how known strategies that boost the performances of deep learning models could be further leveraged to improve the clone detection tools. In this paper, we investigate such a strategy, data augmentation, which has not yet been explored for cross-language clone detection as opposed to single language clone detection. We show how the existing knowledge on transcompilers (source-to-source translators) can be used for data augmentation to boost the performance of cross-language clone detection models, as well as to adapt single-language clone detection models to create cross-language clone detection pipelines. To demonstrate the performance boost for cross-language clone detection through data augmentation, we exploit Transcoder, which is a pre-trained source-to-source translator. To show how to extend single-language models for cross-language clone detection, we extend a popular single-language model, Graph Matching Network (GMN), in a combination with the transcompilers and code parsers (srcML). We evaluated our models on popular benchmark datasets. Our experimental results showed improvements in F1 scores (sometimes up to 3%) for the cutting-edge cross-language clone detection models. Even when extending GMN for cross-language clone detection, the models built leveraging data augmentation outperformed the baseline with scores of 0.90, 0.92, and 0.91 for precision, recall, and F1 score, respectively.\",\"PeriodicalId\":376593,\"journal\":{\"name\":\"2023 IEEE/ACM 31st International Conference on Program Comprehension (ICPC)\",\"volume\":\"31 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-03-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 IEEE/ACM 31st International Conference on Program Comprehension (ICPC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICPC58990.2023.00031\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE/ACM 31st International Conference on Program Comprehension (ICPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPC58990.2023.00031","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

当开发人员在相同或不同的软件系统中重用代码片段来实现类似的功能，导致这些系统中出现重复的代码片段或代码克隆时，通常会引入软件克隆。由于克隆对软件维护的不利影响，文献中出现了大量检测克隆的工具和技术。目前，许多高性能克隆检测工具都是基于深度学习技术，主要用于检测用同一种编程语言编写的克隆，而用于检测跨语言克隆的克隆检测工具也在迅速涌现。基于深度学习的克隆检测工具的普及为研究如何进一步利用提高深度学习模型性能的已知策略来改进克隆检测工具创造了机会。在本文中，我们研究了这样一种策略，即数据增强，它尚未被用于跨语言克隆检测而不是单语言克隆检测。我们展示了如何将编译器(源到源翻译器)的现有知识用于数据增强，以提高跨语言克隆检测模型的性能，以及如何调整单语言克隆检测模型以创建跨语言克隆检测管道。为了演示通过数据增强实现跨语言克隆检测的性能提升，我们利用了Transcoder，这是一个预训练的源到源翻译器。为了展示如何扩展单语言模型以进行跨语言克隆检测，我们扩展了一个流行的单语言模型，图匹配网络(GMN)，并结合了编译器和代码解析器(srcML)。我们在流行的基准数据集上评估了我们的模型。我们的实验结果表明，对于尖端的跨语言克隆检测模型，F1分数有所提高(有时高达3%)。即使在将GMN扩展到跨语言克隆检测时，利用数据增强构建的模型在精度、召回率和F1得分方面的表现也优于基线，分别为0.90、0.92和0.91。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Pathways to Leverage Transcompiler based Data Augmentation for Cross-Language Clone Detection

Software clones are often introduced when developers reuse code fragments to implement similar functionalities in the same or different software systems resulting in duplicated fragments or code clones in those systems. Due to the adverse effect of clones on software maintenance, a great many tools and techniques and techniques have appeared in the literature to detect clones. Many high-performing clone detection tools today are based on deep learning techniques and are mostly used for detecting clones written in the same programming language, whereas clone detection tools for detecting cross-language clones are also emerging rapidly. The popularity of deep learning-based clone detection tools creates an opportunity to investigate how known strategies that boost the performances of deep learning models could be further leveraged to improve the clone detection tools. In this paper, we investigate such a strategy, data augmentation, which has not yet been explored for cross-language clone detection as opposed to single language clone detection. We show how the existing knowledge on transcompilers (source-to-source translators) can be used for data augmentation to boost the performance of cross-language clone detection models, as well as to adapt single-language clone detection models to create cross-language clone detection pipelines. To demonstrate the performance boost for cross-language clone detection through data augmentation, we exploit Transcoder, which is a pre-trained source-to-source translator. To show how to extend single-language models for cross-language clone detection, we extend a popular single-language model, Graph Matching Network (GMN), in a combination with the transcompilers and code parsers (srcML). We evaluated our models on popular benchmark datasets. Our experimental results showed improvements in F1 scores (sometimes up to 3%) for the cutting-edge cross-language clone detection models. Even when extending GMN for cross-language clone detection, the models built leveraging data augmentation outperformed the baseline with scores of 0.90, 0.92, and 0.91 for precision, recall, and F1 score, respectively.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2023 IEEE/ACM 31st International Conference on Program Comprehension (ICPC)

自引率

0.00%

发文量