I2R：用于代码搜索的模态内和模态间表示学习

IF 0.8 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Intelligent Data Analysis Pub Date : 2023-08-10 DOI:10.3233/ida-230082

Xu Zhang, Yanzheng Xiang, Zejie Liu, Xiaoyu Hu, Deyu Zhou

{"title":"I2R：用于代码搜索的模态内和模态间表示学习","authors":"Xu Zhang, Yanzheng Xiang, Zejie Liu, Xiaoyu Hu, Deyu Zhou","doi":"10.3233/ida-230082","DOIUrl":null,"url":null,"abstract":"Code search, which locates code snippets in large code repositories based on natural language queries entered by developers, has become increasingly popular in the software development process. It has the potential to improve the efficiency of software developers. Recent studies have demonstrated the effectiveness of using deep learning techniques to represent queries and codes accurately for code search. In specific, pre-trained models of programming languages have recently achieved significant progress in code searching. However, we argue that aligning programming and natural languages are crucial as there are two different modalities. Existing pre-train models based approaches for code search do not effectively consider implicit alignments of representations across modalities (inter-modal representation). Moreover, the existing methods do not take into account the consistency constraint of intra-modal representations, making the model ineffective. As a result, we propose a novel code search method that optimizes both intra-modal and inter-modal representation learning. The alignment of the representation between the two modalities is achieved by introducing contrastive learning. Furthermore, the consistency of intra-modal feature representation is constrained by KL-divergence. Our experimental results confirm the model’s effectiveness on seven different test datasets. This paper proposes a code search method that significantly improves existing methods. Our source code is publicly available on GitHub.1","PeriodicalId":50355,"journal":{"name":"Intelligent Data Analysis","volume":" ","pages":""},"PeriodicalIF":0.8000,"publicationDate":"2023-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"I2R: Intra and inter-modal representation learning for code search\",\"authors\":\"Xu Zhang, Yanzheng Xiang, Zejie Liu, Xiaoyu Hu, Deyu Zhou\",\"doi\":\"10.3233/ida-230082\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Code search, which locates code snippets in large code repositories based on natural language queries entered by developers, has become increasingly popular in the software development process. It has the potential to improve the efficiency of software developers. Recent studies have demonstrated the effectiveness of using deep learning techniques to represent queries and codes accurately for code search. In specific, pre-trained models of programming languages have recently achieved significant progress in code searching. However, we argue that aligning programming and natural languages are crucial as there are two different modalities. Existing pre-train models based approaches for code search do not effectively consider implicit alignments of representations across modalities (inter-modal representation). Moreover, the existing methods do not take into account the consistency constraint of intra-modal representations, making the model ineffective. As a result, we propose a novel code search method that optimizes both intra-modal and inter-modal representation learning. The alignment of the representation between the two modalities is achieved by introducing contrastive learning. Furthermore, the consistency of intra-modal feature representation is constrained by KL-divergence. Our experimental results confirm the model’s effectiveness on seven different test datasets. This paper proposes a code search method that significantly improves existing methods. Our source code is publicly available on GitHub.1\",\"PeriodicalId\":50355,\"journal\":{\"name\":\"Intelligent Data Analysis\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.8000,\"publicationDate\":\"2023-08-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Intelligent Data Analysis\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.3233/ida-230082\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intelligent Data Analysis","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.3233/ida-230082","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

代码搜索在软件开发过程中越来越流行，它根据开发人员输入的自然语言查询来定位大型代码库中的代码片段。它有可能提高软件开发人员的效率。最近的研究已经证明了使用深度学习技术来准确表示代码搜索的查询和代码的有效性。具体来说，编程语言的预训练模型最近在代码搜索方面取得了重大进展。然而，我们认为，调整编程和自然语言是至关重要的，因为有两种不同的模式。现有的基于预训练模型的代码搜索方法没有有效地考虑跨模态表示的隐式对齐（模态间表示）。此外，现有的方法没有考虑模态内表示的一致性约束，使得模型无效。因此，我们提出了一种新的代码搜索方法，该方法优化了模态内和模态间表示学习。通过引入对比学习来实现两种模式之间的表征的一致性。此外，模态内特征表示的一致性受到KL散度的约束。我们的实验结果证实了该模型在七个不同测试数据集上的有效性。本文提出了一种代码搜索方法，大大改进了现有的方法。我们的源代码在GitHub上公开。1

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

I2R: Intra and inter-modal representation learning for code search

Code search, which locates code snippets in large code repositories based on natural language queries entered by developers, has become increasingly popular in the software development process. It has the potential to improve the efficiency of software developers. Recent studies have demonstrated the effectiveness of using deep learning techniques to represent queries and codes accurately for code search. In specific, pre-trained models of programming languages have recently achieved significant progress in code searching. However, we argue that aligning programming and natural languages are crucial as there are two different modalities. Existing pre-train models based approaches for code search do not effectively consider implicit alignments of representations across modalities (inter-modal representation). Moreover, the existing methods do not take into account the consistency constraint of intra-modal representations, making the model ineffective. As a result, we propose a novel code search method that optimizes both intra-modal and inter-modal representation learning. The alignment of the representation between the two modalities is achieved by introducing contrastive learning. Furthermore, the consistency of intra-modal feature representation is constrained by KL-divergence. Our experimental results confirm the model’s effectiveness on seven different test datasets. This paper proposes a code search method that significantly improves existing methods. Our source code is publicly available on GitHub.1

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Intelligent Data Analysis 工程技术-计算机：人工智能

CiteScore

2.20

自引率

5.90%

发文量

审稿时长

3.3 months

期刊介绍： Intelligent Data Analysis provides a forum for the examination of issues related to the research and applications of Artificial Intelligence techniques in data analysis across a variety of disciplines. These techniques include (but are not limited to): all areas of data visualization, data pre-processing (fusion, editing, transformation, filtering, sampling), data engineering, database mining techniques, tools and applications, use of domain knowledge in data analysis, big data applications, evolutionary algorithms, machine learning, neural nets, fuzzy logic, statistical pattern recognition, knowledge filtering, and post-processing. In particular, papers are preferred that discuss development of new AI related data analysis architectures, methodologies, and techniques and their applications to various domains.