代码搜索的跨模态对比学习

2022 IEEE International Conference on Software Maintenance and Evolution (ICSME) Pub Date : 2022-10-01 DOI:10.1109/ICSME55016.2022.00017

Zejian Shi, Yun Xiong, Xiaolong Zhang, Yao Zhang, Shanshan Li, Yangyong Zhu

{"title":"代码搜索的跨模态对比学习","authors":"Zejian Shi, Yun Xiong, Xiaolong Zhang, Yao Zhang, Shanshan Li, Yangyong Zhu","doi":"10.1109/ICSME55016.2022.00017","DOIUrl":null,"url":null,"abstract":"Code search aims to retrieve code snippets from natural language queries, which serves as a core technology to improve development efficiency. Previous approaches have achieved promising results to learn code and query representations by using BERT-based pre-trained models which, however, leads to semantic collapse problems, i.e. native representations of code and query clustering in a high similarity interval. In this paper, we propose CrossCS, a cross-modal contrastive learning method for code search, to improve the representations of code and query by explicit fine-grained contrastive objectives. Specifically, we design a novel and effective contrastive objective that considers not only the similarity between modalities, but also the similarity within modalities. To maintain semantic consistency of code snippets with different names of functions and variables, we use data augmentation to rename functions and variables to meaningless tokens, which enables us to add comparisons between code and augmented code within modalities. Moreover, in order to further improve the effectiveness of pre-trained models, we rank candidate code snippets using similarity scores weighted by retrieval scores and classification scores. Comprehensive experiments demonstrate that our method can significantly improve the effectiveness of pre-trained models for code search.","PeriodicalId":300084,"journal":{"name":"2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)","volume":"129 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Cross-Modal Contrastive Learning for Code Search\",\"authors\":\"Zejian Shi, Yun Xiong, Xiaolong Zhang, Yao Zhang, Shanshan Li, Yangyong Zhu\",\"doi\":\"10.1109/ICSME55016.2022.00017\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Code search aims to retrieve code snippets from natural language queries, which serves as a core technology to improve development efficiency. Previous approaches have achieved promising results to learn code and query representations by using BERT-based pre-trained models which, however, leads to semantic collapse problems, i.e. native representations of code and query clustering in a high similarity interval. In this paper, we propose CrossCS, a cross-modal contrastive learning method for code search, to improve the representations of code and query by explicit fine-grained contrastive objectives. Specifically, we design a novel and effective contrastive objective that considers not only the similarity between modalities, but also the similarity within modalities. To maintain semantic consistency of code snippets with different names of functions and variables, we use data augmentation to rename functions and variables to meaningless tokens, which enables us to add comparisons between code and augmented code within modalities. Moreover, in order to further improve the effectiveness of pre-trained models, we rank candidate code snippets using similarity scores weighted by retrieval scores and classification scores. Comprehensive experiments demonstrate that our method can significantly improve the effectiveness of pre-trained models for code search.\",\"PeriodicalId\":300084,\"journal\":{\"name\":\"2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)\",\"volume\":\"129 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICSME55016.2022.00017\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSME55016.2022.00017","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

代码搜索旨在从自然语言查询中检索代码片段，是提高开发效率的核心技术。以前的方法通过使用基于bert的预训练模型在学习代码和查询表示方面取得了很好的结果，然而，这导致了语义崩溃问题，即代码的本地表示和查询在高相似区间内聚类。本文提出了一种用于代码搜索的跨模态对比学习方法CrossCS，通过明确的细粒度对比目标来改进代码和查询的表示。具体来说，我们设计了一个新颖而有效的对比目标，不仅考虑了模式之间的相似性，而且考虑了模式内部的相似性。为了保持具有不同函数和变量名称的代码片段的语义一致性，我们使用数据增强将函数和变量重命名为无意义的标记，这使我们能够在模态中添加代码和增强代码之间的比较。此外，为了进一步提高预训练模型的有效性，我们使用检索分数和分类分数加权的相似性分数对候选代码片段进行排名。综合实验表明，该方法可以显著提高预训练模型的代码搜索效率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Cross-Modal Contrastive Learning for Code Search

Code search aims to retrieve code snippets from natural language queries, which serves as a core technology to improve development efficiency. Previous approaches have achieved promising results to learn code and query representations by using BERT-based pre-trained models which, however, leads to semantic collapse problems, i.e. native representations of code and query clustering in a high similarity interval. In this paper, we propose CrossCS, a cross-modal contrastive learning method for code search, to improve the representations of code and query by explicit fine-grained contrastive objectives. Specifically, we design a novel and effective contrastive objective that considers not only the similarity between modalities, but also the similarity within modalities. To maintain semantic consistency of code snippets with different names of functions and variables, we use data augmentation to rename functions and variables to meaningless tokens, which enables us to add comparisons between code and augmented code within modalities. Moreover, in order to further improve the effectiveness of pre-trained models, we rank candidate code snippets using similarity scores weighted by retrieval scores and classification scores. Comprehensive experiments demonstrate that our method can significantly improve the effectiveness of pre-trained models for code search.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)

自引率

0.00%

发文量