代码搜索的跨模态对比学习

Zejian Shi, Yun Xiong, Xiaolong Zhang, Yao Zhang, Shanshan Li, Yangyong Zhu
{"title":"代码搜索的跨模态对比学习","authors":"Zejian Shi, Yun Xiong, Xiaolong Zhang, Yao Zhang, Shanshan Li, Yangyong Zhu","doi":"10.1109/ICSME55016.2022.00017","DOIUrl":null,"url":null,"abstract":"Code search aims to retrieve code snippets from natural language queries, which serves as a core technology to improve development efficiency. Previous approaches have achieved promising results to learn code and query representations by using BERT-based pre-trained models which, however, leads to semantic collapse problems, i.e. native representations of code and query clustering in a high similarity interval. In this paper, we propose CrossCS, a cross-modal contrastive learning method for code search, to improve the representations of code and query by explicit fine-grained contrastive objectives. Specifically, we design a novel and effective contrastive objective that considers not only the similarity between modalities, but also the similarity within modalities. To maintain semantic consistency of code snippets with different names of functions and variables, we use data augmentation to rename functions and variables to meaningless tokens, which enables us to add comparisons between code and augmented code within modalities. Moreover, in order to further improve the effectiveness of pre-trained models, we rank candidate code snippets using similarity scores weighted by retrieval scores and classification scores. Comprehensive experiments demonstrate that our method can significantly improve the effectiveness of pre-trained models for code search.","PeriodicalId":300084,"journal":{"name":"2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)","volume":"129 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Cross-Modal Contrastive Learning for Code Search\",\"authors\":\"Zejian Shi, Yun Xiong, Xiaolong Zhang, Yao Zhang, Shanshan Li, Yangyong Zhu\",\"doi\":\"10.1109/ICSME55016.2022.00017\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Code search aims to retrieve code snippets from natural language queries, which serves as a core technology to improve development efficiency. Previous approaches have achieved promising results to learn code and query representations by using BERT-based pre-trained models which, however, leads to semantic collapse problems, i.e. native representations of code and query clustering in a high similarity interval. In this paper, we propose CrossCS, a cross-modal contrastive learning method for code search, to improve the representations of code and query by explicit fine-grained contrastive objectives. Specifically, we design a novel and effective contrastive objective that considers not only the similarity between modalities, but also the similarity within modalities. To maintain semantic consistency of code snippets with different names of functions and variables, we use data augmentation to rename functions and variables to meaningless tokens, which enables us to add comparisons between code and augmented code within modalities. Moreover, in order to further improve the effectiveness of pre-trained models, we rank candidate code snippets using similarity scores weighted by retrieval scores and classification scores. Comprehensive experiments demonstrate that our method can significantly improve the effectiveness of pre-trained models for code search.\",\"PeriodicalId\":300084,\"journal\":{\"name\":\"2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)\",\"volume\":\"129 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICSME55016.2022.00017\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Software Maintenance and Evolution (ICSME)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSME55016.2022.00017","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

摘要

代码搜索旨在从自然语言查询中检索代码片段,是提高开发效率的核心技术。以前的方法通过使用基于bert的预训练模型在学习代码和查询表示方面取得了很好的结果,然而,这导致了语义崩溃问题,即代码的本地表示和查询在高相似区间内聚类。本文提出了一种用于代码搜索的跨模态对比学习方法CrossCS,通过明确的细粒度对比目标来改进代码和查询的表示。具体来说,我们设计了一个新颖而有效的对比目标,不仅考虑了模式之间的相似性,而且考虑了模式内部的相似性。为了保持具有不同函数和变量名称的代码片段的语义一致性,我们使用数据增强将函数和变量重命名为无意义的标记,这使我们能够在模态中添加代码和增强代码之间的比较。此外,为了进一步提高预训练模型的有效性,我们使用检索分数和分类分数加权的相似性分数对候选代码片段进行排名。综合实验表明,该方法可以显著提高预训练模型的代码搜索效率。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Cross-Modal Contrastive Learning for Code Search
Code search aims to retrieve code snippets from natural language queries, which serves as a core technology to improve development efficiency. Previous approaches have achieved promising results to learn code and query representations by using BERT-based pre-trained models which, however, leads to semantic collapse problems, i.e. native representations of code and query clustering in a high similarity interval. In this paper, we propose CrossCS, a cross-modal contrastive learning method for code search, to improve the representations of code and query by explicit fine-grained contrastive objectives. Specifically, we design a novel and effective contrastive objective that considers not only the similarity between modalities, but also the similarity within modalities. To maintain semantic consistency of code snippets with different names of functions and variables, we use data augmentation to rename functions and variables to meaningless tokens, which enables us to add comparisons between code and augmented code within modalities. Moreover, in order to further improve the effectiveness of pre-trained models, we rank candidate code snippets using similarity scores weighted by retrieval scores and classification scores. Comprehensive experiments demonstrate that our method can significantly improve the effectiveness of pre-trained models for code search.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信