Zhongyang Deng, Ling Xu, Chao Liu, Meng Yan, Zhou Xu, Yan Lei
{"title":"Fine-grained Co-Attentive Representation Learning for Semantic Code Search","authors":"Zhongyang Deng, Ling Xu, Chao Liu, Meng Yan, Zhou Xu, Yan Lei","doi":"10.1109/saner53432.2022.00055","DOIUrl":null,"url":null,"abstract":"Code search aims to find code snippets from large-scale code repositories based on the developer's query intent. A significant challenge for code search is the semantic gap between programming language and natural language. Recent works have indicated that deep learning (DL) techniques can perform well by automatically learning the relationships between query and code. Among these DL-based approaches, the state-of-the-art model is TabCS, a two-stage attention-based model for code search. However, TabCS still has two limitations: semantic loss and semantic confusion. TabCS breaks the structural information of code into token-level words of abstract syntax tree (AST), which loses the sequential semantics between words in programming statements, and it uses a co-attention mechanism to build the semantic correlation of code-query after fusing all features, which may confuse the correlations between individual code features and query. In this paper, we propose a code search model named FcarCS (Fine-grained Co-Attentive Representation Learning Model for Semantic Code Search). FcarCS extracts code textual features (i.e., method name, API sequence, and tokens) and structural features that introduce a statement-level code structure. Unlike TabCS, FcarCS splits AST into a series of subtrees corresponding to code statements and treats each subtree as a whole to preserve sequential semantics between words in code statements. FcarCS constructs a new fine-grained co-attention mechanism to learn interdependent representations for each code feature and query, respectively, instead of performing one co-attention process for the fused code features like TabCS. Generally, this mechanism leverages row/column-wise CNN to enable our model to focus on the strongly correlated local information between code feature and Query. We train and evaluate FcarCS on an open Java dataset with 475k and 10k code/query pairs, respectively. Experimental results show that FcarCS achieves an MRR of 0.613, outperforming three state-of-the-art models DeepCS, UNIF, and TabCS, by 117.38%, 16.76%, and 12.68%, respectively. We also performed a user study for each model with 50 real-world queries, and the results show that FcarCS returned code snippets that are more relevant than the baseline models.","PeriodicalId":437520,"journal":{"name":"2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/saner53432.2022.00055","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Code search aims to find code snippets from large-scale code repositories based on the developer's query intent. A significant challenge for code search is the semantic gap between programming language and natural language. Recent works have indicated that deep learning (DL) techniques can perform well by automatically learning the relationships between query and code. Among these DL-based approaches, the state-of-the-art model is TabCS, a two-stage attention-based model for code search. However, TabCS still has two limitations: semantic loss and semantic confusion. TabCS breaks the structural information of code into token-level words of abstract syntax tree (AST), which loses the sequential semantics between words in programming statements, and it uses a co-attention mechanism to build the semantic correlation of code-query after fusing all features, which may confuse the correlations between individual code features and query. In this paper, we propose a code search model named FcarCS (Fine-grained Co-Attentive Representation Learning Model for Semantic Code Search). FcarCS extracts code textual features (i.e., method name, API sequence, and tokens) and structural features that introduce a statement-level code structure. Unlike TabCS, FcarCS splits AST into a series of subtrees corresponding to code statements and treats each subtree as a whole to preserve sequential semantics between words in code statements. FcarCS constructs a new fine-grained co-attention mechanism to learn interdependent representations for each code feature and query, respectively, instead of performing one co-attention process for the fused code features like TabCS. Generally, this mechanism leverages row/column-wise CNN to enable our model to focus on the strongly correlated local information between code feature and Query. We train and evaluate FcarCS on an open Java dataset with 475k and 10k code/query pairs, respectively. Experimental results show that FcarCS achieves an MRR of 0.613, outperforming three state-of-the-art models DeepCS, UNIF, and TabCS, by 117.38%, 16.76%, and 12.68%, respectively. We also performed a user study for each model with 50 real-world queries, and the results show that FcarCS returned code snippets that are more relevant than the baseline models.
代码搜索旨在根据开发人员的查询意图从大型代码存储库中查找代码片段。代码搜索面临的一个重大挑战是编程语言和自然语言之间的语义差距。最近的研究表明,深度学习(DL)技术可以通过自动学习查询和代码之间的关系来表现良好。在这些基于dl的方法中,最先进的模型是TabCS,这是一个用于代码搜索的两阶段基于注意力的模型。然而,TabCS仍然存在语义丢失和语义混淆两个方面的局限性。TabCS将代码的结构信息分解为抽象语法树(AST)的令牌级词,从而失去了编程语句中词之间的顺序语义;TabCS在融合所有特征后采用共同关注机制构建代码-查询的语义关联,这可能会混淆单个代码特征与查询之间的关联。在本文中,我们提出了一个代码搜索模型FcarCS (Fine-grained co - attention Representation Learning model for Semantic code search)。FcarCS提取代码文本特征(即方法名、API序列和令牌)和引入语句级代码结构的结构特征。与TabCS不同,FcarCS将AST拆分为一系列与代码语句相对应的子树,并将每个子树视为一个整体,以保持代码语句中单词之间的顺序语义。FcarCS构建了一种新的细粒度共同关注机制,分别学习每个代码特征和查询的相互依存表示,而不是像TabCS那样对融合的代码特征执行一个共同关注过程。通常,这种机制利用逐行/逐列的CNN,使我们的模型能够专注于代码特征和查询之间的强相关局部信息。我们在一个开放的Java数据集上分别用475k和10k代码/查询对训练和评估FcarCS。实验结果表明,FcarCS的MRR为0.613,比DeepCS、UNIF和TabCS三种最先进的模型分别高出117.38%、16.76%和12.68%。我们还使用50个真实世界的查询对每个模型进行了用户研究,结果表明FcarCS返回的代码片段比基线模型更相关。