How to better utilize code graphs in semantic code search?

软件产业与工程 Pub Date : 2022-11-07 DOI:10.1145/3540250.3549087

Yucen Shi, Ying Yin, Zhengkui Wang, David Lo, Tao Zhang, Xin Xia, Yuhai Zhao, Bowen Xu

{"title":"How to better utilize code graphs in semantic code search?","authors":"Yucen Shi, Ying Yin, Zhengkui Wang, David Lo, Tao Zhang, Xin Xia, Yuhai Zhao, Bowen Xu","doi":"10.1145/3540250.3549087","DOIUrl":null,"url":null,"abstract":"Semantic code search greatly facilitates software reuse, which enables users to find code snippets highly matching user-specified natural language queries. Due to the rich expressive power of code graphs (e.g., control-flow graph and program dependency graph), both of the two mainstream research works (i.e., multi-modal models and pre-trained models) have attempted to incorporate code graphs for code modelling. However, they still have some limitations: First, there is still much room for improvement in terms of search effectiveness. Second, they have not fully considered the unique features of code graphs. In this paper, we propose a Graph-to-Sequence Converter, namely G2SC. Through converting the code graphs into lossless sequences, G2SC enables to address the problem of small graph learning using sequence feature learning and capture both the edges and nodes attribute information of code graphs. Thus, the effectiveness of code search can be greatly improved. In particular, G2SC first converts the code graph into a unique corresponding node sequence by a specific graph traversal strategy. Then, it gets a statement sequence by replacing each node with its corresponding statement. A set of carefully designed graph traversal strategies guarantee that the process is one-to-one and reversible. G2SC enables capturing rich semantic relationships (i.e., control flow, data flow, node/relationship properties) and provides learning model-friendly data transformation. It can be flexibly integrated with existing models to better utilize the code graphs. As a proof-of-concept application, we present two G2SC enabled models: GSMM (G2SC enabled multi-modal model) and GSCodeBERT (G2SC enabled CodeBERT model). Extensive experiment results on two real large-scale datasets demonstrate that GSMM and GSCodeBERT can greatly improve the state-of-the-art models MMAN and GraphCodeBERT by 92% and 22% on R@1, and 63% and 11.5% on MRR, respectively.","PeriodicalId":68155,"journal":{"name":"软件产业与工程","volume":"20 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"软件产业与工程","FirstCategoryId":"1089","ListUrlMain":"https://doi.org/10.1145/3540250.3549087","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Semantic code search greatly facilitates software reuse, which enables users to find code snippets highly matching user-specified natural language queries. Due to the rich expressive power of code graphs (e.g., control-flow graph and program dependency graph), both of the two mainstream research works (i.e., multi-modal models and pre-trained models) have attempted to incorporate code graphs for code modelling. However, they still have some limitations: First, there is still much room for improvement in terms of search effectiveness. Second, they have not fully considered the unique features of code graphs. In this paper, we propose a Graph-to-Sequence Converter, namely G2SC. Through converting the code graphs into lossless sequences, G2SC enables to address the problem of small graph learning using sequence feature learning and capture both the edges and nodes attribute information of code graphs. Thus, the effectiveness of code search can be greatly improved. In particular, G2SC first converts the code graph into a unique corresponding node sequence by a specific graph traversal strategy. Then, it gets a statement sequence by replacing each node with its corresponding statement. A set of carefully designed graph traversal strategies guarantee that the process is one-to-one and reversible. G2SC enables capturing rich semantic relationships (i.e., control flow, data flow, node/relationship properties) and provides learning model-friendly data transformation. It can be flexibly integrated with existing models to better utilize the code graphs. As a proof-of-concept application, we present two G2SC enabled models: GSMM (G2SC enabled multi-modal model) and GSCodeBERT (G2SC enabled CodeBERT model). Extensive experiment results on two real large-scale datasets demonstrate that GSMM and GSCodeBERT can greatly improve the state-of-the-art models MMAN and GraphCodeBERT by 92% and 22% on R@1, and 63% and 11.5% on MRR, respectively.

查看原文本刊更多论文

如何更好地利用代码图在语义代码搜索?

语义代码搜索极大地促进了软件的重用，使用户能够找到与用户指定的自然语言查询高度匹配的代码片段。由于代码图(如控制流图和程序依赖图)丰富的表达能力，两种主流的研究工作(即多模态模型和预训练模型)都试图将代码图纳入代码建模中。然而，它们仍然存在一些局限性:首先，在搜索效率方面仍有很大的改进空间。其次，他们没有充分考虑代码图的独特特性。在本文中，我们提出了一个图-序列转换器，即G2SC。通过将代码图转换为无损序列，G2SC能够利用序列特征学习解决小图学习问题，并捕获代码图的边和节点属性信息。因此，可以大大提高代码搜索的效率。特别是，G2SC首先通过特定的图遍历策略将代码图转换为唯一的对应节点序列。然后，它通过用相应的语句替换每个节点来获得语句序列。一组精心设计的图遍历策略保证了过程是一对一的和可逆的。G2SC支持捕获丰富的语义关系(即控制流、数据流、节点/关系属性)，并提供学习模型友好的数据转换。它可以灵活地与现有模型集成，以更好地利用代码图。作为概念验证应用，我们提出了两个支持G2SC的模型:GSMM(支持G2SC的多模态模型)和GSCodeBERT(支持G2SC的CodeBERT模型)。在两个真实大规模数据集上的大量实验结果表明，GSMM和GSCodeBERT可以在R@1上大大提高最先进模型MMAN和GraphCodeBERT的92%和22%，MRR分别提高63%和11.5%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

软件产业与工程

自引率

0.00%

发文量

676