Learning Program Semantics with Code Representations: An Empirical Study

2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) Pub Date : 2022-03-01 DOI:10.48550/arXiv.2203.11790

J. Siow, Shangqing Liu, Xiaofei Xie, Guozhu Meng, Yang Liu

{"title":"Learning Program Semantics with Code Representations: An Empirical Study","authors":"J. Siow, Shangqing Liu, Xiaofei Xie, Guozhu Meng, Yang Liu","doi":"10.48550/arXiv.2203.11790","DOIUrl":null,"url":null,"abstract":"Program semantics learning is the core and fundamental for various code intelligent tasks e.g., vulnerability detection, clone detection. A considerable amount of existing works propose diverse approaches to learn the program semantics for different tasks and these works have achieved state-of-the-art performance. However, currently, a comprehensive and systematic study on evaluating different program representation techniques across diverse tasks is still missed. From this starting point, in this paper, we conduct an empirical study to evaluate different program representation techniques. Specifically, we categorize current mainstream code representation techniques into four categories i.e., Feature-based, Sequence-based, Tree-based, and Graph-based program representation technique and evaluate its performance on three diverse and popular code intelligent tasks i.e., Code Classification, Vulnerability Detection, and Clone Detection on the public released benchmark. We further design three research questions (RQs) and conduct a comprehensive analysis to investigate the performance. By the extensive experimental results, we conclude that (1) The graph-based representation is superior to the other selected techniques across these tasks. (2) Compared with the node type information used in tree-based and graph-based representations, the node textual information is more critical to learning the program semantics. (3) Different tasks require the task-specific semantics to achieve their highest performance, however combining various program semantics from different dimensions such as control dependency, data dependency can still produce promising results.","PeriodicalId":437520,"journal":{"name":"2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"22","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2203.11790","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 22

Abstract

Program semantics learning is the core and fundamental for various code intelligent tasks e.g., vulnerability detection, clone detection. A considerable amount of existing works propose diverse approaches to learn the program semantics for different tasks and these works have achieved state-of-the-art performance. However, currently, a comprehensive and systematic study on evaluating different program representation techniques across diverse tasks is still missed. From this starting point, in this paper, we conduct an empirical study to evaluate different program representation techniques. Specifically, we categorize current mainstream code representation techniques into four categories i.e., Feature-based, Sequence-based, Tree-based, and Graph-based program representation technique and evaluate its performance on three diverse and popular code intelligent tasks i.e., Code Classification, Vulnerability Detection, and Clone Detection on the public released benchmark. We further design three research questions (RQs) and conduct a comprehensive analysis to investigate the performance. By the extensive experimental results, we conclude that (1) The graph-based representation is superior to the other selected techniques across these tasks. (2) Compared with the node type information used in tree-based and graph-based representations, the node textual information is more critical to learning the program semantics. (3) Different tasks require the task-specific semantics to achieve their highest performance, however combining various program semantics from different dimensions such as control dependency, data dependency can still produce promising results.

查看原文本刊更多论文

用代码表示学习程序语义:一个实证研究

程序语义学习是漏洞检测、克隆检测等各种代码智能任务的核心和基础。相当多的现有工作提出了不同的方法来学习不同任务的程序语义，这些工作已经达到了最先进的性能。然而，目前还缺乏对跨不同任务的不同程序表示技术进行评估的全面系统的研究。在此基础上，本文对不同的程序表示技术进行了实证研究。具体而言，我们将当前主流的代码表示技术分为基于特征的、基于序列的、基于树的和基于图的四类，并在公开发布的基准测试上评估了其在三种不同且流行的代码智能任务(代码分类、漏洞检测和克隆检测)上的性能。我们进一步设计了三个研究问题(rq)，并进行了全面的分析，以调查性能。通过广泛的实验结果，我们得出结论:(1)在这些任务中，基于图的表示优于其他选择的技术。(2)与基于树和基于图的表示中使用的节点类型信息相比，节点文本信息对于程序语义的学习更为关键。(3)不同的任务需要特定于任务的语义来实现其最高性能，但将控制依赖、数据依赖等不同维度的各种程序语义结合起来，仍然可以产生有希望的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)

自引率

0.00%

发文量