Path-sensitive code embedding via contrastive learning for software vulnerability detection

Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis Pub Date : 2022-07-18 DOI:10.1145/3533767.3534371

Xiao Cheng, Guanqin Zhang, Haoyu Wang, Yulei Sui

{"title":"Path-sensitive code embedding via contrastive learning for software vulnerability detection","authors":"Xiao Cheng, Guanqin Zhang, Haoyu Wang, Yulei Sui","doi":"10.1145/3533767.3534371","DOIUrl":null,"url":null,"abstract":"Machine learning and its promising branch deep learning have shown success in a wide range of application domains. Recently, much effort has been expended on applying deep learning techniques (e.g., graph neural networks) to static vulnerability detection as an alternative to conventional bug detection methods. To obtain the structural information of code, current learning approaches typically abstract a program in the form of graphs (e.g., data-flow graphs, abstract syntax trees), and then train an underlying classification model based on the (sub)graphs of safe and vulnerable code fragments for vulnerability prediction. However, these models are still insufficient for precise bug detection, because the objective of these models is to produce classification results rather than comprehending the semantics of vulnerabilities, e.g., pinpoint bug triggering paths, which are essential for static bug detection. This paper presents ContraFlow, a selective yet precise contrastive value-flow embedding approach to statically detect software vulnerabilities. The novelty of ContraFlow lies in selecting and preserving feasible value-flow (aka program dependence) paths through a pretrained path embedding model using self-supervised contrastive learning, thus significantly reducing the amount of labeled data required for training expensive downstream models for path-based vulnerability detection. We evaluated ContraFlow using 288 real-world projects by comparing eight recent learning-based approaches. ContraFlow outperforms these eight baselines by up to 334.1%, 317.9%, 58.3% for informedness, markedness and F1 Score, and achieves up to 450.0%, 192.3%, 450.0% improvement for mean statement recall, mean statement precision and mean IoU respectively in terms of locating buggy statements.","PeriodicalId":412271,"journal":{"name":"Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"26","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3533767.3534371","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 26

Abstract

Machine learning and its promising branch deep learning have shown success in a wide range of application domains. Recently, much effort has been expended on applying deep learning techniques (e.g., graph neural networks) to static vulnerability detection as an alternative to conventional bug detection methods. To obtain the structural information of code, current learning approaches typically abstract a program in the form of graphs (e.g., data-flow graphs, abstract syntax trees), and then train an underlying classification model based on the (sub)graphs of safe and vulnerable code fragments for vulnerability prediction. However, these models are still insufficient for precise bug detection, because the objective of these models is to produce classification results rather than comprehending the semantics of vulnerabilities, e.g., pinpoint bug triggering paths, which are essential for static bug detection. This paper presents ContraFlow, a selective yet precise contrastive value-flow embedding approach to statically detect software vulnerabilities. The novelty of ContraFlow lies in selecting and preserving feasible value-flow (aka program dependence) paths through a pretrained path embedding model using self-supervised contrastive learning, thus significantly reducing the amount of labeled data required for training expensive downstream models for path-based vulnerability detection. We evaluated ContraFlow using 288 real-world projects by comparing eight recent learning-based approaches. ContraFlow outperforms these eight baselines by up to 334.1%, 317.9%, 58.3% for informedness, markedness and F1 Score, and achieves up to 450.0%, 192.3%, 450.0% improvement for mean statement recall, mean statement precision and mean IoU respectively in terms of locating buggy statements.

查看原文本刊更多论文

基于对比学习的路径敏感代码嵌入软件漏洞检测

机器学习及其分支深度学习在广泛的应用领域取得了成功。最近，人们花了很多精力将深度学习技术(例如，图神经网络)应用于静态漏洞检测，作为传统漏洞检测方法的替代方案。为了获取代码的结构信息，目前的学习方法通常是以图的形式(如数据流图、抽象语法树)对程序进行抽象，然后基于安全和脆弱代码片段(子)图训练底层分类模型进行漏洞预测。然而，这些模型对于精确的bug检测仍然是不够的，因为这些模型的目标是产生分类结果，而不是理解漏洞的语义，例如，精确的bug触发路径，这对于静态的bug检测是必不可少的。本文提出了一种选择性而精确的对比价值流嵌入方法ContraFlow，用于静态检测软件漏洞。ContraFlow的新颖之处在于通过使用自监督对比学习的预训练路径嵌入模型来选择和保留可行的价值流(又称程序依赖)路径，从而大大减少了用于基于路径的漏洞检测的训练昂贵的下游模型所需的标记数据量。我们通过比较八种最近的基于学习的方法，使用288个真实世界的项目来评估ContraFlow。ContraFlow在通知性、标记性和F1分数方面分别比这8个基线高出334.1%、317.9%和58.3%，在查找错误语句方面，平均语句召回率、平均语句精度和平均IoU分别提高了450.0%、192.3%和450.0%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis

自引率

0.00%

发文量