Scalable taint specification inference with big code

Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation Pub Date : 2019-06-08 DOI:10.1145/3314221.3314648

Victor Chibotaru, Benjamin Bichsel, Veselin Raychev, Martin T. Vechev

{"title":"Scalable taint specification inference with big code","authors":"Victor Chibotaru, Benjamin Bichsel, Veselin Raychev, Martin T. Vechev","doi":"10.1145/3314221.3314648","DOIUrl":null,"url":null,"abstract":"We present a new scalable, semi-supervised method for inferring taint analysis specifications by learning from a large dataset of programs. Taint specifications capture the role of library APIs (source, sink, sanitizer) and are a critical ingredient of any taint analyzer that aims to detect security violations based on information flow. The core idea of our method is to formulate the taint specification learning problem as a linear optimization task over a large set of information flow constraints. The resulting constraint system can then be efficiently solved with state-of-the-art solvers. Thanks to its scalability, our method can infer many new and interesting taint specifications by simultaneously learning from a large dataset of programs (e.g., as found on GitHub), while requiring few manual annotations. We implemented our method in an end-to-end system, called Seldon, targeting Python, a language where static specification inference is particularly hard due to lack of typing information. We show that Seldon is practically effective: it learned almost 7,000 API roles from over 210,000 candidate APIs with very little supervision (less than 300 annotations) and with high estimated precision (67%). Further, using the learned specifications, our taint analyzer flagged more than 20,000 violations in open source projects, 97% of which were undetectable without the inferred specifications.","PeriodicalId":441774,"journal":{"name":"Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation","volume":"33 1-2","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"24","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3314221.3314648","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 24

Abstract

We present a new scalable, semi-supervised method for inferring taint analysis specifications by learning from a large dataset of programs. Taint specifications capture the role of library APIs (source, sink, sanitizer) and are a critical ingredient of any taint analyzer that aims to detect security violations based on information flow. The core idea of our method is to formulate the taint specification learning problem as a linear optimization task over a large set of information flow constraints. The resulting constraint system can then be efficiently solved with state-of-the-art solvers. Thanks to its scalability, our method can infer many new and interesting taint specifications by simultaneously learning from a large dataset of programs (e.g., as found on GitHub), while requiring few manual annotations. We implemented our method in an end-to-end system, called Seldon, targeting Python, a language where static specification inference is particularly hard due to lack of typing information. We show that Seldon is practically effective: it learned almost 7,000 API roles from over 210,000 candidate APIs with very little supervision (less than 300 annotations) and with high estimated precision (67%). Further, using the learned specifications, our taint analyzer flagged more than 20,000 violations in open source projects, 97% of which were undetectable without the inferred specifications.

查看原文本刊更多论文

使用大代码进行可伸缩的污染规范推断

我们提出了一种新的可扩展的、半监督的方法，通过学习大型程序数据集来推断污染分析规范。污染规范捕获库api的角色(源、接收器、消毒器)，并且是旨在基于信息流检测安全违规的任何污染分析器的关键组成部分。我们的方法的核心思想是将污染规范学习问题表述为一个基于大量信息流约束的线性优化任务。由此产生的约束系统可以用最先进的求解器有效地求解。由于其可扩展性，我们的方法可以通过同时从大型程序数据集(例如，在GitHub上找到的)中学习来推断许多新的和有趣的污染规范，同时需要很少的手动注释。我们在一个名为Seldon的端到端系统中实现了我们的方法，该系统以Python为目标，在这种语言中，由于缺乏类型信息，静态规范推断特别困难。我们证明Seldon实际上是有效的:它从超过210,000个候选API中学习了近7,000个API角色，几乎没有监督(少于300个注释)，估计精度很高(67%)。此外，使用学习到的规范，我们的污染分析器在开源项目中标记了超过20,000个违规行为，其中97%在没有推断规范的情况下无法检测到。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation

自引率

0.00%

发文量