RTFM! Automatic Assumption Discovery and Verification Derivation from Library Document for API Misuse Detection

Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security Pub Date : 2020-10-30 DOI:10.1145/3372297.3423360

Tao Lv, Ruishi Li, Yi Yang, Kai Chen, Xiaojing Liao, Xiaofeng Wang, Peiwei Hu, Luyi Xing

{"title":"RTFM! Automatic Assumption Discovery and Verification Derivation from Library Document for API Misuse Detection","authors":"Tao Lv, Ruishi Li, Yi Yang, Kai Chen, Xiaojing Liao, Xiaofeng Wang, Peiwei Hu, Luyi Xing","doi":"10.1145/3372297.3423360","DOIUrl":null,"url":null,"abstract":"To use library APIs, a developer is supposed to follow guidance and respect some constraints, which we call integration assumptions (IAs). Violations of these assumptions can have serious consequences, introducing security-critical flaws such as use-after-free, NULL-dereference, and authentication errors. Analyzing a program for compliance with IAs involves significant effort and needs to be automated. A promising direction is to automatically recover IAs from a library document using Natural Language Processing (NLP) and then verify their consistency with the ways APIs are used in a program through code analysis. However, a practical solution along this line needs to overcome several key challenges, particularly the discovery of IAs from loosely formatted documents and interpretation of their informal descriptions to identify complicated constraints (e.g., data-/control-flow relations between different APIs). In this paper, we present a new technique for automated assumption discovery and verification derivation from library documents. Our approach, called Advance, utilizes a suite of innovations to address those challenges. More specifically, we leverage the observation that IAs tend to express a strong sentiment in emphasizing the importance of a constraint, particularly those security-critical, and utilize a new sentiment analysis model to accurately recover them from loosely formatted documents. These IAs are further processed to identify hidden references to APIs and parameters, through an embedding model, to identify the information-flow relations expected to be followed. Then our approach runs frequent subtree mining to discover the grammatical units in IA sentences that tend to indicate some categories of constraints that could have security implications. These components are mapped to verification code snippets organized in line with the IA sentence's grammatical structure, and can be assembled into verification code executed through CodeQL to discover misuses inside a program. We implemented this design and evaluated it on 5 popular libraries (OpenSSL, SQLite, libpcap, libdbus and libxml2) and 39 real-world applications. Our analysis discovered 193 API misuses, including 139 flaws never reported before.","PeriodicalId":20481,"journal":{"name":"Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security","volume":"47 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2020-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3372297.3423360","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 16

Abstract

To use library APIs, a developer is supposed to follow guidance and respect some constraints, which we call integration assumptions (IAs). Violations of these assumptions can have serious consequences, introducing security-critical flaws such as use-after-free, NULL-dereference, and authentication errors. Analyzing a program for compliance with IAs involves significant effort and needs to be automated. A promising direction is to automatically recover IAs from a library document using Natural Language Processing (NLP) and then verify their consistency with the ways APIs are used in a program through code analysis. However, a practical solution along this line needs to overcome several key challenges, particularly the discovery of IAs from loosely formatted documents and interpretation of their informal descriptions to identify complicated constraints (e.g., data-/control-flow relations between different APIs). In this paper, we present a new technique for automated assumption discovery and verification derivation from library documents. Our approach, called Advance, utilizes a suite of innovations to address those challenges. More specifically, we leverage the observation that IAs tend to express a strong sentiment in emphasizing the importance of a constraint, particularly those security-critical, and utilize a new sentiment analysis model to accurately recover them from loosely formatted documents. These IAs are further processed to identify hidden references to APIs and parameters, through an embedding model, to identify the information-flow relations expected to be followed. Then our approach runs frequent subtree mining to discover the grammatical units in IA sentences that tend to indicate some categories of constraints that could have security implications. These components are mapped to verification code snippets organized in line with the IA sentence's grammatical structure, and can be assembled into verification code executed through CodeQL to discover misuses inside a program. We implemented this design and evaluated it on 5 popular libraries (OpenSSL, SQLite, libpcap, libdbus and libxml2) and 39 real-world applications. Our analysis discovered 193 API misuses, including 139 flaws never reported before.

查看原文本刊更多论文

RTFM !面向API误用检测的库文档自动假设发现和验证派生

要使用库api，开发人员应该遵循指导并尊重一些约束，我们称之为集成假设(integration assumption, IAs)。违反这些假设可能会产生严重的后果，引入安全关键缺陷，例如use-after-free、null - derefence和身份验证错误。分析符合IAs的程序需要大量的工作，并且需要自动化。一个有希望的方向是使用自然语言处理(NLP)从库文档中自动恢复api，然后通过代码分析验证它们与程序中使用api的方式的一致性。然而，沿着这条路线的实际解决方案需要克服几个关键挑战，特别是从松散格式的文档中发现IAs并解释其非正式描述以识别复杂的约束(例如，不同api之间的数据/控制流关系)。本文提出了一种从图书馆文档中自动发现假设和验证的新技术。我们的方法被称为Advance，利用一系列创新来应对这些挑战。更具体地说，我们利用了IAs倾向于在强调约束的重要性时表达强烈的情感，特别是那些对安全至关重要的约束，并利用新的情感分析模型从松散格式的文档中准确地恢复它们。通过嵌入模型，对这些IAs进行进一步处理，以识别对api和参数的隐藏引用，从而识别期望遵循的信息流关系。然后，我们的方法运行频繁的子树挖掘，以发现ai句子中的语法单元，这些语法单元倾向于指示可能具有安全含义的某些约束类别。这些组件被映射到按照IA句子的语法结构组织的验证代码片段，并且可以组装成通过CodeQL执行的验证代码，以发现程序中的误用。我们实现了这个设计，并在5个流行的库(OpenSSL、SQLite、libpcap、libdbus和libxml2)和39个实际应用程序上对其进行了评估。我们的分析发现了193个API滥用，包括139个以前从未报告过的漏洞。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security

自引率

0.00%

发文量