基准测试

Marc Miltenberger, Steven Arzt, Philipp Holzinger, Julius Näumann
{"title":"基准测试","authors":"Marc Miltenberger, Steven Arzt, Philipp Holzinger, Julius Näumann","doi":"10.1145/3579856.3582830","DOIUrl":null,"url":null,"abstract":"Over the years, security researchers have developed a broad spectrum of automatic code scanners that aim to find security vulnerabilities in applications. Security benchmarks are commonly used to evaluate novel scanners or program analysis techniques. Each benchmark consists of multiple positive test cases that reflect typical implementations of vulnerabilities, as well as negative test cases, that reflect secure implementations without security flaws. Based on this ground truth, researchers can demonstrate the recall and precision of their novel contributions. However, as we found, existing security benchmarks are often underspecified with respect to their underlying assumptions and threat models. This may lead to misleading evaluation results when testing code scanners, since it requires the scanner to follow unclear and sometimes even contradictory assumptions. To help improve the quality of benchmarks, we propose SecExploitLang, a specification language that allows the authors of benchmarks to specify security assumptions along with their test cases. We further present Exploiter, a tool than can automatically generate exploit code based on a test case and its SecExploitLang specification to demonstrate the correctness of the test case. We created SecExploitLang specifications for two common security benchmarks and used Exploiter to evaluate the adequacy of their test case implementations. Our results show clear shortcomings in both benchmarks, i.e., a significant number of positive test cases turn out to be unexploitable, and even some negative test case implementation turn out to be exploitable. As we explain, the reasons for this include implementation defects, as well as design flaws, which impacts the meaningfulness of evaluations that were based on them. Our work shall highlight the importance of thorough benchmark design and evaluation, and the concepts and tools we propose shall facilitate this task.","PeriodicalId":156082,"journal":{"name":"Proceedings of the 2023 ACM Asia Conference on Computer and Communications Security","volume":"57 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Benchmarking the Benchmarks\",\"authors\":\"Marc Miltenberger, Steven Arzt, Philipp Holzinger, Julius Näumann\",\"doi\":\"10.1145/3579856.3582830\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Over the years, security researchers have developed a broad spectrum of automatic code scanners that aim to find security vulnerabilities in applications. Security benchmarks are commonly used to evaluate novel scanners or program analysis techniques. Each benchmark consists of multiple positive test cases that reflect typical implementations of vulnerabilities, as well as negative test cases, that reflect secure implementations without security flaws. Based on this ground truth, researchers can demonstrate the recall and precision of their novel contributions. However, as we found, existing security benchmarks are often underspecified with respect to their underlying assumptions and threat models. This may lead to misleading evaluation results when testing code scanners, since it requires the scanner to follow unclear and sometimes even contradictory assumptions. To help improve the quality of benchmarks, we propose SecExploitLang, a specification language that allows the authors of benchmarks to specify security assumptions along with their test cases. We further present Exploiter, a tool than can automatically generate exploit code based on a test case and its SecExploitLang specification to demonstrate the correctness of the test case. We created SecExploitLang specifications for two common security benchmarks and used Exploiter to evaluate the adequacy of their test case implementations. Our results show clear shortcomings in both benchmarks, i.e., a significant number of positive test cases turn out to be unexploitable, and even some negative test case implementation turn out to be exploitable. As we explain, the reasons for this include implementation defects, as well as design flaws, which impacts the meaningfulness of evaluations that were based on them. Our work shall highlight the importance of thorough benchmark design and evaluation, and the concepts and tools we propose shall facilitate this task.\",\"PeriodicalId\":156082,\"journal\":{\"name\":\"Proceedings of the 2023 ACM Asia Conference on Computer and Communications Security\",\"volume\":\"57 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-07-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2023 ACM Asia Conference on Computer and Communications Security\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3579856.3582830\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2023 ACM Asia Conference on Computer and Communications Security","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3579856.3582830","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

多年来,安全研究人员开发了广泛的自动代码扫描器,旨在发现应用程序中的安全漏洞。安全基准测试通常用于评估新的扫描器或程序分析技术。每个基准包括多个正面测试用例,反映了漏洞的典型实现,以及负面测试用例,反映了没有安全缺陷的安全实现。基于这一基本事实,研究人员可以证明他们的新贡献的召回性和准确性。然而,正如我们所发现的,现有的安全基准通常在其基础假设和威胁模型方面没有得到充分的说明。这可能会导致在测试代码扫描器时产生误导性的评估结果,因为它要求扫描器遵循不明确的,有时甚至是相互矛盾的假设。为了帮助提高基准测试的质量,我们提出了SecExploitLang,这是一种规范语言,允许基准测试的作者指定安全性假设以及他们的测试用例。我们进一步介绍了Exploiter,一个可以基于测试用例及其SecExploitLang规范自动生成利用代码的工具,以演示测试用例的正确性。我们为两个常见的安全基准创建了SecExploitLang规范,并使用Exploiter来评估它们的测试用例实现的充分性。我们的结果显示了两个基准测试中明显的缺点,也就是说,大量积极的测试用例被证明是不可利用的,甚至一些消极的测试用例实现被证明是可利用的。正如我们所解释的,造成这种情况的原因包括实现缺陷,以及设计缺陷,这些缺陷影响了基于它们的评估的意义。我们的工作将强调彻底的基准设计和评估的重要性,我们提出的概念和工具将有助于这项任务。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Benchmarking the Benchmarks
Over the years, security researchers have developed a broad spectrum of automatic code scanners that aim to find security vulnerabilities in applications. Security benchmarks are commonly used to evaluate novel scanners or program analysis techniques. Each benchmark consists of multiple positive test cases that reflect typical implementations of vulnerabilities, as well as negative test cases, that reflect secure implementations without security flaws. Based on this ground truth, researchers can demonstrate the recall and precision of their novel contributions. However, as we found, existing security benchmarks are often underspecified with respect to their underlying assumptions and threat models. This may lead to misleading evaluation results when testing code scanners, since it requires the scanner to follow unclear and sometimes even contradictory assumptions. To help improve the quality of benchmarks, we propose SecExploitLang, a specification language that allows the authors of benchmarks to specify security assumptions along with their test cases. We further present Exploiter, a tool than can automatically generate exploit code based on a test case and its SecExploitLang specification to demonstrate the correctness of the test case. We created SecExploitLang specifications for two common security benchmarks and used Exploiter to evaluate the adequacy of their test case implementations. Our results show clear shortcomings in both benchmarks, i.e., a significant number of positive test cases turn out to be unexploitable, and even some negative test case implementation turn out to be exploitable. As we explain, the reasons for this include implementation defects, as well as design flaws, which impacts the meaningfulness of evaluations that were based on them. Our work shall highlight the importance of thorough benchmark design and evaluation, and the concepts and tools we propose shall facilitate this task.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信