研究报告:为安全解析器开发构建一个广泛的语料库

2020 IEEE Security and Privacy Workshops (SPW) Pub Date : 2020-05-01 DOI:10.1109/SPW50608.2020.00066

Timothy B. Allison, Wayne Burke, V. Constantinou, Edwin Goh, C. Mattmann, Anastasija Mensikova, Philip Southam, R. Stonebraker, Virisha Timmaraju

{"title":"研究报告:为安全解析器开发构建一个广泛的语料库","authors":"Timothy B. Allison, Wayne Burke, V. Constantinou, Edwin Goh, C. Mattmann, Anastasija Mensikova, Philip Southam, R. Stonebraker, Virisha Timmaraju","doi":"10.1109/SPW50608.2020.00066","DOIUrl":null,"url":null,"abstract":"Computer software that parses electronic files is often vulnerable to maliciously crafted input data. Rather than relying on developers to implement ad hoc defenses against such data, the Language-theoretic security (LangSec) philosophy offers formally correct and verifiable input handling throughout the software development lifecycle. Whether developing from a specification or deriving parsers from samples, LangSec parser developers require wide-reach corpora of their target file format in order to identify key edge cases or common deviations from the format's specification. In this research report, we provide the details of several methods we have used to gather approximately 30 million files, extract features and make these features amenable to search and use in analytics. Additionally, we provide documentation on opportunities and limitations of some popular open-source datasets and annotation tools that will benefit researchers which need to efficiently gather a large file corpus for the purposes of LangSec parser development.","PeriodicalId":413600,"journal":{"name":"2020 IEEE Security and Privacy Workshops (SPW)","volume":"38 10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Research Report: Building a Wide Reach Corpus for Secure Parser Development\",\"authors\":\"Timothy B. Allison, Wayne Burke, V. Constantinou, Edwin Goh, C. Mattmann, Anastasija Mensikova, Philip Southam, R. Stonebraker, Virisha Timmaraju\",\"doi\":\"10.1109/SPW50608.2020.00066\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Computer software that parses electronic files is often vulnerable to maliciously crafted input data. Rather than relying on developers to implement ad hoc defenses against such data, the Language-theoretic security (LangSec) philosophy offers formally correct and verifiable input handling throughout the software development lifecycle. Whether developing from a specification or deriving parsers from samples, LangSec parser developers require wide-reach corpora of their target file format in order to identify key edge cases or common deviations from the format's specification. In this research report, we provide the details of several methods we have used to gather approximately 30 million files, extract features and make these features amenable to search and use in analytics. Additionally, we provide documentation on opportunities and limitations of some popular open-source datasets and annotation tools that will benefit researchers which need to efficiently gather a large file corpus for the purposes of LangSec parser development.\",\"PeriodicalId\":413600,\"journal\":{\"name\":\"2020 IEEE Security and Privacy Workshops (SPW)\",\"volume\":\"38 10 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE Security and Privacy Workshops (SPW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SPW50608.2020.00066\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE Security and Privacy Workshops (SPW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SPW50608.2020.00066","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

解析电子文件的计算机软件常常容易受到恶意输入数据的攻击。语言理论安全(LangSec)哲学不是依靠开发人员实现针对此类数据的特别防御，而是在整个软件开发生命周期中提供正式正确且可验证的输入处理。无论是从规范开发还是从示例派生解析器，LangSec解析器开发人员都需要目标文件格式的广泛语料库，以便识别关键边缘情况或与格式规范的常见偏差。在这份研究报告中，我们提供了几种方法的细节，我们用来收集大约3000万个文件，提取特征，并使这些特征适合搜索和分析使用。此外，我们还提供了一些流行的开源数据集和注释工具的机会和局限性的文档，这些文档将有助于研究人员有效地收集大型文件语料库，以用于LangSec解析器的开发。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Research Report: Building a Wide Reach Corpus for Secure Parser Development

Computer software that parses electronic files is often vulnerable to maliciously crafted input data. Rather than relying on developers to implement ad hoc defenses against such data, the Language-theoretic security (LangSec) philosophy offers formally correct and verifiable input handling throughout the software development lifecycle. Whether developing from a specification or deriving parsers from samples, LangSec parser developers require wide-reach corpora of their target file format in order to identify key edge cases or common deviations from the format's specification. In this research report, we provide the details of several methods we have used to gather approximately 30 million files, extract features and make these features amenable to search and use in analytics. Additionally, we provide documentation on opportunities and limitations of some popular open-source datasets and annotation tools that will benefit researchers which need to efficiently gather a large file corpus for the purposes of LangSec parser development.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 IEEE Security and Privacy Workshops (SPW)

自引率

0.00%

发文量