研究报告:为安全解析器开发构建一个广泛的语料库

Timothy B. Allison, Wayne Burke, V. Constantinou, Edwin Goh, C. Mattmann, Anastasija Mensikova, Philip Southam, R. Stonebraker, Virisha Timmaraju
{"title":"研究报告:为安全解析器开发构建一个广泛的语料库","authors":"Timothy B. Allison, Wayne Burke, V. Constantinou, Edwin Goh, C. Mattmann, Anastasija Mensikova, Philip Southam, R. Stonebraker, Virisha Timmaraju","doi":"10.1109/SPW50608.2020.00066","DOIUrl":null,"url":null,"abstract":"Computer software that parses electronic files is often vulnerable to maliciously crafted input data. Rather than relying on developers to implement ad hoc defenses against such data, the Language-theoretic security (LangSec) philosophy offers formally correct and verifiable input handling throughout the software development lifecycle. Whether developing from a specification or deriving parsers from samples, LangSec parser developers require wide-reach corpora of their target file format in order to identify key edge cases or common deviations from the format's specification. In this research report, we provide the details of several methods we have used to gather approximately 30 million files, extract features and make these features amenable to search and use in analytics. Additionally, we provide documentation on opportunities and limitations of some popular open-source datasets and annotation tools that will benefit researchers which need to efficiently gather a large file corpus for the purposes of LangSec parser development.","PeriodicalId":413600,"journal":{"name":"2020 IEEE Security and Privacy Workshops (SPW)","volume":"38 10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Research Report: Building a Wide Reach Corpus for Secure Parser Development\",\"authors\":\"Timothy B. Allison, Wayne Burke, V. Constantinou, Edwin Goh, C. Mattmann, Anastasija Mensikova, Philip Southam, R. Stonebraker, Virisha Timmaraju\",\"doi\":\"10.1109/SPW50608.2020.00066\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Computer software that parses electronic files is often vulnerable to maliciously crafted input data. Rather than relying on developers to implement ad hoc defenses against such data, the Language-theoretic security (LangSec) philosophy offers formally correct and verifiable input handling throughout the software development lifecycle. Whether developing from a specification or deriving parsers from samples, LangSec parser developers require wide-reach corpora of their target file format in order to identify key edge cases or common deviations from the format's specification. In this research report, we provide the details of several methods we have used to gather approximately 30 million files, extract features and make these features amenable to search and use in analytics. Additionally, we provide documentation on opportunities and limitations of some popular open-source datasets and annotation tools that will benefit researchers which need to efficiently gather a large file corpus for the purposes of LangSec parser development.\",\"PeriodicalId\":413600,\"journal\":{\"name\":\"2020 IEEE Security and Privacy Workshops (SPW)\",\"volume\":\"38 10 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE Security and Privacy Workshops (SPW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SPW50608.2020.00066\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE Security and Privacy Workshops (SPW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SPW50608.2020.00066","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

摘要

解析电子文件的计算机软件常常容易受到恶意输入数据的攻击。语言理论安全(LangSec)哲学不是依靠开发人员实现针对此类数据的特别防御,而是在整个软件开发生命周期中提供正式正确且可验证的输入处理。无论是从规范开发还是从示例派生解析器,LangSec解析器开发人员都需要目标文件格式的广泛语料库,以便识别关键边缘情况或与格式规范的常见偏差。在这份研究报告中,我们提供了几种方法的细节,我们用来收集大约3000万个文件,提取特征,并使这些特征适合搜索和分析使用。此外,我们还提供了一些流行的开源数据集和注释工具的机会和局限性的文档,这些文档将有助于研究人员有效地收集大型文件语料库,以用于LangSec解析器的开发。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Research Report: Building a Wide Reach Corpus for Secure Parser Development
Computer software that parses electronic files is often vulnerable to maliciously crafted input data. Rather than relying on developers to implement ad hoc defenses against such data, the Language-theoretic security (LangSec) philosophy offers formally correct and verifiable input handling throughout the software development lifecycle. Whether developing from a specification or deriving parsers from samples, LangSec parser developers require wide-reach corpora of their target file format in order to identify key edge cases or common deviations from the format's specification. In this research report, we provide the details of several methods we have used to gather approximately 30 million files, extract features and make these features amenable to search and use in analytics. Additionally, we provide documentation on opportunities and limitations of some popular open-source datasets and annotation tools that will benefit researchers which need to efficiently gather a large file corpus for the purposes of LangSec parser development.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信