Prashant Anantharaman, R. Lathrop, Rebecca Shapiro, M. Locasto
{"title":"PolyDoc:调查来自polywarm网络的PDF文件","authors":"Prashant Anantharaman, R. Lathrop, Rebecca Shapiro, M. Locasto","doi":"10.1109/SPW59333.2023.00017","DOIUrl":null,"url":null,"abstract":"Complex data formats implicitly demand complex logic to parse and apprehend them. The Portable Document Format (PDF) is among the most demanding formats because it is used as both a data exchange and presentation format, and it has a particularly stringent tradition of supporting in-teroperability and consistent presentation. These requirements create complexity that presents an opportunity for adversaries to encode a variety of exploits and attacks. To investigate whether there is an association between structural malforms and malice (using PDF files as the example challenge format), we built PolyDoc, a tool that conducts format-aware tracing of files pulled from the PolySwarm network. The PolySwarm network crowdsources threat intelligence by running files through several industry-scale threat-detection engines. The PolySwarm network provides a PolyScore, which indicates whether a file is safe or malicious, as judged by threat-detection engines. We ran PolyDoc in a live hunt mode to gather PDF files submitted to PolySwarm and then trace the execution of these PDF files through popular PDF tools such as Mutool, Poppler, and Caradoc. We collected and analyzed 58,906 files from PolySwarm. Further, we used the PDF Error Ontology to assign error categories based on tracer output and compared them to the PolyScore. Our work demonstrates three core insights. First, PDF files classified as malicious contain syntactic malformations. Second, “uncategorized” error ontology classes were common across our different PDF tools—demonstrating that the PDF Error Ontology may be underspecified for files that real-world threat engines receive. Finally, attackers leverage specific syntactic malformations in attacks: malformations that current PDF tools can detect.","PeriodicalId":308378,"journal":{"name":"2023 IEEE Security and Privacy Workshops (SPW)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"PolyDoc: Surveying PDF Files from the PolySwarm network\",\"authors\":\"Prashant Anantharaman, R. Lathrop, Rebecca Shapiro, M. Locasto\",\"doi\":\"10.1109/SPW59333.2023.00017\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Complex data formats implicitly demand complex logic to parse and apprehend them. The Portable Document Format (PDF) is among the most demanding formats because it is used as both a data exchange and presentation format, and it has a particularly stringent tradition of supporting in-teroperability and consistent presentation. These requirements create complexity that presents an opportunity for adversaries to encode a variety of exploits and attacks. To investigate whether there is an association between structural malforms and malice (using PDF files as the example challenge format), we built PolyDoc, a tool that conducts format-aware tracing of files pulled from the PolySwarm network. The PolySwarm network crowdsources threat intelligence by running files through several industry-scale threat-detection engines. The PolySwarm network provides a PolyScore, which indicates whether a file is safe or malicious, as judged by threat-detection engines. We ran PolyDoc in a live hunt mode to gather PDF files submitted to PolySwarm and then trace the execution of these PDF files through popular PDF tools such as Mutool, Poppler, and Caradoc. We collected and analyzed 58,906 files from PolySwarm. Further, we used the PDF Error Ontology to assign error categories based on tracer output and compared them to the PolyScore. Our work demonstrates three core insights. First, PDF files classified as malicious contain syntactic malformations. Second, “uncategorized” error ontology classes were common across our different PDF tools—demonstrating that the PDF Error Ontology may be underspecified for files that real-world threat engines receive. Finally, attackers leverage specific syntactic malformations in attacks: malformations that current PDF tools can detect.\",\"PeriodicalId\":308378,\"journal\":{\"name\":\"2023 IEEE Security and Privacy Workshops (SPW)\",\"volume\":\"35 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 IEEE Security and Privacy Workshops (SPW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SPW59333.2023.00017\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE Security and Privacy Workshops (SPW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SPW59333.2023.00017","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
复杂的数据格式隐含地需要复杂的逻辑来解析和理解它们。可移植文档格式(Portable Document Format, PDF)是要求最高的格式之一,因为它既可用作数据交换格式,也可用作表示格式,而且它在支持互操作性和一致表示方面有着特别严格的传统。这些需求带来了复杂性,为对手提供了编码各种利用和攻击的机会。为了调查结构畸形和恶意之间是否存在关联(使用PDF文件作为示例挑战格式),我们构建了PolyDoc,这是一个对从PolySwarm网络中提取的文件进行格式感知跟踪的工具。该网络通过几个行业规模的威胁检测引擎运行文件,将威胁情报众包。该网络提供了一个PolyScore,它表明一个文件是安全的还是恶意的,由威胁检测引擎判断。我们以实时搜索模式运行PolyDoc,收集提交给PolySwarm的PDF文件,然后通过流行的PDF工具(如Mutool, Poppler和Caradoc)跟踪这些PDF文件的执行情况。我们收集并分析了来自PolySwarm的58,906个文件。此外,我们使用PDF错误本体根据跟踪器输出分配错误类别,并将它们与PolyScore进行比较。我们的工作展示了三个核心见解。首先,被归类为恶意的PDF文件包含语法错误。其次,“未分类”的错误本体类在我们不同的PDF工具中很常见,这表明PDF错误本体可能没有为现实世界的威胁引擎接收的文件指定充分。最后,攻击者利用攻击中特定的语法错误:当前PDF工具可以检测到的错误。
PolyDoc: Surveying PDF Files from the PolySwarm network
Complex data formats implicitly demand complex logic to parse and apprehend them. The Portable Document Format (PDF) is among the most demanding formats because it is used as both a data exchange and presentation format, and it has a particularly stringent tradition of supporting in-teroperability and consistent presentation. These requirements create complexity that presents an opportunity for adversaries to encode a variety of exploits and attacks. To investigate whether there is an association between structural malforms and malice (using PDF files as the example challenge format), we built PolyDoc, a tool that conducts format-aware tracing of files pulled from the PolySwarm network. The PolySwarm network crowdsources threat intelligence by running files through several industry-scale threat-detection engines. The PolySwarm network provides a PolyScore, which indicates whether a file is safe or malicious, as judged by threat-detection engines. We ran PolyDoc in a live hunt mode to gather PDF files submitted to PolySwarm and then trace the execution of these PDF files through popular PDF tools such as Mutool, Poppler, and Caradoc. We collected and analyzed 58,906 files from PolySwarm. Further, we used the PDF Error Ontology to assign error categories based on tracer output and compared them to the PolyScore. Our work demonstrates three core insights. First, PDF files classified as malicious contain syntactic malformations. Second, “uncategorized” error ontology classes were common across our different PDF tools—demonstrating that the PDF Error Ontology may be underspecified for files that real-world threat engines receive. Finally, attackers leverage specific syntactic malformations in attacks: malformations that current PDF tools can detect.