Corpus-wide Analysis of Parser Behaviors via a Format Analysis Workbench

2023 IEEE Security and Privacy Workshops (SPW) Pub Date : 2023-05-01 DOI:10.1109/SPW59333.2023.00024

Pottayil Harisanker Menon, Walt Woods

{"title":"Corpus-wide Analysis of Parser Behaviors via a Format Analysis Workbench","authors":"Pottayil Harisanker Menon, Walt Woods","doi":"10.1109/SPW59333.2023.00024","DOIUrl":null,"url":null,"abstract":"As the number of parsers written for a data format grows, the number of interpretations of that format's specification also grows. Often, these interpretations differ in subtle, hard-to-determine ways that can result in parser differentials – where one input passed to two parsing programs results in two semantically different behaviors. For example, two widely-used HTTP parsers have been shown to process packet headers differently, allowing for the exfiltration of private files. To help find, diagnose, and mitigate the risks of parser differentials, we present the Format Analysis Workbench (FAW), a collection of tools for collecting information on large numbers of parser/input interactions and analyzing those interactions to detect and explain differentials. This tool suite supports any number of file formats through a flexible configuration, allows for processing to be scaled horizontally, and can be run offline. It has been used for results including the analysis of more than 1 million PDF files and unifying parser behaviors across these files to identify a gold standard of validity across multiple parsers. The included statistical tools have been used to identify the root causes of parser rendering differentials, including mislabeled non-embedded fonts. Tools for instrumenting existing parsers are also included, such as PolyTracker, allowing for the analysis of blind spots which might be used to craft differentials for other parsers, or to exfiltrate large quantities of data. Through allowing users to characterize parser behaviors at scale against large corpuses of inputs, the FAW helps to mitigate security risks arising from parser behaviors by making it tractable to resolve examples of differentials back to their behavioral causes.","PeriodicalId":308378,"journal":{"name":"2023 IEEE Security and Privacy Workshops (SPW)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE Security and Privacy Workshops (SPW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SPW59333.2023.00024","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

As the number of parsers written for a data format grows, the number of interpretations of that format's specification also grows. Often, these interpretations differ in subtle, hard-to-determine ways that can result in parser differentials – where one input passed to two parsing programs results in two semantically different behaviors. For example, two widely-used HTTP parsers have been shown to process packet headers differently, allowing for the exfiltration of private files. To help find, diagnose, and mitigate the risks of parser differentials, we present the Format Analysis Workbench (FAW), a collection of tools for collecting information on large numbers of parser/input interactions and analyzing those interactions to detect and explain differentials. This tool suite supports any number of file formats through a flexible configuration, allows for processing to be scaled horizontally, and can be run offline. It has been used for results including the analysis of more than 1 million PDF files and unifying parser behaviors across these files to identify a gold standard of validity across multiple parsers. The included statistical tools have been used to identify the root causes of parser rendering differentials, including mislabeled non-embedded fonts. Tools for instrumenting existing parsers are also included, such as PolyTracker, allowing for the analysis of blind spots which might be used to craft differentials for other parsers, or to exfiltrate large quantities of data. Through allowing users to characterize parser behaviors at scale against large corpuses of inputs, the FAW helps to mitigate security risks arising from parser behaviors by making it tractable to resolve examples of differentials back to their behavioral causes.

查看原文本刊更多论文

通过格式分析工作台对解析器行为进行语料库范围的分析

随着为数据格式编写的解析器数量的增加，对该格式规范的解释数量也会增加。通常，这些解释以微妙的、难以确定的方式存在差异，这可能导致解析器的差异——传递给两个解析程序的一个输入会导致两种语义上不同的行为。例如，两个广泛使用的HTTP解析器以不同的方式处理数据包头，从而允许对私有文件进行泄漏。为了帮助发现、诊断和减轻解析器差异的风险，我们提出了格式分析工作台(Format Analysis Workbench, FAW)，这是一组工具，用于收集大量解析器/输入交互的信息，并分析这些交互以检测和解释差异。该工具套件通过灵活的配置支持任意数量的文件格式，允许横向扩展处理，并且可以脱机运行。它已被用于分析结果，包括对超过100万个PDF文件的分析和跨这些文件统一解析器行为，以确定跨多个解析器有效性的黄金标准。所包含的统计工具已用于识别解析器呈现差异的根本原因，包括错误标记的非嵌入式字体。还包括用于检测现有解析器的工具，例如PolyTracker，允许对盲点进行分析，这些盲点可能用于为其他解析器制造差异，或者泄露大量数据。通过允许用户针对大型输入语库大规模地描述解析器行为，FAW有助于减轻解析器行为引起的安全风险，因为它可以将差异示例解析回其行为原因。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2023 IEEE Security and Privacy Workshops (SPW)

自引率

0.00%

发文量