Finding the Dwarf: Recovering Precise Types from WebAssembly Binaries

Daniel Lehmann, Michael Pradel
{"title":"Finding the Dwarf: Recovering Precise Types from WebAssembly Binaries","authors":"Daniel Lehmann, Michael Pradel","doi":"10.1145/3519939.3523449","DOIUrl":null,"url":null,"abstract":"The increasing popularity of WebAssembly creates a demand for understanding and reverse engineering WebAssembly binaries. Recovering high-level function types is an important part of this process. One method to recover types is data-flow analysis, but it is complex to implement and may require manual heuristics when logical constraints fall short. In contrast, this paper presents SnowWhite, a learning-based approach for recovering precise, high-level parameter and return types for WebAssembly functions. It improves over prior work on learning-based type recovery by representing the types-to-predict in an expressive type language, which can describe a large number of complex types, instead of the fixed, and usually small type vocabulary used previously. Thus, recovery of a single type is no longer a classification task but sequence prediction, for which we build on the success of neural sequence-to-sequence models. We evaluate SnowWhite on a new, large-scale dataset of 6.3 million type samples extracted from 300,905 WebAssembly object files. The results show the type language is expressive, precisely describing 1,225 types instead the 7 to 35 types considered in previous learning-based approaches. Despite this expressiveness, our type recovery has high accuracy, exactly matching 44.5% (75.2%) of all parameter types and 57.7% (80.5%) of all return types within the top-1 (top-5) predictions.","PeriodicalId":140942,"journal":{"name":"Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation","volume":"2012 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3519939.3523449","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 13

Abstract

The increasing popularity of WebAssembly creates a demand for understanding and reverse engineering WebAssembly binaries. Recovering high-level function types is an important part of this process. One method to recover types is data-flow analysis, but it is complex to implement and may require manual heuristics when logical constraints fall short. In contrast, this paper presents SnowWhite, a learning-based approach for recovering precise, high-level parameter and return types for WebAssembly functions. It improves over prior work on learning-based type recovery by representing the types-to-predict in an expressive type language, which can describe a large number of complex types, instead of the fixed, and usually small type vocabulary used previously. Thus, recovery of a single type is no longer a classification task but sequence prediction, for which we build on the success of neural sequence-to-sequence models. We evaluate SnowWhite on a new, large-scale dataset of 6.3 million type samples extracted from 300,905 WebAssembly object files. The results show the type language is expressive, precisely describing 1,225 types instead the 7 to 35 types considered in previous learning-based approaches. Despite this expressiveness, our type recovery has high accuracy, exactly matching 44.5% (75.2%) of all parameter types and 57.7% (80.5%) of all return types within the top-1 (top-5) predictions.
寻找矮人:从WebAssembly二进制文件中恢复精确类型
WebAssembly的日益普及产生了理解和逆向工程WebAssembly二进制文件的需求。恢复高级函数类型是这个过程的重要组成部分。恢复类型的一种方法是数据流分析,但是实现起来很复杂,并且当逻辑约束不足时可能需要手动启发式。相比之下,本文提出了SnowWhite,一种基于学习的方法,用于恢复WebAssembly函数的精确、高级参数和返回类型。它改进了先前基于学习的类型恢复工作,用表达型语言表示要预测的类型,这种语言可以描述大量复杂类型,而不是以前使用的固定且通常较小的类型词汇表。因此,单一类型的恢复不再是分类任务,而是序列预测,为此我们建立在神经序列到序列模型的成功基础上。我们在从300,905个WebAssembly对象文件中提取的630万个类型样本的新的大规模数据集上评估SnowWhite。结果表明,类型语言具有表现力,精确地描述了1225种类型,而不是以前基于学习的方法中考虑的7到35种类型。尽管具有这种表达性,我们的类型恢复具有很高的准确性,在top-1 (top-5)预测中,所有参数类型的精确匹配率为44.5%(75.2%),所有返回类型的精确匹配率为57.7%(80.5%)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信