Polymorphic type inference for machine code

Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation Pub Date : 2016-03-17 DOI:10.1145/2908080.2908119

M. Noonan, Alexey Loginov, D. Cok

{"title":"Polymorphic type inference for machine code","authors":"M. Noonan, Alexey Loginov, D. Cok","doi":"10.1145/2908080.2908119","DOIUrl":null,"url":null,"abstract":"For many compiled languages, source-level types are erased very early in the compilation process. As a result, further compiler passes may convert type-safe source into type-unsafe machine code. Type-unsafe idioms in the original source and type-unsafe optimizations mean that type information in a stripped binary is essentially nonexistent. The problem of recovering high-level types by performing type inference over stripped machine code is called type reconstruction, and offers a useful capability in support of reverse engineering and decompilation. In this paper, we motivate and develop a novel type system and algorithm for machine-code type inference. The features of this type system were developed by surveying a wide collection of common source- and machine-code idioms, building a catalog of challenging cases for type reconstruction. We found that these idioms place a sophisticated set of requirements on the type system, inducing features such as recursively-constrained polymorphic types. Many of the features we identify are often seen only in expressive and powerful type systems used by high-level functional languages. Using these type-system features as a guideline, we have developed Retypd: a novel static type-inference algorithm for machine code that supports recursive types, polymorphism, and subtyping. Retypd yields more accurate inferred types than existing algorithms, while also enabling new capabilities such as reconstruction of pointer const annotations with 98% recall. Retypd can operate on weaker program representations than the current state of the art, removing the need for high-quality points-to information that may be impractical to compute.","PeriodicalId":178839,"journal":{"name":"Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation","volume":"39 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"29","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2908080.2908119","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 29

Abstract

For many compiled languages, source-level types are erased very early in the compilation process. As a result, further compiler passes may convert type-safe source into type-unsafe machine code. Type-unsafe idioms in the original source and type-unsafe optimizations mean that type information in a stripped binary is essentially nonexistent. The problem of recovering high-level types by performing type inference over stripped machine code is called type reconstruction, and offers a useful capability in support of reverse engineering and decompilation. In this paper, we motivate and develop a novel type system and algorithm for machine-code type inference. The features of this type system were developed by surveying a wide collection of common source- and machine-code idioms, building a catalog of challenging cases for type reconstruction. We found that these idioms place a sophisticated set of requirements on the type system, inducing features such as recursively-constrained polymorphic types. Many of the features we identify are often seen only in expressive and powerful type systems used by high-level functional languages. Using these type-system features as a guideline, we have developed Retypd: a novel static type-inference algorithm for machine code that supports recursive types, polymorphism, and subtyping. Retypd yields more accurate inferred types than existing algorithms, while also enabling new capabilities such as reconstruction of pointer const annotations with 98% recall. Retypd can operate on weaker program representations than the current state of the art, removing the need for high-quality points-to information that may be impractical to compute.

查看原文本刊更多论文

机器代码的多态类型推断

对于许多编译语言，源级类型在编译过程的早期就被删除了。因此，进一步的编译器传递可能会将类型安全的源代码转换为类型不安全的机器码。原始源代码中的类型不安全习惯用法和类型不安全优化意味着在剥离的二进制文件中基本上不存在类型信息。通过对剥离的机器码执行类型推断来恢复高级类型的问题称为类型重构，它为支持逆向工程和反编译提供了有用的功能。在本文中，我们激发并开发了一种新的机器代码类型推断系统和算法。该类型系统的特性是通过调查广泛的通用源代码和机器码习惯用法的集合来开发的，为类型重构构建了一个具有挑战性的案例目录。我们发现，这些习惯用法对类型系统提出了一组复杂的要求，产生了递归约束多态类型等特性。我们确定的许多特性通常只出现在高级函数式语言使用的表达性强的类型系统中。以这些类型系统特性为指导，我们开发了Retypd:一种新的用于机器码的静态类型推断算法，它支持递归类型、多态性和子类型。Retypd产生比现有算法更准确的推断类型，同时还支持新功能，例如以98%的召回率重建指针const注释。Retypd可以在比当前技术状态更弱的程序表示上操作，从而消除了对可能无法计算的高质量指向信息的需求。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation

自引率

0.00%

发文量