Using recurrent neural networks for decompilation

2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER) Pub Date : 2018-03-01 DOI:10.1109/SANER.2018.8330222

Deborah S. Katz, Jason Ruchti, Eric Schulte

{"title":"Using recurrent neural networks for decompilation","authors":"Deborah S. Katz, Jason Ruchti, Eric Schulte","doi":"10.1109/SANER.2018.8330222","DOIUrl":null,"url":null,"abstract":"Decompilation, recovering source code from binary, is useful in many situations where it is necessary to analyze or understand software for which source code is not available. Source code is much easier for humans to read than binary code, and there are many tools available to analyze source code. Existing decompilation techniques often generate source code that is difficult for humans to understand because the generated code often does not use the coding idioms that programmers use. Differences from human-written code also reduce the effectiveness of analysis tools on the decompiled source code. To address the problem of differences between decompiled code and human-written code, we present a novel technique for decompiling binary code snippets using a model based on Recurrent Neural Networks. The model learns properties and patterns that occur in source code and uses them to produce decompilation output. We train and evaluate our technique on snippets of binary machine code compiled from C source code. The general approach we outline in this paper is not language-specific and requires little or no domain knowledge of a language and its properties or how a compiler operates, making the approach easily extensible to new languages and constructs. Furthermore, the technique can be extended and applied in situations to which traditional decompilers are not targeted, such as for decompilation of isolated binary snippets; fast, on-demand decompilation; domain-specific learned decompilation; optimizing for readability of decompilation; and recovering control flow constructs, comments, and variable or function names. We show that the translations produced by this technique are often accurate or close and can provide a useful picture of the snippet's behavior.","PeriodicalId":6602,"journal":{"name":"2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER)","volume":"22 1","pages":"346-356"},"PeriodicalIF":0.0000,"publicationDate":"2018-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"51","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SANER.2018.8330222","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 51

Abstract

Decompilation, recovering source code from binary, is useful in many situations where it is necessary to analyze or understand software for which source code is not available. Source code is much easier for humans to read than binary code, and there are many tools available to analyze source code. Existing decompilation techniques often generate source code that is difficult for humans to understand because the generated code often does not use the coding idioms that programmers use. Differences from human-written code also reduce the effectiveness of analysis tools on the decompiled source code. To address the problem of differences between decompiled code and human-written code, we present a novel technique for decompiling binary code snippets using a model based on Recurrent Neural Networks. The model learns properties and patterns that occur in source code and uses them to produce decompilation output. We train and evaluate our technique on snippets of binary machine code compiled from C source code. The general approach we outline in this paper is not language-specific and requires little or no domain knowledge of a language and its properties or how a compiler operates, making the approach easily extensible to new languages and constructs. Furthermore, the technique can be extended and applied in situations to which traditional decompilers are not targeted, such as for decompilation of isolated binary snippets; fast, on-demand decompilation; domain-specific learned decompilation; optimizing for readability of decompilation; and recovering control flow constructs, comments, and variable or function names. We show that the translations produced by this technique are often accurate or close and can provide a useful picture of the snippet's behavior.

查看原文本刊更多论文

使用递归神经网络进行反编译

反编译，即从二进制文件中恢复源代码，在许多需要分析或理解源代码不可用的软件的情况下非常有用。对于人类来说，源代码比二进制代码更容易阅读，并且有许多工具可用于分析源代码。现有的反编译技术经常生成人类难以理解的源代码，因为生成的代码通常不使用程序员使用的编码习惯。与人类编写的代码的差异也降低了分析工具对反编译源代码的有效性。为了解决反编译代码和人工编写代码之间的差异问题，我们提出了一种基于递归神经网络模型的二进制代码片段反编译新技术。模型学习源代码中出现的属性和模式，并使用它们生成反编译输出。我们在从C源代码编译的二进制机器码片段上训练和评估我们的技术。我们在本文中概述的一般方法不是特定于语言的，并且需要很少或不需要语言及其属性或编译器如何操作的领域知识，这使得该方法很容易扩展到新的语言和结构。此外，该技术可以扩展和应用于传统反编译器不针对的情况，例如反编译孤立的二进制片段;快速，按需反编译;特定领域学习反编译;反编译可读性的优化;以及恢复控制流结构、注释和变量或函数名。我们展示了由这种技术产生的翻译通常是准确的或接近的，并且可以提供一个有用的片段行为的图片。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER)

自引率

0.00%

发文量