A Transformer-based Function Symbol Name Inference Model from an Assembly Language for Binary Reversing

Proceedings of the 2023 ACM Asia Conference on Computer and Communications Security Pub Date : 2023-07-10 DOI:10.1145/3579856.3582823

Hyunjin Kim, Jinyeong Bak, Kyunghyun Cho, Hyungjoon Koo

{"title":"A Transformer-based Function Symbol Name Inference Model from an Assembly Language for Binary Reversing","authors":"Hyunjin Kim, Jinyeong Bak, Kyunghyun Cho, Hyungjoon Koo","doi":"10.1145/3579856.3582823","DOIUrl":null,"url":null,"abstract":"Reverse engineering of a stripped binary has a wide range of applications, yet it is challenging mainly due to the lack of contextually useful information within. Once debugging symbols (e.g., variable names, types, function names) are discarded, recovering such information is not technically viable with traditional approaches like static or dynamic binary analysis. We focus on a function symbol name recovery, which allows a reverse engineer to gain a quick overview of an unseen binary. The key insight is that a well-developed program labels a meaningful function name that describes its underlying semantics well. In this paper, we present AsmDepictor, the Transformer-based framework that generates a function symbol name from a set of assembly codes (i.e., machine instructions), which consists of three major components: binary code refinement, model training, and inference. To this end, we conduct systematic experiments on the effectiveness of code refinement that can enhance an overall performance. We introduce the per-layer positional embedding and Unique-softmax for AsmDepictor so that both can aid to capture a better relationship between tokens. Lastly, we devise a novel evaluation metric tailored for a short description length, the Jaccard* score. Our empirical evaluation shows that the performance of AsmDepictor by far surpasses that of the state-of-the-art models up to around 400%. The best AsmDepictor model achieves an F1 of 71.5 and Jaccard* of 75.4.","PeriodicalId":156082,"journal":{"name":"Proceedings of the 2023 ACM Asia Conference on Computer and Communications Security","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2023 ACM Asia Conference on Computer and Communications Security","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3579856.3582823","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Reverse engineering of a stripped binary has a wide range of applications, yet it is challenging mainly due to the lack of contextually useful information within. Once debugging symbols (e.g., variable names, types, function names) are discarded, recovering such information is not technically viable with traditional approaches like static or dynamic binary analysis. We focus on a function symbol name recovery, which allows a reverse engineer to gain a quick overview of an unseen binary. The key insight is that a well-developed program labels a meaningful function name that describes its underlying semantics well. In this paper, we present AsmDepictor, the Transformer-based framework that generates a function symbol name from a set of assembly codes (i.e., machine instructions), which consists of three major components: binary code refinement, model training, and inference. To this end, we conduct systematic experiments on the effectiveness of code refinement that can enhance an overall performance. We introduce the per-layer positional embedding and Unique-softmax for AsmDepictor so that both can aid to capture a better relationship between tokens. Lastly, we devise a novel evaluation metric tailored for a short description length, the Jaccard* score. Our empirical evaluation shows that the performance of AsmDepictor by far surpasses that of the state-of-the-art models up to around 400%. The best AsmDepictor model achieves an F1 of 71.5 and Jaccard* of 75.4.

查看原文本刊更多论文

基于转换器的二进制反转汇编语言函数符号名推理模型

剥离二进制的逆向工程具有广泛的应用，但主要由于缺乏上下文有用的信息而具有挑战性。一旦调试符号(例如，变量名、类型、函数名)被丢弃，使用静态或动态二进制分析等传统方法恢复这些信息在技术上是不可用的。我们将重点关注函数符号名恢复，它允许逆向工程师快速了解未见的二进制文件。关键的观点是，一个开发良好的程序标记了一个有意义的函数名，它很好地描述了它的底层语义。在本文中，我们介绍了asmdescriptor，一个基于transformer的框架，它从一组汇编代码(即机器指令)中生成函数符号名，它由三个主要组件组成:二进制代码细化、模型训练和推理。为此，我们对代码细化的有效性进行了系统的实验，以提高整体性能。我们为asmdescriptor引入了逐层位置嵌入和Unique-softmax，这样两者都可以帮助捕获令牌之间更好的关系。最后，我们设计了一种新颖的评价指标，为简短的描述长度量身定制，即Jaccard*分数。我们的经验评估表明，asmdescriptor的性能远远超过了最先进的模型，高达400%左右。最好的asmdescriptor模型达到F1为71.5,Jaccard*为75.4。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2023 ACM Asia Conference on Computer and Communications Security

自引率

0.00%

发文量