In-memory fuzzing for binary code similarity analysis

2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE) Pub Date : 2017-10-30 DOI:10.1109/ASE.2017.8115645

Shuai Wang, Dinghao Wu

{"title":"In-memory fuzzing for binary code similarity analysis","authors":"Shuai Wang, Dinghao Wu","doi":"10.1109/ASE.2017.8115645","DOIUrl":null,"url":null,"abstract":"Detecting similar functions in binary executables serves as a foundation for many binary code analysis and reuse tasks. By far, recognizing similar components in binary code remains a challenge. Existing research employs either static or dynamic approaches to capture program syntax or semantics-level features for comparison. However, there exist multiple design limitations in previous work, which result in relatively high cost, low accuracy and scalability, and thus severely impede their practical use. In this paper, we present a novel method that leverages in-memory fuzzing for binary code similarity analysis. Our prototype tool IMF-SIM applies in-memory fuzzing to launch analysis towards every function and collect traces of different kinds of program behaviors. The similarity score of two behavior traces is computed according to their longest common subsequence. To compare two functions, a feature vector is generated, whose elements are the similarity scores of the behavior trace-level comparisons. We train a machine learning model through labeled feature vectors; later, for a given feature vector by comparing two functions, the trained model gives a final score, representing the similarity score of the two functions. We evaluate IMF-SIM against binaries compiled by different compilers, optimizations, and commonly-used obfuscation methods, in total over one thousand binary executables. Our evaluation shows that IMF-SIM notably outperforms existing tools with higher accuracy and broader application scopes.","PeriodicalId":382876,"journal":{"name":"2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"53","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASE.2017.8115645","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 53

Abstract

Detecting similar functions in binary executables serves as a foundation for many binary code analysis and reuse tasks. By far, recognizing similar components in binary code remains a challenge. Existing research employs either static or dynamic approaches to capture program syntax or semantics-level features for comparison. However, there exist multiple design limitations in previous work, which result in relatively high cost, low accuracy and scalability, and thus severely impede their practical use. In this paper, we present a novel method that leverages in-memory fuzzing for binary code similarity analysis. Our prototype tool IMF-SIM applies in-memory fuzzing to launch analysis towards every function and collect traces of different kinds of program behaviors. The similarity score of two behavior traces is computed according to their longest common subsequence. To compare two functions, a feature vector is generated, whose elements are the similarity scores of the behavior trace-level comparisons. We train a machine learning model through labeled feature vectors; later, for a given feature vector by comparing two functions, the trained model gives a final score, representing the similarity score of the two functions. We evaluate IMF-SIM against binaries compiled by different compilers, optimizations, and commonly-used obfuscation methods, in total over one thousand binary executables. Our evaluation shows that IMF-SIM notably outperforms existing tools with higher accuracy and broader application scopes.

查看原文本刊更多论文

二进制代码相似度分析的内存模糊分析

检测二进制可执行文件中的类似函数是许多二进制代码分析和重用任务的基础。到目前为止，识别二进制代码中的相似组件仍然是一个挑战。现有的研究采用静态或动态方法捕获程序语法或语义级别的特征进行比较。然而，在以往的工作中存在着诸多设计局限性，导致其成本相对较高，精度和可扩展性较低，严重阻碍了其实际应用。在本文中，我们提出了一种利用内存模糊进行二进制代码相似度分析的新方法。我们的原型工具IMF-SIM应用内存模糊对每个函数进行分析，并收集不同类型程序行为的痕迹。根据两个行为轨迹的最长公共子序列计算它们的相似度得分。为了比较两个函数，生成一个特征向量，其元素是行为跟踪级比较的相似性分数。我们通过标记特征向量来训练机器学习模型;然后，通过比较两个函数得到给定的特征向量，训练后的模型给出一个最终得分，表示两个函数的相似度得分。我们针对由不同编译器、优化和常用混淆方法编译的二进制文件(总共超过1000个二进制可执行文件)对IMF-SIM进行了评估。我们的评估表明，IMF-SIM显著优于现有工具，具有更高的精度和更广泛的应用范围。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE)

自引率

0.00%

发文量