Strtune: Data Dependence-Based Code Slicing for Binary Similarity Detection With Fine-Tuned Representation

IF 6.3 1区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS
Kaiyan He;Yikun Hu;Xuehui Li;Yunhao Song;Yubo Zhao;Dawu Gu
{"title":"Strtune: Data Dependence-Based Code Slicing for Binary Similarity Detection With Fine-Tuned Representation","authors":"Kaiyan He;Yikun Hu;Xuehui Li;Yunhao Song;Yubo Zhao;Dawu Gu","doi":"10.1109/TIFS.2024.3484944","DOIUrl":null,"url":null,"abstract":"Binary Code Similarity Detection (BCSD) is significant for software security as it can address binary tasks such as malicious code snippets identification and binary patch analysis by comparing code patterns. Recently, there has been a growing focus on artificial intelligence-based approaches in BCSD due to their scalability and generalization. Because binaries are compiled with different compilation configurations, existing approaches still face notable limitations when comparing binary similarity. First, BCSD requires analysis on code behavior, and existing work claims to extract semantic, but actually still makes analysis in terms of syntax. Second, directly extracting features from assembly sequences, existing work cannot address the issues of instruction reordering and different syntax expressions caused by various compilation configurations. In this paper, we propose STRTUNE, which slices binary code based on data dependence and perform slice-level fine-tuning. To address the first limitation, STRTUNE performs backward slicing based on data dependence to capture how a value is computed along the execution. Each slice reflects the collecting semantics of the code, which is stable across different compilation configurations. STRTUNE introduces flow types to emphasize the independence of computations between slices, forming a graph representation. To overcome the second limitation, based on slices corresponding to the same value computation but having different syntax representation, STRTUNE utilizes a Siamese Network to fine-tune such pairs, making their representations closer in the feature space. This allows the cross-graph attention to focus more on the matching of similar slices based on slice contents and flow types involved. Our evaluation results demonstrate the effectiveness and practicality of STRTUNE. We show that STRTUNE outperforms the state-of-the-art methods for BCSD, achieving a Recall@1 that is 25.3% and 22.2% higher than jTrans and GMN in the task of function retrieval cross optimization in x64.","PeriodicalId":13492,"journal":{"name":"IEEE Transactions on Information Forensics and Security","volume":"19 ","pages":"10233-10245"},"PeriodicalIF":6.3000,"publicationDate":"2024-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Information Forensics and Security","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10750885/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

Binary Code Similarity Detection (BCSD) is significant for software security as it can address binary tasks such as malicious code snippets identification and binary patch analysis by comparing code patterns. Recently, there has been a growing focus on artificial intelligence-based approaches in BCSD due to their scalability and generalization. Because binaries are compiled with different compilation configurations, existing approaches still face notable limitations when comparing binary similarity. First, BCSD requires analysis on code behavior, and existing work claims to extract semantic, but actually still makes analysis in terms of syntax. Second, directly extracting features from assembly sequences, existing work cannot address the issues of instruction reordering and different syntax expressions caused by various compilation configurations. In this paper, we propose STRTUNE, which slices binary code based on data dependence and perform slice-level fine-tuning. To address the first limitation, STRTUNE performs backward slicing based on data dependence to capture how a value is computed along the execution. Each slice reflects the collecting semantics of the code, which is stable across different compilation configurations. STRTUNE introduces flow types to emphasize the independence of computations between slices, forming a graph representation. To overcome the second limitation, based on slices corresponding to the same value computation but having different syntax representation, STRTUNE utilizes a Siamese Network to fine-tune such pairs, making their representations closer in the feature space. This allows the cross-graph attention to focus more on the matching of similar slices based on slice contents and flow types involved. Our evaluation results demonstrate the effectiveness and practicality of STRTUNE. We show that STRTUNE outperforms the state-of-the-art methods for BCSD, achieving a Recall@1 that is 25.3% and 22.2% higher than jTrans and GMN in the task of function retrieval cross optimization in x64.
Strtune:基于数据依赖性的代码切分,用微调表示法进行二元相似性检测
二进制代码相似性检测(BCSD)对软件安全意义重大,因为它可以通过比较代码模式,解决恶意代码片段识别和二进制补丁分析等二进制任务。最近,基于人工智能的 BCSD 方法因其可扩展性和通用性越来越受到关注。由于二进制文件是用不同的编译配置编译的,因此现有方法在比较二进制文件相似性时仍面临明显的局限性。首先,BCSD 需要对代码行为进行分析,而现有工作声称能提取语义,但实际上仍是从语法方面进行分析。其次,直接从汇编序列中提取特征,现有工作无法解决各种编译配置导致的指令重排和语法表达不同的问题。本文提出的 STRTUNE 可根据数据依赖性对二进制代码进行切片,并执行切片级微调。为了解决第一个限制,STRTUNE 基于数据依赖性执行后向切片,以捕捉值在执行过程中的计算方式。每个切片都反映了代码的收集语义,在不同的编译配置下保持稳定。STRTUNE 引入了流类型来强调切片间计算的独立性,从而形成了一种图表示法。为了克服第二个限制,即对应于相同值计算但具有不同语法表示的片段,STRTUNE 利用连体网络对这些片段进行微调,使它们在特征空间中的表示更加接近。这样,跨图注意力就能更多地集中在基于切片内容和相关流类型的相似切片匹配上。我们的评估结果证明了 STRTUNE 的有效性和实用性。我们发现 STRTUNE 在 BCSD 方面的表现优于最先进的方法,在 x64 中的函数检索交叉优化任务中,STRTUNE 的 Recall@1 比 jTrans 和 GMN 分别高出 25.3% 和 22.2%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
IEEE Transactions on Information Forensics and Security
IEEE Transactions on Information Forensics and Security 工程技术-工程:电子与电气
CiteScore
14.40
自引率
7.40%
发文量
234
审稿时长
6.5 months
期刊介绍: The IEEE Transactions on Information Forensics and Security covers the sciences, technologies, and applications relating to information forensics, information security, biometrics, surveillance and systems applications that incorporate these features
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信