Binary2vec:Cross-architecture binary embeddings with global attention-enhanced graph neural networks

IF 4.5 Q2 COMPUTER SCIENCE, THEORY & METHODS
Array Pub Date : 2025-08-18 DOI:10.1016/j.array.2025.100491
Zhenyu Gao , Lei Xiao , Wei Weng , Qizhen Xu , Baishun Zhou , Longquan Luo
{"title":"Binary2vec:Cross-architecture binary embeddings with global attention-enhanced graph neural networks","authors":"Zhenyu Gao ,&nbsp;Lei Xiao ,&nbsp;Wei Weng ,&nbsp;Qizhen Xu ,&nbsp;Baishun Zhou ,&nbsp;Longquan Luo","doi":"10.1016/j.array.2025.100491","DOIUrl":null,"url":null,"abstract":"<div><div>Binary analysis is crucial in the software security domain, supporting tasks such as software plagiarism detection and reverse engineering. However, existing methods either struggle to generalize across hardware architectures or fail to fully capture high-level program semantics. Moreover, in binary similarity analysis, some approaches yield both high precision and high mean squared error, indicating that they assign high similarity scores to both similar and dissimilar binaries. To address these challenges, we propose Binary2vec, a novel framework for constructing cross-architecture binary embeddings. First, Binary2vec leverages the LLVM intermediate representation to achieve cross-architecture compatibility. Then, Binary2vec captures program semantics through a novel graph representation, A-PROGRAML. Finally, A-PROGRAML graph is fed into a graph neural network called GPS with a global attention mechanism to obtain the binary embeddings. To demonstrate its effectiveness, we evaluate Binary2vec on three binary analysis tasks: heterogeneous compute device mapping, optimal thread coarsening factor prediction, and similarity analysis. In heterogeneous compute device mapping and optimal thread coarsening factor prediction, Binary2vec demonstrates better performance than NCC and IR2VEC on average. In similarity analysis, Binary2vec outperforms BinaryAI (the state-of-the-art method) proposed by Tencent Security Keen Lab in cross-architecture scenarios, and works well even with binaries with a small number of functions, where BinaryAI fails.</div></div>","PeriodicalId":8417,"journal":{"name":"Array","volume":"27 ","pages":"Article 100491"},"PeriodicalIF":4.5000,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Array","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2590005625001183","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

Binary analysis is crucial in the software security domain, supporting tasks such as software plagiarism detection and reverse engineering. However, existing methods either struggle to generalize across hardware architectures or fail to fully capture high-level program semantics. Moreover, in binary similarity analysis, some approaches yield both high precision and high mean squared error, indicating that they assign high similarity scores to both similar and dissimilar binaries. To address these challenges, we propose Binary2vec, a novel framework for constructing cross-architecture binary embeddings. First, Binary2vec leverages the LLVM intermediate representation to achieve cross-architecture compatibility. Then, Binary2vec captures program semantics through a novel graph representation, A-PROGRAML. Finally, A-PROGRAML graph is fed into a graph neural network called GPS with a global attention mechanism to obtain the binary embeddings. To demonstrate its effectiveness, we evaluate Binary2vec on three binary analysis tasks: heterogeneous compute device mapping, optimal thread coarsening factor prediction, and similarity analysis. In heterogeneous compute device mapping and optimal thread coarsening factor prediction, Binary2vec demonstrates better performance than NCC and IR2VEC on average. In similarity analysis, Binary2vec outperforms BinaryAI (the state-of-the-art method) proposed by Tencent Security Keen Lab in cross-architecture scenarios, and works well even with binaries with a small number of functions, where BinaryAI fails.
Binary2vec:基于全局注意力增强图神经网络的跨架构二进制嵌入
二进制分析在软件安全领域至关重要,支持软件剽窃检测和逆向工程等任务。然而,现有的方法要么难以实现跨硬件架构的泛化,要么无法完全捕获高级程序语义。此外,在二元相似性分析中,一些方法既具有较高的精度,又具有较高的均方误差,这表明它们对相似和不相似的二元都赋予了较高的相似性分数。为了解决这些挑战,我们提出了Binary2vec,这是一个用于构建跨架构二进制嵌入的新框架。首先,Binary2vec利用LLVM中间表示来实现跨架构兼容性。然后,Binary2vec通过一种新颖的图表示(a - programl)捕获程序语义。最后,将a - programl图输入到具有全局关注机制的GPS图神经网络中,得到二值嵌入。为了证明它的有效性,我们在三个二进制分析任务上对Binary2vec进行了评估:异构计算设备映射、最优线程粗化因子预测和相似性分析。在异构计算设备映射和最优线程粗化因子预测方面,Binary2vec的平均性能优于NCC和IR2VEC。在相似性分析中,Binary2vec在跨架构场景下优于腾讯安全敏锐实验室提出的BinaryAI(最先进的方法),即使在BinaryAI失败的情况下,具有少量功能的二进制文件也能很好地工作。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Array
Array Computer Science-General Computer Science
CiteScore
4.40
自引率
0.00%
发文量
93
审稿时长
45 days
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信