Binary2vec:Cross-architecture binary embeddings with global attention-enhanced graph neural networks

IF 4.5 Q2 COMPUTER SCIENCE, THEORY & METHODS

Array Pub Date : 2025-08-18 DOI:10.1016/j.array.2025.100491

Zhenyu Gao , Lei Xiao , Wei Weng , Qizhen Xu , Baishun Zhou , Longquan Luo

{"title":"Binary2vec:Cross-architecture binary embeddings with global attention-enhanced graph neural networks","authors":"Zhenyu Gao , Lei Xiao , Wei Weng , Qizhen Xu , Baishun Zhou , Longquan Luo","doi":"10.1016/j.array.2025.100491","DOIUrl":null,"url":null,"abstract":"<div><div>Binary analysis is crucial in the software security domain, supporting tasks such as software plagiarism detection and reverse engineering. However, existing methods either struggle to generalize across hardware architectures or fail to fully capture high-level program semantics. Moreover, in binary similarity analysis, some approaches yield both high precision and high mean squared error, indicating that they assign high similarity scores to both similar and dissimilar binaries. To address these challenges, we propose Binary2vec, a novel framework for constructing cross-architecture binary embeddings. First, Binary2vec leverages the LLVM intermediate representation to achieve cross-architecture compatibility. Then, Binary2vec captures program semantics through a novel graph representation, A-PROGRAML. Finally, A-PROGRAML graph is fed into a graph neural network called GPS with a global attention mechanism to obtain the binary embeddings. To demonstrate its effectiveness, we evaluate Binary2vec on three binary analysis tasks: heterogeneous compute device mapping, optimal thread coarsening factor prediction, and similarity analysis. In heterogeneous compute device mapping and optimal thread coarsening factor prediction, Binary2vec demonstrates better performance than NCC and IR2VEC on average. In similarity analysis, Binary2vec outperforms BinaryAI (the state-of-the-art method) proposed by Tencent Security Keen Lab in cross-architecture scenarios, and works well even with binaries with a small number of functions, where BinaryAI fails.</div></div>","PeriodicalId":8417,"journal":{"name":"Array","volume":"27 ","pages":"Article 100491"},"PeriodicalIF":4.5000,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Array","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2590005625001183","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Binary analysis is crucial in the software security domain, supporting tasks such as software plagiarism detection and reverse engineering. However, existing methods either struggle to generalize across hardware architectures or fail to fully capture high-level program semantics. Moreover, in binary similarity analysis, some approaches yield both high precision and high mean squared error, indicating that they assign high similarity scores to both similar and dissimilar binaries. To address these challenges, we propose Binary2vec, a novel framework for constructing cross-architecture binary embeddings. First, Binary2vec leverages the LLVM intermediate representation to achieve cross-architecture compatibility. Then, Binary2vec captures program semantics through a novel graph representation, A-PROGRAML. Finally, A-PROGRAML graph is fed into a graph neural network called GPS with a global attention mechanism to obtain the binary embeddings. To demonstrate its effectiveness, we evaluate Binary2vec on three binary analysis tasks: heterogeneous compute device mapping, optimal thread coarsening factor prediction, and similarity analysis. In heterogeneous compute device mapping and optimal thread coarsening factor prediction, Binary2vec demonstrates better performance than NCC and IR2VEC on average. In similarity analysis, Binary2vec outperforms BinaryAI (the state-of-the-art method) proposed by Tencent Security Keen Lab in cross-architecture scenarios, and works well even with binaries with a small number of functions, where BinaryAI fails.

查看原文本刊更多论文

Binary2vec：基于全局注意力增强图神经网络的跨架构二进制嵌入

二进制分析在软件安全领域至关重要，支持软件剽窃检测和逆向工程等任务。然而，现有的方法要么难以实现跨硬件架构的泛化，要么无法完全捕获高级程序语义。此外，在二元相似性分析中，一些方法既具有较高的精度，又具有较高的均方误差，这表明它们对相似和不相似的二元都赋予了较高的相似性分数。为了解决这些挑战，我们提出了Binary2vec，这是一个用于构建跨架构二进制嵌入的新框架。首先，Binary2vec利用LLVM中间表示来实现跨架构兼容性。然后，Binary2vec通过一种新颖的图表示（a - programl）捕获程序语义。最后，将a - programl图输入到具有全局关注机制的GPS图神经网络中，得到二值嵌入。为了证明它的有效性，我们在三个二进制分析任务上对Binary2vec进行了评估：异构计算设备映射、最优线程粗化因子预测和相似性分析。在异构计算设备映射和最优线程粗化因子预测方面，Binary2vec的平均性能优于NCC和IR2VEC。在相似性分析中，Binary2vec在跨架构场景下优于腾讯安全敏锐实验室提出的BinaryAI（最先进的方法），即使在BinaryAI失败的情况下，具有少量功能的二进制文件也能很好地工作。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊