Zhenyu Gao , Lei Xiao , Wei Weng , Qizhen Xu , Baishun Zhou , Longquan Luo
{"title":"Binary2vec:Cross-architecture binary embeddings with global attention-enhanced graph neural networks","authors":"Zhenyu Gao , Lei Xiao , Wei Weng , Qizhen Xu , Baishun Zhou , Longquan Luo","doi":"10.1016/j.array.2025.100491","DOIUrl":null,"url":null,"abstract":"<div><div>Binary analysis is crucial in the software security domain, supporting tasks such as software plagiarism detection and reverse engineering. However, existing methods either struggle to generalize across hardware architectures or fail to fully capture high-level program semantics. Moreover, in binary similarity analysis, some approaches yield both high precision and high mean squared error, indicating that they assign high similarity scores to both similar and dissimilar binaries. To address these challenges, we propose Binary2vec, a novel framework for constructing cross-architecture binary embeddings. First, Binary2vec leverages the LLVM intermediate representation to achieve cross-architecture compatibility. Then, Binary2vec captures program semantics through a novel graph representation, A-PROGRAML. Finally, A-PROGRAML graph is fed into a graph neural network called GPS with a global attention mechanism to obtain the binary embeddings. To demonstrate its effectiveness, we evaluate Binary2vec on three binary analysis tasks: heterogeneous compute device mapping, optimal thread coarsening factor prediction, and similarity analysis. In heterogeneous compute device mapping and optimal thread coarsening factor prediction, Binary2vec demonstrates better performance than NCC and IR2VEC on average. In similarity analysis, Binary2vec outperforms BinaryAI (the state-of-the-art method) proposed by Tencent Security Keen Lab in cross-architecture scenarios, and works well even with binaries with a small number of functions, where BinaryAI fails.</div></div>","PeriodicalId":8417,"journal":{"name":"Array","volume":"27 ","pages":"Article 100491"},"PeriodicalIF":4.5000,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Array","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2590005625001183","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
Binary analysis is crucial in the software security domain, supporting tasks such as software plagiarism detection and reverse engineering. However, existing methods either struggle to generalize across hardware architectures or fail to fully capture high-level program semantics. Moreover, in binary similarity analysis, some approaches yield both high precision and high mean squared error, indicating that they assign high similarity scores to both similar and dissimilar binaries. To address these challenges, we propose Binary2vec, a novel framework for constructing cross-architecture binary embeddings. First, Binary2vec leverages the LLVM intermediate representation to achieve cross-architecture compatibility. Then, Binary2vec captures program semantics through a novel graph representation, A-PROGRAML. Finally, A-PROGRAML graph is fed into a graph neural network called GPS with a global attention mechanism to obtain the binary embeddings. To demonstrate its effectiveness, we evaluate Binary2vec on three binary analysis tasks: heterogeneous compute device mapping, optimal thread coarsening factor prediction, and similarity analysis. In heterogeneous compute device mapping and optimal thread coarsening factor prediction, Binary2vec demonstrates better performance than NCC and IR2VEC on average. In similarity analysis, Binary2vec outperforms BinaryAI (the state-of-the-art method) proposed by Tencent Security Keen Lab in cross-architecture scenarios, and works well even with binaries with a small number of functions, where BinaryAI fails.