Zhenyu Gao , Lei Xiao , Wei Weng , Qizhen Xu , Baishun Zhou , Longquan Luo
{"title":"Binary2vec:基于全局注意力增强图神经网络的跨架构二进制嵌入","authors":"Zhenyu Gao , Lei Xiao , Wei Weng , Qizhen Xu , Baishun Zhou , Longquan Luo","doi":"10.1016/j.array.2025.100491","DOIUrl":null,"url":null,"abstract":"<div><div>Binary analysis is crucial in the software security domain, supporting tasks such as software plagiarism detection and reverse engineering. However, existing methods either struggle to generalize across hardware architectures or fail to fully capture high-level program semantics. Moreover, in binary similarity analysis, some approaches yield both high precision and high mean squared error, indicating that they assign high similarity scores to both similar and dissimilar binaries. To address these challenges, we propose Binary2vec, a novel framework for constructing cross-architecture binary embeddings. First, Binary2vec leverages the LLVM intermediate representation to achieve cross-architecture compatibility. Then, Binary2vec captures program semantics through a novel graph representation, A-PROGRAML. Finally, A-PROGRAML graph is fed into a graph neural network called GPS with a global attention mechanism to obtain the binary embeddings. To demonstrate its effectiveness, we evaluate Binary2vec on three binary analysis tasks: heterogeneous compute device mapping, optimal thread coarsening factor prediction, and similarity analysis. In heterogeneous compute device mapping and optimal thread coarsening factor prediction, Binary2vec demonstrates better performance than NCC and IR2VEC on average. In similarity analysis, Binary2vec outperforms BinaryAI (the state-of-the-art method) proposed by Tencent Security Keen Lab in cross-architecture scenarios, and works well even with binaries with a small number of functions, where BinaryAI fails.</div></div>","PeriodicalId":8417,"journal":{"name":"Array","volume":"27 ","pages":"Article 100491"},"PeriodicalIF":4.5000,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Binary2vec:Cross-architecture binary embeddings with global attention-enhanced graph neural networks\",\"authors\":\"Zhenyu Gao , Lei Xiao , Wei Weng , Qizhen Xu , Baishun Zhou , Longquan Luo\",\"doi\":\"10.1016/j.array.2025.100491\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Binary analysis is crucial in the software security domain, supporting tasks such as software plagiarism detection and reverse engineering. However, existing methods either struggle to generalize across hardware architectures or fail to fully capture high-level program semantics. Moreover, in binary similarity analysis, some approaches yield both high precision and high mean squared error, indicating that they assign high similarity scores to both similar and dissimilar binaries. To address these challenges, we propose Binary2vec, a novel framework for constructing cross-architecture binary embeddings. First, Binary2vec leverages the LLVM intermediate representation to achieve cross-architecture compatibility. Then, Binary2vec captures program semantics through a novel graph representation, A-PROGRAML. Finally, A-PROGRAML graph is fed into a graph neural network called GPS with a global attention mechanism to obtain the binary embeddings. To demonstrate its effectiveness, we evaluate Binary2vec on three binary analysis tasks: heterogeneous compute device mapping, optimal thread coarsening factor prediction, and similarity analysis. In heterogeneous compute device mapping and optimal thread coarsening factor prediction, Binary2vec demonstrates better performance than NCC and IR2VEC on average. In similarity analysis, Binary2vec outperforms BinaryAI (the state-of-the-art method) proposed by Tencent Security Keen Lab in cross-architecture scenarios, and works well even with binaries with a small number of functions, where BinaryAI fails.</div></div>\",\"PeriodicalId\":8417,\"journal\":{\"name\":\"Array\",\"volume\":\"27 \",\"pages\":\"Article 100491\"},\"PeriodicalIF\":4.5000,\"publicationDate\":\"2025-08-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Array\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2590005625001183\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Array","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2590005625001183","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
Binary2vec:Cross-architecture binary embeddings with global attention-enhanced graph neural networks
Binary analysis is crucial in the software security domain, supporting tasks such as software plagiarism detection and reverse engineering. However, existing methods either struggle to generalize across hardware architectures or fail to fully capture high-level program semantics. Moreover, in binary similarity analysis, some approaches yield both high precision and high mean squared error, indicating that they assign high similarity scores to both similar and dissimilar binaries. To address these challenges, we propose Binary2vec, a novel framework for constructing cross-architecture binary embeddings. First, Binary2vec leverages the LLVM intermediate representation to achieve cross-architecture compatibility. Then, Binary2vec captures program semantics through a novel graph representation, A-PROGRAML. Finally, A-PROGRAML graph is fed into a graph neural network called GPS with a global attention mechanism to obtain the binary embeddings. To demonstrate its effectiveness, we evaluate Binary2vec on three binary analysis tasks: heterogeneous compute device mapping, optimal thread coarsening factor prediction, and similarity analysis. In heterogeneous compute device mapping and optimal thread coarsening factor prediction, Binary2vec demonstrates better performance than NCC and IR2VEC on average. In similarity analysis, Binary2vec outperforms BinaryAI (the state-of-the-art method) proposed by Tencent Security Keen Lab in cross-architecture scenarios, and works well even with binaries with a small number of functions, where BinaryAI fails.