Xingyu Gong , Yang Xu , Sicong Zhang , Chenhang He
{"title":"Ex2Vec:通过端到端的执行感知嵌入增强汇编代码语义","authors":"Xingyu Gong , Yang Xu , Sicong Zhang , Chenhang He","doi":"10.1016/j.neunet.2025.107506","DOIUrl":null,"url":null,"abstract":"<div><div>Binary code similarity detection (BSCD), whose goal is to identify and analyze similar or identical functions in compiled binaries, is an essential task in computer security. Recent methods leveraging deep neural networks (DNN) for numerical vector representation of code have achieved significant success. However, these methods primarily adapt techniques from masked language modeling (MLM), encoding code instructions by predicting missing values from an instruction context, which limits their ability to fully capture execution semantics. In this paper, we propose Ex2vec, an innovative end-to-end encoding method that generates high-quality embeddings rich in execution semantics for BCSD. Ex2vec employs a novel pre-training strategy that enables the model to learn the impact of assembly instructions on register states, thus mitigating the reliance on learning the frequency and co-occurrence of the instructions in the assembly context. By simulating the execution of assembly instructions, Ex2Vec accurately captures the semantic features of assembly code, which is further demonstrated by Principal Component Analysis (PCA) that functionally similar instructions cluster closely in the embedding space. Extensive experiments on large datasets validate that Ex2vec performs exceptionally well in binary code similarity detection, surpassing all existing state-of-the-art methods. In real-world vulnerability detection experiments, Ex2Vec exhibits the highest accuracy.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"189 ","pages":"Article 107506"},"PeriodicalIF":6.0000,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Ex2Vec: Enhancing assembly code semantics with end-to-end execution-aware embeddings\",\"authors\":\"Xingyu Gong , Yang Xu , Sicong Zhang , Chenhang He\",\"doi\":\"10.1016/j.neunet.2025.107506\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Binary code similarity detection (BSCD), whose goal is to identify and analyze similar or identical functions in compiled binaries, is an essential task in computer security. Recent methods leveraging deep neural networks (DNN) for numerical vector representation of code have achieved significant success. However, these methods primarily adapt techniques from masked language modeling (MLM), encoding code instructions by predicting missing values from an instruction context, which limits their ability to fully capture execution semantics. In this paper, we propose Ex2vec, an innovative end-to-end encoding method that generates high-quality embeddings rich in execution semantics for BCSD. Ex2vec employs a novel pre-training strategy that enables the model to learn the impact of assembly instructions on register states, thus mitigating the reliance on learning the frequency and co-occurrence of the instructions in the assembly context. By simulating the execution of assembly instructions, Ex2Vec accurately captures the semantic features of assembly code, which is further demonstrated by Principal Component Analysis (PCA) that functionally similar instructions cluster closely in the embedding space. Extensive experiments on large datasets validate that Ex2vec performs exceptionally well in binary code similarity detection, surpassing all existing state-of-the-art methods. In real-world vulnerability detection experiments, Ex2Vec exhibits the highest accuracy.</div></div>\",\"PeriodicalId\":49763,\"journal\":{\"name\":\"Neural Networks\",\"volume\":\"189 \",\"pages\":\"Article 107506\"},\"PeriodicalIF\":6.0000,\"publicationDate\":\"2025-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neural Networks\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0893608025003855\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0893608025003855","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Ex2Vec: Enhancing assembly code semantics with end-to-end execution-aware embeddings
Binary code similarity detection (BSCD), whose goal is to identify and analyze similar or identical functions in compiled binaries, is an essential task in computer security. Recent methods leveraging deep neural networks (DNN) for numerical vector representation of code have achieved significant success. However, these methods primarily adapt techniques from masked language modeling (MLM), encoding code instructions by predicting missing values from an instruction context, which limits their ability to fully capture execution semantics. In this paper, we propose Ex2vec, an innovative end-to-end encoding method that generates high-quality embeddings rich in execution semantics for BCSD. Ex2vec employs a novel pre-training strategy that enables the model to learn the impact of assembly instructions on register states, thus mitigating the reliance on learning the frequency and co-occurrence of the instructions in the assembly context. By simulating the execution of assembly instructions, Ex2Vec accurately captures the semantic features of assembly code, which is further demonstrated by Principal Component Analysis (PCA) that functionally similar instructions cluster closely in the embedding space. Extensive experiments on large datasets validate that Ex2vec performs exceptionally well in binary code similarity detection, surpassing all existing state-of-the-art methods. In real-world vulnerability detection experiments, Ex2Vec exhibits the highest accuracy.
期刊介绍:
Neural Networks is a platform that aims to foster an international community of scholars and practitioners interested in neural networks, deep learning, and other approaches to artificial intelligence and machine learning. Our journal invites submissions covering various aspects of neural networks research, from computational neuroscience and cognitive modeling to mathematical analyses and engineering applications. By providing a forum for interdisciplinary discussions between biology and technology, we aim to encourage the development of biologically-inspired artificial intelligence.