Ex2Vec:通过端到端的执行感知嵌入增强汇编代码语义

IF 6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Xingyu Gong , Yang Xu , Sicong Zhang , Chenhang He
{"title":"Ex2Vec:通过端到端的执行感知嵌入增强汇编代码语义","authors":"Xingyu Gong ,&nbsp;Yang Xu ,&nbsp;Sicong Zhang ,&nbsp;Chenhang He","doi":"10.1016/j.neunet.2025.107506","DOIUrl":null,"url":null,"abstract":"<div><div>Binary code similarity detection (BSCD), whose goal is to identify and analyze similar or identical functions in compiled binaries, is an essential task in computer security. Recent methods leveraging deep neural networks (DNN) for numerical vector representation of code have achieved significant success. However, these methods primarily adapt techniques from masked language modeling (MLM), encoding code instructions by predicting missing values from an instruction context, which limits their ability to fully capture execution semantics. In this paper, we propose Ex2vec, an innovative end-to-end encoding method that generates high-quality embeddings rich in execution semantics for BCSD. Ex2vec employs a novel pre-training strategy that enables the model to learn the impact of assembly instructions on register states, thus mitigating the reliance on learning the frequency and co-occurrence of the instructions in the assembly context. By simulating the execution of assembly instructions, Ex2Vec accurately captures the semantic features of assembly code, which is further demonstrated by Principal Component Analysis (PCA) that functionally similar instructions cluster closely in the embedding space. Extensive experiments on large datasets validate that Ex2vec performs exceptionally well in binary code similarity detection, surpassing all existing state-of-the-art methods. In real-world vulnerability detection experiments, Ex2Vec exhibits the highest accuracy.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"189 ","pages":"Article 107506"},"PeriodicalIF":6.0000,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Ex2Vec: Enhancing assembly code semantics with end-to-end execution-aware embeddings\",\"authors\":\"Xingyu Gong ,&nbsp;Yang Xu ,&nbsp;Sicong Zhang ,&nbsp;Chenhang He\",\"doi\":\"10.1016/j.neunet.2025.107506\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Binary code similarity detection (BSCD), whose goal is to identify and analyze similar or identical functions in compiled binaries, is an essential task in computer security. Recent methods leveraging deep neural networks (DNN) for numerical vector representation of code have achieved significant success. However, these methods primarily adapt techniques from masked language modeling (MLM), encoding code instructions by predicting missing values from an instruction context, which limits their ability to fully capture execution semantics. In this paper, we propose Ex2vec, an innovative end-to-end encoding method that generates high-quality embeddings rich in execution semantics for BCSD. Ex2vec employs a novel pre-training strategy that enables the model to learn the impact of assembly instructions on register states, thus mitigating the reliance on learning the frequency and co-occurrence of the instructions in the assembly context. By simulating the execution of assembly instructions, Ex2Vec accurately captures the semantic features of assembly code, which is further demonstrated by Principal Component Analysis (PCA) that functionally similar instructions cluster closely in the embedding space. Extensive experiments on large datasets validate that Ex2vec performs exceptionally well in binary code similarity detection, surpassing all existing state-of-the-art methods. In real-world vulnerability detection experiments, Ex2Vec exhibits the highest accuracy.</div></div>\",\"PeriodicalId\":49763,\"journal\":{\"name\":\"Neural Networks\",\"volume\":\"189 \",\"pages\":\"Article 107506\"},\"PeriodicalIF\":6.0000,\"publicationDate\":\"2025-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neural Networks\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0893608025003855\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0893608025003855","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

二进制代码相似度检测(BSCD)是计算机安全中的一项重要任务,其目的是识别和分析编译后的二进制文件中相似或相同的函数。最近利用深度神经网络(DNN)对代码进行数值向量表示的方法取得了显著的成功。然而,这些方法主要采用掩码语言建模(MLM)中的技术,通过预测指令上下文中的缺失值来编码代码指令,这限制了它们完全捕获执行语义的能力。在本文中,我们提出了一种创新的端到端编码方法Ex2vec,它可以为BCSD生成高质量的嵌入,并且具有丰富的执行语义。Ex2vec采用了一种新颖的预训练策略,使模型能够学习汇编指令对寄存器状态的影响,从而减轻了对汇编上下文中指令的频率和共现性的学习依赖。Ex2Vec通过模拟汇编指令的执行,准确捕捉汇编代码的语义特征,并通过主成分分析(PCA)进一步证明,功能相似的指令在嵌入空间中紧密聚类。在大型数据集上进行的大量实验验证了Ex2vec在二进制代码相似性检测方面的出色表现,超越了所有现有的最先进的方法。在现实世界的漏洞检测实验中,Ex2Vec显示出最高的准确性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Ex2Vec: Enhancing assembly code semantics with end-to-end execution-aware embeddings
Binary code similarity detection (BSCD), whose goal is to identify and analyze similar or identical functions in compiled binaries, is an essential task in computer security. Recent methods leveraging deep neural networks (DNN) for numerical vector representation of code have achieved significant success. However, these methods primarily adapt techniques from masked language modeling (MLM), encoding code instructions by predicting missing values from an instruction context, which limits their ability to fully capture execution semantics. In this paper, we propose Ex2vec, an innovative end-to-end encoding method that generates high-quality embeddings rich in execution semantics for BCSD. Ex2vec employs a novel pre-training strategy that enables the model to learn the impact of assembly instructions on register states, thus mitigating the reliance on learning the frequency and co-occurrence of the instructions in the assembly context. By simulating the execution of assembly instructions, Ex2Vec accurately captures the semantic features of assembly code, which is further demonstrated by Principal Component Analysis (PCA) that functionally similar instructions cluster closely in the embedding space. Extensive experiments on large datasets validate that Ex2vec performs exceptionally well in binary code similarity detection, surpassing all existing state-of-the-art methods. In real-world vulnerability detection experiments, Ex2Vec exhibits the highest accuracy.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Neural Networks
Neural Networks 工程技术-计算机:人工智能
CiteScore
13.90
自引率
7.70%
发文量
425
审稿时长
67 days
期刊介绍: Neural Networks is a platform that aims to foster an international community of scholars and practitioners interested in neural networks, deep learning, and other approaches to artificial intelligence and machine learning. Our journal invites submissions covering various aspects of neural networks research, from computational neuroscience and cognitive modeling to mathematical analyses and engineering applications. By providing a forum for interdisciplinary discussions between biology and technology, we aim to encourage the development of biologically-inspired artificial intelligence.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信