DiffBCE: Difference contrastive learning for binary code embeddings

IF 4.3 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information and Software Technology Pub Date : 2025-06-26 DOI:10.1016/j.infsof.2025.107822

Yun Zhang , Ge Cheng

{"title":"DiffBCE: Difference contrastive learning for binary code embeddings","authors":"Yun Zhang , Ge Cheng","doi":"10.1016/j.infsof.2025.107822","DOIUrl":null,"url":null,"abstract":"<div><h3>Context:</h3><div>Binary code embedding plays a crucial role in binary similarity detection and software security analysis. However, conventional methods often suffer from scalability issues and depend heavily on large amounts of labeled data, limiting their practical deployment in real-world scenarios.</div></div><div><h3>Objectives:</h3><div>This research introduces DiffBCE, a novel binary code embedding method based on differential contrastive learning. The primary goal is to overcome the limitations of existing approaches by reducing the reliance on labeled data while enhancing the robustness and semantic sensitivity of binary code representations.</div></div><div><h3>Methods:</h3><div>DiffBCE integrates two complementary data augmentation strategies – insensitive transformations (implemented via dropout) and sensitive transformations (using instruction replacement with a Masked Language Model) – within a contrastive learning framework. In addition, a conditional difference prediction module is introduced to capture subtle semantic changes by identifying differences between original and transformed binary code. The model is jointly trained with a combined loss function balancing contrastive loss and conditional difference prediction loss. Experimental validation is performed on multiple binary datasets across various scenarios, including cross-version analysis, cross-optimization-level evaluation, and code obfuscation difference analysis.</div></div><div><h3>Results:</h3><div>Experimental evaluations demonstrate that DiffBCE significantly outperforms state of-the-art methods (e.g., Asm2Vec, DeepBinDiff, PalmTree). Across three similarity detection scenarios, the method achieves improvements in F1 scores by approximately 3.8%, 5.6%, and 11.1%, respectively, underscoring its robustness and effectiveness in handling complex binary code differences.</div></div><div><h3>Conclusions:</h3><div>DiffBCE offers a scalable and efficient solution for binary code embedding by effectively capturing rich semantic features without requiring extensive labeled data. Its superior performance in various testing scenarios suggests promising applications in vulnerability detection, code reuse analysis, reverse engineering, and automated patch generation, paving the way for enhanced software security assessments.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"187 ","pages":"Article 107822"},"PeriodicalIF":4.3000,"publicationDate":"2025-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information and Software Technology","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950584925001612","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Context:

Binary code embedding plays a crucial role in binary similarity detection and software security analysis. However, conventional methods often suffer from scalability issues and depend heavily on large amounts of labeled data, limiting their practical deployment in real-world scenarios.

Objectives:

This research introduces DiffBCE, a novel binary code embedding method based on differential contrastive learning. The primary goal is to overcome the limitations of existing approaches by reducing the reliance on labeled data while enhancing the robustness and semantic sensitivity of binary code representations.

Methods:

DiffBCE integrates two complementary data augmentation strategies – insensitive transformations (implemented via dropout) and sensitive transformations (using instruction replacement with a Masked Language Model) – within a contrastive learning framework. In addition, a conditional difference prediction module is introduced to capture subtle semantic changes by identifying differences between original and transformed binary code. The model is jointly trained with a combined loss function balancing contrastive loss and conditional difference prediction loss. Experimental validation is performed on multiple binary datasets across various scenarios, including cross-version analysis, cross-optimization-level evaluation, and code obfuscation difference analysis.

Results:

Experimental evaluations demonstrate that DiffBCE significantly outperforms state of-the-art methods (e.g., Asm2Vec, DeepBinDiff, PalmTree). Across three similarity detection scenarios, the method achieves improvements in F1 scores by approximately 3.8%, 5.6%, and 11.1%, respectively, underscoring its robustness and effectiveness in handling complex binary code differences.

Conclusions:

DiffBCE offers a scalable and efficient solution for binary code embedding by effectively capturing rich semantic features without requiring extensive labeled data. Its superior performance in various testing scenarios suggests promising applications in vulnerability detection, code reuse analysis, reverse engineering, and automated patch generation, paving the way for enhanced software security assessments.

查看原文本刊更多论文

二进制代码嵌入的差异对比学习

背景：二进制代码嵌入在二进制相似性检测和软件安全分析中起着至关重要的作用。然而，传统的方法经常受到可伸缩性问题的困扰，并且严重依赖于大量的标记数据，这限制了它们在实际场景中的实际部署。目的：介绍一种基于差分对比学习的二进制码嵌入方法DiffBCE。主要目标是通过减少对标记数据的依赖来克服现有方法的局限性，同时增强二进制代码表示的鲁棒性和语义敏感性。方法：DiffBCE在对比学习框架内集成了两种互补的数据增强策略-不敏感转换（通过dropout实现）和敏感转换（使用屏蔽语言模型的指令替换）。此外，还引入了条件差预测模块，通过识别原始和转换后的二进制码之间的差异来捕捉细微的语义变化。该模型采用平衡对比损失和条件差分预测损失的组合损失函数进行联合训练。在不同场景下对多个二进制数据集进行实验验证，包括跨版本分析、跨优化级评估和代码混淆差异分析。结果：实验评估表明，DiffBCE显著优于最先进的方法（例如，Asm2Vec, DeepBinDiff, PalmTree）。在三种相似度检测场景中，该方法的F1分数分别提高了约3.8%、5.6%和11.1%，强调了其在处理复杂二进制代码差异方面的鲁棒性和有效性。结论：DiffBCE通过有效地捕获丰富的语义特征而不需要大量的标记数据，为二进制代码嵌入提供了一个可扩展和高效的解决方案。它在各种测试场景中的优越性能表明在漏洞检测、代码重用分析、逆向工程和自动补丁生成方面有前景的应用，为增强软件安全性评估铺平了道路。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information and Software Technology 工程技术-计算机：软件工程

CiteScore

9.10

自引率

7.70%

发文量

164

审稿时长

9.6 weeks

期刊介绍： Information and Software Technology is the international archival journal focusing on research and experience that contributes to the improvement of software development practices. The journal''s scope includes methods and techniques to better engineer software and manage its development. Articles submitted for review should have a clear component of software engineering or address ways to improve the engineering and management of software development. Areas covered by the journal include: • Software management, quality and metrics, • Software processes, • Software architecture, modelling, specification, design and programming • Functional and non-functional software requirements • Software testing and verification & validation • Empirical studies of all aspects of engineering and managing software development Short Communications is a new section dedicated to short papers addressing new ideas, controversial opinions, "Negative" results and much more. Read the Guide for authors for more information. The journal encourages and welcomes submissions of systematic literature studies (reviews and maps) within the scope of the journal. Information and Software Technology is the premiere outlet for systematic literature studies in software engineering.