BinCoFer: Three-stage purification for effective C/C++ binary third-party library detection

IF 3.7 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Journal of Systems and Software Pub Date : 2025-05-10 DOI:10.1016/j.jss.2025.112480

Yayi Zou , Yixiang Zhang , Guanghao Zhao , Yueming Wu , Shuhao Shen , Cai Fu

{"title":"BinCoFer: Three-stage purification for effective C/C++ binary third-party library detection","authors":"Yayi Zou , Yixiang Zhang , Guanghao Zhao , Yueming Wu , Shuhao Shen , Cai Fu","doi":"10.1016/j.jss.2025.112480","DOIUrl":null,"url":null,"abstract":"<div><div>Third-party libraries (TPL) are becoming increasingly popular to achieve efficient and concise software development. However, unregulated use of TPL will introduce legal and security issues in software development. Consequently, some studies have attempted to detect the reuse of TPLs in target programs by constructing a feature repository. Most of the works require access to the source code of TPLs, while the others suffer from redundancy in the repository, low detection efficiency, and difficulties in detecting partially referenced third-party libraries.</div><div>Therefore, we introduce BinCoFer, a tool designed for detecting TPLs reused in binary programs. We leverage the work of binary code similarity detection(BCSD) to extract binary-format TPL features, making it suitable for scenarios where the source code of TPLs is inaccessible. BinCoFer employs a novel three-stage purification strategy to mitigate feature repository redundancy by highlighting core functions and extracting function-level features, making it applicable to scenarios of partial reuse of TPLs. We have observed that directly using similarity threshold to determine the reuse between two binary functions is inaccurate, a problem that previous work has not addressed. Thus we design a method that uses weight to aggregate the similarity between functions in the target binary and core functions to ultimately judge the reuse situation with high frequency. To examine the ability of <em>BinCoFer</em>, we compiled a dataset on ArchLinux and conduct comparative experiments on it with other four most related works (<em>i.e., ModX</em>, <em>B2SFinder</em>, <em>LibAM</em> and <em>BinaryAI</em>). Through the experimental results, we find that <em>BinCoFer</em> outperforms them by over 20.0% in precision and 7.0% in F1. As the data volume increases, we observe the precision of BinCoFer tends to be stable and high. Moreover, <em>BinCoFer</em> greatly accelerates TPL detection efficiency which reduces the time cost of <em>ModX</em> by up to 99.7%.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"229 ","pages":"Article 112480"},"PeriodicalIF":3.7000,"publicationDate":"2025-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Systems and Software","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0164121225001487","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Third-party libraries (TPL) are becoming increasingly popular to achieve efficient and concise software development. However, unregulated use of TPL will introduce legal and security issues in software development. Consequently, some studies have attempted to detect the reuse of TPLs in target programs by constructing a feature repository. Most of the works require access to the source code of TPLs, while the others suffer from redundancy in the repository, low detection efficiency, and difficulties in detecting partially referenced third-party libraries.

Therefore, we introduce BinCoFer, a tool designed for detecting TPLs reused in binary programs. We leverage the work of binary code similarity detection(BCSD) to extract binary-format TPL features, making it suitable for scenarios where the source code of TPLs is inaccessible. BinCoFer employs a novel three-stage purification strategy to mitigate feature repository redundancy by highlighting core functions and extracting function-level features, making it applicable to scenarios of partial reuse of TPLs. We have observed that directly using similarity threshold to determine the reuse between two binary functions is inaccurate, a problem that previous work has not addressed. Thus we design a method that uses weight to aggregate the similarity between functions in the target binary and core functions to ultimately judge the reuse situation with high frequency. To examine the ability of BinCoFer, we compiled a dataset on ArchLinux and conduct comparative experiments on it with other four most related works (i.e., ModX, B2SFinder, LibAM and BinaryAI). Through the experimental results, we find that BinCoFer outperforms them by over 20.0% in precision and 7.0% in F1. As the data volume increases, we observe the precision of BinCoFer tends to be stable and high. Moreover, BinCoFer greatly accelerates TPL detection efficiency which reduces the time cost of ModX by up to 99.7%.

查看原文本刊更多论文

三级净化有效的C/ c++二进制第三方库检测

为了实现高效和简洁的软件开发，第三方库（TPL）正变得越来越流行。然而，不受管制地使用TPL将在软件开发中引入法律和安全问题。因此，一些研究试图通过构建特征库来检测目标程序中tpl的重用情况。大多数工作需要访问tpl的源代码，而其他工作则存在存储库冗余、检测效率低以及难以检测部分引用的第三方库的问题。因此，我们介绍了BinCoFer，一个用于检测在二进制程序中重用的tpl的工具。我们利用二进制代码相似度检测（BCSD）的工作来提取二进制格式的TPL特征，使其适用于无法访问TPL源代码的场景。BinCoFer采用了一种新的三阶段净化策略，通过突出核心功能和提取功能级特征来减轻特征库冗余，使其适用于tpl的部分重用场景。我们已经观察到，直接使用相似阈值来确定两个二元函数之间的重用是不准确的，这是以前的工作没有解决的问题。因此，我们设计了一种利用权值对目标二值函数与核心函数的相似度进行聚合的方法，最终判断高频复用情况。为了检验BinCoFer的能力，我们在ArchLinux上编译了一个数据集，并与其他四个最相关的作品（即ModX， B2SFinder， LibAM和BinaryAI）进行了对比实验。通过实验结果，我们发现BinCoFer在精度上比它们高出20.0%以上，在F1上比它们高出7.0%。随着数据量的增加，我们观察到BinCoFer的精度趋于稳定和高。此外，BinCoFer大大加快了TPL检测效率，将ModX的时间成本降低了99.7%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Systems and Software 工程技术-计算机：理论方法

CiteScore

8.60

自引率

5.70%

发文量

193

审稿时长

16 weeks

期刊介绍： The Journal of Systems and Software publishes papers covering all aspects of software engineering and related hardware-software-systems issues. All articles should include a validation of the idea presented, e.g. through case studies, experiments, or systematic comparisons with other approaches already in practice. Topics of interest include, but are not limited to: •Methods and tools for, and empirical studies on, software requirements, design, architecture, verification and validation, maintenance and evolution •Agile, model-driven, service-oriented, open source and global software development •Approaches for mobile, multiprocessing, real-time, distributed, cloud-based, dependable and virtualized systems •Human factors and management concerns of software development •Data management and big data issues of software systems •Metrics and evaluation, data mining of software development resources •Business and economic aspects of software development processes The journal welcomes state-of-the-art surveys and reports of practical experience for all of these topics.