Yayi Zou , Yixiang Zhang , Guanghao Zhao , Yueming Wu , Shuhao Shen , Cai Fu
{"title":"BinCoFer: Three-stage purification for effective C/C++ binary third-party library detection","authors":"Yayi Zou , Yixiang Zhang , Guanghao Zhao , Yueming Wu , Shuhao Shen , Cai Fu","doi":"10.1016/j.jss.2025.112480","DOIUrl":null,"url":null,"abstract":"<div><div>Third-party libraries (TPL) are becoming increasingly popular to achieve efficient and concise software development. However, unregulated use of TPL will introduce legal and security issues in software development. Consequently, some studies have attempted to detect the reuse of TPLs in target programs by constructing a feature repository. Most of the works require access to the source code of TPLs, while the others suffer from redundancy in the repository, low detection efficiency, and difficulties in detecting partially referenced third-party libraries.</div><div>Therefore, we introduce BinCoFer, a tool designed for detecting TPLs reused in binary programs. We leverage the work of binary code similarity detection(BCSD) to extract binary-format TPL features, making it suitable for scenarios where the source code of TPLs is inaccessible. BinCoFer employs a novel three-stage purification strategy to mitigate feature repository redundancy by highlighting core functions and extracting function-level features, making it applicable to scenarios of partial reuse of TPLs. We have observed that directly using similarity threshold to determine the reuse between two binary functions is inaccurate, a problem that previous work has not addressed. Thus we design a method that uses weight to aggregate the similarity between functions in the target binary and core functions to ultimately judge the reuse situation with high frequency. To examine the ability of <em>BinCoFer</em>, we compiled a dataset on ArchLinux and conduct comparative experiments on it with other four most related works (<em>i.e., ModX</em>, <em>B2SFinder</em>, <em>LibAM</em> and <em>BinaryAI</em>). Through the experimental results, we find that <em>BinCoFer</em> outperforms them by over 20.0% in precision and 7.0% in F1. As the data volume increases, we observe the precision of BinCoFer tends to be stable and high. Moreover, <em>BinCoFer</em> greatly accelerates TPL detection efficiency which reduces the time cost of <em>ModX</em> by up to 99.7%.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"229 ","pages":"Article 112480"},"PeriodicalIF":3.7000,"publicationDate":"2025-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Systems and Software","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0164121225001487","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
Abstract
Third-party libraries (TPL) are becoming increasingly popular to achieve efficient and concise software development. However, unregulated use of TPL will introduce legal and security issues in software development. Consequently, some studies have attempted to detect the reuse of TPLs in target programs by constructing a feature repository. Most of the works require access to the source code of TPLs, while the others suffer from redundancy in the repository, low detection efficiency, and difficulties in detecting partially referenced third-party libraries.
Therefore, we introduce BinCoFer, a tool designed for detecting TPLs reused in binary programs. We leverage the work of binary code similarity detection(BCSD) to extract binary-format TPL features, making it suitable for scenarios where the source code of TPLs is inaccessible. BinCoFer employs a novel three-stage purification strategy to mitigate feature repository redundancy by highlighting core functions and extracting function-level features, making it applicable to scenarios of partial reuse of TPLs. We have observed that directly using similarity threshold to determine the reuse between two binary functions is inaccurate, a problem that previous work has not addressed. Thus we design a method that uses weight to aggregate the similarity between functions in the target binary and core functions to ultimately judge the reuse situation with high frequency. To examine the ability of BinCoFer, we compiled a dataset on ArchLinux and conduct comparative experiments on it with other four most related works (i.e., ModX, B2SFinder, LibAM and BinaryAI). Through the experimental results, we find that BinCoFer outperforms them by over 20.0% in precision and 7.0% in F1. As the data volume increases, we observe the precision of BinCoFer tends to be stable and high. Moreover, BinCoFer greatly accelerates TPL detection efficiency which reduces the time cost of ModX by up to 99.7%.
期刊介绍:
The Journal of Systems and Software publishes papers covering all aspects of software engineering and related hardware-software-systems issues. All articles should include a validation of the idea presented, e.g. through case studies, experiments, or systematic comparisons with other approaches already in practice. Topics of interest include, but are not limited to:
•Methods and tools for, and empirical studies on, software requirements, design, architecture, verification and validation, maintenance and evolution
•Agile, model-driven, service-oriented, open source and global software development
•Approaches for mobile, multiprocessing, real-time, distributed, cloud-based, dependable and virtualized systems
•Human factors and management concerns of software development
•Data management and big data issues of software systems
•Metrics and evaluation, data mining of software development resources
•Business and economic aspects of software development processes
The journal welcomes state-of-the-art surveys and reports of practical experience for all of these topics.