Lige Zhan , Jiang Ming , Jianming Fu , Guojun Peng , Letian Sha , Lili Lan
{"title":"The hidden complexities of Android TPL detection: An empirical analysis of techniques, challenges, and effectiveness","authors":"Lige Zhan , Jiang Ming , Jianming Fu , Guojun Peng , Letian Sha , Lili Lan","doi":"10.1016/j.cose.2025.104672","DOIUrl":null,"url":null,"abstract":"<div><div>Third-party libraries (TPLs) play a crucial role in Android application (app) development and have become an indispensable part of the Android ecosystem. However, TPLs also introduce potential security risks, as they may propagate 1-day vulnerabilities or even malicious code into apps. Moreover, certain downstream tasks, such as app clone detection, license violation identification and patch presence test, require accurate TPL detection as a prerequisite. Consequently, TPL detection has gained increasing importance over the past decade in improving maintainability and enhancing security within the software supply chain. To ensure robustness against external factors and precise vulnerability identification, modern library detection tools, in addition to recognizing TPL variety, must be resilient to code obfuscation and optimization, and must also be capable of accurately identifying library versions. Although recent studies have reported progress in addressing these issues, none have conducted a comprehensive evaluation to determine whether the proposed methods effectively overcome these challenges. Furthermore, critical aspects such as tool performance on real-world apps, as well as the generalizability of existing approaches, are frequently overlooked in current research.</div><div>To gain deeper insights into TPL detection research, we conducted a comprehensive empirical analysis of state-of-the-art approaches in this domain. This study begins by summarizing the common technologies used at each stage of the TPL detection process, followed by an analysis of the prevalence of code obfuscation and optimization in real-world apps to identify key external factors that hinder effective library detection. Next, we evaluate the performance of cutting-edge tools on multiple ground-truth datasets to validate our findings. Specifically, we systematically analyze the methodologies employed by these tools, assessing their capabilities in TPL variety detection, version identification, resilience to common obfuscation and optimization techniques, and the underlying causes of their failures. Finally, we assessed the generalizability of these tools by comparing their performance across diverse datasets and validating them with real-world data. Our findings confirm that obfuscation and optimization are indeed prevalent in real-world scenarios. However, the code transformations introduced by these techniques often exceed the scope of scenarios considered in prior TPL detection studies. We also observe that even the most advanced detection features struggle to accurately differentiate between library versions. In addition to errors caused by obfuscation and optimization, overly simplistic library features can further contribute to false positives. Moreover, while most tools perform well on their own curated datasets and show reduced performance on external datasets, their effectiveness in real-world scenarios does not exhibit a substantial disparity. Overall, this paper presents a comprehensive analysis and evaluation of current TPL detection techniques, providing a solid foundation for future research in this area.</div></div>","PeriodicalId":51004,"journal":{"name":"Computers & Security","volume":"159 ","pages":"Article 104672"},"PeriodicalIF":5.4000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers & Security","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S016740482500361X","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Third-party libraries (TPLs) play a crucial role in Android application (app) development and have become an indispensable part of the Android ecosystem. However, TPLs also introduce potential security risks, as they may propagate 1-day vulnerabilities or even malicious code into apps. Moreover, certain downstream tasks, such as app clone detection, license violation identification and patch presence test, require accurate TPL detection as a prerequisite. Consequently, TPL detection has gained increasing importance over the past decade in improving maintainability and enhancing security within the software supply chain. To ensure robustness against external factors and precise vulnerability identification, modern library detection tools, in addition to recognizing TPL variety, must be resilient to code obfuscation and optimization, and must also be capable of accurately identifying library versions. Although recent studies have reported progress in addressing these issues, none have conducted a comprehensive evaluation to determine whether the proposed methods effectively overcome these challenges. Furthermore, critical aspects such as tool performance on real-world apps, as well as the generalizability of existing approaches, are frequently overlooked in current research.
To gain deeper insights into TPL detection research, we conducted a comprehensive empirical analysis of state-of-the-art approaches in this domain. This study begins by summarizing the common technologies used at each stage of the TPL detection process, followed by an analysis of the prevalence of code obfuscation and optimization in real-world apps to identify key external factors that hinder effective library detection. Next, we evaluate the performance of cutting-edge tools on multiple ground-truth datasets to validate our findings. Specifically, we systematically analyze the methodologies employed by these tools, assessing their capabilities in TPL variety detection, version identification, resilience to common obfuscation and optimization techniques, and the underlying causes of their failures. Finally, we assessed the generalizability of these tools by comparing their performance across diverse datasets and validating them with real-world data. Our findings confirm that obfuscation and optimization are indeed prevalent in real-world scenarios. However, the code transformations introduced by these techniques often exceed the scope of scenarios considered in prior TPL detection studies. We also observe that even the most advanced detection features struggle to accurately differentiate between library versions. In addition to errors caused by obfuscation and optimization, overly simplistic library features can further contribute to false positives. Moreover, while most tools perform well on their own curated datasets and show reduced performance on external datasets, their effectiveness in real-world scenarios does not exhibit a substantial disparity. Overall, this paper presents a comprehensive analysis and evaluation of current TPL detection techniques, providing a solid foundation for future research in this area.
期刊介绍:
Computers & Security is the most respected technical journal in the IT security field. With its high-profile editorial board and informative regular features and columns, the journal is essential reading for IT security professionals around the world.
Computers & Security provides you with a unique blend of leading edge research and sound practical management advice. It is aimed at the professional involved with computer security, audit, control and data integrity in all sectors - industry, commerce and academia. Recognized worldwide as THE primary source of reference for applied research and technical expertise it is your first step to fully secure systems.