{"title":"PypiGuard: A novel meta-learning approach for enhanced malicious package detection in PyPI through static-dynamic feature fusion","authors":"Tahir Iqbal , Guowei Wu , Zahid Iqbal , Muhammad Bilal Mahmood , Amreen Shafique , Wenbo Guo","doi":"10.1016/j.jisa.2025.104032","DOIUrl":null,"url":null,"abstract":"<div><div>The increasing reliance on open-source software repositories, especially the Python Package Index (PyPi), has introduced serious security vulnerabilities as malicious actors embed malware into widely adopted packages, threatening the integrity of the software supply chain. Traditional detection methods, often based on static analysis, struggle to capture the complex and obfuscated behaviors characteristic of modern malware. Addressing these limitations, we present <strong>PypiGuard</strong>, an advanced hybrid ensemble meta-model for malicious package detection that integrates both static metadata and dynamic Application Programming Interface (API) call behaviors, enhancing detection accuracy and reducing error rates. Leveraging the <strong>MalwareBench</strong> dataset, our approach utilizes an innovative preprocessing pipeline that fuses metadata features with categorized API behaviors. The <strong>PypiGuard</strong> model employs a hybrid ensemble structure composed of Random Forest (RF), Gradient Boosting (GB), Decision Tree (DT), K-Nearest Neighbors (KNN), LightGBM, and an Artificial Neural Network (ANN), assembled through dynamically optimized stacking-based meta-learning framework that adapts to model-specific prediction strengths. Compared to Deep Learning (DL) baselines like Long-Short Term Memory (LSTM) and Convolutional Neural Network (CNN), <strong>PypiGuard</strong> achieves significant improvements in accuracy and False Positive Rate (FPR), with a detection accuracy of 98.43% and a markedly low FPR, confirming its enhanced effectiveness in accurately identifying malicious packages.</div></div>","PeriodicalId":48638,"journal":{"name":"Journal of Information Security and Applications","volume":"90 ","pages":"Article 104032"},"PeriodicalIF":3.8000,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Information Security and Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2214212625000705","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
The increasing reliance on open-source software repositories, especially the Python Package Index (PyPi), has introduced serious security vulnerabilities as malicious actors embed malware into widely adopted packages, threatening the integrity of the software supply chain. Traditional detection methods, often based on static analysis, struggle to capture the complex and obfuscated behaviors characteristic of modern malware. Addressing these limitations, we present PypiGuard, an advanced hybrid ensemble meta-model for malicious package detection that integrates both static metadata and dynamic Application Programming Interface (API) call behaviors, enhancing detection accuracy and reducing error rates. Leveraging the MalwareBench dataset, our approach utilizes an innovative preprocessing pipeline that fuses metadata features with categorized API behaviors. The PypiGuard model employs a hybrid ensemble structure composed of Random Forest (RF), Gradient Boosting (GB), Decision Tree (DT), K-Nearest Neighbors (KNN), LightGBM, and an Artificial Neural Network (ANN), assembled through dynamically optimized stacking-based meta-learning framework that adapts to model-specific prediction strengths. Compared to Deep Learning (DL) baselines like Long-Short Term Memory (LSTM) and Convolutional Neural Network (CNN), PypiGuard achieves significant improvements in accuracy and False Positive Rate (FPR), with a detection accuracy of 98.43% and a markedly low FPR, confirming its enhanced effectiveness in accurately identifying malicious packages.
期刊介绍:
Journal of Information Security and Applications (JISA) focuses on the original research and practice-driven applications with relevance to information security and applications. JISA provides a common linkage between a vibrant scientific and research community and industry professionals by offering a clear view on modern problems and challenges in information security, as well as identifying promising scientific and "best-practice" solutions. JISA issues offer a balance between original research work and innovative industrial approaches by internationally renowned information security experts and researchers.