PypiGuard: A novel meta-learning approach for enhanced malicious package detection in PyPI through static-dynamic feature fusion

IF 3.8 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Journal of Information Security and Applications Pub Date : 2025-03-19 DOI:10.1016/j.jisa.2025.104032

Tahir Iqbal , Guowei Wu , Zahid Iqbal , Muhammad Bilal Mahmood , Amreen Shafique , Wenbo Guo

{"title":"PypiGuard: A novel meta-learning approach for enhanced malicious package detection in PyPI through static-dynamic feature fusion","authors":"Tahir Iqbal , Guowei Wu , Zahid Iqbal , Muhammad Bilal Mahmood , Amreen Shafique , Wenbo Guo","doi":"10.1016/j.jisa.2025.104032","DOIUrl":null,"url":null,"abstract":"<div><div>The increasing reliance on open-source software repositories, especially the Python Package Index (PyPi), has introduced serious security vulnerabilities as malicious actors embed malware into widely adopted packages, threatening the integrity of the software supply chain. Traditional detection methods, often based on static analysis, struggle to capture the complex and obfuscated behaviors characteristic of modern malware. Addressing these limitations, we present <strong>PypiGuard</strong>, an advanced hybrid ensemble meta-model for malicious package detection that integrates both static metadata and dynamic Application Programming Interface (API) call behaviors, enhancing detection accuracy and reducing error rates. Leveraging the <strong>MalwareBench</strong> dataset, our approach utilizes an innovative preprocessing pipeline that fuses metadata features with categorized API behaviors. The <strong>PypiGuard</strong> model employs a hybrid ensemble structure composed of Random Forest (RF), Gradient Boosting (GB), Decision Tree (DT), K-Nearest Neighbors (KNN), LightGBM, and an Artificial Neural Network (ANN), assembled through dynamically optimized stacking-based meta-learning framework that adapts to model-specific prediction strengths. Compared to Deep Learning (DL) baselines like Long-Short Term Memory (LSTM) and Convolutional Neural Network (CNN), <strong>PypiGuard</strong> achieves significant improvements in accuracy and False Positive Rate (FPR), with a detection accuracy of 98.43% and a markedly low FPR, confirming its enhanced effectiveness in accurately identifying malicious packages.</div></div>","PeriodicalId":48638,"journal":{"name":"Journal of Information Security and Applications","volume":"90 ","pages":"Article 104032"},"PeriodicalIF":3.8000,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Information Security and Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2214212625000705","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

The increasing reliance on open-source software repositories, especially the Python Package Index (PyPi), has introduced serious security vulnerabilities as malicious actors embed malware into widely adopted packages, threatening the integrity of the software supply chain. Traditional detection methods, often based on static analysis, struggle to capture the complex and obfuscated behaviors characteristic of modern malware. Addressing these limitations, we present PypiGuard, an advanced hybrid ensemble meta-model for malicious package detection that integrates both static metadata and dynamic Application Programming Interface (API) call behaviors, enhancing detection accuracy and reducing error rates. Leveraging the MalwareBench dataset, our approach utilizes an innovative preprocessing pipeline that fuses metadata features with categorized API behaviors. The PypiGuard model employs a hybrid ensemble structure composed of Random Forest (RF), Gradient Boosting (GB), Decision Tree (DT), K-Nearest Neighbors (KNN), LightGBM, and an Artificial Neural Network (ANN), assembled through dynamically optimized stacking-based meta-learning framework that adapts to model-specific prediction strengths. Compared to Deep Learning (DL) baselines like Long-Short Term Memory (LSTM) and Convolutional Neural Network (CNN), PypiGuard achieves significant improvements in accuracy and False Positive Rate (FPR), with a detection accuracy of 98.43% and a markedly low FPR, confirming its enhanced effectiveness in accurately identifying malicious packages.

查看原文本刊更多论文

pypguard：一种新的元学习方法，通过静态动态特征融合增强了PyPI中的恶意包检测

随着对开源软件库的日益依赖，尤其是Python包索引（PyPi），恶意行为者将恶意软件嵌入到广泛采用的软件包中，从而带来了严重的安全漏洞，威胁到软件供应链的完整性。传统的检测方法通常基于静态分析，难以捕捉到现代恶意软件复杂和模糊的行为特征。针对这些限制，我们提出了pypguard，一种用于恶意包检测的高级混合集成元模型，它集成了静态元数据和动态应用程序编程接口（API）调用行为，提高了检测精度并降低了错误率。利用MalwareBench数据集，我们的方法利用了一种创新的预处理管道，将元数据特征与分类API行为融合在一起。pypguard模型采用由随机森林（RF）、梯度增强（GB）、决策树（DT）、k近邻（KNN）、LightGBM和人工神经网络（ANN）组成的混合集成结构，通过动态优化的基于堆栈的元学习框架组装而成，该框架适应模型特定的预测强度。与长短期记忆（LSTM）和卷积神经网络（CNN）等深度学习（DL）基线相比，PypiGuard在准确率和误报率（FPR）方面取得了显著提高，检测准确率为98.43%，FPR明显较低，证实了其在准确识别恶意包方面的有效性增强。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Information Security and Applications Computer Science-Computer Networks and Communications

CiteScore

10.90

自引率

5.40%

发文量

206

审稿时长

56 days

期刊介绍： Journal of Information Security and Applications (JISA) focuses on the original research and practice-driven applications with relevance to information security and applications. JISA provides a common linkage between a vibrant scientific and research community and industry professionals by offering a clear view on modern problems and challenges in information security, as well as identifying promising scientific and "best-practice" solutions. JISA issues offer a balance between original research work and innovative industrial approaches by internationally renowned information security experts and researchers.