On the Feasibility of Supervised Machine Learning for the Detection of Malicious Software Packages

Proceedings of the 17th International Conference on Availability, Reliability and Security Pub Date : 2022-08-23 DOI:10.1145/3538969.3544415

Marc Ohm, Felix Boes, Christian Bungartz, M. Meier

{"title":"On the Feasibility of Supervised Machine Learning for the Detection of Malicious Software Packages","authors":"Marc Ohm, Felix Boes, Christian Bungartz, M. Meier","doi":"10.1145/3538969.3544415","DOIUrl":null,"url":null,"abstract":"Modern software development heavily relies on a multitude of externally – often also open source – developed components that constitute a so-called Software Supply Chain. Over the last few years a rise of trojanized (i.e., maliciously manipulated) software packages have been observed and addressed in multiple academic publications. A central issue of this is the timely detection of such malicious packages for which typically single heuristic- or machine learning based approaches have been chosen. Especially the general suitability of supervised machine learning is currently not fully covered. In order to gain insight, we analyze a diverse set of commonly employed supervised machine learning techniques, both quantitatively and qualitatively. More precisely, we leverage a labeled dataset of known malicious software packages on which we measure the performance of each technique. This is followed by an in-depth analysis of the three best performing classifiers on unlabeled data, i.e., the whole npm package repository. Our combination of multiple classifiers indicates a good viability of supervised machine learning for the detection of malicious packages by pre-selecting a feasible number of suspicious packages for further manual analysis. This research effort includes the evaluation of over 25,210 different models which led to True Positive Rates of over 70 % and the detection and reporting of 13 previously unknown malicious packages.","PeriodicalId":306813,"journal":{"name":"Proceedings of the 17th International Conference on Availability, Reliability and Security","volume":"82 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 17th International Conference on Availability, Reliability and Security","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3538969.3544415","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

Abstract

Modern software development heavily relies on a multitude of externally – often also open source – developed components that constitute a so-called Software Supply Chain. Over the last few years a rise of trojanized (i.e., maliciously manipulated) software packages have been observed and addressed in multiple academic publications. A central issue of this is the timely detection of such malicious packages for which typically single heuristic- or machine learning based approaches have been chosen. Especially the general suitability of supervised machine learning is currently not fully covered. In order to gain insight, we analyze a diverse set of commonly employed supervised machine learning techniques, both quantitatively and qualitatively. More precisely, we leverage a labeled dataset of known malicious software packages on which we measure the performance of each technique. This is followed by an in-depth analysis of the three best performing classifiers on unlabeled data, i.e., the whole npm package repository. Our combination of multiple classifiers indicates a good viability of supervised machine learning for the detection of malicious packages by pre-selecting a feasible number of suspicious packages for further manual analysis. This research effort includes the evaluation of over 25,210 different models which led to True Positive Rates of over 70 % and the detection and reporting of 13 previously unknown malicious packages.

查看原文本刊更多论文

有监督机器学习检测恶意软件包的可行性研究

现代软件开发严重依赖于大量外部(通常也是开源的)开发的组件，这些组件构成了所谓的软件供应链。在过去的几年中，在多个学术出版物中已经观察到并解决了木马化(即恶意操纵)软件包的兴起。这方面的一个核心问题是及时检测此类恶意软件包，通常选择了基于启发式或机器学习的方法。特别是有监督机器学习的一般适用性目前还没有完全覆盖。为了获得洞察力，我们从定量和定性两方面分析了各种常用的监督机器学习技术。更准确地说，我们利用已知恶意软件包的标记数据集来衡量每种技术的性能。接下来是对未标记数据(即整个npm包存储库)上表现最好的三个分类器的深入分析。我们的多个分类器的组合表明，通过预先选择可行数量的可疑包进行进一步的人工分析，监督机器学习在检测恶意包方面具有良好的可行性。这项研究工作包括对超过25,210种不同模型的评估，这些模型的真阳性率超过70%，并检测和报告了13个以前未知的恶意软件包。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 17th International Conference on Availability, Reliability and Security

自引率

0.00%

发文量