Malware detection using Explainable ML models based on Feature Extraction using API calls

IF 2.6 4区计算机科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Big Data Pub Date : 2023-08-03 DOI:10.1109/icABCD59051.2023.10220515

Bhanu Prakash Reddy Banda, Bianca Govan, K. Roy, Kelvin S. Bryant

{"title":"Malware detection using Explainable ML models based on Feature Extraction using API calls","authors":"Bhanu Prakash Reddy Banda, Bianca Govan, K. Roy, Kelvin S. Bryant","doi":"10.1109/icABCD59051.2023.10220515","DOIUrl":null,"url":null,"abstract":"Malware attacks have become a crucial problem in modern life. From 2015 to 2021 about 56.1billion malware attacks have taken place in the world. A malware attack typically costs a business over 2.5 million dollars to remediate. According to Cybersecurity Ventures, during the next five years, the cost of cybercrime would increase by 15% yearly, reaching 10.5 trillion USD annually by 2025 from 3 trillion USD in 2015. There is a global epidemic of malware. Studies imply that malware's effects are deteriorating. The main defense against malware tools is malware detectors. Therefore, it is crucial that we research malware detection methods to better comprehend their advantages and disadvantages. This research focuses on an Application Pro-gramming Interface (API) call-based malware detection strategy with Machine Learning to further improve malware detection. The Limitations that motivated to work on this project was the lack of datasets with newly attacked malware samples and also lack of detecting the malware with good accuracy. The main goal of this research is to understand the malware behavior on the Windows platform, use a dynamic analysis to identify various aspects or features that have dangerous code patterns from malware samples and employ various malware and benign samples to construct and validate machine learning-based malware detection models. The data was gathered from publicly accessible sites and sampled using a sandbox approach. Machine Learning models were built using the new dataset. The Supervised Learning models and deep Learning models were applied to the dataset and then the results were compared and cross-checked to get the best fit model. This investigation demonstrated the possibility of estab- lishing a high-precision capability for the detection of malware while combining API calls and Machine Learning models., The strategy yielded a high malware detection accuracy of 88.26% (XGBoost) model and 90.70% (MLP classifier) for Windows-based platforms. We have used Explainable Machine Learning, namely the SHapley Additive exPlanations (SHAP) value methods to demonstrate the important component or feature responsible for the prediction of the model.","PeriodicalId":51314,"journal":{"name":"Big Data","volume":"3 1","pages":"1-7"},"PeriodicalIF":2.6000,"publicationDate":"2023-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Big Data","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/icABCD59051.2023.10220515","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Malware attacks have become a crucial problem in modern life. From 2015 to 2021 about 56.1billion malware attacks have taken place in the world. A malware attack typically costs a business over 2.5 million dollars to remediate. According to Cybersecurity Ventures, during the next five years, the cost of cybercrime would increase by 15% yearly, reaching 10.5 trillion USD annually by 2025 from 3 trillion USD in 2015. There is a global epidemic of malware. Studies imply that malware's effects are deteriorating. The main defense against malware tools is malware detectors. Therefore, it is crucial that we research malware detection methods to better comprehend their advantages and disadvantages. This research focuses on an Application Pro-gramming Interface (API) call-based malware detection strategy with Machine Learning to further improve malware detection. The Limitations that motivated to work on this project was the lack of datasets with newly attacked malware samples and also lack of detecting the malware with good accuracy. The main goal of this research is to understand the malware behavior on the Windows platform, use a dynamic analysis to identify various aspects or features that have dangerous code patterns from malware samples and employ various malware and benign samples to construct and validate machine learning-based malware detection models. The data was gathered from publicly accessible sites and sampled using a sandbox approach. Machine Learning models were built using the new dataset. The Supervised Learning models and deep Learning models were applied to the dataset and then the results were compared and cross-checked to get the best fit model. This investigation demonstrated the possibility of estab- lishing a high-precision capability for the detection of malware while combining API calls and Machine Learning models., The strategy yielded a high malware detection accuracy of 88.26% (XGBoost) model and 90.70% (MLP classifier) for Windows-based platforms. We have used Explainable Machine Learning, namely the SHapley Additive exPlanations (SHAP) value methods to demonstrate the important component or feature responsible for the prediction of the model.

查看原文本刊更多论文

基于API调用的特征提取的可解释ML模型的恶意软件检测

恶意软件攻击已经成为现代生活中的一个关键问题。从2015年到2021年，全球共发生了561亿次恶意软件攻击。恶意软件攻击通常要花费企业超过250万美元来修复。根据网络安全风险投资公司的数据，在未来五年内，网络犯罪的成本将以每年15%的速度增长，到2025年将从2015年的每年3万亿美元达到10.5万亿美元。恶意软件在全球流行。研究表明，恶意软件的影响正在恶化。针对恶意软件的主要防御工具是恶意软件检测器。因此，研究恶意软件检测方法以更好地了解它们的优缺点是至关重要的。本文研究了一种基于应用程序编程接口(API)调用的恶意软件检测策略，并结合机器学习进一步改进恶意软件检测。这个项目的局限性是缺乏新攻击的恶意软件样本的数据集，也缺乏准确检测恶意软件的能力。本研究的主要目标是了解Windows平台上的恶意软件行为，使用动态分析来识别恶意软件样本中具有危险代码模式的各个方面或特征，并使用各种恶意软件和良性样本来构建和验证基于机器学习的恶意软件检测模型。数据是从可公开访问的站点收集的，并使用沙盒方法进行抽样。使用新的数据集建立了机器学习模型。将有监督学习模型和深度学习模型应用于数据集，然后对结果进行比较和交叉检查，以获得最佳拟合模型。这项调查证明了在结合API调用和机器学习模型的同时，建立高精度恶意软件检测能力的可能性。该策略在windows平台上的恶意软件检测准确率为88.26% (XGBoost)模型和90.70% (MLP分类器)。我们使用了可解释机器学习，即SHapley加性解释(SHAP)值方法来展示负责模型预测的重要成分或特征。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Big Data COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS-COMPUTER SCIENCE, THEORY & METHODS

CiteScore

9.10

自引率

2.20%

发文量

期刊介绍： Big Data is the leading peer-reviewed journal covering the challenges and opportunities in collecting, analyzing, and disseminating vast amounts of data. The Journal addresses questions surrounding this powerful and growing field of data science and facilitates the efforts of researchers, business managers, analysts, developers, data scientists, physicists, statisticians, infrastructure developers, academics, and policymakers to improve operations, profitability, and communications within their businesses and institutions. Spanning a broad array of disciplines focusing on novel big data technologies, policies, and innovations, the Journal brings together the community to address current challenges and enforce effective efforts to organize, store, disseminate, protect, manipulate, and, most importantly, find the most effective strategies to make this incredible amount of information work to benefit society, industry, academia, and government. Big Data coverage includes: Big data industry standards, New technologies being developed specifically for big data, Data acquisition, cleaning, distribution, and best practices, Data protection, privacy, and policy, Business interests from research to product, The changing role of business intelligence, Visualization and design principles of big data infrastructures, Physical interfaces and robotics, Social networking advantages for Facebook, Twitter, Amazon, Google, etc, Opportunities around big data and how companies can harness it to their advantage.