Interpretable Machine Learning for Mitigating Feature-Driven Attacks

Corey M. Hartman;Bhaskar P. Rimal
{"title":"Interpretable Machine Learning for Mitigating Feature-Driven Attacks","authors":"Corey M. Hartman;Bhaskar P. Rimal","doi":"10.1109/TTS.2025.3531780","DOIUrl":null,"url":null,"abstract":"Recent studies have found that 43% of malware infections begin as malicious Microsoft Office documents in the form of Word or Excel file. While many techniques are proposed and are effective in the detection of malicious documents through the utilization of machine learning (ML) algorithms, bias in the datasets and the lack of insight into the decision as to why a document was flagged as malicious are problematic, as one key feature focused on by the ML model utilized may be relied on solely for the prediction that is made. By utilizing the SHAP algorithm (SHapley Additive exPlanation) and an ensemble of ML algorithms split into groups by their SHAP magnitude, where those features taking over the decision-making process of a model are split into their own feature set and are utilized in the training of a separate ML model, a voting classifier can be made to reduce this bias and reliance on a single or select few features. That allows for a more robust ML model for predicting malicious Office documents and presenting more insight into why a prediction was made by the classifier and a model that can let the user know when not enough data is present to predict with confidence. By utilizing this technique, an ensemble soft voting classifier was created that obtained 90.1% accuracy on a balanced dataset consisting of 250 malicious and 250 benign randomly selected Office documents and presents the user with a simple natural language statement that indicates the classification of the documents and why it was classified as a specific label.","PeriodicalId":73324,"journal":{"name":"IEEE transactions on technology and society","volume":"6 2","pages":"220-230"},"PeriodicalIF":0.0000,"publicationDate":"2025-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on technology and society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10869832/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Recent studies have found that 43% of malware infections begin as malicious Microsoft Office documents in the form of Word or Excel file. While many techniques are proposed and are effective in the detection of malicious documents through the utilization of machine learning (ML) algorithms, bias in the datasets and the lack of insight into the decision as to why a document was flagged as malicious are problematic, as one key feature focused on by the ML model utilized may be relied on solely for the prediction that is made. By utilizing the SHAP algorithm (SHapley Additive exPlanation) and an ensemble of ML algorithms split into groups by their SHAP magnitude, where those features taking over the decision-making process of a model are split into their own feature set and are utilized in the training of a separate ML model, a voting classifier can be made to reduce this bias and reliance on a single or select few features. That allows for a more robust ML model for predicting malicious Office documents and presenting more insight into why a prediction was made by the classifier and a model that can let the user know when not enough data is present to predict with confidence. By utilizing this technique, an ensemble soft voting classifier was created that obtained 90.1% accuracy on a balanced dataset consisting of 250 malicious and 250 benign randomly selected Office documents and presents the user with a simple natural language statement that indicates the classification of the documents and why it was classified as a specific label.
用于减轻特征驱动攻击的可解释机器学习
最近的研究发现,43%的恶意软件感染始于Word或Excel文件形式的恶意微软Office文档。虽然提出了许多技术,并且通过利用机器学习(ML)算法有效地检测恶意文档,但数据集中的偏差以及对文档被标记为恶意的决定缺乏洞察力是有问题的,因为所使用的ML模型关注的一个关键特征可能仅依赖于所做的预测。通过利用SHAP算法(SHapley Additive exPlanation)和按其SHAP大小分成组的ML算法集合,其中接管模型决策过程的那些特征被分成自己的特征集,并用于单独的ML模型的训练,可以制作投票分类器来减少这种偏见和对单个或选择少数特征的依赖。这允许一个更健壮的ML模型来预测恶意Office文档,并更深入地了解分类器为什么要进行预测,以及一个可以让用户知道何时没有足够的数据来进行自信预测的模型。通过利用这种技术,创建了一个集成软投票分类器,该分类器在由250个恶意和250个良性随机选择的Office文档组成的平衡数据集上获得了90.1%的准确率,并向用户提供了一个简单的自然语言语句,表明文档的分类以及为什么它被分类为特定的标签。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信