Temporal Analysis of Distribution Shifts in Malware Classification for Digital Forensics

Francesco Zola, J. L. Bruse, M. Galar
{"title":"Temporal Analysis of Distribution Shifts in Malware Classification for Digital Forensics","authors":"Francesco Zola, J. L. Bruse, M. Galar","doi":"10.1109/EuroSPW59978.2023.00054","DOIUrl":null,"url":null,"abstract":"In recent years, malware diversity and complexity have increased substantially, so the detection and classification of malware families have become one of the key objectives of information security. Machine learning (ML)-based approaches have been proposed to tackle this problem. However, most of these approaches focus on achieving high classification performance scores in static scenarios, without taking into account a key feature of malware: it is constantly evolving. This leads to ML models being outdated and performing poorly after only a few months, leaving stakeholders exposed to potential security risks. With this work, our aim is to highlight the issues that may arise when applying ML-based classification to malware data. We propose a three-step approach to carry out a forensics exploration of model failures. In particular, in the first step, we evaluate and compare the concept drift generated by models trained using a rolling windows approach for selecting the training dataset. In the second step, we evaluate model drift based on the amount of temporal information used in the training dataset. Finally, we perform an in-depth misclassification and feature analysis to emphasize the interpretation of the results and to highlight drift causes. We conclude that caution is warranted when training ML models for malware analysis, as concept drift and clear performance drops were observed even for models trained on larger datasets. Based on our results, it may be more beneficial to train models on fewer but recent data and re-train them after a few months in order to maintain performance.","PeriodicalId":220415,"journal":{"name":"2023 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/EuroSPW59978.2023.00054","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

In recent years, malware diversity and complexity have increased substantially, so the detection and classification of malware families have become one of the key objectives of information security. Machine learning (ML)-based approaches have been proposed to tackle this problem. However, most of these approaches focus on achieving high classification performance scores in static scenarios, without taking into account a key feature of malware: it is constantly evolving. This leads to ML models being outdated and performing poorly after only a few months, leaving stakeholders exposed to potential security risks. With this work, our aim is to highlight the issues that may arise when applying ML-based classification to malware data. We propose a three-step approach to carry out a forensics exploration of model failures. In particular, in the first step, we evaluate and compare the concept drift generated by models trained using a rolling windows approach for selecting the training dataset. In the second step, we evaluate model drift based on the amount of temporal information used in the training dataset. Finally, we perform an in-depth misclassification and feature analysis to emphasize the interpretation of the results and to highlight drift causes. We conclude that caution is warranted when training ML models for malware analysis, as concept drift and clear performance drops were observed even for models trained on larger datasets. Based on our results, it may be more beneficial to train models on fewer but recent data and re-train them after a few months in order to maintain performance.
数字取证中恶意软件分类分布变化的时间分析
近年来,恶意软件的多样性和复杂性大幅增加,因此恶意软件家族的检测和分类已成为信息安全的关键目标之一。人们提出了基于机器学习(ML)的方法来解决这个问题。然而,这些方法大多侧重于在静态场景中获得较高的分类性能分数,而没有考虑到恶意软件的一个关键特征:它是不断发展的。这导致ML模型过时,并且在几个月后表现不佳,使利益相关者暴露在潜在的安全风险中。通过这项工作,我们的目标是强调在将基于ml的分类应用于恶意软件数据时可能出现的问题。我们提出了一个三步的方法来进行模型失效的取证探索。特别是,在第一步中,我们评估和比较使用滚动窗口方法训练的模型产生的概念漂移,以选择训练数据集。在第二步中,我们根据训练数据集中使用的时间信息的数量来评估模型漂移。最后,我们进行了深入的误分类和特征分析,以强调对结果的解释和突出漂移原因。我们得出的结论是,在训练用于恶意软件分析的ML模型时需要谨慎,因为即使在更大的数据集上训练的模型也会观察到概念漂移和明显的性能下降。根据我们的结果,在较少但最近的数据上训练模型并在几个月后重新训练它们以保持性能可能更有益。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信