Temporal Analysis of Distribution Shifts in Malware Classification for Digital Forensics

2023 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW) Pub Date : 2023-07-01 DOI:10.1109/EuroSPW59978.2023.00054

Francesco Zola, J. L. Bruse, M. Galar

{"title":"Temporal Analysis of Distribution Shifts in Malware Classification for Digital Forensics","authors":"Francesco Zola, J. L. Bruse, M. Galar","doi":"10.1109/EuroSPW59978.2023.00054","DOIUrl":null,"url":null,"abstract":"In recent years, malware diversity and complexity have increased substantially, so the detection and classification of malware families have become one of the key objectives of information security. Machine learning (ML)-based approaches have been proposed to tackle this problem. However, most of these approaches focus on achieving high classification performance scores in static scenarios, without taking into account a key feature of malware: it is constantly evolving. This leads to ML models being outdated and performing poorly after only a few months, leaving stakeholders exposed to potential security risks. With this work, our aim is to highlight the issues that may arise when applying ML-based classification to malware data. We propose a three-step approach to carry out a forensics exploration of model failures. In particular, in the first step, we evaluate and compare the concept drift generated by models trained using a rolling windows approach for selecting the training dataset. In the second step, we evaluate model drift based on the amount of temporal information used in the training dataset. Finally, we perform an in-depth misclassification and feature analysis to emphasize the interpretation of the results and to highlight drift causes. We conclude that caution is warranted when training ML models for malware analysis, as concept drift and clear performance drops were observed even for models trained on larger datasets. Based on our results, it may be more beneficial to train models on fewer but recent data and re-train them after a few months in order to maintain performance.","PeriodicalId":220415,"journal":{"name":"2023 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/EuroSPW59978.2023.00054","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In recent years, malware diversity and complexity have increased substantially, so the detection and classification of malware families have become one of the key objectives of information security. Machine learning (ML)-based approaches have been proposed to tackle this problem. However, most of these approaches focus on achieving high classification performance scores in static scenarios, without taking into account a key feature of malware: it is constantly evolving. This leads to ML models being outdated and performing poorly after only a few months, leaving stakeholders exposed to potential security risks. With this work, our aim is to highlight the issues that may arise when applying ML-based classification to malware data. We propose a three-step approach to carry out a forensics exploration of model failures. In particular, in the first step, we evaluate and compare the concept drift generated by models trained using a rolling windows approach for selecting the training dataset. In the second step, we evaluate model drift based on the amount of temporal information used in the training dataset. Finally, we perform an in-depth misclassification and feature analysis to emphasize the interpretation of the results and to highlight drift causes. We conclude that caution is warranted when training ML models for malware analysis, as concept drift and clear performance drops were observed even for models trained on larger datasets. Based on our results, it may be more beneficial to train models on fewer but recent data and re-train them after a few months in order to maintain performance.

查看原文本刊更多论文

数字取证中恶意软件分类分布变化的时间分析

近年来，恶意软件的多样性和复杂性大幅增加，因此恶意软件家族的检测和分类已成为信息安全的关键目标之一。人们提出了基于机器学习(ML)的方法来解决这个问题。然而，这些方法大多侧重于在静态场景中获得较高的分类性能分数，而没有考虑到恶意软件的一个关键特征:它是不断发展的。这导致ML模型过时，并且在几个月后表现不佳，使利益相关者暴露在潜在的安全风险中。通过这项工作，我们的目标是强调在将基于ml的分类应用于恶意软件数据时可能出现的问题。我们提出了一个三步的方法来进行模型失效的取证探索。特别是，在第一步中，我们评估和比较使用滚动窗口方法训练的模型产生的概念漂移，以选择训练数据集。在第二步中，我们根据训练数据集中使用的时间信息的数量来评估模型漂移。最后，我们进行了深入的误分类和特征分析，以强调对结果的解释和突出漂移原因。我们得出的结论是，在训练用于恶意软件分析的ML模型时需要谨慎，因为即使在更大的数据集上训练的模型也会观察到概念漂移和明显的性能下降。根据我们的结果，在较少但最近的数据上训练模型并在几个月后重新训练它们以保持性能可能更有益。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2023 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW)

自引率

0.00%

发文量