Patrick Himler, Max Landauer, Florian Skopik, Markus Wurzenberger
{"title":"日志事件序列中的异常检测:联合深度学习方法与开放挑战","authors":"Patrick Himler, Max Landauer, Florian Skopik, Markus Wurzenberger","doi":"10.1016/j.mlwa.2024.100554","DOIUrl":null,"url":null,"abstract":"<div><p>Anomaly Detection (AD) is an important area to reliably detect malicious behavior and attacks on computer systems. Log data is a rich source of information about systems and thus provides a suitable input for AD. With the sheer amount of log data available today, for years Machine Learning (ML) and more recently Deep Learning (DL) have been applied to create models for AD. Especially when processing complex log data, DL has shown some promising results in recent research to spot anomalies. It is necessary to group these log lines into log-event sequences, to detect anomalous patterns that span over multiple log lines. This work uses a centralized approach using a Long Short-Term Memory (LSTM) model for AD as its basis which is one of the most important approaches to represent long-range temporal dependencies in log-event sequences of arbitrary length. Therefore, we use past information to predict whether future events are normal or anomalous. For the LSTM model we adapt a state of the art open source implementation called LogDeep. For the evaluation, we use a Hadoop Distributed File System (HDFS) data set, which is well studied in current research. In this paper we show that without padding, which is a commonly used preprocessing step that strongly influences the AD process and artificially improves detection results and thus accuracy in lab testing, it is not possible to achieve the same high quality of results shown in literature. With the large quantity of log data, issues arise with the transfer of log data to a central entity where model computation can be done. Federated Learning (FL) tries to overcome this problem, by learning local models simultaneously on edge devices and overcome biases due to a lack of heterogeneity in training data through exchange of model parameters and finally arrive at a converging global model. Processing log data locally takes privacy and legal concerns into account, which could improve coordination and collaboration between researchers, cyber security companies, etc., in the future. Currently, there are only few scientific publications on log-based AD which use FL. Implementing FL gives the advantage of converging models even if the log data are heterogeneously distributed among participants as our results show. Furthermore, by varying individual LSTM model parameters, the results can be greatly improved. Further scientific research will be necessary to optimize FL approaches.</p></div>","PeriodicalId":74093,"journal":{"name":"Machine learning with applications","volume":"16 ","pages":"Article 100554"},"PeriodicalIF":0.0000,"publicationDate":"2024-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2666827024000306/pdfft?md5=fc8d0afe652c7146979d5889ecbf2afa&pid=1-s2.0-S2666827024000306-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Anomaly detection in log-event sequences: A federated deep learning approach and open challenges\",\"authors\":\"Patrick Himler, Max Landauer, Florian Skopik, Markus Wurzenberger\",\"doi\":\"10.1016/j.mlwa.2024.100554\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Anomaly Detection (AD) is an important area to reliably detect malicious behavior and attacks on computer systems. Log data is a rich source of information about systems and thus provides a suitable input for AD. With the sheer amount of log data available today, for years Machine Learning (ML) and more recently Deep Learning (DL) have been applied to create models for AD. Especially when processing complex log data, DL has shown some promising results in recent research to spot anomalies. It is necessary to group these log lines into log-event sequences, to detect anomalous patterns that span over multiple log lines. This work uses a centralized approach using a Long Short-Term Memory (LSTM) model for AD as its basis which is one of the most important approaches to represent long-range temporal dependencies in log-event sequences of arbitrary length. Therefore, we use past information to predict whether future events are normal or anomalous. For the LSTM model we adapt a state of the art open source implementation called LogDeep. For the evaluation, we use a Hadoop Distributed File System (HDFS) data set, which is well studied in current research. In this paper we show that without padding, which is a commonly used preprocessing step that strongly influences the AD process and artificially improves detection results and thus accuracy in lab testing, it is not possible to achieve the same high quality of results shown in literature. With the large quantity of log data, issues arise with the transfer of log data to a central entity where model computation can be done. Federated Learning (FL) tries to overcome this problem, by learning local models simultaneously on edge devices and overcome biases due to a lack of heterogeneity in training data through exchange of model parameters and finally arrive at a converging global model. Processing log data locally takes privacy and legal concerns into account, which could improve coordination and collaboration between researchers, cyber security companies, etc., in the future. Currently, there are only few scientific publications on log-based AD which use FL. Implementing FL gives the advantage of converging models even if the log data are heterogeneously distributed among participants as our results show. Furthermore, by varying individual LSTM model parameters, the results can be greatly improved. Further scientific research will be necessary to optimize FL approaches.</p></div>\",\"PeriodicalId\":74093,\"journal\":{\"name\":\"Machine learning with applications\",\"volume\":\"16 \",\"pages\":\"Article 100554\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-04-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S2666827024000306/pdfft?md5=fc8d0afe652c7146979d5889ecbf2afa&pid=1-s2.0-S2666827024000306-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Machine learning with applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2666827024000306\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine learning with applications","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666827024000306","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
异常检测(AD)是可靠检测计算机系统恶意行为和攻击的一个重要领域。日志数据是系统信息的丰富来源,因此为异常检测提供了合适的输入。由于日志数据量巨大,多年来,机器学习(ML)和最近的深度学习(DL)已被用于创建 AD 模型。特别是在处理复杂的日志数据时,深度学习在最近的研究中显示出了发现异常的良好效果。有必要将这些日志行分组为日志事件序列,以检测跨越多个日志行的异常模式。这项工作采用了一种集中式方法,以 AD 的长短时记忆 (LSTM) 模型为基础,该模型是在任意长度的日志事件序列中表示长程时间依赖性的最重要方法之一。因此,我们利用过去的信息来预测未来事件是正常还是异常。对于 LSTM 模型,我们采用了名为 LogDeep 的最新开源实现。为了进行评估,我们使用了 Hadoop 分布式文件系统(HDFS)数据集,该数据集在当前的研究中得到了充分的研究。在本文中,我们展示了在实验室测试中,如果没有填充这一常用的预处理步骤,就不可能获得与文献中显示的同样高质量的结果。由于日志数据量巨大,将日志数据传输到中央实体进行模型计算就成了问题。联邦学习(FL)试图克服这一问题,它在边缘设备上同时学习本地模型,并通过交换模型参数克服因训练数据缺乏异质性而产生的偏差,最终形成一个趋同的全局模型。本地处理日志数据考虑到了隐私和法律问题,这在未来可以改善研究人员、网络安全公司等之间的协调与合作。目前,只有少数基于日志的 AD 科学出版物使用了 FL。正如我们的研究结果所显示的那样,即使日志数据在参与者之间分布不均,实施 FL 也能带来收敛模型的优势。此外,通过改变单个 LSTM 模型参数,还能大大改善结果。要优化 FL 方法,还需要进一步的科学研究。
Anomaly detection in log-event sequences: A federated deep learning approach and open challenges
Anomaly Detection (AD) is an important area to reliably detect malicious behavior and attacks on computer systems. Log data is a rich source of information about systems and thus provides a suitable input for AD. With the sheer amount of log data available today, for years Machine Learning (ML) and more recently Deep Learning (DL) have been applied to create models for AD. Especially when processing complex log data, DL has shown some promising results in recent research to spot anomalies. It is necessary to group these log lines into log-event sequences, to detect anomalous patterns that span over multiple log lines. This work uses a centralized approach using a Long Short-Term Memory (LSTM) model for AD as its basis which is one of the most important approaches to represent long-range temporal dependencies in log-event sequences of arbitrary length. Therefore, we use past information to predict whether future events are normal or anomalous. For the LSTM model we adapt a state of the art open source implementation called LogDeep. For the evaluation, we use a Hadoop Distributed File System (HDFS) data set, which is well studied in current research. In this paper we show that without padding, which is a commonly used preprocessing step that strongly influences the AD process and artificially improves detection results and thus accuracy in lab testing, it is not possible to achieve the same high quality of results shown in literature. With the large quantity of log data, issues arise with the transfer of log data to a central entity where model computation can be done. Federated Learning (FL) tries to overcome this problem, by learning local models simultaneously on edge devices and overcome biases due to a lack of heterogeneity in training data through exchange of model parameters and finally arrive at a converging global model. Processing log data locally takes privacy and legal concerns into account, which could improve coordination and collaboration between researchers, cyber security companies, etc., in the future. Currently, there are only few scientific publications on log-based AD which use FL. Implementing FL gives the advantage of converging models even if the log data are heterogeneously distributed among participants as our results show. Furthermore, by varying individual LSTM model parameters, the results can be greatly improved. Further scientific research will be necessary to optimize FL approaches.