用更简单的方法尝试 - 基于日志的异常检测中改进的主成分分析评估

IF 6.6 2区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING
Lin Yang, Junjie Chen, Shutao Gao, Zhihao Gong, Hongyu Zhang, Yue Kang, Huaan Li
{"title":"用更简单的方法尝试 - 基于日志的异常检测中改进的主成分分析评估","authors":"Lin Yang, Junjie Chen, Shutao Gao, Zhihao Gong, Hongyu Zhang, Yue Kang, Huaan Li","doi":"10.1145/3644386","DOIUrl":null,"url":null,"abstract":"<p>With the rapid development of deep learning (DL), the recent trend of log-based anomaly detection focuses on extracting semantic information from log events (i.e., templates of log messages) and designing more advanced DL models for anomaly detection. Indeed, the effectiveness of log-based anomaly detection can be improved, but these DL-based techniques further suffer from the limitations of more heavy dependency on training data (such as data quality or data labels) and higher costs in time and resources due to the complexity and scale of DL models, which hinder their practical use. On the contrary, the techniques based on traditional machine learning or data mining algorithms are less dependent on training data and more efficient, but produce worse effectiveness than DL-based techniques which is mainly caused by the problem of unseen log events (some log events in incoming log messages are unseen in training data) confirmed by our motivating study. Intuitively, if we can improve the effectiveness of traditional techniques to be comparable with advanced DL-based techniques, log-based anomaly detection can be more practical. Indeed, an existing study in the other area (i.e., linking questions posted on Stack Overflow) has pointed out that traditional techniques with some optimizations can indeed achieve comparable effectiveness with the state-of-the-art DL-based technique, indicating the feasibility of enhancing traditional log-based anomaly detection techniques to some degree. </p><p>Inspired by the idea of “try-with-simpler”, we conducted the first empirical study to explore the potential of improving traditional techniques for more practical log-based anomaly detection. In this work, we optimized the traditional unsupervised PCA (Principal Component Analysis) technique by incorporating a lightweight semantic-based log representation in it, called <i>SemPCA</i>, and conducted an extensive study to investigate the potential of <i>SemPCA</i> for more practical log-based anomaly detection. By comparing seven log-based anomaly detection techniques (including four DL-based techniques, two traditional techniques, and <i>SemPCA</i>) on both public and industrial datasets, our results show that <i>SemPCA</i> achieves comparable effectiveness as advanced supervised/semi-supervised DL-based techniques while being much more stable under insufficient training data and more efficient, demonstrating that the traditional technique can still excel after small but useful adaptation.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"44 1","pages":""},"PeriodicalIF":6.6000,"publicationDate":"2024-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Try with Simpler – An Evaluation of Improved Principal Component Analysis in Log-based Anomaly Detection\",\"authors\":\"Lin Yang, Junjie Chen, Shutao Gao, Zhihao Gong, Hongyu Zhang, Yue Kang, Huaan Li\",\"doi\":\"10.1145/3644386\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>With the rapid development of deep learning (DL), the recent trend of log-based anomaly detection focuses on extracting semantic information from log events (i.e., templates of log messages) and designing more advanced DL models for anomaly detection. Indeed, the effectiveness of log-based anomaly detection can be improved, but these DL-based techniques further suffer from the limitations of more heavy dependency on training data (such as data quality or data labels) and higher costs in time and resources due to the complexity and scale of DL models, which hinder their practical use. On the contrary, the techniques based on traditional machine learning or data mining algorithms are less dependent on training data and more efficient, but produce worse effectiveness than DL-based techniques which is mainly caused by the problem of unseen log events (some log events in incoming log messages are unseen in training data) confirmed by our motivating study. Intuitively, if we can improve the effectiveness of traditional techniques to be comparable with advanced DL-based techniques, log-based anomaly detection can be more practical. Indeed, an existing study in the other area (i.e., linking questions posted on Stack Overflow) has pointed out that traditional techniques with some optimizations can indeed achieve comparable effectiveness with the state-of-the-art DL-based technique, indicating the feasibility of enhancing traditional log-based anomaly detection techniques to some degree. </p><p>Inspired by the idea of “try-with-simpler”, we conducted the first empirical study to explore the potential of improving traditional techniques for more practical log-based anomaly detection. In this work, we optimized the traditional unsupervised PCA (Principal Component Analysis) technique by incorporating a lightweight semantic-based log representation in it, called <i>SemPCA</i>, and conducted an extensive study to investigate the potential of <i>SemPCA</i> for more practical log-based anomaly detection. By comparing seven log-based anomaly detection techniques (including four DL-based techniques, two traditional techniques, and <i>SemPCA</i>) on both public and industrial datasets, our results show that <i>SemPCA</i> achieves comparable effectiveness as advanced supervised/semi-supervised DL-based techniques while being much more stable under insufficient training data and more efficient, demonstrating that the traditional technique can still excel after small but useful adaptation.</p>\",\"PeriodicalId\":50933,\"journal\":{\"name\":\"ACM Transactions on Software Engineering and Methodology\",\"volume\":\"44 1\",\"pages\":\"\"},\"PeriodicalIF\":6.6000,\"publicationDate\":\"2024-02-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Software Engineering and Methodology\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3644386\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Software Engineering and Methodology","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3644386","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0

摘要

随着深度学习(DL)的快速发展,近期基于日志的异常检测趋势侧重于从日志事件(即日志消息模板)中提取语义信息,并设计更先进的 DL 模型用于异常检测。事实上,基于日志的异常检测的有效性是可以提高的,但这些基于 DL 的技术还存在对训练数据(如数据质量或数据标签)依赖性较强、DL 模型的复杂性和规模导致时间和资源成本较高等局限性,这些都阻碍了它们的实际应用。相反,基于传统机器学习或数据挖掘算法的技术对训练数据的依赖性较低,效率较高,但效果却不如基于 DL 的技术,这主要是由于未见日志事件的问题造成的(输入日志信息中的一些日志事件在训练数据中是未见的)。直观地说,如果我们能提高传统技术的有效性,使其与先进的基于 DL 的技术相媲美,那么基于日志的异常检测就会更加实用。事实上,另一个领域的现有研究(即在 Stack Overflow 上发布的链接问题)已经指出,传统技术经过一些优化后确实可以达到与最先进的基于 DL 的技术相当的效果,这表明在一定程度上增强传统的基于日志的异常检测技术是可行的。受 "简化尝试"(try-with-simpler)思想的启发,我们首次开展了实证研究,探索改进传统技术以实现更实用的基于日志的异常检测的潜力。在这项工作中,我们优化了传统的无监督 PCA(主成分分析)技术,在其中加入了一种轻量级的基于语义的日志表示法,称为 SemPCA,并开展了一项广泛的研究,以探讨 SemPCA 在更实用的基于日志的异常检测中的潜力。通过在公共数据集和工业数据集上比较七种基于日志的异常检测技术(包括四种基于DL的技术、两种传统技术和SemPCA),我们的结果表明,SemPCA与先进的基于监督/半监督DL的技术效果相当,而且在训练数据不足的情况下更加稳定,效率更高,这表明传统技术在经过微小但有益的调整后仍能发挥出色的作用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Try with Simpler – An Evaluation of Improved Principal Component Analysis in Log-based Anomaly Detection

With the rapid development of deep learning (DL), the recent trend of log-based anomaly detection focuses on extracting semantic information from log events (i.e., templates of log messages) and designing more advanced DL models for anomaly detection. Indeed, the effectiveness of log-based anomaly detection can be improved, but these DL-based techniques further suffer from the limitations of more heavy dependency on training data (such as data quality or data labels) and higher costs in time and resources due to the complexity and scale of DL models, which hinder their practical use. On the contrary, the techniques based on traditional machine learning or data mining algorithms are less dependent on training data and more efficient, but produce worse effectiveness than DL-based techniques which is mainly caused by the problem of unseen log events (some log events in incoming log messages are unseen in training data) confirmed by our motivating study. Intuitively, if we can improve the effectiveness of traditional techniques to be comparable with advanced DL-based techniques, log-based anomaly detection can be more practical. Indeed, an existing study in the other area (i.e., linking questions posted on Stack Overflow) has pointed out that traditional techniques with some optimizations can indeed achieve comparable effectiveness with the state-of-the-art DL-based technique, indicating the feasibility of enhancing traditional log-based anomaly detection techniques to some degree.

Inspired by the idea of “try-with-simpler”, we conducted the first empirical study to explore the potential of improving traditional techniques for more practical log-based anomaly detection. In this work, we optimized the traditional unsupervised PCA (Principal Component Analysis) technique by incorporating a lightweight semantic-based log representation in it, called SemPCA, and conducted an extensive study to investigate the potential of SemPCA for more practical log-based anomaly detection. By comparing seven log-based anomaly detection techniques (including four DL-based techniques, two traditional techniques, and SemPCA) on both public and industrial datasets, our results show that SemPCA achieves comparable effectiveness as advanced supervised/semi-supervised DL-based techniques while being much more stable under insufficient training data and more efficient, demonstrating that the traditional technique can still excel after small but useful adaptation.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
ACM Transactions on Software Engineering and Methodology
ACM Transactions on Software Engineering and Methodology 工程技术-计算机:软件工程
CiteScore
6.30
自引率
4.50%
发文量
164
审稿时长
>12 weeks
期刊介绍: Designing and building a large, complex software system is a tremendous challenge. ACM Transactions on Software Engineering and Methodology (TOSEM) publishes papers on all aspects of that challenge: specification, design, development and maintenance. It covers tools and methodologies, languages, data structures, and algorithms. TOSEM also reports on successful efforts, noting practical lessons that can be scaled and transferred to other projects, and often looks at applications of innovative technologies. The tone is scholarly but readable; the content is worthy of study; the presentation is effective.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信