What information contributes to log-based anomaly detection? Insights from a configurable transformer-based approach

IF 3.1 2区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Automated Software Engineering Pub Date : 2025-06-03 DOI:10.1007/s10515-025-00527-3

Xingfang Wu, Heng Li, Foutse Khomh

{"title":"What information contributes to log-based anomaly detection? Insights from a configurable transformer-based approach","authors":"Xingfang Wu, Heng Li, Foutse Khomh","doi":"10.1007/s10515-025-00527-3","DOIUrl":null,"url":null,"abstract":"<div><p>Log data are generated from logging statements in the source code, providing insights into the execution processes of software applications and systems. State-of-the-art log-based anomaly detection approaches typically leverage deep learning models to capture the semantic or sequential information in the log data and detect anomalous runtime behaviors. However, the impacts of these different types of information are not clear. In addition, most existing approaches ignore the timestamps in log data, which can potentially provide fine-grained sequential and temporal information. In this work, we propose a configurable Transformer-based anomaly detection model that can capture the semantic, sequential, and temporal information in the log data and allows us to configure the different types of information as the model’s features. Additionally, we train and evaluate the proposed model using log sequences of different lengths, thus overcoming the constraint of existing methods that rely on fixed-length or time-windowed log sequences as inputs. With the proposed model, we conduct a series of experiments with different combinations of input features to evaluate the roles of different types of information (i.e., sequential, temporal, semantic information) in anomaly detection. The model can attain competitive and consistently stable performance compared to the baselines when presented with log sequences of varying lengths. The results indicate that the event occurrence information plays a key role in identifying anomalies, while the impact of the sequential and temporal information is not significant for anomaly detection on the studied public datasets. On the other hand, the findings also reveal the simplicity of the studied public datasets and highlight the importance of constructing new datasets that contain different types of anomalies to better evaluate the performance of anomaly detection models.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 2","pages":""},"PeriodicalIF":3.1000,"publicationDate":"2025-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Automated Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10515-025-00527-3","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Log data are generated from logging statements in the source code, providing insights into the execution processes of software applications and systems. State-of-the-art log-based anomaly detection approaches typically leverage deep learning models to capture the semantic or sequential information in the log data and detect anomalous runtime behaviors. However, the impacts of these different types of information are not clear. In addition, most existing approaches ignore the timestamps in log data, which can potentially provide fine-grained sequential and temporal information. In this work, we propose a configurable Transformer-based anomaly detection model that can capture the semantic, sequential, and temporal information in the log data and allows us to configure the different types of information as the model’s features. Additionally, we train and evaluate the proposed model using log sequences of different lengths, thus overcoming the constraint of existing methods that rely on fixed-length or time-windowed log sequences as inputs. With the proposed model, we conduct a series of experiments with different combinations of input features to evaluate the roles of different types of information (i.e., sequential, temporal, semantic information) in anomaly detection. The model can attain competitive and consistently stable performance compared to the baselines when presented with log sequences of varying lengths. The results indicate that the event occurrence information plays a key role in identifying anomalies, while the impact of the sequential and temporal information is not significant for anomaly detection on the studied public datasets. On the other hand, the findings also reveal the simplicity of the studied public datasets and highlight the importance of constructing new datasets that contain different types of anomalies to better evaluate the performance of anomaly detection models.

查看原文本刊更多论文

哪些信息有助于基于日志的异常检测？来自基于可配置转换器的方法的见解

日志数据是从源代码中的日志语句生成的，提供了对软件应用程序和系统执行过程的洞察。最先进的基于日志的异常检测方法通常利用深度学习模型来捕获日志数据中的语义或顺序信息，并检测异常的运行时行为。然而，这些不同类型信息的影响尚不清楚。此外，大多数现有的方法都忽略了日志数据中的时间戳，这可能会提供细粒度的顺序和时间信息。在这项工作中，我们提出了一个可配置的基于transformer的异常检测模型，该模型可以捕获日志数据中的语义、顺序和时间信息，并允许我们将不同类型的信息配置为模型的特征。此外，我们使用不同长度的对数序列来训练和评估所提出的模型，从而克服了依赖固定长度或时间窗对数序列作为输入的现有方法的约束。利用提出的模型，我们对输入特征的不同组合进行了一系列实验，以评估不同类型的信息（即顺序信息、时间信息、语义信息）在异常检测中的作用。当呈现不同长度的对数序列时，与基线相比，该模型可以获得具有竞争力和持续稳定的性能。结果表明，事件发生信息在异常识别中起关键作用，序列信息和时间信息对异常检测的影响不显著。另一方面，研究结果也揭示了所研究的公共数据集的简单性，并强调了构建包含不同类型异常的新数据集的重要性，以便更好地评估异常检测模型的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Automated Software Engineering 工程技术-计算机：软件工程

CiteScore

4.80

自引率

11.80%

发文量

审稿时长

>12 weeks

期刊介绍： This journal details research, tutorial papers, survey and accounts of significant industrial experience in the foundations, techniques, tools and applications of automated software engineering technology. This includes the study of techniques for constructing, understanding, adapting, and modeling software artifacts and processes. Coverage in Automated Software Engineering examines both automatic systems and collaborative systems as well as computational models of human software engineering activities. In addition, it presents knowledge representations and artificial intelligence techniques applicable to automated software engineering, and formal techniques that support or provide theoretical foundations. The journal also includes reviews of books, software, conferences and workshops.