The impact of concept drift and data leakage on log level prediction models

IF 3.5 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering Pub Date : 2024-07-25 DOI:10.1007/s10664-024-10518-9

Youssef Esseddiq Ouatiti, Mohammed Sayagh, Noureddine Kerzazi, Bram Adams, Ahmed E. Hassan

{"title":"The impact of concept drift and data leakage on log level prediction models","authors":"Youssef Esseddiq Ouatiti, Mohammed Sayagh, Noureddine Kerzazi, Bram Adams, Ahmed E. Hassan","doi":"10.1007/s10664-024-10518-9","DOIUrl":null,"url":null,"abstract":"<p>Developers insert logging statements to collect information about the execution of their systems. Along with a logging framework (e.g., Log4j), practitioners can decide which log statement to print or suppress by tagging each log line with a log level. Since picking the right log level for a new logging statement is not straightforward, machine learning models for log level prediction (LLP) were proposed by prior studies. While these models show good performances, they are still subject to the context in which they are applied, specifically to the way practitioners decide on log levels in different phases of the development history of their projects (e.g., debugging vs. testing). For example, Openstack developers interchangeably increased/decreased the verbosity of their logs across the history of the project in response to code changes (e.g., before vs after fixing a new bug). Thus, the manifestation of these changing log verbosity choices across time can lead to concept drift and data leakage issues, which we wish to quantify in this paper on LLP models. In this paper, we empirically quantify the impact of data leakage and concept drift on the performance and interpretability of LLP models in three large open-source systems. Additionally, we compare the performance and interpretability of several time-aware approaches to tackle time-related issues. We observe that both shallow and deep-learning-based models suffer from both time-related issues. We also observe that training a model on just a window of the historical data (i.e., contextual model) outperforms models that are trained on the whole historical data (i.e., all-knowing model) in the case of our shallow LLP model. Finally, we observe that contextual models exhibit a different (even contradictory) model interpretability, with a (very) weak correlation between the ranking of important features of the pairs of contextual models we compared. Our findings suggest that data leakage and concept drift should be taken into consideration for LLP models. We also invite practitioners to include the size of the historical window as an additional hyperparameter to tune a suitable contextual model instead of leveraging all-knowing models.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"16 1","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Empirical Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10664-024-10518-9","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Developers insert logging statements to collect information about the execution of their systems. Along with a logging framework (e.g., Log4j), practitioners can decide which log statement to print or suppress by tagging each log line with a log level. Since picking the right log level for a new logging statement is not straightforward, machine learning models for log level prediction (LLP) were proposed by prior studies. While these models show good performances, they are still subject to the context in which they are applied, specifically to the way practitioners decide on log levels in different phases of the development history of their projects (e.g., debugging vs. testing). For example, Openstack developers interchangeably increased/decreased the verbosity of their logs across the history of the project in response to code changes (e.g., before vs after fixing a new bug). Thus, the manifestation of these changing log verbosity choices across time can lead to concept drift and data leakage issues, which we wish to quantify in this paper on LLP models. In this paper, we empirically quantify the impact of data leakage and concept drift on the performance and interpretability of LLP models in three large open-source systems. Additionally, we compare the performance and interpretability of several time-aware approaches to tackle time-related issues. We observe that both shallow and deep-learning-based models suffer from both time-related issues. We also observe that training a model on just a window of the historical data (i.e., contextual model) outperforms models that are trained on the whole historical data (i.e., all-knowing model) in the case of our shallow LLP model. Finally, we observe that contextual models exhibit a different (even contradictory) model interpretability, with a (very) weak correlation between the ranking of important features of the pairs of contextual models we compared. Our findings suggest that data leakage and concept drift should be taken into consideration for LLP models. We also invite practitioners to include the size of the historical window as an additional hyperparameter to tune a suitable contextual model instead of leveraging all-knowing models.

Abstract Image

查看原文本刊更多论文

概念漂移和数据泄露对日志级别预测模型的影响

开发人员插入日志语句来收集系统执行信息。通过日志框架（如 Log4j），从业人员可以为每行日志标记日志级别，从而决定打印或抑制哪条日志语句。由于为新日志语句选择正确的日志级别并不简单，先前的研究提出了用于日志级别预测（LLP）的机器学习模型。虽然这些模型表现出了良好的性能，但它们仍然受到应用环境的影响，特别是从业人员在项目开发历史的不同阶段（如调试与测试）决定日志级别的方式。例如，Openstack 开发人员根据代码变化（如修复新错误之前和之后），在整个项目历史中交替增加/减少日志的冗长程度。因此，这些随时间变化的日志冗余度选择的表现形式可能会导致概念漂移和数据泄漏问题，我们希望在本文中对 LLP 模型进行量化。在本文中，我们在三个大型开源系统中实证量化了数据泄漏和概念漂移对 LLP 模型性能和可解释性的影响。此外，我们还比较了几种时间感知方法的性能和可解释性，以解决与时间相关的问题。我们发现，基于浅层学习和深度学习的模型都会受到时间相关问题的影响。我们还观察到，在我们的浅层 LLP 模型中，仅在历史数据的一个窗口上训练模型（即上下文模型）的效果优于在整个历史数据上训练的模型（即全知模型）。最后，我们观察到，上下文模型表现出不同（甚至矛盾）的模型可解释性，我们比较过的成对上下文模型的重要特征排序之间存在（非常）微弱的相关性。我们的研究结果表明，LLP 模型应考虑数据泄漏和概念漂移。我们还请实践者将历史窗口的大小作为额外的超参数，以调整合适的上下文模型，而不是利用全知模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Empirical Software Engineering 工程技术-计算机：软件工程

CiteScore

8.50

自引率

12.20%

发文量

169

审稿时长

>12 weeks

期刊介绍： Empirical Software Engineering provides a forum for applied software engineering research with a strong empirical component, and a venue for publishing empirical results relevant to both researchers and practitioners. Empirical studies presented here usually involve the collection and analysis of data and experience that can be used to characterize, evaluate and reveal relationships between software development deliverables, practices, and technologies. Over time, it is expected that such empirical results will form a body of knowledge leading to widely accepted and well-formed theories. The journal also offers industrial experience reports detailing the application of software technologies - processes, methods, or tools - and their effectiveness in industrial settings. Empirical Software Engineering promotes the publication of industry-relevant research, to address the significant gap between research and practice.