Feature Bagging with Nested Rotations (FBNR) for anomaly detection in multivariate time series

IF 6.2 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Future Generation Computer Systems-The International Journal of Escience Pub Date : 2024-10-09 DOI:10.1016/j.future.2024.107545

Anastasios Iliopoulos, John Violos, Christos Diou, Iraklis Varlamis

{"title":"Feature Bagging with Nested Rotations (FBNR) for anomaly detection in multivariate time series","authors":"Anastasios Iliopoulos, John Violos, Christos Diou, Iraklis Varlamis","doi":"10.1016/j.future.2024.107545","DOIUrl":null,"url":null,"abstract":"<div><div>Detecting anomalies in multivariate time series poses a significant challenge across various domains. The infrequent occurrence of anomalies in real-world data, as well as the lack of a large number of annotated samples, makes it a complex task for classification algorithms. Deep Neural Network approaches, based on Long Short-Term Memory (LSTMs), Autoencoders, and Variational Autoencoders (VAEs), among others, prove effective with handling imbalanced data. However, the same does not follow when such algorithms are applied on multivariate time-series, as their performance degrades significantly. Our main hypothesis is that the above is due to anomalies stemming from a small subset of the feature set. To mitigate the above issues in the multivariate setting, we propose forming an ensemble of base models by combining different feature selection and transformation techniques. The proposed processing pipeline includes applying a Feature Bagging techniques on multiple individual models, which considers separate feature subsets for each specific model. These subsets are then partitioned and transformed using multiple nested rotations derived from Principal Component Analysis (PCA). This approach aims to identify anomalies that arise from only a small portion of the feature set while also introduces diversity by transforming the subspaces. Each model provides an anomaly score, which are then aggregated, via an unsupervised decision fusion model. A semi-supervised fusion model was also explored, in which a Logistic Regressor was applied on the individual model outputs. The proposed methodology is evaluated on the Skoltech Anomaly Benchmark (SKAB), containing multivariate time series related to water flow in a closed circuit, as well as the Server Machine Dataset (SMD), which was collected from a large Internet company. The experimental results reveal that the proposed ensemble technique surpasses state-of-the-art algorithms. The unsupervised approach demonstrated a performance improvement of 2% for SKAB and 3% for SMD, compared to the baseline models. In the semi-supervised approach, the proposed method achieved a minimum of 10% improvement in terms of anomaly detection accuracy.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"163 ","pages":"Article 107545"},"PeriodicalIF":6.2000,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X24005090","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Detecting anomalies in multivariate time series poses a significant challenge across various domains. The infrequent occurrence of anomalies in real-world data, as well as the lack of a large number of annotated samples, makes it a complex task for classification algorithms. Deep Neural Network approaches, based on Long Short-Term Memory (LSTMs), Autoencoders, and Variational Autoencoders (VAEs), among others, prove effective with handling imbalanced data. However, the same does not follow when such algorithms are applied on multivariate time-series, as their performance degrades significantly. Our main hypothesis is that the above is due to anomalies stemming from a small subset of the feature set. To mitigate the above issues in the multivariate setting, we propose forming an ensemble of base models by combining different feature selection and transformation techniques. The proposed processing pipeline includes applying a Feature Bagging techniques on multiple individual models, which considers separate feature subsets for each specific model. These subsets are then partitioned and transformed using multiple nested rotations derived from Principal Component Analysis (PCA). This approach aims to identify anomalies that arise from only a small portion of the feature set while also introduces diversity by transforming the subspaces. Each model provides an anomaly score, which are then aggregated, via an unsupervised decision fusion model. A semi-supervised fusion model was also explored, in which a Logistic Regressor was applied on the individual model outputs. The proposed methodology is evaluated on the Skoltech Anomaly Benchmark (SKAB), containing multivariate time series related to water flow in a closed circuit, as well as the Server Machine Dataset (SMD), which was collected from a large Internet company. The experimental results reveal that the proposed ensemble technique surpasses state-of-the-art algorithms. The unsupervised approach demonstrated a performance improvement of 2% for SKAB and 3% for SMD, compared to the baseline models. In the semi-supervised approach, the proposed method achieved a minimum of 10% improvement in terms of anomaly detection accuracy.

Abstract Image

查看原文本刊更多论文

用嵌套旋转（FBNR）对多变量时间序列进行特征袋化异常检测

在多元时间序列中检测异常现象是各个领域面临的一项重大挑战。由于异常情况在现实世界的数据中并不经常出现，而且缺乏大量有注释的样本，因此对于分类算法来说是一项复杂的任务。基于长短期记忆（LSTM）、自动编码器和变异自动编码器（VAE）等的深度神经网络方法被证明能有效处理不平衡数据。然而，当这些算法应用于多变量时间序列时，情况就不一样了，因为它们的性能会显著下降。我们的主要假设是，上述情况是由于一小部分特征集产生了异常。为了在多变量环境中缓解上述问题，我们建议结合不同的特征选择和转换技术，形成一个基础模型集合。建议的处理管道包括在多个单独模型上应用特征袋技术，该技术为每个特定模型考虑单独的特征子集。然后使用主成分分析 (PCA) 得出的多个嵌套旋转对这些子集进行分割和转换。这种方法旨在识别仅由特征集一小部分产生的异常，同时通过转换子空间引入多样性。每个模型都会提供一个异常得分，然后通过无监督决策融合模型进行汇总。此外，还探索了一种半监督融合模型，即在单个模型输出上应用 Logistic 回归器。建议的方法在 Skoltech 异常基准（SKAB）和服务器机器数据集（SMD）上进行了评估，前者包含与闭合电路中水流相关的多变量时间序列，后者是从一家大型互联网公司收集的。实验结果表明，所提出的集合技术超越了最先进的算法。与基线模型相比，无监督方法在 SKAB 和 SMD 上分别提高了 2% 和 3% 的性能。在半监督方法中，所提出的方法在异常检测准确率方面至少提高了 10%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Future Generation Computer Systems-The International Journal of Escience 工程技术-计算机：理论方法

CiteScore

19.90

自引率

2.70%

发文量

376

审稿时长

10.6 months

期刊介绍： Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications. Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration. Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.