Anastasios Iliopoulos, John Violos, Christos Diou, Iraklis Varlamis
{"title":"Feature Bagging with Nested Rotations (FBNR) for anomaly detection in multivariate time series","authors":"Anastasios Iliopoulos, John Violos, Christos Diou, Iraklis Varlamis","doi":"10.1016/j.future.2024.107545","DOIUrl":null,"url":null,"abstract":"<div><div>Detecting anomalies in multivariate time series poses a significant challenge across various domains. The infrequent occurrence of anomalies in real-world data, as well as the lack of a large number of annotated samples, makes it a complex task for classification algorithms. Deep Neural Network approaches, based on Long Short-Term Memory (LSTMs), Autoencoders, and Variational Autoencoders (VAEs), among others, prove effective with handling imbalanced data. However, the same does not follow when such algorithms are applied on multivariate time-series, as their performance degrades significantly. Our main hypothesis is that the above is due to anomalies stemming from a small subset of the feature set. To mitigate the above issues in the multivariate setting, we propose forming an ensemble of base models by combining different feature selection and transformation techniques. The proposed processing pipeline includes applying a Feature Bagging techniques on multiple individual models, which considers separate feature subsets for each specific model. These subsets are then partitioned and transformed using multiple nested rotations derived from Principal Component Analysis (PCA). This approach aims to identify anomalies that arise from only a small portion of the feature set while also introduces diversity by transforming the subspaces. Each model provides an anomaly score, which are then aggregated, via an unsupervised decision fusion model. A semi-supervised fusion model was also explored, in which a Logistic Regressor was applied on the individual model outputs. The proposed methodology is evaluated on the Skoltech Anomaly Benchmark (SKAB), containing multivariate time series related to water flow in a closed circuit, as well as the Server Machine Dataset (SMD), which was collected from a large Internet company. The experimental results reveal that the proposed ensemble technique surpasses state-of-the-art algorithms. The unsupervised approach demonstrated a performance improvement of 2% for SKAB and 3% for SMD, compared to the baseline models. In the semi-supervised approach, the proposed method achieved a minimum of 10% improvement in terms of anomaly detection accuracy.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"163 ","pages":"Article 107545"},"PeriodicalIF":6.2000,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X24005090","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
Detecting anomalies in multivariate time series poses a significant challenge across various domains. The infrequent occurrence of anomalies in real-world data, as well as the lack of a large number of annotated samples, makes it a complex task for classification algorithms. Deep Neural Network approaches, based on Long Short-Term Memory (LSTMs), Autoencoders, and Variational Autoencoders (VAEs), among others, prove effective with handling imbalanced data. However, the same does not follow when such algorithms are applied on multivariate time-series, as their performance degrades significantly. Our main hypothesis is that the above is due to anomalies stemming from a small subset of the feature set. To mitigate the above issues in the multivariate setting, we propose forming an ensemble of base models by combining different feature selection and transformation techniques. The proposed processing pipeline includes applying a Feature Bagging techniques on multiple individual models, which considers separate feature subsets for each specific model. These subsets are then partitioned and transformed using multiple nested rotations derived from Principal Component Analysis (PCA). This approach aims to identify anomalies that arise from only a small portion of the feature set while also introduces diversity by transforming the subspaces. Each model provides an anomaly score, which are then aggregated, via an unsupervised decision fusion model. A semi-supervised fusion model was also explored, in which a Logistic Regressor was applied on the individual model outputs. The proposed methodology is evaluated on the Skoltech Anomaly Benchmark (SKAB), containing multivariate time series related to water flow in a closed circuit, as well as the Server Machine Dataset (SMD), which was collected from a large Internet company. The experimental results reveal that the proposed ensemble technique surpasses state-of-the-art algorithms. The unsupervised approach demonstrated a performance improvement of 2% for SKAB and 3% for SMD, compared to the baseline models. In the semi-supervised approach, the proposed method achieved a minimum of 10% improvement in terms of anomaly detection accuracy.
期刊介绍:
Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications.
Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration.
Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.