{"title":"Scalable and accurate online multivariate anomaly detection","authors":"Rebecca Salles , Benoit Lange , Reza Akbarinia , Florent Masseglia , Eduardo Ogasawara , Esther Pacitti","doi":"10.1016/j.is.2025.102524","DOIUrl":"10.1016/j.is.2025.102524","url":null,"abstract":"<div><div>The continuous monitoring of dynamic processes generates vast amounts of streaming multivariate time series data. Detecting anomalies within them is crucial for real-time identification of significant events, such as environmental phenomena, security breaches, or system failures, which can critically impact sensitive applications. Despite significant advances in univariate time series anomaly detection, scalable and efficient solutions for online detection in multivariate streams remain underexplored. This challenge becomes increasingly prominent with the growing volume and complexity of multivariate time series data in streaming scenarios. In this paper, we provide the first structured survey primarily focused on scalable and online anomaly detection techniques for multivariate time series, offering a comprehensive taxonomy. Additionally, we introduce the Online Distributed Outlier Detection (2OD) methodology, a novel well-defined and repeatable process designed to benchmark the online and distributed execution of anomaly detection methods. Experimental results with both synthetic and real-world datasets, covering up to hundreds of millions of observations, demonstrate that a distributed approach can enable centralized algorithms to achieve significant computational efficiency gains, averaging tens and reaching up to hundreds in speedup, without compromising detection accuracy.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"131 ","pages":"Article 102524"},"PeriodicalIF":3.0,"publicationDate":"2025-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143196973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gerard Pons , Besim Bilalli , Alberto Abelló , Santiago Blanco Sánchez
{"title":"On the use of trajectory data for tackling data scarcity","authors":"Gerard Pons , Besim Bilalli , Alberto Abelló , Santiago Blanco Sánchez","doi":"10.1016/j.is.2025.102523","DOIUrl":"10.1016/j.is.2025.102523","url":null,"abstract":"<div><div>In recent years, the availability of GPS-equipped mobile devices and other inexpensive location-tracking technologies have enabled the ubiquitous capturing of the location of moving objects. As a result, trajectory data are abundantly available and there is an increasing trend in analyzing them in the context of mobility data science. However, the abundant availability of trajectory data makes them compelling for other tasks too. In this paper, we propose the use of these data to tackle the data scarcity problem in data analysis by appropriately transforming them to extract relevant knowledge. The challenge lies not just in leveraging these abundant trajectory data, but in accurately deriving information from them that closely approximates the target variable of interest. Such knowledge can be used to generate or supplement the scarcely available datasets in a data analytics problem, thereby enhancing model learning. We showcase the feasibility of our approach in the domain of fishing where there is an abundance of trajectory data but a scarcity of detailed catch information. By using environmental data as explanatory variables, we build and compare models to predict fishing productivity using the actual catches from fishing reports and/or the inferred knowledge from the vessel’s trajectories. The results show that, mainly due to trajectory data being larger in volume than fishing data, models trained with the former obtain a precision 7.9% higher, despite the simplicity of the applied transformations.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"130 ","pages":"Article 102523"},"PeriodicalIF":3.0,"publicationDate":"2025-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143311830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Furong Peng , Fujin Liao , Xuan Lu , Jianxing Zheng , Ru Li
{"title":"Revisiting explicit recommendation with DC-GCN: Divide-and-Conquer Graph Convolution Network","authors":"Furong Peng , Fujin Liao , Xuan Lu , Jianxing Zheng , Ru Li","doi":"10.1016/j.is.2024.102513","DOIUrl":"10.1016/j.is.2024.102513","url":null,"abstract":"<div><div>In recent years, Graph Convolutional Networks (GCNs) have primarily been applied to implicit feedback recommendation, with limited exploration in explicit scenarios. Although explicit recommendations can yield promising results, the conflict between the sparsity of data and the data starvation of deep learning hinders its development. Unlike implicit scenarios, explicit recommendation provides less evidence for predictions and requires distinguishing weights of edges (ratings) in the user-item graph.</div><div>To exploit high-order relations by GCN in explicit scenarios, we propose dividing the explicit rating graph into sub-graphs, each containing only one type of rating. We then employ GCN to capture user and item representations within each sub-graph, allowing the model to focus on rating-related user-item relations, and aggregate the representations of all subgraphs by MLP for the final recommendation. This approach, named Divide-and-Conquer Graph Convolution Network (DC-GCN), simplifies each model’s mission and highlights the strengths of individual modules. Considering that creating GCNs for each sub-graph may result in over-fitting and faces more serious data sparsity, we propose to share node embeddings for all GCNs to reduce the number of parameters, and create rating-aware embedding for each sub-graph to model rating-related relations. Moreover, to alleviate over-smoothing, we utilize random column mask to randomly select columns of node features to update in GCN layers. This technique can prevent node representations from becoming homogeneous in deep GCN networks. DC-GCN is evaluated on four public datasets and achieves the SOTA experimentally. Furthermore, DC-GCN is analyzed in cold-start and popularity bias scenarios, exhibiting competitive performance in various scenarios.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"130 ","pages":"Article 102513"},"PeriodicalIF":3.0,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143311864","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chong Mu , Lizong Zhang , Junsong Li , Zhiguo Wang , Ling Tian , Ming Jia
{"title":"Inductive link prediction via global relational semantic learning","authors":"Chong Mu , Lizong Zhang , Junsong Li , Zhiguo Wang , Ling Tian , Ming Jia","doi":"10.1016/j.is.2024.102514","DOIUrl":"10.1016/j.is.2024.102514","url":null,"abstract":"<div><div>Knowledge graphs (KGs) play a crucial role in storing and utilizing real-world facts, but they often suffer from sparse and missing relations. To overcome these challenges, researchers have proposed relation prediction models, including embedding-based methods. However, these methods are restricted to the transductive setting and require retraining when new entities emerge. Thus, recent research has focused on the inductive setting, allowing for different entities in the test set. Subgraph-based models utilizing graph neural networks (GNNs) for local structural information aggregation have shown promising performance. However, existing approaches focus only on local structural information, ignoring the semantic correlation among relations in the global perspective, resulting in sub-optimal performance. Thus, we propose an inductive relation prediction model GRelGT that incorporates the <strong>g</strong>lobal <strong>rel</strong>ation <strong>g</strong>raph with <strong>t</strong>opological information and the enclosing subgraph. GRelGT consists of two core components: a global relation graph module and a subgraph module. The global relation graph module converts the original knowledge graph into a relation graph, with nodes representing edges (triples) in KGs. Furthermore, we introduce four topological structural features as edge types in the global relation graph to facilitating the learning of the semantic correlations between relations. By leveraging the topological features of the relations, the model’s ability to capture the hidden patterns in the KG is enhanced. Meanwhile, the subgraph module is dedicated to exploring the local structural and semantic information within the enclosing subgraph around the target triple. For a more precise understanding of semantic correlations, we further introduce global relation-aware attention and local query-aware attention mechanisms in the subgraph GNN. This allows GRelGT to dynamically weigh the importance of different relations, effectively leveraging both global and local information for inference. Experimental results on three KG datasets demonstrate the superiority of our model compared to state-of-the-art approaches.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"130 ","pages":"Article 102514"},"PeriodicalIF":3.0,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143311863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Martin Kabierski, Markus Richter, Matthias Weidlich
{"title":"Quantifying and relating the completeness and diversity of process representations using species estimation","authors":"Martin Kabierski, Markus Richter, Matthias Weidlich","doi":"10.1016/j.is.2024.102512","DOIUrl":"10.1016/j.is.2024.102512","url":null,"abstract":"<div><div>The analysis of process representations, such as event logs or process models, has become a staple in the context of business process management. Insights gained from such an analysis serve to monitor and improve the business processes that is captured. Yet, any process representation is merely a sample of the past and possible behaviour of a business process, which raises the question of its representativeness: To which extent does the process representation capture the process characteristics that are relevant for the analysis? In this paper, we propose to answer this question using estimators from biodiversity research. Specifically, we propose to infer a completeness profile based on the estimated number of distinct relevant characteristics of the process representation and a diversity profile, that captures the heterogeneity of relevant distinct characteristics using asymptotic Hill numbers. We validate the applicability of the proposed estimators for process analysis in a series of controlled experiments. Applying the estimators to real-world event logs, we highlight potential issues in terms of trustworthiness of analysis that is based on them, and show how the profiles can be leveraged to compare different process representations concerning their similarity and completeness.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"130 ","pages":"Article 102512"},"PeriodicalIF":3.0,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143311862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adaptive sliding window normalization","authors":"George Papageorgiou, Christos Tjortjis","doi":"10.1016/j.is.2024.102515","DOIUrl":"10.1016/j.is.2024.102515","url":null,"abstract":"<div><div>Time series data, frequent in various domains such as finance, healthcare, environmental monitoring, and energy management, often exhibit nonstationary behaviors and anomalies that challenge traditional normalization techniques. This research proposes an innovative methodology termed Adaptive Sliding Window Normalization (ASWN) to address these limitations. ASWN dynamically adjusts normalization window sizes based on detected anomalies with multiple methods, applied Density-Based Spatial Clustering of Applications with Noise (DBSCAN) for the finalization of those, and utilizes the Akaike Information Criterion (AIC) with AutoRegressive Integrated Moving Average (ARIMA) models to determine optimal window sizes in the absence of anomalies. This approach integrates multiple anomaly detection methods to ensure responsiveness to changes in data patterns and effective management of outliers. ASWN is applied to diverse time series datasets, including energy consumption, and financial data, demonstrating significant improvements in predictive accuracy. Extensive experiments show that ASWN outperforms traditional normalization methods, providing empirical evidence of its benefits in handling nonstationary and anomalous data. This research enhances the robustness and reliability of time series forecasting and contributes to the broader field by thoroughly documenting the methodology, experimental setup, and results. The findings are intended to foster further advancements in time series normalization and forecasting.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"129 ","pages":"Article 102515"},"PeriodicalIF":3.0,"publicationDate":"2024-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143165465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proactive event matching with predictive analysis in content-based publish/subscribe systems","authors":"Yongpeng Dong, Shiyou Qian, Tianchen Ding, Jian Cao, Guangtao Xue, Minglu Li","doi":"10.1016/j.is.2024.102508","DOIUrl":"10.1016/j.is.2024.102508","url":null,"abstract":"<div><div>The real-time efficacy of content-based publish/subscribe systems is largely dependent on the efficiency of matching algorithms. Current methodologies mainly focus on overall matching performance, often ignoring the dynamic nature and evolving trends of hot events. This paper introduces a novel, learning-driven approach – the proactive adjustment framework (PAF) – tailored to dynamically adapt to hot event trends. By strategically prioritizing subscriptions in alignment with the changing dynamics of hot events, PAF enhances the efficiency of matching algorithms and optimize the system real-time performance. One challenge of PAF is the trade-off that needs to be made between the gains of improving real-time performance by identifying matching subscriptions earlier and the cost of increasing matching time due to subscription classification and adjustment. We design a concise scheme to classify subscriptions, establish a lightweight adjustment mechanism to handle dynamics, and propose an efficient greedy algorithm to compute adjustment plans. This approach helps to mitigate the impact of PAF on matching performance. The experiment results show that the 95th percentile of the determining time of matching subscriptions is improved by about 50.5% and the throughput is also increased by 13%, compared to the baseline SCSL. Furthermore, we integrate PAF into Apache Kafka to augment it as a content-based publish/subscribe system. We test the effectiveness of PAF using two real-world datasets. Compared with two baselines, SCSL and REIN, PAF achieves an improvement of 22.5% and 51.8% respectively in average event transfer latency.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"129 ","pages":"Article 102508"},"PeriodicalIF":3.0,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143165463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Capturing end-to-end provenance for machine learning pipelines","authors":"Marius Schlegel, Kai-Uwe Sattler","doi":"10.1016/j.is.2024.102495","DOIUrl":"10.1016/j.is.2024.102495","url":null,"abstract":"<div><div>Modern workflows for developing ML pipelines utilize ML artifact management systems (ML AMSs) such as MLflow in addition to traditional version control systems such as Git. ML AMSs collect data, model, metadata and software artifacts used and produced in pipeline development workflows. While ensuring repeatability and reproducibility, the provenance capabilities are still rudimentary, mainly due to incomplete traces, coarse granularity, and limited query capabilities. In this paper, we introduce a comprehensive PROV-compliant provenance model that captures end-to-end provenance traces of ML pipelines, their artifacts, and their relationships based on MLflow and Git activities. Moreover, we present the tool MLflow2PROV for continuously extracting provenance graphs according to our model, enabling querying, analyzing, and processing of the collected provenance information.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"132 ","pages":"Article 102495"},"PeriodicalIF":3.0,"publicationDate":"2024-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143686876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On the discovery of seasonal gradual patterns through periodic patterns mining","authors":"Jerry Lonlac , Arnaud Doniec , Marin Lujak , Stéphane Lecoeuche","doi":"10.1016/j.is.2024.102511","DOIUrl":"10.1016/j.is.2024.102511","url":null,"abstract":"<div><div>Gradual patterns, capturing intricate attribute co-variations expressed as “when X increases/decreases, Y increases/decreases” in numerical data, play a vital role in managing vast volumes of complex numerical data in real-world applications. Recently, the data science community has focused on efficient extraction methods for gradual patterns from temporal data. However, there is a notable gap in approaches addressing the extraction of gradual patterns that capture seasonality from the graduality point of view in the temporal data sequences, despite their potential to yield valuable insights in applications such as e-commerce. This paper proposes a new method for extracting co-variations of periodically repeating attributes termed as seasonal gradual patterns. To achieve this, we formulate the task of mining seasonal gradual patterns as the problem of mining periodic patterns in multiple sequences and then, leverage periodic pattern mining algorithms to extract seasonal gradual patterns. Additionally, we propose a new antimonotonic support definition associated with these seasonal gradual patterns. Illustrative results from real-world datasets demonstrate the efficiency of the proposed approach and its ability to sift through numerous non-seasonal patterns to identify the seasonal ones.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"129 ","pages":"Article 102511"},"PeriodicalIF":3.0,"publicationDate":"2024-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143165464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alessandro Gianola , Jonghyeon Ko , Fabrizio Maria Maggi , Marco Montali , Sarah Winkler
{"title":"Approximate conformance checking: Fast computation of multi-perspective, probabilistic alignments","authors":"Alessandro Gianola , Jonghyeon Ko , Fabrizio Maria Maggi , Marco Montali , Sarah Winkler","doi":"10.1016/j.is.2024.102510","DOIUrl":"10.1016/j.is.2024.102510","url":null,"abstract":"<div><div>In the context of process mining, alignments are increasingly being adopted for conformance checking, due to their ability in providing sophisticated diagnostics on the nature and extent of deviations between observed traces and a reference process model. On the downside, deriving alignments is challenging from the computational point of view, even more so when dealing with multiple perspectives in the process, such as, in particular, data. In fact, every observed trace must in principle be compared with infinitely many model traces. In this work, we tackle this computational bottleneck by borrowing the classical idea of <em>encoding</em> from machine learning. Instead of computing alignments directly and exactly, we do so in an approximate way after applying a lossy trace encoding that maps each trace into a corresponding compact, vectorial representation that retains only certain information of the original trace. We study trace encoding-based approximate alignments for processes equipped with event data attributes, from three different angles. First, we indeed show that computing approximate alignments in this way is much more efficient than in the exact setting. Second, we evaluate how accurate such approximate alignments are, considering different encoding strategies that focus on different features of the trace. Our findings suggest that sufficiently rich encodings actually yield good accuracy. Third, we consider the impact of frequency and density of model variants, comparing the effectiveness of using standard approximate multi-perspective alignments as opposed to a variant that incorporates probabilities. As a by-product of this analysis, we also obtain insights on how these two approaches perform in the presence of noise.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"129 ","pages":"Article 102510"},"PeriodicalIF":3.0,"publicationDate":"2024-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143165462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}