Fabio Carrara, Lucia Vadicamo, Giuseppe Amato, Claudio Gennaro
{"title":"Training-free sparse representations of dense vectors for scalable information retrieval","authors":"Fabio Carrara, Lucia Vadicamo, Giuseppe Amato, Claudio Gennaro","doi":"10.1016/j.is.2025.102567","DOIUrl":"10.1016/j.is.2025.102567","url":null,"abstract":"<div><div>In this paper, we propose and analyze Vec2Doc, a novel training-free method to transform dense vectors into sparse integer vectors, facilitating the use of inverted indexes for information retrieval (IR). The exponential growth of deep learning and artificial intelligence has revolutionized scientific problem-solving in areas such as computer vision, natural language processing, and automatic content generation. These advances have also significantly impacted IR, with a better understanding of natural language and multimodal content analysis leading to more accurate information retrieval. Despite these developments, modern IR relies primarily on the similarity evaluation of dense vectors from the latent spaces of deep neural networks. This dependence introduces substantial challenges in performing similarity searches on large collections containing billions of vectors. Traditional IR methods, which employ inverted indexes and vector space models, are adept at handling sparse vectors but do not work well with dense ones. Vec2Doc attempts to fill this gap by converting dense vectors into a format compatible with conventional inverted index techniques. Our preliminary experimental evaluations show that Vec2Doc is a promising solution to overcome the scalability problems inherent in vector-based IR, offering an alternative method for efficient and accurate large-scale information retrieval.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"133 ","pages":"Article 102567"},"PeriodicalIF":3.0,"publicationDate":"2025-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144068633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Alternating Optimization Scheme for Binary Sketches","authors":"Erik Thordsen, Erich Schubert","doi":"10.1016/j.is.2025.102563","DOIUrl":"10.1016/j.is.2025.102563","url":null,"abstract":"<div><div>Searching for similar objects in intrinsically high-dimensional data sets is a challenging task. The use of compact sketches has been proposed for faster similarity search using linear scans. Binary sketches are one such approach to find a good mapping from the original data space to bit strings of a fixed length. These bit strings can be compared efficiently using only few XOR and bit count operations, replacing costly similarity computations with an inexpensive approximation. We propose a new scheme to initialize and improve binary sketches for similarity search in Euclidean spaces. Our optimization iteratively improves the quality of the sketches with a form of orthogonalization. We provide empirical evidence that the quality of the sketches has a peak beyond which it is not correlated to neither bit independence nor bit balance, which contradicts a previous hypothesis in the literature. Regularization in the form of noise added to the training data can turn the peak into a plateau and applying the optimization in a stochastic fashion, i.e., training on smaller subsets of the data, allows for rapid initialization. We provide a loss function that allows to approximate the same objective using neural network frameworks such as PyTorch, elevating the approach to GPU-based training.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"133 ","pages":"Article 102563"},"PeriodicalIF":3.0,"publicationDate":"2025-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144070924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Class Representatives Selection in non-metric spaces for nearest prototype classification","authors":"Jaroslav Hlaváč , Martin Kopp , Tomáš Skopal","doi":"10.1016/j.is.2025.102564","DOIUrl":"10.1016/j.is.2025.102564","url":null,"abstract":"<div><div>The nearest prototype classification is a less computationally intensive replacement for the <span><math><mi>k</mi></math></span>-NN method, especially when large datasets are considered. Centroids are often used as prototypes to represent whole classes in metric spaces. Selection of class prototypes in non-metric spaces is more challenging as the idea of computing centroids is not directly applicable. Instead, a set of representative objects can be used as the class prototype.</div><div>This paper presents the Class Representatives Selection (CRS) method, a novel memory and computationally efficient method that finds a small yet representative set of objects from each class to be used as a prototype. CRS leverages the similarity graph representation of each class created by the NN-Descent algorithm to pick a low number of representatives that ensure sufficient class coverage. Thanks to the graph-based approach, CRS can be applied to any space where at least a pairwise similarity can be defined. In the experimental evaluation, we demonstrate that our method outperforms the state-of-the-art techniques on multiple datasets from different domains.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"133 ","pages":"Article 102564"},"PeriodicalIF":3.0,"publicationDate":"2025-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143948792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Back to the Order: Partial orders in streaming conformance checking","authors":"Kristo Raun , Riccardo Tommasini , Ahmed Awad","doi":"10.1016/j.is.2025.102566","DOIUrl":"10.1016/j.is.2025.102566","url":null,"abstract":"<div><div>Most organizations are built around their business processes. Commonly, these processes follow a predefined path. Deviations from the expected path can lead to lower quality products and services, reduced efficiencies, and compliance liabilities. Rapid identification of deviations helps mitigate such risks. For identifying deviations, the conformance checker would need to know the sequence in which events occurred. In this paper, we tackle two challenges associated with knowing the right sequence of events. First, we look at out-of-order event arrival, a common occurrence in modern information systems. Second, we extend the previous work by incorporating partial order handling. Partially ordered events are a well-studied problem in process mining, but to the best of our knowledge it has not been researched in terms of fast-paced streaming conformance checking. Real-life and semi-synthetic datasets are used for validating the proposed methods.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"133 ","pages":"Article 102566"},"PeriodicalIF":3.0,"publicationDate":"2025-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143948791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Timeline-based process discovery","authors":"Christoffer Rubensson , Harleen Kaur , Timotheus Kampik , Jan Mendling","doi":"10.1016/j.is.2025.102568","DOIUrl":"10.1016/j.is.2025.102568","url":null,"abstract":"<div><div>A key concern of automatic process discovery is providing insights into business process performance. Process analysts are specifically interested in waiting times and delays for identifying opportunities to speed up processes. Against this backdrop, it is surprising that current techniques for automatic process discovery generate directly-follows graphs and comparable process models without representing the time axis explicitly. This paper presents four layout strategies for automatically constructing process models that explicitly align with a time axis. We exemplify our approaches for directly-follows graphs. We evaluate their effectiveness by applying them to real-world event logs with varying complexities. Our specific focus is on their ability to handle the trade-off between high control-flow abstraction and high consistency of temporal activity order. Our results show that timeline-based layouts provide benefits in terms of an explicit representation of temporal distances. They face challenges for logs with many repeating and concurrent activities.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"133 ","pages":"Article 102568"},"PeriodicalIF":3.0,"publicationDate":"2025-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144070923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Stochastic conformance checking based on variable-length Markov chains","authors":"Emilio Incerto , Andrea Vandin , Sima Sarv Ahrabi","doi":"10.1016/j.is.2025.102561","DOIUrl":"10.1016/j.is.2025.102561","url":null,"abstract":"<div><div>Conformance checking is central in process mining (PM). It studies deviations of logs from reference processes. Originally, the proposed approaches did not focus on stochastic aspects of the underlying process, and gave qualitative models as output. Recently, these have been extended in approaches for <em>stochastic conformance checking</em> (SCC), giving quantitative models as output. A different community, namely the <em>software performance engineering</em> (PE) one, interested in the synthesis of stochastic processes since decades, has developed independently techniques to synthesize Markov Chains (MC) that describe the stochastic process underlying program runs. However, these were never applied to SCC problems. We propose a novel approach to SCC based on PE results for the synthesis of stochastic processes. Thanks to a rich experimental evaluation, we show that it outperforms the state-of-the-art. In doing so, we further bridge PE and PM, fostering cross-fertilization. We use techniques for the synthesis of Variable-length MC (VLMC), higher-order MC able to compactly encode complex path dependencies in the control-flow. VLMCs are equipped with a notion of likelihood that a trace belongs to a model. We use it to perform SCC of a log against a model. We establish the degree of conformance by equipping VLMCs with uEMSC, a standard conformance measure in the SCC literature. We compare with 18 SCC techniques from the PM literature, using 11 benchmark datasets from the PM community. We outperform all approaches in 10 out of 11 datasets, i.e., we get uEMSC values closer to 1 for logs conforming to a model. Furthermore, we show that VLMC are efficient, as they handled all considered datasets in a few seconds.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"133 ","pages":"Article 102561"},"PeriodicalIF":3.0,"publicationDate":"2025-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144068634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stef van den Elzen , Mieke Jans , Niels Martin , Femke Pieters , Christian Tominski , Maria-Cruz Villa-Uriol , Sebastiaan J. van Zelst
{"title":"Towards Multi-Faceted Visual Process Analytics","authors":"Stef van den Elzen , Mieke Jans , Niels Martin , Femke Pieters , Christian Tominski , Maria-Cruz Villa-Uriol , Sebastiaan J. van Zelst","doi":"10.1016/j.is.2025.102560","DOIUrl":"10.1016/j.is.2025.102560","url":null,"abstract":"<div><div>Both the fields of Process Mining (PM) and Visual Analytics (VA) aim to make complex phenomena understandable. In PM, the goal is to gain insights into the execution of complex processes by analyzing the event data that is captured in event logs. This data is inherently multi-faceted, meaning that it covers various data facets, including spatial and temporal dependencies, relations between data entities (such as cases/events), and multivariate data attributes per entity. However, the multi-faceted nature of the data has not received much attention in PM. Conversely, VA research has investigated interactive visual methods for making multi-faceted data understandable for about two decades. In this study, we bring together PM and VA with the goal of advancing towards Visual Process Analytics (VPA) of multi-faceted processes. To this end, we present a systematic view of relevant (VA) data facets in the context of PM and assess to what extent existing PM visualizations address the data facets’ characteristics, making use of VA guidelines. In addition to visualizations, we look at how PM can benefit from analytical abstraction and interaction techniques known in the VA realm. Based on this, we discuss open challenges and opportunities for future research towards multi-faceted VPA.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"133 ","pages":"Article 102560"},"PeriodicalIF":3.0,"publicationDate":"2025-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143934651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Support estimation in frequent itemsets mining on Enriched Two Level Tree","authors":"Clémentin Tayou Djamegni , William Kery Branston Ndemaze , Edith Belise Kenmogne , Hervé Maradona Nana Kouassi , Arnauld Nzegha Fountsop , Idriss Tetakouchom , Laurent Cabrel Tabueu Fotso","doi":"10.1016/j.is.2025.102559","DOIUrl":"10.1016/j.is.2025.102559","url":null,"abstract":"<div><div>Efficiently counting the support of candidate itemsets is a crucial aspect of extracting frequent itemsets because it directly impacts the overall performance of the mining process. Researchers have developed various techniques and data structures to overcome this challenge, but the problem is still open. In this paper, we investigate the two-level tree enrichment technique as a potential solution without adding significant computational overhead. In addition, we introduce ETL_Miner, a novel algorithm that provides an estimated bound for the support value of all candidate itemsets within the search space. The method presented in this article is flexible and can be used with various algorithms. To demonstrate this point, we introduce a modified version of Apriori that integrates ETL_Miner as an extra pruning phase. Preliminary empirical experimental results on both real and synthetic datasets confirm the accuracy of the proposed method and reduce the total extraction time.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"133 ","pages":"Article 102559"},"PeriodicalIF":3.0,"publicationDate":"2025-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143934650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Substring compression variations and LZ78-Derivates","authors":"Dominik Köppl","doi":"10.1016/j.is.2025.102553","DOIUrl":"10.1016/j.is.2025.102553","url":null,"abstract":"<div><div>We propose algorithms computing the semi-greedy Lempel–Ziv 78 (LZ78), the Lempel–Ziv Double (LZD), and the Lempel–Ziv–Miller–Wegman (LZMW) factorizations in linear time for integer alphabets. For LZD and LZMW, we additionally propose data structures that can be constructed in linear time, which can solve the substring compression problems for these factorizations in time linear in the output size. For substring compression, we give the first results for lexparse and closed factorizations.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"133 ","pages":"Article 102553"},"PeriodicalIF":3.0,"publicationDate":"2025-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143906791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning to resolve inconsistencies in qualitative constraint networks","authors":"Anastasia Paparrizou, Michael Sioutis","doi":"10.1016/j.is.2025.102557","DOIUrl":"10.1016/j.is.2025.102557","url":null,"abstract":"<div><div>In this paper, we present a reinforcement learning approach for resolving inconsistencies in qualitative constraint networks (<span><math><mi>QCN</mi></math></span>s). <span><math><mi>QCN</mi></math></span>s are typically used in constraint programming to represent and reason about intuitive spatial or temporal relations like <em>x</em> {<em>is inside of</em> <span><math><mo>∨</mo></math></span> <em>overlaps</em>} <em>y</em>. Naturally, <span><math><mi>QCN</mi></math></span>s are not immune to uncertainty, noise, or imperfect data that may be present in information, and thus, more often than not, they are hampered by inconsistencies. We propose a multi-armed bandit approach that defines a well-suited ordering of constraints for finding a maximal satisfiable subset of them. Specifically, our learning approach interacts with a solver, and after each trial a reward is returned to measure the performance of the selected action (constraint addition). The reward function is based on the reduction of the solution space of a consistent reconstruction of the input <span><math><mi>QCN</mi></math></span>. Experimental results with different bandit policies and various rewards that are obtained by our algorithm suggest that we can do better than the state of the art in terms of both effectiveness, viz., lower number of repairs obtained for an inconsistent <span><math><mi>QCN</mi></math></span>, and efficiency, viz., faster runtime.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"133 ","pages":"Article 102557"},"PeriodicalIF":3.0,"publicationDate":"2025-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143868929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}