Junwei Hu , Michael Bewong , Selasi Kwashie , Yidi Zhang , Vincent Nofong , John Wondoh , Zaiwen Feng
{"title":"When GDD meets GNN: A knowledge-driven neural connection for effective entity resolution in property graphs","authors":"Junwei Hu , Michael Bewong , Selasi Kwashie , Yidi Zhang , Vincent Nofong , John Wondoh , Zaiwen Feng","doi":"10.1016/j.is.2025.102551","DOIUrl":"10.1016/j.is.2025.102551","url":null,"abstract":"<div><div>This paper studies the entity resolution (ER) problem in property graphs. ER is the task of identifying and linking different records that refer to the same real-world entity. It is commonly used in data integration, data cleansing, and other applications where it is important to have accurate and consistent data. In general, two predominant approaches exist in the literature: rule-based and learning-based methods. On the one hand, rule-based techniques are often desired due to their explainability and ability to encode domain knowledge. Learning-based methods, on the other hand, are preferred due to their effectiveness in spite of their black-box nature. In this work, we devise a hybrid ER solution, <span>GraphER</span>, that leverages the strengths of both systems for property graphs. In particular, we adopt <em>graph differential dependency</em> (GDD) for encoding the so-called <em>record-matching rules</em>, and employ them to guide a graph neural network (GNN) based representation learning for the task. We conduct extensive empirical evaluation of our proposal on benchmark ER datasets including 17 graph datasets and 7 relational datasets in comparison with 10 state-of-the-art (SOTA) techniques. The results show that our approach provides a significantly better solution to addressing ER in graph data, both quantitatively and qualitatively, while attaining highly competitive results on the benchmark relational datasets <em>w.r.t.</em> the SOTA solutions.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"132 ","pages":"Article 102551"},"PeriodicalIF":3.0,"publicationDate":"2025-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143739768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tomas Llano-Rios , Mohamed Khalefa , Antonio Badia
{"title":"A JSON document algebra for query optimization","authors":"Tomas Llano-Rios , Mohamed Khalefa , Antonio Badia","doi":"10.1016/j.is.2025.102537","DOIUrl":"10.1016/j.is.2025.102537","url":null,"abstract":"<div><div>Due to the popularity of JSON, several systems have been developed that store data in collections of JSON documents. Each system has developed its own query language, sometimes in an ad-hoc manner. This makes difficult to formally define and analyze query optimization techniques. We propose an algebra tailored to JSON documents. First, we argue that JSON is different from nested relations and XML and therefore requires its own solution. Then, we propose an algebra on 3 levels: the first level defines operators to manipulate individual documents, providing an abstraction over different serializations. The second level provides operators over collections of JSON documents, while the third level defines also collection operators which are not primitive, but that enable direct and efficient implementation of data manipulation operations. We provide a number of properties of the algebraic operators which provide a solid basis for query optimization.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"132 ","pages":"Article 102537"},"PeriodicalIF":3.0,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143739767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sedir Mohammed , Lukas Budach , Moritz Feuerpfeil , Nina Ihde , Andrea Nathansen , Nele Noack , Hendrik Patzlaff , Felix Naumann , Hazar Harmouch
{"title":"The effects of data quality on machine learning performance on tabular data","authors":"Sedir Mohammed , Lukas Budach , Moritz Feuerpfeil , Nina Ihde , Andrea Nathansen , Nele Noack , Hendrik Patzlaff , Felix Naumann , Hazar Harmouch","doi":"10.1016/j.is.2025.102549","DOIUrl":"10.1016/j.is.2025.102549","url":null,"abstract":"<div><div>Modern artificial intelligence (AI) applications require large quantities of training and test data. This need creates critical challenges not only concerning the availability of such data, but also regarding its quality. For example, incomplete, erroneous, or inappropriate training data can lead to unreliable models that produce ultimately poor decisions. Trustworthy AI applications require high-quality training and test data along many quality dimensions, such as accuracy, completeness, and consistency.</div><div>We explore empirically the relationship between six data quality dimensions and the performance of 19 popular machine learning algorithms covering the tasks of classification, regression, and clustering, with the goal of explaining their performance in terms of data quality. Our experiments distinguish three scenarios based on the AI pipeline steps that were fed with polluted data: polluted training data, test data, or both. We conclude the paper with an extensive discussion of our observations.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"132 ","pages":"Article 102549"},"PeriodicalIF":3.0,"publicationDate":"2025-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143642966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zihang Su , Tianshi Yu , Artem Polyvyanyy , Ying Tan , Nir Lipovetzky , Sebastian Sardiña , Nick van Beest , Alireza Mohammadi , Denny Oetomo
{"title":"Process mining over sensor data: Goal recognition for powered transhumeral prostheses","authors":"Zihang Su , Tianshi Yu , Artem Polyvyanyy , Ying Tan , Nir Lipovetzky , Sebastian Sardiña , Nick van Beest , Alireza Mohammadi , Denny Oetomo","doi":"10.1016/j.is.2025.102540","DOIUrl":"10.1016/j.is.2025.102540","url":null,"abstract":"<div><div>Process mining (PM)-based goal recognition (GR) techniques, which infer goals or targets based on sequences of observed actions, have shown efficacy in real-world engineering applications. This study explores the applicability of PM-based GR in identifying target poses for users employing powered transhumeral prosthetics. These prosthetics are designed to restore missing anatomical segments below the shoulder, including the hand. In this article, we aim to apply the GR techniques to identify the intended movements of users, enabling the motors on the powered transhumeral prosthesis to execute the desired motions precisely. In this way, a powered transhumeral prosthesis can assist individuals with disabilities in completing movement tasks. PM-based GR techniques were initially designed to infer goals from sequences of observed actions, where discrete event names represent actions. However, the electromyography electrodes and kinematic sensors on powered transhumeral prosthetic devices register sequences of continuous, real-valued data measurements. Therefore, we rely on methods to transform sensor data into discrete events and integrate these methods with the PM-based GR system to develop target pose recognition approaches. Two data transformation approaches are introduced. The first approach relies on the clustering of data measurements collected before the target pose is reached (the clustering approach). The second approach uses the time series of measurements collected while the dynamic user movement to perform linear discriminant analysis (LDA) classification and identify discrete events (the dynamic LDA approach). These methods are evaluated through offline and human-in-the-loop (online) experiments and compared with established techniques, such as static LDA, an LDA classification based on data collected at static target poses, and GR approaches based on neural networks. Real-time human-in-the-loop experiments further validate the effectiveness of the proposed methods, demonstrating that PM-based GR using the dynamic LDA classifier achieves superior <span><math><msub><mrow><mi>F</mi></mrow><mrow><mn>1</mn></mrow></msub></math></span> score and balanced accuracy compared to state-of-the-art techniques.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"132 ","pages":"Article 102540"},"PeriodicalIF":3.0,"publicationDate":"2025-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143611419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Resource allocation in business process executions—A systematic literature study","authors":"Luise Pufahl , Fabian Stiehle , Sven Ihde , Mathias Weske , Ingo Weber","doi":"10.1016/j.is.2025.102541","DOIUrl":"10.1016/j.is.2025.102541","url":null,"abstract":"<div><div>To achieve their goals, organizations execute business processes, which require effective allocation of resources to process activities. This results in the decision-making problem: Which resources should be allocated to which process activities? This problem significantly impacts both process efficiency and effectiveness. Over the past decades, various system-initiated (largely automated) resource allocation approaches have been developed. This study presents a comprehensive overview of this field by analyzing 61 primary studies identified through a rigorous, structured literature review covering publications from 1995 to 2023. We investigate resource allocation goals and cardinalities and describe how process models, execution data, and task attributes, as well as resource attributes, are used to specify the resource allocation problem. Additionally, the type of algorithmic solution and evaluation methods are discussed. This study shows that most approaches support 1-to-1 allocation cardinalities only, specify process-oriented goals, focus on process models, and utilize rule-based methods. Based on the results, we call for future research to define common terminology, support evidence-oriented resource allocation and adaptability, and improve reproducibility and comparability by performing benchmarking studies.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"132 ","pages":"Article 102541"},"PeriodicalIF":3.0,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143601203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Context-aware automated ICD coding: A semantic-driven approach","authors":"O.K. Reshma, N. Saleena, K.A. Abdul Nazeer","doi":"10.1016/j.is.2025.102539","DOIUrl":"10.1016/j.is.2025.102539","url":null,"abstract":"<div><div>Identifying the exact International Classification of Diseases (ICD) codes describing a patient’ s health condition is essential in classifying patients with similar disease conditions. Numerous studies have devised automated approaches to retrieve the ICD codes from patients’ health records. However, majority of these methodologies have considered ICD codes solely as alphanumeric codes, overlooking their descriptions and thus neglecting the inherent semantics. Also, these methodologies overlook the one-to-many semantic relationships between diagnosis and assigned ICD code descriptions. Subsequently, this constrains these approaches from effectively assigning ICD codes with meaningful context. This work addresses these limitations by capturing the semantic similarity between the diagnosis and ICD code descriptions, while utilising the inherent one-to-many relationships between them, to accurately assign ICD codes. For this, we formulate the ICD coding problem as a Semantic Text Similarity task. The proposed approach uses a siamese stacked Bi-LSTM network to learn context-aware representations of diagnoses and ICD code descriptions. We transform each patient-visit data into sentence pairs by considering the one-to-many relationships between diagnosis and assigned ICD code descriptions. Further, we compute their semantic similarity and classify them as similar or dissimilar. The proposed approach was evaluated using 5-fold cross-validation on MIMIC-III dataset and achieved the highest evaluation metric scores (F1-score 0.66, precision 0.67, recall 0.84) compared with other sequential models. The per-label evaluation demonstrates the performance of the proposed approach for each ICD code. Furthermore, the proposed approach outperformed several existing attention-based models, demonstrating the potential use of semantics in automated ICD coding.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"132 ","pages":"Article 102539"},"PeriodicalIF":3.0,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143579980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yongwang Yuan , Xianwen Fang , Ke Lu , ZhenHu Zhang
{"title":"An interpretable deep fusion framework for event log repair","authors":"Yongwang Yuan , Xianwen Fang , Ke Lu , ZhenHu Zhang","doi":"10.1016/j.is.2025.102548","DOIUrl":"10.1016/j.is.2025.102548","url":null,"abstract":"<div><div>In executing business processes, issues like information system failures or manual recording errors may lead to data loss in event logs, resulting in missing event logs. Utilizing such missing logs could seriously impact the quality of business process analysis results. To address this scenario, current advanced repair methods rely primarily on deep learning technology to provide intelligent solutions for business processes. However, deep learning technology is often considered a \"black-box\" model, lacking sufficient interpretability. No method is currently available to provide particular interpretability, especially in repairing specific missing values within the logs. This paper proposes the deep fusion interpretability framework based on artificial intelligence technology to address this issue. In the task of event log repair, this framework gradually transitions from the overall framework's local to global interpretability. It provides local interpretability from the attribute-level data flow perspective, semi-local interpretability from the event-level behavioral control-flow perspective, and global interpretability from the trace-level perspective. Next, we present various modes of multi-head attention within the framework and visualize the process of attention distribution calculation to explain how the framework repairs missing values through the profound combination of multi-head attention mode and context. Finally, Experimental results in real public event logs show that the DFI framework can effectively repair the missing values in event logs and explain the missing value repair process.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"132 ","pages":"Article 102548"},"PeriodicalIF":3.0,"publicationDate":"2025-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143562706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Errikos Streviniotis , Nikos Giatrakos , Yannis Kotidis , Thaleia Ntiniakou , Miguel Ponce de Leon
{"title":"RATS: A resource allocator for optimizing the execution of tumor simulations over HPC infrastructures","authors":"Errikos Streviniotis , Nikos Giatrakos , Yannis Kotidis , Thaleia Ntiniakou , Miguel Ponce de Leon","doi":"10.1016/j.is.2025.102538","DOIUrl":"10.1016/j.is.2025.102538","url":null,"abstract":"<div><div>In this work, we introduce RATS (<u>R</u>esource <u>A</u>llocator for <u>T</u>umor <u>S</u>imulations), the first optimizer for the execution of tumor simulations over HPC infrastructures. Given a set of drug therapies under in-silico study, the optimization framework of RATS can: <em>(i)</em> devise the optimal number of cores and prescribe the required number of core hours; and <em>(ii)</em> under core capacity constraints, RATS schedules the execution of simulations minimizing the overall number of core hours, simultaneously prioritizing the execution of expectedly promising in-silico trials higher compared to unpromising ones. RATS is deployed by life scientists at the Barcelona Supercomputing Center to remove the burden of blindly guessing the core hours needing to be reserved from HPC admins to study various tumor treatment methodologies, as well as to rapidly distinguish effective drug combinations, thus, potentially cutting time to market for new cancer therapies. The latter is further elevated by the RATS+ extension we plug into the initial framework. RATS+ employs a Transfer Learning approach to leverage optimization models and decisions from prior in-silico studies, thereby reducing the optimization effort required for new studies in this domain.</div><div>Our experimental evaluation, on real-world data derived from the execution of more than 2500 tumor simulations on the MareNostrum4 supercomputer, confirms the effectiveness of both RATS and RATS+ across the aforementioned performance dimensions.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"132 ","pages":"Article 102538"},"PeriodicalIF":3.0,"publicationDate":"2025-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143579985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yingxia Tang , Yanxuan Wei , Teng Li , Xiangwei Zheng , Cun Ji
{"title":"A hierarchical transformer-based network for multivariate time series classification","authors":"Yingxia Tang , Yanxuan Wei , Teng Li , Xiangwei Zheng , Cun Ji","doi":"10.1016/j.is.2025.102536","DOIUrl":"10.1016/j.is.2025.102536","url":null,"abstract":"<div><div>In recent years, Transformer has demonstrated considerable potential in multivariate time series classification due to its exceptional strength in capturing global dependencies. However, as a generalized approach, it still faces challenges in processing time series data, such as insufficient temporal sensitivity and inadequate ability to capture local features. In this paper, a hierarchical Transformer-based network (Hformer) is innovatively proposed to address these problems. Hformer handles time series progressively at various stages to aggregate multi-scale representations. At the start of each stage, Hformer segments the input sequence and extracts features independently on each temporal slice. Furthermore, to better accommodate multivariate time series data, a more efficient absolute position encoding as well as relative position encoding are employed by Hformer. Experimental results on 30 multivariate time series datasets of the UEA archive demonstrate that the proposed method outperforms most state-of-the-art methods.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"132 ","pages":"Article 102536"},"PeriodicalIF":3.0,"publicationDate":"2025-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143562707","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Claudio Di Ciccio, Remco Dijkman, Adela del Río Ortega, Stefanie Rinderle-Ma, Manfred Reichert
{"title":"Special issue: BPM 2022 Selected papers in Foundations and Engineering","authors":"Claudio Di Ciccio, Remco Dijkman, Adela del Río Ortega, Stefanie Rinderle-Ma, Manfred Reichert","doi":"10.1016/j.is.2025.102535","DOIUrl":"10.1016/j.is.2025.102535","url":null,"abstract":"","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"131 ","pages":"Article 102535"},"PeriodicalIF":3.0,"publicationDate":"2025-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143510294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}