Owen Eriksson , Paul Johannesson , Maria Bergholtz , Pär Ågerfalk
{"title":"Turning Conceptual Modeling Institutional – The prescriptive role of conceptual models in transforming institutional reality","authors":"Owen Eriksson , Paul Johannesson , Maria Bergholtz , Pär Ågerfalk","doi":"10.1016/j.datak.2024.102404","DOIUrl":"10.1016/j.datak.2024.102404","url":null,"abstract":"<div><div>It has traditionally been assumed that information systems describe physical reality. However, this assumption is becoming obsolete as digital infrastructures are increasingly part of real-world experiences. Digital infrastructures (ubiquitous and scalable information systems) no longer merely map physical reality representations onto digital objects but increasingly assume an active role in creating, shaping, and governing physical reality. We currently witness an “ontological reversal”, where conceptual models and digital infrastructures change physical reality. Still, the fundamental assumption remains that physical reality is the only real world. However, to fully embrace the implications of the ontological reversal, conceptual modeling needs an “institutional turn” that abandons the idea that physical reality always takes priority. Institutional reality, which includes, for example, institutional entities such as organizations, contracts, and payment transactions, is not simply part of physical reality detached from digital infrastructures. Digital infrastructures are part of institutional reality. Accordingly, the research question we address is: What are the fundamental constructs in the design of digital infrastructures that constitute and transform institutional reality? In answering this question, we develop a foundation for conceptual modeling, which we illustrate by modeling the institution of open banking and its associated digital infrastructure. In the article, we identify digital institutional entities, digital agents regulated by software, and digital institutional actions as critical constructs for modeling digital infrastructures in institutional contexts. In so doing, we show how conceptual modeling can improve our understanding of the digital transformation of institutional reality and the prescriptive role of conceptual modeling. We also generate theoretical insights about the need for legitimacy and liability that advance the study and practice of digital infrastructure design and its consequences.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"156 ","pages":"Article 102404"},"PeriodicalIF":2.7,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143133533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhancing the intelligibility of decision trees with concise and reliable probabilistic explanations","authors":"Louenas Bounia, Insaf Setitra","doi":"10.1016/j.datak.2024.102394","DOIUrl":"10.1016/j.datak.2024.102394","url":null,"abstract":"<div><div>This work deals with explainable artificial intelligence (XAI), specifically focusing on improving the intelligibility of decision trees through reliable and concise probabilistic explanations. Decision trees are popular because they are considered highly interpretable. Due to cognitive limitations, abductive explanations can be too large to be interpretable by human users. When this happens, decision trees are far from being easily interpretable. In this context, our goal is to enhance the intelligibility of decision trees by using probabilistic explanations. Drawing inspiration from previous work on approximating probabilistic explanations, we propose a greedy algorithm that enables us to derive concise and reliable probabilistic explanations for decision trees. We provide a detailed description of this algorithm and compare it to the state-of-the-art SAT encoding. In the order to highlight the gains in intelligibility while emphasizing its empirical effectiveness, we will conduct in-depth experiments on binary decision trees as well as on cases of multi-class classification. We expect significant gains in intelligibility. Finally, to demonstrate the usefulness of such an approach in a practical context, we chose to carry out additional experiments focused on text classification, in particular the detection of emotions in tweets. Our objective is to determine the set of words explaining the emotion predicted by the decision tree.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"156 ","pages":"Article 102394"},"PeriodicalIF":2.7,"publicationDate":"2024-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143133640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
François Camelin , Samir Loudni , Gilles Pesant , Charlotte Truchet
{"title":"Coupling MDL and Markov chain Monte Carlo to sample diverse pattern sets","authors":"François Camelin , Samir Loudni , Gilles Pesant , Charlotte Truchet","doi":"10.1016/j.datak.2024.102393","DOIUrl":"10.1016/j.datak.2024.102393","url":null,"abstract":"<div><div>Exhaustive methods of pattern extraction in a database face real obstacles to speed and output control of patterns: a large number of patterns are extracted, many of which are redundant. Pattern extraction methods through sampling, which allow for controlling the size of the outputs while ensuring fast response times, provide a solution to these two problems. However, these methods do not provide high-quality patterns: they return patterns that are very infrequent in the database. Furthermore, they do not scale. To ensure more frequent and diversified patterns in the output, we propose integrating compression methods into sampling to select the most representative patterns from the sampled transactions. We demonstrate that our approach improves the state of the art in terms of diversity of produced patterns.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"156 ","pages":"Article 102393"},"PeriodicalIF":2.7,"publicationDate":"2024-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143133536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HiBenchLLM: Historical Inquiry Benchmarking for Large Language Models","authors":"Mathieu Chartier , Nabil Dakkoune , Guillaume Bourgeois , Stéphane Jean","doi":"10.1016/j.datak.2024.102383","DOIUrl":"10.1016/j.datak.2024.102383","url":null,"abstract":"<div><div>Large Language Models (LLMs) such as ChatGPT or Bard have significantly transformed information retrieval and captured the public’s attention with their ability to generate customized responses across various topics. In this paper, we analyze the capabilities of different LLMs to generate responses related to historical facts in French. Our objective is to evaluate their reliability, comprehensiveness, and relevance for direct usability or extraction. To accomplish this, we propose a benchmark consisting of numerous historical questions covering various types, themes, and difficulty levels. Our evaluation of responses provided by 14 selected LLMs reveals several limitations in both content and structure. In addition to an overall insufficient precision rate, we observe uneven treatment of the French language, along with issues related to verbosity and inconsistency in the responses generated by LLMs.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"156 ","pages":"Article 102383"},"PeriodicalIF":2.7,"publicationDate":"2024-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143133532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Clustering of timed sequences – Application to the analysis of care pathways","authors":"Thomas Guyet , Pierre Pinson , Enoal Gesny","doi":"10.1016/j.datak.2024.102401","DOIUrl":"10.1016/j.datak.2024.102401","url":null,"abstract":"<div><div>Improving the future of healthcare starts by better understanding the current actual practices in . This motivates the objective of discovering typical care pathways from patient data. Revealing care pathways can be achieved through clustering. The difficulty in clustering care pathways, represented by sequences of timestamped events, lies in defining a semantically appropriate metric and clustering algorithms.</div><div>In this article, we adapt two methods developed for time series to the clustering of timed sequences: the drop-DTW metric and the DBA approach for the construction of averaged time sequences. These methods are then applied in clustering algorithms to propose original and sound clustering algorithms for timed sequences. This approach is experimented with and evaluated on synthetic and .</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"156 ","pages":"Article 102401"},"PeriodicalIF":2.7,"publicationDate":"2024-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143133539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Semantic vs. LLM-based approach: A case study of KOnPoTe vs. Claude for ontology population from French advertisements","authors":"Aya Sahbi , Céline Alec , Pierre Beust","doi":"10.1016/j.datak.2024.102392","DOIUrl":"10.1016/j.datak.2024.102392","url":null,"abstract":"<div><div>Automatic ontology population is the process of identifying, extracting, and integrating relevant information from diverse sources to instantiate the classes and properties specified in an ontology, thereby creating a Knowledge Graph (KG) for a particular domain. In this study, we evaluate two approaches for ontology population from text: KOnPoTe, a semantic technique that employs textual and domain knowledge analysis, and a generative AI method leveraging Claude, a Large Language Model (LLM). We conduct comparative experiments on three French advertisement domains: real estate, boats, and restaurants to assess the performance of these techniques. Our analysis highlights the respective strengths and limitations of the semantic approach and the LLM-based one in the context of the ontology population process.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"156 ","pages":"Article 102392"},"PeriodicalIF":2.7,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143133538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Curvature constrained MPNNs: Improving message passing with local structural properties","authors":"Hugo Attali, Davide Buscaldi, Nathalie Pernelle","doi":"10.1016/j.datak.2024.102382","DOIUrl":"10.1016/j.datak.2024.102382","url":null,"abstract":"<div><div>Graph neural networks operate through an iterative process that involves updating node representations by aggregating information from neighboring nodes, a concept commonly referred to as the message passing paradigm. Despite their widespread usage, a recognized issue with these networks is the tendency to over-squash, leading to diminished efficiency. Recent studies have highlighted that this bottleneck phenomenon is often associated with specific regions within graphs, that can be identified through a measure of edge curvature. In this paper, we present a novel framework designed for any Message Passing Neural Network (MPNN) architecture, wherein information distribution is guided by the curvature of the graph’s edges. Our approach aims to address the over-squashing problem by strategically considering the geometric properties of the underlying graph. The experiments carried out show that our method demonstrates significant improvements in mitigating over-squashing, surpassing the performance of existing graph rewiring techniques across multiple node classification datasets.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"156 ","pages":"Article 102382"},"PeriodicalIF":2.7,"publicationDate":"2024-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143133537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving multi-view ensemble learning with Round-Robin feature set partitioning","authors":"Aditya Kumar , Jainath Yadav","doi":"10.1016/j.datak.2024.102380","DOIUrl":"10.1016/j.datak.2024.102380","url":null,"abstract":"<div><div>Multi-view Ensemble Learning (MEL) techniques have shown remarkable success in improving the accuracy and resilience of classification algorithms by combining multiple base classifiers trained over different perspectives of a dataset, known as views. One crucial factor affecting ensemble performance is the selection of diverse and informative feature subsets. Feature Set Partitioning (FSP) methods address this challenge by creating distinct views of features for each base classifier. In this context, we propose the Round-Robin Feature Set Partitioning (<span><math><mi>RR</mi></math></span>-FSP) technique, which introduces a novel approach to feature allocation among views. This novel approach evenly distributes highly correlated features across views, thereby enhancing ensemble diversity, promoting balanced feature utilization, and encouraging the more equitable distribution of correlated features, <span><math><mi>RR</mi></math></span>-FSP contributes to the advancement of MEL techniques. Through experiments on various datasets, we demonstrate that <span><math><mi>RR</mi></math></span>-FSP offers improved classification accuracy and robustness, making it a valuable addition to the arsenal of FSP techniques for MEL.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"156 ","pages":"Article 102380"},"PeriodicalIF":2.7,"publicationDate":"2024-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142743038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"White box specification of intervention policies for prescriptive process monitoring","authors":"Mahmoud Shoush, Marlon Dumas","doi":"10.1016/j.datak.2024.102379","DOIUrl":"10.1016/j.datak.2024.102379","url":null,"abstract":"<div><div>Prescriptive process monitoring methods seek to enhance business process performance by triggering real-time interventions, such as offering discounts to increase the likelihood of a positive outcome (e.g., a purchase). At the core of a prescriptive process monitoring method lies an intervention policy, which determines under which conditions and when to trigger an intervention. While state-of-the-art prescriptive process monitoring approaches rely on black-box intervention policies derived through reinforcement learning, algorithmic decision-making requirements sometimes dictate that the business stakeholders must be able to understand, justify, and adjust these intervention policies manually. To address this requirement, this article proposes <em>WB-PrPM</em> (White-Box Prescriptive Process Monitoring), a framework that enables stakeholders to define intervention policies in business processes. WB-PrPM is a rule-based system that helps decision-makers balance the demand for effective interventions with the imperatives of limited resource capacity. The framework incorporates an automated method for tuning the parameters of the intervention policies to optimize a total gain function. An evaluation is presented using real-life datasets to examine the tradeoffs among various parameters. The evaluation reveals that different variants of the proposed framework outperform existing baselines in terms of total gain, even when default parameter values are used. Additionally, the automated parameter optimization approach further enhances the total gain.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"155 ","pages":"Article 102379"},"PeriodicalIF":2.7,"publicationDate":"2024-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142721292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yong Song , Hongjie Fan , Junfei Liu , Yunxin Liu , Xiaozhou Ye , Ye Ouyang
{"title":"A goal-oriented document-grounded dialogue based on evidence generation","authors":"Yong Song , Hongjie Fan , Junfei Liu , Yunxin Liu , Xiaozhou Ye , Ye Ouyang","doi":"10.1016/j.datak.2024.102378","DOIUrl":"10.1016/j.datak.2024.102378","url":null,"abstract":"<div><div>Goal-oriented Document-grounded Dialogue (DGD) is used for retrieving specific domain documents, assisting users in document content retrieval, question answering, and document management. Existing methods typically employ keyword extraction and vector space models to understand the content of documents, identify the intent of questions, and generate answers based on the capabilities of generation models. However, challenges remain in semantic understanding, long text processing, and context understanding. The emergence of Large Language Models (LLMs) has brought new capabilities in context learning and step-by-step reasoning. These models, combined with Retrieval Augmented Generation(RAG) methods, have made significant breakthroughs in text comprehension, intent detection, language organization, offering exciting prospects for DGD research. However, the “hallucination” issue arising from LLMs requires complementary methods to ensure the credibility of their outputs. In this paper we propose a goal-oriented document-grounded dialogue approach based on evidence generation using LLMs. It designs and implements methods for document content retrieval & reranking, fine-tuning and inference, and evidence generation. Through experiments, the method of combining LLMs with vector space model, or with key information matching technique is used as a comparison, the accuracy of the proposed method is improved by 21.91% and 12.81%, while the comprehensiveness is increased by 10.89% and 69.83%, coherence is enhanced by 38.98% and 53.27%, and completeness is boosted by 16.13% and 36.97%, respectively, on average. Additional, ablation analysis conducted reveals that the evidence generation method also contributes significantly to the comprehensiveness and completeness.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"155 ","pages":"Article 102378"},"PeriodicalIF":2.7,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142705366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}