{"title":"A survey on big data classification","authors":"Keerthana G , Sherly Puspha Annabel L","doi":"10.1016/j.datak.2025.102408","DOIUrl":"10.1016/j.datak.2025.102408","url":null,"abstract":"<div><div>Big data refers to vast volumes of structured and unstructured data that are too large or complex for traditional data-processing methods to handle efficiently. The importance of big data lies in its ability to provide actionable insights and drive decision-making across various industries, such as healthcare, finance, marketing, and government, by enabling more accurate predictions, and personalized services. Moreover, traditional big data classification approaches, often struggle with big data's complexity. They failed to manage high-dimensionality, deal with non-linearity, or process data in real time. For effective big data classification, robust computing infrastructure, scalable storage solutions, and advanced algorithms are required. This survey provides a thorough assessment of 50 research papers based on big data classification, by identifying the struggle faced by current big data classification techniques to process and classify data efficiently without substantial computational resources. The analysis is enabled on a variety of scenarios and key points. In this case, this survey will enable the classification of the techniques utilized for big data classification that is made based on the rule-based, deep learning-based, optimization-based, machine learning-based techniques and so on. Furthermore, the classification of techniques, tools used, published year, used software tool, and performance metrics are contemplated for the analysis in big data classification. At last, the research gaps and technical problems of the techniques in a way that makes the motivations for creating an efficient model of enabling big data classification optimal.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"156 ","pages":"Article 102408"},"PeriodicalIF":2.7,"publicationDate":"2025-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143133531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Danrun Cao , Nicolas Béchet , Pierre-François Marteau , Oussama Ahmia
{"title":"Textual data augmentation using generative approaches - Impact on named entity recognition tasks","authors":"Danrun Cao , Nicolas Béchet , Pierre-François Marteau , Oussama Ahmia","doi":"10.1016/j.datak.2024.102403","DOIUrl":"10.1016/j.datak.2024.102403","url":null,"abstract":"<div><div>Industrial applications of Named Entity Recognition (NER) are usually confronted with small and imbalanced corpora. This could harm the performance of trained and finetuned recognition models, especially when they encounter unknown data. In this study we develop three generation-based data enrichment approaches, in order to increase the number of examples of underrepresented entities. We compare the impact of enriched corpora on NER models, using both non-contextual (fastText) and contextual (Bert-like) embedding models to provide discriminant features to a biLSTM-CRF used as an entity classifier. The approach is evaluated on a contract renewal detection task applied to a corpus of calls for tenders. The results show that the proposed data enrichment procedure effectively improves the NER model’s effectiveness when applied on both known and unknown data.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"156 ","pages":"Article 102403"},"PeriodicalIF":2.7,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143133529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automated mapping between SDG indicators and open data: An LLM-augmented knowledge graph approach","authors":"Wissal Benjira , Faten Atigui , Bénédicte Bucher , Malika Grim-Yefsah , Nicolas Travers","doi":"10.1016/j.datak.2024.102405","DOIUrl":"10.1016/j.datak.2024.102405","url":null,"abstract":"<div><div>Meeting the Sustainable Development Goals (SDGs) presents a large-scale challenge for all countries. SDGs established by the United Nations provide a comprehensive framework for addressing global issues. To monitor progress towards these goals, we need to develop key performance indicators and integrate and analyze heterogeneous datasets. The definition of these indicators requires the use of existing data and metadata. However, the diversity of data sources and formats raises major issues in terms of structuring and integration. Despite the abundance of open data and metadata, its exploitation remains limited, leaving untapped potential for guiding urban policies towards sustainability. Thus, this paper introduces a novel approach for SDG indicator computation, leveraging the capabilities of Large Language Models (LLMs) and Knowledge Graphs (KGs). We propose a method that combines rule-based filtering with LLM-powered schema mapping to establish semantic correspondences between diverse data sources and SDG indicators, including disaggregation. Our approach integrates these mappings into a KG, which enables indicator computation by querying graph’s topology. We evaluate our method through a case study focusing on the SDG Indicator 11.7.1 about accessibility of public open spaces. Our experimental results show significant improvements in accuracy, precision, recall, and F1-score compared to traditional schema mapping techniques.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"156 ","pages":"Article 102405"},"PeriodicalIF":2.7,"publicationDate":"2025-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143133534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Augmenting post-hoc explanations for predictive process monitoring with uncertainty quantification via conformalized Monte Carlo dropout","authors":"Nijat Mehdiyev, Maxim Majlatow, Peter Fettke","doi":"10.1016/j.datak.2024.102402","DOIUrl":"10.1016/j.datak.2024.102402","url":null,"abstract":"<div><div>This study presents a novel approach to improve the transparency and reliability of deep learning models in predictive process monitoring (PPM) by integrating uncertainty quantification (UQ) and explainable artificial intelligence (XAI) techniques. We introduce the conformalized Monte Carlo dropout method, which combines Monte Carlo dropout for uncertainty estimation with conformal prediction (CP) to generate reliable prediction intervals. Additionally, we enhance post-hoc explanation techniques such as individual conditional expectation (ICE) plots and partial dependence plots (PDP) with uncertainty information, including credible and conformal predictive intervals. Our empirical evaluation in the manufacturing industry demonstrates the effectiveness of these approaches in refining strategic and operational decisions. This research contributes to advancing PPM and machine learning by bridging the gap between model transparency and high-stakes decision-making scenarios.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"156 ","pages":"Article 102402"},"PeriodicalIF":2.7,"publicationDate":"2024-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143133642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Owen Eriksson , Paul Johannesson , Maria Bergholtz , Pär Ågerfalk
{"title":"Turning Conceptual Modeling Institutional – The prescriptive role of conceptual models in transforming institutional reality","authors":"Owen Eriksson , Paul Johannesson , Maria Bergholtz , Pär Ågerfalk","doi":"10.1016/j.datak.2024.102404","DOIUrl":"10.1016/j.datak.2024.102404","url":null,"abstract":"<div><div>It has traditionally been assumed that information systems describe physical reality. However, this assumption is becoming obsolete as digital infrastructures are increasingly part of real-world experiences. Digital infrastructures (ubiquitous and scalable information systems) no longer merely map physical reality representations onto digital objects but increasingly assume an active role in creating, shaping, and governing physical reality. We currently witness an “ontological reversal”, where conceptual models and digital infrastructures change physical reality. Still, the fundamental assumption remains that physical reality is the only real world. However, to fully embrace the implications of the ontological reversal, conceptual modeling needs an “institutional turn” that abandons the idea that physical reality always takes priority. Institutional reality, which includes, for example, institutional entities such as organizations, contracts, and payment transactions, is not simply part of physical reality detached from digital infrastructures. Digital infrastructures are part of institutional reality. Accordingly, the research question we address is: What are the fundamental constructs in the design of digital infrastructures that constitute and transform institutional reality? In answering this question, we develop a foundation for conceptual modeling, which we illustrate by modeling the institution of open banking and its associated digital infrastructure. In the article, we identify digital institutional entities, digital agents regulated by software, and digital institutional actions as critical constructs for modeling digital infrastructures in institutional contexts. In so doing, we show how conceptual modeling can improve our understanding of the digital transformation of institutional reality and the prescriptive role of conceptual modeling. We also generate theoretical insights about the need for legitimacy and liability that advance the study and practice of digital infrastructure design and its consequences.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"156 ","pages":"Article 102404"},"PeriodicalIF":2.7,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143133533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhancing the intelligibility of decision trees with concise and reliable probabilistic explanations","authors":"Louenas Bounia, Insaf Setitra","doi":"10.1016/j.datak.2024.102394","DOIUrl":"10.1016/j.datak.2024.102394","url":null,"abstract":"<div><div>This work deals with explainable artificial intelligence (XAI), specifically focusing on improving the intelligibility of decision trees through reliable and concise probabilistic explanations. Decision trees are popular because they are considered highly interpretable. Due to cognitive limitations, abductive explanations can be too large to be interpretable by human users. When this happens, decision trees are far from being easily interpretable. In this context, our goal is to enhance the intelligibility of decision trees by using probabilistic explanations. Drawing inspiration from previous work on approximating probabilistic explanations, we propose a greedy algorithm that enables us to derive concise and reliable probabilistic explanations for decision trees. We provide a detailed description of this algorithm and compare it to the state-of-the-art SAT encoding. In the order to highlight the gains in intelligibility while emphasizing its empirical effectiveness, we will conduct in-depth experiments on binary decision trees as well as on cases of multi-class classification. We expect significant gains in intelligibility. Finally, to demonstrate the usefulness of such an approach in a practical context, we chose to carry out additional experiments focused on text classification, in particular the detection of emotions in tweets. Our objective is to determine the set of words explaining the emotion predicted by the decision tree.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"156 ","pages":"Article 102394"},"PeriodicalIF":2.7,"publicationDate":"2024-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143133640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
François Camelin , Samir Loudni , Gilles Pesant , Charlotte Truchet
{"title":"Coupling MDL and Markov chain Monte Carlo to sample diverse pattern sets","authors":"François Camelin , Samir Loudni , Gilles Pesant , Charlotte Truchet","doi":"10.1016/j.datak.2024.102393","DOIUrl":"10.1016/j.datak.2024.102393","url":null,"abstract":"<div><div>Exhaustive methods of pattern extraction in a database face real obstacles to speed and output control of patterns: a large number of patterns are extracted, many of which are redundant. Pattern extraction methods through sampling, which allow for controlling the size of the outputs while ensuring fast response times, provide a solution to these two problems. However, these methods do not provide high-quality patterns: they return patterns that are very infrequent in the database. Furthermore, they do not scale. To ensure more frequent and diversified patterns in the output, we propose integrating compression methods into sampling to select the most representative patterns from the sampled transactions. We demonstrate that our approach improves the state of the art in terms of diversity of produced patterns.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"156 ","pages":"Article 102393"},"PeriodicalIF":2.7,"publicationDate":"2024-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143133536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HiBenchLLM: Historical Inquiry Benchmarking for Large Language Models","authors":"Mathieu Chartier , Nabil Dakkoune , Guillaume Bourgeois , Stéphane Jean","doi":"10.1016/j.datak.2024.102383","DOIUrl":"10.1016/j.datak.2024.102383","url":null,"abstract":"<div><div>Large Language Models (LLMs) such as ChatGPT or Bard have significantly transformed information retrieval and captured the public’s attention with their ability to generate customized responses across various topics. In this paper, we analyze the capabilities of different LLMs to generate responses related to historical facts in French. Our objective is to evaluate their reliability, comprehensiveness, and relevance for direct usability or extraction. To accomplish this, we propose a benchmark consisting of numerous historical questions covering various types, themes, and difficulty levels. Our evaluation of responses provided by 14 selected LLMs reveals several limitations in both content and structure. In addition to an overall insufficient precision rate, we observe uneven treatment of the French language, along with issues related to verbosity and inconsistency in the responses generated by LLMs.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"156 ","pages":"Article 102383"},"PeriodicalIF":2.7,"publicationDate":"2024-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143133532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Clustering of timed sequences – Application to the analysis of care pathways","authors":"Thomas Guyet , Pierre Pinson , Enoal Gesny","doi":"10.1016/j.datak.2024.102401","DOIUrl":"10.1016/j.datak.2024.102401","url":null,"abstract":"<div><div>Improving the future of healthcare starts by better understanding the current actual practices in . This motivates the objective of discovering typical care pathways from patient data. Revealing care pathways can be achieved through clustering. The difficulty in clustering care pathways, represented by sequences of timestamped events, lies in defining a semantically appropriate metric and clustering algorithms.</div><div>In this article, we adapt two methods developed for time series to the clustering of timed sequences: the drop-DTW metric and the DBA approach for the construction of averaged time sequences. These methods are then applied in clustering algorithms to propose original and sound clustering algorithms for timed sequences. This approach is experimented with and evaluated on synthetic and .</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"156 ","pages":"Article 102401"},"PeriodicalIF":2.7,"publicationDate":"2024-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143133539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Semantic vs. LLM-based approach: A case study of KOnPoTe vs. Claude for ontology population from French advertisements","authors":"Aya Sahbi , Céline Alec , Pierre Beust","doi":"10.1016/j.datak.2024.102392","DOIUrl":"10.1016/j.datak.2024.102392","url":null,"abstract":"<div><div>Automatic ontology population is the process of identifying, extracting, and integrating relevant information from diverse sources to instantiate the classes and properties specified in an ontology, thereby creating a Knowledge Graph (KG) for a particular domain. In this study, we evaluate two approaches for ontology population from text: KOnPoTe, a semantic technique that employs textual and domain knowledge analysis, and a generative AI method leveraging Claude, a Large Language Model (LLM). We conduct comparative experiments on three French advertisement domains: real estate, boats, and restaurants to assess the performance of these techniques. Our analysis highlights the respective strengths and limitations of the semantic approach and the LLM-based one in the context of the ontology population process.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"156 ","pages":"Article 102392"},"PeriodicalIF":2.7,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143133538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}