{"title":"Explainable influenza forecasting scheme using DCC-based feature selection","authors":"Sungwoo Park , Jaeuk Moon , Seungwon Jung , Seungmin Rho , Eenjun Hwang","doi":"10.1016/j.datak.2023.102256","DOIUrl":"https://doi.org/10.1016/j.datak.2023.102256","url":null,"abstract":"<div><p>As influenza is easily converted to another type of virus and spreads very quickly from person to person, it is more likely to develop into a pandemic. Even though vaccines are the most effective way to prevent influenza, it takes a lot of time to produce them. Due to this, there has been an imbalance in the supply and demand of influenza vaccines every year. For a smooth vaccine supply, it is necessary to accurately forecast vaccine demand at least three to six months in advance. So far, many machine learning-based predictive models have shown excellent performance. However, their use was limited due to performance deterioration due to inappropriate training data and inability to explain the results. To solve these problems, in this paper, we propose an explainable influenza forecasting model. In particular, the model selects highly related data based on the distance correlation coefficient for effective training and explains the prediction results using shapley additive explanations. We evaluated its performance through extensive experiments. We report some of the results.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"149 ","pages":"Article 102256"},"PeriodicalIF":2.5,"publicationDate":"2023-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138471983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A-MKMC: An effective adaptive-based multilevel K-means clustering with optimal centroid selection using hybrid heuristic approach for handling the incomplete data","authors":"Hima Vijayan , Subramaniam M , Sathiyasekar K","doi":"10.1016/j.datak.2023.102243","DOIUrl":"10.1016/j.datak.2023.102243","url":null,"abstract":"<div><p><span><span>In general, clustering is defined as partitioning similar and dissimilar objects into several groups. It has been widely used in applications like pattern recognition, image processing, and data analysis. When the dataset contains some missing data or value, it is termed incomplete data. In such implications, the incomplete dataset issue is untreatable while validating the data. Due to these flaws, the quality or standard level of the data gets an impact. Hence, the handling of missing values is done by influencing the clustering mechanisms for sorting out the missing data. Yet, the traditional </span>clustering algorithms<span> fail to combat the issues as it is not supposed to maintain large dimensional data. It is also caused by errors of human intervention or inaccurate outcomes. To alleviate the challenging issue of incomplete data, a novel clustering algorithm is proposed. Initially, incomplete or mixed data is garnered from the five different standard data sources. Once the data is to be collected, it is undergone the pre-processing phase, which is accomplished using data normalization. Subsequently, the final step is processed by the new clustering algorithm that is termed Adaptive centroid based Multilevel K-Means Clustering (A-MKMC), in which the cluster centroid is optimized by integrating the two conventional algorithms such as Border Collie Optimization (BCO) and </span></span>Whale Optimization Algorithm<span> (WOA) named as Hybrid Border Collie Whale Optimization (HBCWO). Therefore, the validation of the novel clustering model is estimated using various measures and compared against traditional mechanisms. From the overall result analysis, the accuracy and precision of the designed HBCWO-A-MKMC method attain 93 % and 95 %. Hence, the adaptive clustering process exploits the higher performance that aids in sorting out the missing data issuecompared to the other conventional methods.</span></p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"150 ","pages":"Article 102243"},"PeriodicalIF":2.5,"publicationDate":"2023-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138534224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Global and item-by-item reasoning fusion-based multi-hop KGQA","authors":"Tongzhao Xu, Turdi Tohti, Askar Hamdulla","doi":"10.1016/j.datak.2023.102244","DOIUrl":"https://doi.org/10.1016/j.datak.2023.102244","url":null,"abstract":"<div><p><span><span><span>Existing embedded multi-hop Question Answering over Knowledge Graph (KGQA) methods attempted to handle Knowledge Graph (KG) sparsity using Knowledge Graph Embedding (KGE) to improve KGQA. However, they almost ignore the intermediate path reasoning process of answer prediction, do not consider the information interaction between the question and the KG, and rarely consider the problem that the triple scoring reasoning mechanism is inadequate in extracting deep features. To address the above issues, this paper proposes Global and Item-by-item Reasoning Fusion-based Multi-hop KGQA (GIRFM-KGQA). In global reasoning, a convolutional attention reasoning mechanism is proposed and fused with the triple scoring reasoning mechanism to jointly implement global reasoning, thus enhancing the long-chain reasoning ability of the global reasoning model. In item-by-item reasoning, the reasoning path is formed by serially predicting relations, and then the answer is predicted, which effectively solves the problem that the embedded multi-hop KGQA method lacks the intermediate path reasoning ability. In addition, we introduce an information interaction method between the question and the KG to improve the accuracy of the answer prediction. Finally, we merge the global reasoning score with the item-by-item reasoning score to jointly predict the answer. Our model, compared to the </span>baseline model (EmbedKGQA), achieves an accuracy improvement of 0.5% and 2.7% on two-hop questions, and 6.2% and 4.6% on three-hop questions for the MetaQA_Full and MetaQA_Half datasets, and 1.7% on the WebQuestionSP dataset, respectively. The experimental results show that the proposed model can effectively improve the accuracy of the multi-hop KGQA model and enhance the </span>interpretability<span> of the model. We have made our model’s source code available at github: </span></span><span>https://github.com/feixiongfeixiong/GIRFM</span><svg><path></path></svg>.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"149 ","pages":"Article 102244"},"PeriodicalIF":2.5,"publicationDate":"2023-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138430916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Troels Andreasen , Gloria Bordogna , Guy De Tré , Janusz Kacprzyk , Henrik Legind Larsen , Sławomir Zadrożny
{"title":"The power and potentials of Flexible Query Answering Systems: A critical and comprehensive analysis","authors":"Troels Andreasen , Gloria Bordogna , Guy De Tré , Janusz Kacprzyk , Henrik Legind Larsen , Sławomir Zadrożny","doi":"10.1016/j.datak.2023.102246","DOIUrl":"https://doi.org/10.1016/j.datak.2023.102246","url":null,"abstract":"<div><p>The popularity of chatbots, such as ChatGPT, has brought research attention to question answering systems, capable to generate natural language answers to user’s natural language queries. However, also in other kinds of systems, flexibility of querying, including but also going beyond the use of natural language, is an important feature. With this consideration in mind the paper presents a critical and comprehensive analysis of recent developments, trends and challenges of Flexible Query Answering Systems (FQASs). Flexible query answering is a multidisciplinary research field that is not limited to question answering in natural language, but comprises other query forms and interaction modalities, which aim to provide powerful means and techniques for better reflecting human preferences and intentions to retrieve relevant information. It adopts methods at the crossroad of several disciplines among which Information Retrieval (IR), databases, knowledge based systems, knowledge and data engineering, Natural Language Processing (NLP) and the semantic web may be mentioned. The analysis principles are inspired by the Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) framework, characterized by a top-down process, starting with relevant keywords for the topic of interest to retrieve relevant articles from meta-sources And complementing these articles with other relevant articles from seed sources Identified by a bottom-up process. to mine the retrieved publication data a network analysis is performed Which allows to present in a synthetic way intrinsic topics of the selected publications. issues dealt with are related to query answering methods Both model-based and data-driven (the latter based on either machine learning or deep learning) And to their needs for explainability and fairness to deal with big data Notably by taking into account data veracity. conclusions point out trends and challenges to help better shaping the future of the FQAS field.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"149 ","pages":"Article 102246"},"PeriodicalIF":2.5,"publicationDate":"2023-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X23001064/pdfft?md5=a520b95a7109e1b8dddc31cb9594841b&pid=1-s2.0-S0169023X23001064-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138471982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ilias Dimitriadis, George Dialektakis, Athena Vakali
{"title":"CALEB: A Conditional Adversarial Learning Framework to enhance bot detection","authors":"Ilias Dimitriadis, George Dialektakis, Athena Vakali","doi":"10.1016/j.datak.2023.102245","DOIUrl":"10.1016/j.datak.2023.102245","url":null,"abstract":"<div><p><span>The high growth of Online Social Networks<span> (OSNs) over the last few years has allowed automated accounts, known as social bots, to gain ground. As highlighted by other researchers, many of these bots have malicious purposes and tend to mimic human behavior, posing high-level security threats on OSN platforms. Moreover, recent studies have shown that social bots evolve over time by reforming and reinventing unforeseen and sophisticated characteristics, making them capable of evading the current machine learning<span> state-of-the-art bot detection systems. This work is motivated by the critical need to establish adaptive bot detection methods in order to proactively capture unseen evolved bots towards healthier OSNs interactions. In contrast with most earlier supervised ML approaches which are limited by the inability to effectively detect new types of bots, this paper proposes CALEB, a robust end-to-end proactive framework based on the Conditional </span></span></span>Generative Adversarial Network<span><span> (CGAN) and its extension, Auxiliary Classifier GAN (AC-GAN), to simulate bot evolution by creating realistic synthetic instances of different bot types. These simulated evolved bots augment existing bot datasets and therefore enhance the detection of emerging generations of bots before they even appear. Furthermore, we show that our augmentation approach overpasses other earlier augmentation techniques which fail at simulating evolving bots. Extensive experimentation on well established public bot datasets, show that our approach offers a performance boost of up to 10% regarding the detection of new unseen bots. Finally, the use of the AC-GAN </span>Discriminator as a bot detector, has outperformed former ML approaches, showcasing the efficiency of our end to end framework.</span></p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"149 ","pages":"Article 102245"},"PeriodicalIF":2.5,"publicationDate":"2023-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135763694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"What is the business value of your data? A multi-perspective empirical study on monetary valuation factors and methods for data governance","authors":"Frank Bodendorf, Jörg Franke","doi":"10.1016/j.datak.2023.102242","DOIUrl":"10.1016/j.datak.2023.102242","url":null,"abstract":"<div><p><span>Digitalization has greatly increased the importance of data in recent years, making data an indispensable resource for value creation in our time. There is currently still a lack of theories as well as practicable methods and techniques for the monetary valuation of data, and data is therefore not yet sufficiently managed in terms of business management principles. In this context, this research is intended to design theory ingrained principles for a multidimensional conceptual approach to the monetary valuation of data as assets. We draw on the theory of dynamic capabilities as a further development of resource theory as well as value theory. To this end, the research conducts a qualitative field study followed by a quantitative survey study. Literature analysis is used to explain different dimensions in the qualitative field study. Structural equation modeling is used to analyze empirical data collected in the quantitative study. The results show that data value determination is a multidimensional and hierarchical construct consisting of three primary dimensions. These are the benefit-oriented, cost-oriented, and quality-oriented dimensions. The results also confirm that institutional pressures (coercive, normative, mimetic) that influence </span>organizational behaviors lead to a greater intention for organizations to adapt a monetary data value determination.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"149 ","pages":"Article 102242"},"PeriodicalIF":2.5,"publicationDate":"2023-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135714069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Damir Vandic , Lennart J. Nederstigt , Flavius Frasincar , Uzay Kaymak , Enzo Ido
{"title":"A framework for approximate product search using faceted navigation and user preference ranking","authors":"Damir Vandic , Lennart J. Nederstigt , Flavius Frasincar , Uzay Kaymak , Enzo Ido","doi":"10.1016/j.datak.2023.102241","DOIUrl":"10.1016/j.datak.2023.102241","url":null,"abstract":"<div><p>One of the problems that e-commerce users face is that the desired products are sometimes not available and Web shops fail to provide similar products due to their exclusive reliance on Boolean faceted search. User preferences are also often not taken into account. In order to address these problems, we present a novel framework specifically geared towards approximate faceted search within the product catalog of a Web shop. It is based on adaptations to the p-norm extended Boolean model, to account for the domain-specific characteristics of faceted search in an e-commerce environment. These e-commerce specific characteristics are, for example, the use of quantitative properties and the presence of user preferences. Our approach explores the concept of facet similarity functions in order to better match products to queries. In addition, the user preferences are used to assign importance weights to the query terms. Using a large-scale experimental setup based on real-world data, we conclude that the proposed algorithm outperforms the considered benchmark algorithms. Last, we have performed a user-based study in which we found that users who use our approach find more relevant products with less effort.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"149 ","pages":"Article 102241"},"PeriodicalIF":2.5,"publicationDate":"2023-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135664518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Discovering and evaluating organizational knowledge from textual data: Application to crisis management","authors":"Dhouha Grissa, Eric Andonoff, Chihab Hanachi","doi":"10.1016/j.datak.2023.102237","DOIUrl":"https://doi.org/10.1016/j.datak.2023.102237","url":null,"abstract":"<div><p><span><span>Crisis management effectiveness relies mainly on the quality of the distributed human organization deployed for saving lives, limiting damage and reducing risks. Organizations set up in this context are not always predefined and static; they could evolve and new forms could emerge since actors, such as volunteers or NGO, could join dynamically to collaborate. To improve crisis resolution effectiveness, it is first important to understand, analyze and evaluate such dynamic organizations in order to adjust crisis management plans and ease coordination among actors. Giving a textual experience feedback from past crisis, the objective of this paper is to discover the organizational structure deployed in the considered crisis and then evaluate it according to a set of criteria. For that purpose, we combine in a coherent framework text and </span>association rule mining for pattern discovery and annotation, and multi-agent system models and techniques for formally building and evaluating organizational structures. We present the </span><em>OSminer</em> algorithm that discovers association rules based on relevant textual patterns and then builds an organizational structure including three main relations between actors: power, control and coordination. A real-life case study, a flood crisis hitting the south west of France, serves as a basis for testing/experimenting our solution. The organizational structure, discovered in this case study, has 24 actors. Its evaluation indicates its efficiency, but shows that it is neither robust nor flexible. Our findings highlight the potential of our approach to discover and evaluate organizational structures from a text recording interactions between stakeholders in a crisis context.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"148 ","pages":"Article 102237"},"PeriodicalIF":2.5,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"92046670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"User-generated short-text classification using cograph editing-based network clustering with an application in invoice categorization","authors":"Dewan F. Wahid , Elkafi Hassini","doi":"10.1016/j.datak.2023.102238","DOIUrl":"https://doi.org/10.1016/j.datak.2023.102238","url":null,"abstract":"<div><p>Rapid adaptation of online business platforms in every sector creates an enormous amount of user-generated textual data related to providing product or service descriptions, reviewing, marketing, invoicing and bookkeeping. These data are often short in size, noisy (e.g., misspellings, abbreviations), and do not have accurate classifying labels (line-item categories). Classifying these user-generated short-text data with appropriate line-item categories is crucial for corresponding platforms to understand users’ needs. This paper proposed a framework for user-generated short-text classification based on identified line-item categories. In the line-item identification phase, we used cograph editing (CoE)-based clustering on keywords network, which can be formulated from users’ generated short-texts. We also proposed integer linear programming (ILP) formulations for CoE on weighted networks and designed a heuristic algorithm to identify clusters in large-scale networks. Finally, we outlined an application of this framework to categorize invoices in an empirical setting. Our framework showed promising results in identifying invoice line-item categories for large-scale data.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"148 ","pages":"Article 102238"},"PeriodicalIF":2.5,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"92136240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Challenges of a Data Ecosystem for scientific data","authors":"Edoardo Ramalli, Barbara Pernici","doi":"10.1016/j.datak.2023.102236","DOIUrl":"https://doi.org/10.1016/j.datak.2023.102236","url":null,"abstract":"<div><p>Data Ecosystems (DE) are used across various fields and applications. They facilitate collaboration between organizations, such as companies or research institutions, enabling them to share data and services. A DE can boost research outcomes by managing and extracting value from the increasing volume of generated and shared data in the last decades. However, the adoption of DE solutions for scientific data by R&D departments and scientific communities is still difficult. Scientific data are challenging to manage, and, as a result, a considerable part of this information still needs to be annotated and organized in order to be shared. This work discusses the challenges of employing DE in scientific domains and the corresponding potential mitigations. First, scientific data and their typologies are contextualized, then their unique characteristics are discussed. Typical properties regarding their high heterogeneity and uncertainty make assessing their consistency and accuracy problematic. In addition, this work discusses the specific requirements expressed by the scientific communities when it comes to integrating a DE solution into their workflow. The unique properties of scientific data and domain-specific requirements create a challenging setting for adopting DEs. The challenges are expressed as general research questions, and this work explores the corresponding solutions in terms of data management aspects. Finally, the paper presents a real-world scenario with more technical details.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"148 ","pages":"Article 102236"},"PeriodicalIF":2.5,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X23000964/pdfft?md5=98e3f9c9e5690c131b72c032eddd9253&pid=1-s2.0-S0169023X23000964-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"92046668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}