{"title":"An explainable machine learning approach for automated medical decision support of heart disease","authors":"Francisco Mesquita, Gonçalo Marques","doi":"10.1016/j.datak.2024.102339","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102339","url":null,"abstract":"<div><p>Coronary Heart Disease (CHD) is the dominant cause of mortality around the world. Every year, it causes about 3.9 million deaths in Europe and 1.8 million in the European Union (EU). It is responsible for 45 % and 37 % of all deaths in Europe and the European Union, respectively. Using machine learning (ML) to predict heart diseases is one of the most promising research topics, as it can improve healthcare and consequently increase the longevity of people's lives. However, although the ability to interpret the results of the predictive model is essential, most of the related studies do not propose explainable methods. To address this problem, this paper presents a classification method that not only exhibits reliable performance but is also interpretable, ensuring transparency in its decision-making process. SHapley Additive exPlanations, known as the SHAP method was chosen for model interpretability. This approach presents a comparison between different classifiers and parameter tuning techniques, providing all the details necessary to replicate the experiment and help future researchers working in the field. The proposed model achieves similar performance to those proposed in the literature, and its predictions are fully interpretable.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"153 ","pages":"Article 102339"},"PeriodicalIF":2.7,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X24000636/pdfft?md5=9bdfa8117c5ce50d0508986a80981671&pid=1-s2.0-S0169023X24000636-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141592969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Olga Francés, Javi Fernández, José Abreu-Salas, Yoan Gutiérrez, Manuel Palomar
{"title":"A comprehensive methodology to construct standardised datasets for Science and Technology Parks","authors":"Olga Francés, Javi Fernández, José Abreu-Salas, Yoan Gutiérrez, Manuel Palomar","doi":"10.1016/j.datak.2024.102338","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102338","url":null,"abstract":"<div><p>This work presents a standardised approach to create datasets for Science and Technology Parks (STPs), facilitating future analysis of STP characteristics, trends and performance. STPs are the most representative examples of innovation ecosystems. The ETL (extraction-transformation-load) structure was adapted to a global field study of STPs. A selection stage and quality check were incorporated, and the methodology was applied to Spanish STPs. This study applies diverse techniques such as expert labelling and information extraction which uses language technologies. A novel methodology for building quality and standardised STP datasets was designed and applied to a Spanish STP case study with 49 STPs. An updatable dataset and a list of the main features impacting STPs are presented. Twenty-one (<em>n</em> = 21) core features were refined and selected, with fifteen of them (71.4 %) being robust enough for developing further quality analysis. The methodology presented integrates different sources with heterogeneous information that is often decentralised, disaggregated and in different formats: excel files, and unstructured information in HTML or PDF format. The existence of this updatable dataset and the defined methodology will enable powerful AI tools to be applied that focus on more sophisticated analysis, such as taxonomy, monitoring, and predictive and prescriptive analytics in the innovation ecosystems field.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"153 ","pages":"Article 102338"},"PeriodicalIF":2.7,"publicationDate":"2024-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141542542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Providing healthcare shopping advice through knowledge-based virtual agents","authors":"Claire Deventer, Pietro Zidda","doi":"10.1016/j.datak.2024.102336","DOIUrl":"10.1016/j.datak.2024.102336","url":null,"abstract":"<div><p>Knowledge-based virtual shopping agents, that advise their users about which products to buy, are well used in technical markets such as healthcare e-commerce. To ensure the proper adoption of this technology, it is important to consider aspects of users’ psychology early in the software design process. When traditional adoption models such as UTAUT-2 work well for many technologies, they overlook important specificities of the healthcare e-commerce domain and of knowledge-based virtual agents technology. Drawing upon health information technology and virtual agent literature, we propose a complementary adoption model incorporating new predictors and moderators reflecting these domains’ specificities. The model is tested using 903 observations gathered through an online survey conducted in collaboration with a major actor in the herbal medicine market. Our model can serve as a basis for many phases of the knowledge-based agents software development. We propose actionable recommendations for practitioners and ideas for further research.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"153 ","pages":"Article 102336"},"PeriodicalIF":2.7,"publicationDate":"2024-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141412665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zelong Su , Bin Gao , Xiaoou Pan , Zhengjun Liu , Yu Ji , Shutian Liu
{"title":"CGT: A Clause Graph Transformer Structure for aspect-based sentiment analysis","authors":"Zelong Su , Bin Gao , Xiaoou Pan , Zhengjun Liu , Yu Ji , Shutian Liu","doi":"10.1016/j.datak.2024.102332","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102332","url":null,"abstract":"<div><p>In the realm of natural language processing (NLP), aspect-based sentiment analysis plays a pivotal role. Recently, there has been a growing emphasis on techniques leveraging Graph Convolutional Neural Network (GCN). However, there are several challenges associated with current approaches: (1) Due to the inherent transitivity of CGN, training inevitably entails the acquisition of irrelevant semantic information. (2) Existing methodologies heavily depend on the dependency tree, neglecting to consider the contextual structure of the sentence. (3) Another limitation of the majority of methods is their failure to account for the interactions occurring between different aspects. In this study, we propose a Clause Graph Transformer Structure (CGT) to alleviate these limitations. Specifically, CGT comprises three modules. The preprocessing module extracts aspect clauses from each sentence by bi-directionally traversing the constituent tree, reducing reliance on syntax trees and extracting semantic information from the perspective of clauses. Additionally, we assert that a word’s vector direction signifies its underlying attitude in the semantic space, a feature often overlooked in recent research. Without the necessity for additional parameters, we introduce the Clause Attention encoder (CA-encoder) to the clause module to effectively capture the directed cross-correlation coefficient between the clause and the target aspect. To enhance the representation of the target component, we propose capturing the connections between various aspects. In the inter-aspect module, we intricately design a Balanced Attention encoder (BA-encoder) that forms an aspect sequence by navigating the extracted phrase tree. To effectively capture the emotion of implicit components, we introduce a Top-K Attention Graph Convolutional Network (KA-GCN). Our proposed method has showcased state-of-the-art (SOTA) performance through experiments conducted on four widely used datasets. Furthermore, our model demonstrates a significant improvement in the robustness of datasets subjected to disturbances.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"153 ","pages":"Article 102332"},"PeriodicalIF":2.5,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141328532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marcel Altendeitering , Tobias Moritz Guggenberger , Frederik Möller
{"title":"A design theory for data quality tools in data ecosystems: Findings from three industry cases","authors":"Marcel Altendeitering , Tobias Moritz Guggenberger , Frederik Möller","doi":"10.1016/j.datak.2024.102333","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102333","url":null,"abstract":"<div><p>Data ecosystems are a novel inter-organizational form of cooperation. They require at least one data provider and one or more data consumers. Existing research mainly addresses generativity mechanisms in this relationship, such as business models or role models for data ecosystems. However, an essential prerequisite for thriving data ecosystems is high data quality in the shared data. Without sufficient data quality, sharing data might lead to negative business consequences, given that the information drawn from them or services built on them might be incorrect or produce fraudulent results. We tackle this issue precisely since we report on a multi-case study deploying data quality tools in data ecosystem scenarios. From these cases, we derive generalized prescriptive design knowledge as a design theory to make the knowledge available for others designing data quality tools for data sharing. Subsequently, our study contributes to integrating the issue of data quality in data ecosystem research and provides practitioners with actionable guidelines inferred from three real-world cases.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"153 ","pages":"Article 102333"},"PeriodicalIF":2.5,"publicationDate":"2024-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X24000570/pdfft?md5=c13245062cdefc052035d38866a21318&pid=1-s2.0-S0169023X24000570-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141324527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An extended TF-IDF method for improving keyword extraction in traditional corpus-based research: An example of a climate change corpus","authors":"Liang-Ching Chen","doi":"10.1016/j.datak.2024.102322","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102322","url":null,"abstract":"<div><p>Keyword extraction involves the application of Natural Language Processing (NLP) algorithms or models developed in the realm of text mining. Keyword extraction is a common technique used to explore linguistic patterns in the corpus linguistic field, and Dunning’s Log-Likelihood Test (LLT) has long been integrated into corpus software as a statistic-based NLP model. While prior research has confirmed the widespread applicability of keyword extraction in corpus-based research, LLT has certain limitations that may impact the accuracy of keyword extraction in such research. This paper summarized the limitations of LLT, which include benchmark corpus interference, elimination of grammatical and generic words, consideration of sub-corpus relevance, flexibility in feature selection, and adaptability to different research goals. To address these limitations, this paper proposed an extended Term Frequency-Inverse Document Frequency (TF-IDF) method. To verify the applicability of the proposed method, 20 highly cited research articles on climate change from the Web of Science (WOS) database were used as the target corpus, and a comparison was conducted with the traditional method. The experimental results indicated that the proposed method could effectively overcome the limitations of the traditional method and demonstrated the feasibility and practicality of incorporating the TF-IDF algorithm into relevant corpus-based research.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"153 ","pages":"Article 102322"},"PeriodicalIF":2.5,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141285881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christina Antoniou, Kalliopi Kravari, Nick Bassiliades
{"title":"Semantic requirements construction using ontologies and boilerplates","authors":"Christina Antoniou, Kalliopi Kravari, Nick Bassiliades","doi":"10.1016/j.datak.2024.102323","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102323","url":null,"abstract":"<div><p>This paper presents a combination of an ontology and boilerplates, which are requirements templates for the syntactic structure of individual requirements that try to alleviate the problem of ambiguity caused using natural language and make it easier for inexperienced engineers to create requirements. However, still the use of boilerplates restricts the use of natural language only syntactically and not semantically. Boilerplates consists of fixed and attributes elements. Using ontologies, restricts the vocabulary of the words used in the requirements boilerplates to entities, their properties and entity relationships that are semantically meaningful to the application domain, leading thus to fewer errors. In this work we combine the advantages of boilerplates and ontologies. Usually, the attributes of boilerplates are completed with the help of the ontology. The contribution of this paper is that the whole boilerplates are stored in the ontology, based on the fact that RDF triples have similar syntax to the boilerplate syntax, so that attributes and fixed elements are part of the ontology. This combination helps to construct semantically and syntactically correct requirements. The contribution and novelty of our method is that we exploit the natural language syntax of boilerplates mapping them to Resource Description Framework triples which have also a linguistic nature. In this paper we created and present the development of a domain-specific ontology as well as a minimal set of boilerplates for a specific application domain, namely that of engineering software for an ATM, while maintaining flexibility on the one hand and generality on the other.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"152 ","pages":"Article 102323"},"PeriodicalIF":2.5,"publicationDate":"2024-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141289760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Robert Buchmann , Johann Eder , Hans-Georg Fill , Ulrich Frank , Dimitris Karagiannis , Emanuele Laurenzi , John Mylopoulos , Dimitris Plexousakis , Maribel Yasmina Santos
{"title":"Large language models: Expectations for semantics-driven systems engineering","authors":"Robert Buchmann , Johann Eder , Hans-Georg Fill , Ulrich Frank , Dimitris Karagiannis , Emanuele Laurenzi , John Mylopoulos , Dimitris Plexousakis , Maribel Yasmina Santos","doi":"10.1016/j.datak.2024.102324","DOIUrl":"10.1016/j.datak.2024.102324","url":null,"abstract":"<div><p>The hype of Large Language Models manifests in disruptions, expectations or concerns in scientific communities that have focused for a long time on design-oriented research. The current experiences with Large Language Models and associated products (e.g. ChatGPT) lead to diverse positions regarding the foreseeable evolution of such products from the point of view of scholars who have been working with designed abstractions for most of their careers - typically relying on deterministic design decisions to ensure systems and automation reliability. Such expectations are collected in this paper in relation to a flavor of systems engineering that relies on explicit knowledge structures, introduced here as “semantics-driven systems engineering”.</p><p>The paper was motivated by the panel discussion that took place at CAiSE 2023 in Zaragoza, Spain, during the workshop on Knowledge Graphs for Semantics-driven Systems Engineering (KG4SDSE). The workshop brought together Conceptual Modeling researchers with an interest in specific applications of Knowledge Graphs and the semantic enrichment benefits they can bring to systems engineering. The panel context and consensus are summarized at the end of the paper, preceded by a proposed research agenda considering the expressed positions.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"152 ","pages":"Article 102324"},"PeriodicalIF":2.5,"publicationDate":"2024-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141134203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gabriele Scaffidi Militone, Daniele Apiletti, Giovanni Malnati
{"title":"Hermes, a low-latency transactional storage for binary data streams from remote devices","authors":"Gabriele Scaffidi Militone, Daniele Apiletti, Giovanni Malnati","doi":"10.1016/j.datak.2024.102315","DOIUrl":"10.1016/j.datak.2024.102315","url":null,"abstract":"<div><p>In many contexts where data is streamed on a large scale, such as video surveillance systems, there is a dual requirement: secure data storage and continuous access to audio and video content by third parties, such as human operators or specific business logic, even while the media files are still being collected. However, using transactions to ensure data persistence often limits system throughput and latency. This paper presents a solution that enables both high ingestion rates with transactional data persistence and near real-time, low-latency access to the stream during collection. This immediate access enables the prompt application of specialized data engineering algorithms during data acquisition. The proposed solution is particularly suitable for binary data sources such as audio and video recordings in surveillance systems, and it can be extended to various big data scenarios via well-defined general interfaces. The scalability of the approach is based on the microservice architecture. Preliminary results obtained with Apache Kafka and MongoDB replica sets show that the proposed solution provides up to 3 times higher throughput and 2.2 times lower latency compared to standard multi-document transactions.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"153 ","pages":"Article 102315"},"PeriodicalIF":2.5,"publicationDate":"2024-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141042236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analyzing fuzzy semantics of reviews for multi-criteria recommendations","authors":"Navreen Kaur Boparai , Himanshu Aggarwal , Rinkle Rani","doi":"10.1016/j.datak.2024.102314","DOIUrl":"10.1016/j.datak.2024.102314","url":null,"abstract":"<div><p>Hotel reviews play a vital role in tourism recommender system. They should be analyzed effectively to enhance the accuracy of recommendations which can be generated either from crisp ratings on a fixed scale or real sentiments of reviews. But crisp ratings cannot represent the actual feelings of reviewers. Existing tourism recommender systems mostly recommend hotels on the basis of vague and sparse ratings resulting in inaccurate recommendations or preferences for online users. This paper presents a semantic approach to analyze the online reviews being crawled from tripadvisor.in. It discovers the underlying fuzzy semantics of reviews with respect to the multiple criteria of hotels rather than using the crisp ratings. The crawled reviews are preprocessed via data cleaning such as stopword and punctuation removal, tokenization, lemmatization, pos tagging to understand the semantics efficiently. Nouns representing frequent features of hotels are extracted from pre-processed reviews which are further used to identify opinion phrases. Fuzzy weights are derived from normalized frequency of frequent nouns and combined with sentiment score of all the synonyms of adjectives in the identified opinion phrases. This results in fuzzy semantics which form an ideal representation of reviews for a multi-criteria tourism recommender system. The proposed work is implemented in python by crawling the recent reviews of Jaipur hotels from TripAdvisor and analyzing their semantics. The resultant fuzzy semantics form a manually tagged dataset of reviews tagged with sentiments of identified aspects, respectively. Experimental results show improved sentiment score while considering all the synonyms of adjectives. The results are further used to fine-tune BERT models to form encodings for a query-based recommender system. The proposed approach can help tourism and hospitality service providers to take advantage of such sentiment analysis to examine the negative comments or unpleasant experiences of tourists and making appropriate improvements. Moreover, it will help online users to get better recommendations while planning their trips.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"152 ","pages":"Article 102314"},"PeriodicalIF":2.5,"publicationDate":"2024-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141034319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}