{"title":"An interactive approach to semantic enrichment with geospatial data","authors":"Flavio De Paoli , Michele Ciavotta , Roberto Avogadro , Emil Hristov , Milena Borukova , Dessislava Petrova-Antonova , Iva Krasteva","doi":"10.1016/j.datak.2024.102341","DOIUrl":"10.1016/j.datak.2024.102341","url":null,"abstract":"<div><p>The ubiquitous availability of datasets has spurred the utilization of Artificial Intelligence methods and models to extract valuable insights, unearth hidden patterns, and predict future trends. However, the current process of data collection and linking heavily relies on expert knowledge and domain-specific understanding, which engenders substantial costs in terms of both time and financial resources. Therefore, streamlining the data acquisition, harmonization, and enrichment procedures to deliver high-fidelity datasets readily usable for analytics is paramount. This paper explores the capabilities of <em>SemTUI</em>, a comprehensive framework designed to support the enrichment of tabular data by leveraging semantics and user interaction. Utilizing SemTUI, an iterative and interactive approach is proposed to enhance the flexibility, usability and efficiency of geospatial data enrichment. The approach is evaluated through a pilot case study focused on urban planning, with a particular emphasis on geocoding. Using a real-world scenario involving the analysis of kindergarten accessibility within walking distance, the study demonstrates the proficiency of SemTUI in generating precise and semantically enriched location data. The incorporation of human feedback in the enrichment process successfully enhances the quality of the resulting dataset, highlighting SemTUI’s potential for broader applications in geospatial analysis and its usability for users with limited expertise in manipulating geospatial data.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"153 ","pages":"Article 102341"},"PeriodicalIF":2.7,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X2400065X/pdfft?md5=969535621599adcaa2ec5e5d12e392b3&pid=1-s2.0-S0169023X2400065X-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141698385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Explanation, semantics, and ontology","authors":"Giancarlo Guizzardi , Nicola Guarino","doi":"10.1016/j.datak.2024.102325","DOIUrl":"10.1016/j.datak.2024.102325","url":null,"abstract":"<div><p>The terms ‘semantics’ and ‘ontology’ are increasingly appearing together with ‘explanation’, not only in the scientific literature, but also in everyday social interactions, in particular, within organizations. Ontologies have been shown to play a key role in supporting the semantic interoperability of data and knowledge representation structures used by information systems. With the proliferation of applications of Artificial Intelligence (AI) in different settings and the increasing need to guarantee their explainability (but also their interoperability) in critical contexts, the term ‘explanation’ has also become part of the scientific and technical jargon of modern information systems engineering. However, all of these terms are also significantly overloaded. In this paper, we address several interpretations of these notions, with an emphasis on their strong connection. Specifically, we discuss a notion of explanation termed <em>ontological unpacking</em>, which aims at explaining symbolic domain descriptions (e.g., conceptual models, knowledge graphs, logical specifications) by revealing their <em>ontological commitment</em> in terms of their so-called <em>truthmakers</em>, i.e., the entities in one’s ontology that are responsible for the truth of a description. To illustrate this methodology, we employ an ontological theory of relations to explain a symbolic model encoded in the <em>de facto</em> standard modeling language UML. We also discuss the essential role played by ontology-driven conceptual models (resulting from this form of explanation processes) in supporting semantic interoperability tasks. Furthermore, we revisit a proposal for quality criteria for explanations from philosophy of science to assess our approach. Finally, we discuss the relation between ontological unpacking and other forms of explanation in philosophy and science, as well as in the subarea of Artificial Intelligence known as Explainable AI (XAI).</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"153 ","pages":"Article 102325"},"PeriodicalIF":2.7,"publicationDate":"2024-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X24000491/pdfft?md5=79cddbdaff8702c03d78a624d5f422a3&pid=1-s2.0-S0169023X24000491-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141943335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Topological querying of music scores","authors":"Philippe Rigaux, Virginie Thion","doi":"10.1016/j.datak.2024.102340","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102340","url":null,"abstract":"<div><p>For centuries, <em>sheet music scores</em> have been the traditional way to preserve and disseminate Western music works. Nowadays, their content can be encoded in digital formats, making possible to store music score data in digital score libraries (DSL). To supply intelligent services (extracting and analysing relevant information from data), the new generation of DSL has to rely on digital representations of the score content as structured objects apt at being manipulated by high-level operators. In the present paper, we propose the <em>Muster</em> model, a graph-based data model for representing the music content of a digital score, and we discuss the querying of such data through graph pattern queries. We then present a proof-of-concept of this approach, which allows storing graph-based representations of music scores in the Neo4j database, and performing musical pattern searches through graph pattern queries with the Cypher query language. A benchmark study, using (real) datasets stemming from the <span>Neuma</span> Digital Score Library, complements this implementation.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"153 ","pages":"Article 102340"},"PeriodicalIF":2.7,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X24000648/pdfft?md5=422689552133a28488b6610063f13879&pid=1-s2.0-S0169023X24000648-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141484752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yi Gan , Zhihui Su , Gaoyong Lu , Pengju Zhang , Aixiang Cui , Jiawei Jiang , Duanbing Chen
{"title":"Entity type inference based on path walking and inter-types relationships","authors":"Yi Gan , Zhihui Su , Gaoyong Lu , Pengju Zhang , Aixiang Cui , Jiawei Jiang , Duanbing Chen","doi":"10.1016/j.datak.2024.102337","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102337","url":null,"abstract":"<div><p>As a crucial task for knowledge graphs (KGs), knowledge graph entity type inference (KGET) has garnered increasing attention in recent years. However, recent methods overlook the long-distance information pertaining to entities and the inter-types relationships. The neglect of long-distance information results in the omission of crucial entity relationships and neighbors, consequently leading to the loss of path information associated with missing types. To address this, a path-walking strategy is utilized to identify two-hop triplet paths of the crucial entity for encoding long-distance entity information. Moreover, the absence of inter-types relationships can lead to the loss of the neighborhood information of types, such as co-occurrence information. To ensure a comprehensive understanding of inter-types relationships, we consider interactions not only with the types of single entity but also with different types of entities. Finally, in order to comprehensively represent entities for missing types, considering both the dimensions of path information and neighborhood information, we propose an entity type inference model based on path walking and inter-types relationships, denoted as “ET-PT”. This model effectively extracts comprehensive entity information, thereby obtaining the most complete semantic representation of entities. The experimental results on publicly available datasets demonstrate that the proposed method outperforms state-of-the-art approaches.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"153 ","pages":"Article 102337"},"PeriodicalIF":2.7,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X24000612/pdfft?md5=3856b1f399f41f93c93401f8aea9503b&pid=1-s2.0-S0169023X24000612-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141484693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An explainable machine learning approach for automated medical decision support of heart disease","authors":"Francisco Mesquita, Gonçalo Marques","doi":"10.1016/j.datak.2024.102339","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102339","url":null,"abstract":"<div><p>Coronary Heart Disease (CHD) is the dominant cause of mortality around the world. Every year, it causes about 3.9 million deaths in Europe and 1.8 million in the European Union (EU). It is responsible for 45 % and 37 % of all deaths in Europe and the European Union, respectively. Using machine learning (ML) to predict heart diseases is one of the most promising research topics, as it can improve healthcare and consequently increase the longevity of people's lives. However, although the ability to interpret the results of the predictive model is essential, most of the related studies do not propose explainable methods. To address this problem, this paper presents a classification method that not only exhibits reliable performance but is also interpretable, ensuring transparency in its decision-making process. SHapley Additive exPlanations, known as the SHAP method was chosen for model interpretability. This approach presents a comparison between different classifiers and parameter tuning techniques, providing all the details necessary to replicate the experiment and help future researchers working in the field. The proposed model achieves similar performance to those proposed in the literature, and its predictions are fully interpretable.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"153 ","pages":"Article 102339"},"PeriodicalIF":2.7,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X24000636/pdfft?md5=9bdfa8117c5ce50d0508986a80981671&pid=1-s2.0-S0169023X24000636-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141592969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Olga Francés, Javi Fernández, José Abreu-Salas, Yoan Gutiérrez, Manuel Palomar
{"title":"A comprehensive methodology to construct standardised datasets for Science and Technology Parks","authors":"Olga Francés, Javi Fernández, José Abreu-Salas, Yoan Gutiérrez, Manuel Palomar","doi":"10.1016/j.datak.2024.102338","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102338","url":null,"abstract":"<div><p>This work presents a standardised approach to create datasets for Science and Technology Parks (STPs), facilitating future analysis of STP characteristics, trends and performance. STPs are the most representative examples of innovation ecosystems. The ETL (extraction-transformation-load) structure was adapted to a global field study of STPs. A selection stage and quality check were incorporated, and the methodology was applied to Spanish STPs. This study applies diverse techniques such as expert labelling and information extraction which uses language technologies. A novel methodology for building quality and standardised STP datasets was designed and applied to a Spanish STP case study with 49 STPs. An updatable dataset and a list of the main features impacting STPs are presented. Twenty-one (<em>n</em> = 21) core features were refined and selected, with fifteen of them (71.4 %) being robust enough for developing further quality analysis. The methodology presented integrates different sources with heterogeneous information that is often decentralised, disaggregated and in different formats: excel files, and unstructured information in HTML or PDF format. The existence of this updatable dataset and the defined methodology will enable powerful AI tools to be applied that focus on more sophisticated analysis, such as taxonomy, monitoring, and predictive and prescriptive analytics in the innovation ecosystems field.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"153 ","pages":"Article 102338"},"PeriodicalIF":2.7,"publicationDate":"2024-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141542542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Providing healthcare shopping advice through knowledge-based virtual agents","authors":"Claire Deventer, Pietro Zidda","doi":"10.1016/j.datak.2024.102336","DOIUrl":"10.1016/j.datak.2024.102336","url":null,"abstract":"<div><p>Knowledge-based virtual shopping agents, that advise their users about which products to buy, are well used in technical markets such as healthcare e-commerce. To ensure the proper adoption of this technology, it is important to consider aspects of users’ psychology early in the software design process. When traditional adoption models such as UTAUT-2 work well for many technologies, they overlook important specificities of the healthcare e-commerce domain and of knowledge-based virtual agents technology. Drawing upon health information technology and virtual agent literature, we propose a complementary adoption model incorporating new predictors and moderators reflecting these domains’ specificities. The model is tested using 903 observations gathered through an online survey conducted in collaboration with a major actor in the herbal medicine market. Our model can serve as a basis for many phases of the knowledge-based agents software development. We propose actionable recommendations for practitioners and ideas for further research.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"153 ","pages":"Article 102336"},"PeriodicalIF":2.7,"publicationDate":"2024-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141412665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zelong Su , Bin Gao , Xiaoou Pan , Zhengjun Liu , Yu Ji , Shutian Liu
{"title":"CGT: A Clause Graph Transformer Structure for aspect-based sentiment analysis","authors":"Zelong Su , Bin Gao , Xiaoou Pan , Zhengjun Liu , Yu Ji , Shutian Liu","doi":"10.1016/j.datak.2024.102332","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102332","url":null,"abstract":"<div><p>In the realm of natural language processing (NLP), aspect-based sentiment analysis plays a pivotal role. Recently, there has been a growing emphasis on techniques leveraging Graph Convolutional Neural Network (GCN). However, there are several challenges associated with current approaches: (1) Due to the inherent transitivity of CGN, training inevitably entails the acquisition of irrelevant semantic information. (2) Existing methodologies heavily depend on the dependency tree, neglecting to consider the contextual structure of the sentence. (3) Another limitation of the majority of methods is their failure to account for the interactions occurring between different aspects. In this study, we propose a Clause Graph Transformer Structure (CGT) to alleviate these limitations. Specifically, CGT comprises three modules. The preprocessing module extracts aspect clauses from each sentence by bi-directionally traversing the constituent tree, reducing reliance on syntax trees and extracting semantic information from the perspective of clauses. Additionally, we assert that a word’s vector direction signifies its underlying attitude in the semantic space, a feature often overlooked in recent research. Without the necessity for additional parameters, we introduce the Clause Attention encoder (CA-encoder) to the clause module to effectively capture the directed cross-correlation coefficient between the clause and the target aspect. To enhance the representation of the target component, we propose capturing the connections between various aspects. In the inter-aspect module, we intricately design a Balanced Attention encoder (BA-encoder) that forms an aspect sequence by navigating the extracted phrase tree. To effectively capture the emotion of implicit components, we introduce a Top-K Attention Graph Convolutional Network (KA-GCN). Our proposed method has showcased state-of-the-art (SOTA) performance through experiments conducted on four widely used datasets. Furthermore, our model demonstrates a significant improvement in the robustness of datasets subjected to disturbances.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"153 ","pages":"Article 102332"},"PeriodicalIF":2.5,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141328532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marcel Altendeitering , Tobias Moritz Guggenberger , Frederik Möller
{"title":"A design theory for data quality tools in data ecosystems: Findings from three industry cases","authors":"Marcel Altendeitering , Tobias Moritz Guggenberger , Frederik Möller","doi":"10.1016/j.datak.2024.102333","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102333","url":null,"abstract":"<div><p>Data ecosystems are a novel inter-organizational form of cooperation. They require at least one data provider and one or more data consumers. Existing research mainly addresses generativity mechanisms in this relationship, such as business models or role models for data ecosystems. However, an essential prerequisite for thriving data ecosystems is high data quality in the shared data. Without sufficient data quality, sharing data might lead to negative business consequences, given that the information drawn from them or services built on them might be incorrect or produce fraudulent results. We tackle this issue precisely since we report on a multi-case study deploying data quality tools in data ecosystem scenarios. From these cases, we derive generalized prescriptive design knowledge as a design theory to make the knowledge available for others designing data quality tools for data sharing. Subsequently, our study contributes to integrating the issue of data quality in data ecosystem research and provides practitioners with actionable guidelines inferred from three real-world cases.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"153 ","pages":"Article 102333"},"PeriodicalIF":2.5,"publicationDate":"2024-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X24000570/pdfft?md5=c13245062cdefc052035d38866a21318&pid=1-s2.0-S0169023X24000570-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141324527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An extended TF-IDF method for improving keyword extraction in traditional corpus-based research: An example of a climate change corpus","authors":"Liang-Ching Chen","doi":"10.1016/j.datak.2024.102322","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102322","url":null,"abstract":"<div><p>Keyword extraction involves the application of Natural Language Processing (NLP) algorithms or models developed in the realm of text mining. Keyword extraction is a common technique used to explore linguistic patterns in the corpus linguistic field, and Dunning’s Log-Likelihood Test (LLT) has long been integrated into corpus software as a statistic-based NLP model. While prior research has confirmed the widespread applicability of keyword extraction in corpus-based research, LLT has certain limitations that may impact the accuracy of keyword extraction in such research. This paper summarized the limitations of LLT, which include benchmark corpus interference, elimination of grammatical and generic words, consideration of sub-corpus relevance, flexibility in feature selection, and adaptability to different research goals. To address these limitations, this paper proposed an extended Term Frequency-Inverse Document Frequency (TF-IDF) method. To verify the applicability of the proposed method, 20 highly cited research articles on climate change from the Web of Science (WOS) database were used as the target corpus, and a comparison was conducted with the traditional method. The experimental results indicated that the proposed method could effectively overcome the limitations of the traditional method and demonstrated the feasibility and practicality of incorporating the TF-IDF algorithm into relevant corpus-based research.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"153 ","pages":"Article 102322"},"PeriodicalIF":2.5,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141285881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}