{"title":"What is the business value of your data? A multi-perspective empirical study on monetary valuation factors and methods for data governance","authors":"Frank Bodendorf, Jörg Franke","doi":"10.1016/j.datak.2023.102242","DOIUrl":"10.1016/j.datak.2023.102242","url":null,"abstract":"<div><p><span>Digitalization has greatly increased the importance of data in recent years, making data an indispensable resource for value creation in our time. There is currently still a lack of theories as well as practicable methods and techniques for the monetary valuation of data, and data is therefore not yet sufficiently managed in terms of business management principles. In this context, this research is intended to design theory ingrained principles for a multidimensional conceptual approach to the monetary valuation of data as assets. We draw on the theory of dynamic capabilities as a further development of resource theory as well as value theory. To this end, the research conducts a qualitative field study followed by a quantitative survey study. Literature analysis is used to explain different dimensions in the qualitative field study. Structural equation modeling is used to analyze empirical data collected in the quantitative study. The results show that data value determination is a multidimensional and hierarchical construct consisting of three primary dimensions. These are the benefit-oriented, cost-oriented, and quality-oriented dimensions. The results also confirm that institutional pressures (coercive, normative, mimetic) that influence </span>organizational behaviors lead to a greater intention for organizations to adapt a monetary data value determination.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":null,"pages":null},"PeriodicalIF":2.5,"publicationDate":"2023-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135714069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Damir Vandic , Lennart J. Nederstigt , Flavius Frasincar , Uzay Kaymak , Enzo Ido
{"title":"A framework for approximate product search using faceted navigation and user preference ranking","authors":"Damir Vandic , Lennart J. Nederstigt , Flavius Frasincar , Uzay Kaymak , Enzo Ido","doi":"10.1016/j.datak.2023.102241","DOIUrl":"10.1016/j.datak.2023.102241","url":null,"abstract":"<div><p>One of the problems that e-commerce users face is that the desired products are sometimes not available and Web shops fail to provide similar products due to their exclusive reliance on Boolean faceted search. User preferences are also often not taken into account. In order to address these problems, we present a novel framework specifically geared towards approximate faceted search within the product catalog of a Web shop. It is based on adaptations to the p-norm extended Boolean model, to account for the domain-specific characteristics of faceted search in an e-commerce environment. These e-commerce specific characteristics are, for example, the use of quantitative properties and the presence of user preferences. Our approach explores the concept of facet similarity functions in order to better match products to queries. In addition, the user preferences are used to assign importance weights to the query terms. Using a large-scale experimental setup based on real-world data, we conclude that the proposed algorithm outperforms the considered benchmark algorithms. Last, we have performed a user-based study in which we found that users who use our approach find more relevant products with less effort.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":null,"pages":null},"PeriodicalIF":2.5,"publicationDate":"2023-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135664518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Discovering and evaluating organizational knowledge from textual data: Application to crisis management","authors":"Dhouha Grissa, Eric Andonoff, Chihab Hanachi","doi":"10.1016/j.datak.2023.102237","DOIUrl":"https://doi.org/10.1016/j.datak.2023.102237","url":null,"abstract":"<div><p><span><span>Crisis management effectiveness relies mainly on the quality of the distributed human organization deployed for saving lives, limiting damage and reducing risks. Organizations set up in this context are not always predefined and static; they could evolve and new forms could emerge since actors, such as volunteers or NGO, could join dynamically to collaborate. To improve crisis resolution effectiveness, it is first important to understand, analyze and evaluate such dynamic organizations in order to adjust crisis management plans and ease coordination among actors. Giving a textual experience feedback from past crisis, the objective of this paper is to discover the organizational structure deployed in the considered crisis and then evaluate it according to a set of criteria. For that purpose, we combine in a coherent framework text and </span>association rule mining for pattern discovery and annotation, and multi-agent system models and techniques for formally building and evaluating organizational structures. We present the </span><em>OSminer</em> algorithm that discovers association rules based on relevant textual patterns and then builds an organizational structure including three main relations between actors: power, control and coordination. A real-life case study, a flood crisis hitting the south west of France, serves as a basis for testing/experimenting our solution. The organizational structure, discovered in this case study, has 24 actors. Its evaluation indicates its efficiency, but shows that it is neither robust nor flexible. Our findings highlight the potential of our approach to discover and evaluate organizational structures from a text recording interactions between stakeholders in a crisis context.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":null,"pages":null},"PeriodicalIF":2.5,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"92046670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"User-generated short-text classification using cograph editing-based network clustering with an application in invoice categorization","authors":"Dewan F. Wahid , Elkafi Hassini","doi":"10.1016/j.datak.2023.102238","DOIUrl":"https://doi.org/10.1016/j.datak.2023.102238","url":null,"abstract":"<div><p>Rapid adaptation of online business platforms in every sector creates an enormous amount of user-generated textual data related to providing product or service descriptions, reviewing, marketing, invoicing and bookkeeping. These data are often short in size, noisy (e.g., misspellings, abbreviations), and do not have accurate classifying labels (line-item categories). Classifying these user-generated short-text data with appropriate line-item categories is crucial for corresponding platforms to understand users’ needs. This paper proposed a framework for user-generated short-text classification based on identified line-item categories. In the line-item identification phase, we used cograph editing (CoE)-based clustering on keywords network, which can be formulated from users’ generated short-texts. We also proposed integer linear programming (ILP) formulations for CoE on weighted networks and designed a heuristic algorithm to identify clusters in large-scale networks. Finally, we outlined an application of this framework to categorize invoices in an empirical setting. Our framework showed promising results in identifying invoice line-item categories for large-scale data.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":null,"pages":null},"PeriodicalIF":2.5,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"92136240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Challenges of a Data Ecosystem for scientific data","authors":"Edoardo Ramalli, Barbara Pernici","doi":"10.1016/j.datak.2023.102236","DOIUrl":"https://doi.org/10.1016/j.datak.2023.102236","url":null,"abstract":"<div><p>Data Ecosystems (DE) are used across various fields and applications. They facilitate collaboration between organizations, such as companies or research institutions, enabling them to share data and services. A DE can boost research outcomes by managing and extracting value from the increasing volume of generated and shared data in the last decades. However, the adoption of DE solutions for scientific data by R&D departments and scientific communities is still difficult. Scientific data are challenging to manage, and, as a result, a considerable part of this information still needs to be annotated and organized in order to be shared. This work discusses the challenges of employing DE in scientific domains and the corresponding potential mitigations. First, scientific data and their typologies are contextualized, then their unique characteristics are discussed. Typical properties regarding their high heterogeneity and uncertainty make assessing their consistency and accuracy problematic. In addition, this work discusses the specific requirements expressed by the scientific communities when it comes to integrating a DE solution into their workflow. The unique properties of scientific data and domain-specific requirements create a challenging setting for adopting DEs. The challenges are expressed as general research questions, and this work explores the corresponding solutions in terms of data management aspects. Finally, the paper presents a real-world scenario with more technical details.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":null,"pages":null},"PeriodicalIF":2.5,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X23000964/pdfft?md5=98e3f9c9e5690c131b72c032eddd9253&pid=1-s2.0-S0169023X23000964-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"92046668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards comparable ratings: Exploring bias in German physician reviews","authors":"Joschka Kersting, Falk Maoro, Michaela Geierhos","doi":"10.1016/j.datak.2023.102235","DOIUrl":"https://doi.org/10.1016/j.datak.2023.102235","url":null,"abstract":"<div><p>In this study, we evaluate the impact of gender-biased data from German-language physician reviews on the fairness of fine-tuned language models. For two different downstream tasks, we use data reported to be gender biased and aggregate it with annotations. First, we propose a new approach to aspect-based sentiment analysis that allows identifying, extracting, and classifying implicit and explicit aspect phrases and their polarity within a single model. The second task we present is grade prediction, where we predict the overall grade of a review on the basis of the review text. For both tasks, we train numerous transformer models and evaluate their performance. The aggregation of sensitive attributes, such as a physician’s gender and migration background, with individual text reviews allows us to measure the performance of the models with respect to these sensitive groups. These group-wise performance measures act as extrinsic bias measures for our downstream tasks. In addition, we translate several gender-specific templates of the intrinsic bias metrics into the German language and evaluate our fine-tuned models. Based on this set of tasks, fine-tuned models, and intrinsic and extrinsic bias measures, we perform correlation analyses between intrinsic and extrinsic bias measures. In terms of sensitive groups and effect sizes, our bias measure results show different directions. Furthermore, correlations between measures of intrinsic and extrinsic bias can be observed in different directions. This leads us to conclude that gender-biased data does not inherently lead to biased models. Other variables, such as template dependency for intrinsic measures and label distribution in the data, must be taken into account as they strongly influence the metric results. Therefore, we suggest that metrics and templates should be chosen according to the given task and the biases to be assessed.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":null,"pages":null},"PeriodicalIF":2.5,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X23000952/pdfft?md5=035f0e2eec55531089e125433a25b2bc&pid=1-s2.0-S0169023X23000952-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"92046250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An efficient and scalable SPARQL query processing framework for big data using MapReduce and hybrid optimum load balancing","authors":"V. Naveen Kumar , Ashok Kumar P.S.","doi":"10.1016/j.datak.2023.102239","DOIUrl":"https://doi.org/10.1016/j.datak.2023.102239","url":null,"abstract":"<div><p>The increasing RDF (Resource Description Framework) data volume requires a Hadoop<span> platform for processing queries over large datasets. In this work, SPARQL (Simple Protocol and Rdf Query Language) queries are evaluated with Hadoop based on the objective of minimizing the number of joins through data partitioning for performing map/reduce jobs. The query evaluation time and the number of cross node joins are minimized with the proposed partitioning techniques. Extended vertical partitioning is proposed for distributed data stores based on objects’ explicit information for splitting predicates. For accessing the RDF data, hybrid monarch butterfly with beetle swarm load balancing optimization with Map-reduce (Hybrid Optimum Load Balancing) is applied. The proposed SPARQL query processing is evaluated over large RDF datasets. The proposed approach’s evaluation results are analyzed with the existing approaches, indicating the proposed framework’s efficiency. By using the proposed approach, an accuracy of 97 % is obtained.</span></p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":null,"pages":null},"PeriodicalIF":2.5,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"92067826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ammar N. Abbas , Georgios C. Chasparis , John D. Kelleher
{"title":"Hierarchical framework for interpretable and specialized deep reinforcement learning-based predictive maintenance","authors":"Ammar N. Abbas , Georgios C. Chasparis , John D. Kelleher","doi":"10.1016/j.datak.2023.102240","DOIUrl":"https://doi.org/10.1016/j.datak.2023.102240","url":null,"abstract":"<div><p>Deep reinforcement learning holds significant potential for application in industrial decision-making, offering a promising alternative to traditional physical models. However, its black-box learning approach presents challenges for real-world and safety-critical systems, as it lacks interpretability and explanations for the derived actions. Moreover, a key research question in deep reinforcement learning is how to focus policy learning on critical decisions within sparse domains. This paper introduces a novel approach that combines probabilistic modeling and reinforcement learning, providing interpretability and addressing these challenges in the context of safety-critical predictive maintenance. The methodology is activated in specific situations identified through the input–output hidden Markov model, such as critical conditions or near-failure scenarios. To mitigate the challenges associated with deep reinforcement learning in safety-critical predictive maintenance, the approach is initialized with a baseline policy using behavioral cloning, requiring minimal interactions with the environment. The effectiveness of this framework is demonstrated through a case study on predictive maintenance for turbofan engines, outperforming previous approaches and baselines, while also providing the added benefit of interpretability. Importantly, while the framework is applied to a specific use case, this paper aims to present a general methodology that can be applied to diverse predictive maintenance applications.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":null,"pages":null},"PeriodicalIF":2.5,"publicationDate":"2023-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"92059899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Carlos J. Fernández Candel, Jesús J. García-Molina, Diego Sevilla Ruiz
{"title":"SkiQL: A unified schema query language","authors":"Carlos J. Fernández Candel, Jesús J. García-Molina, Diego Sevilla Ruiz","doi":"10.1016/j.datak.2023.102234","DOIUrl":"https://doi.org/10.1016/j.datak.2023.102234","url":null,"abstract":"<div><p>Most NoSQL systems are schema-on-read: data can be stored without first having to declare a schema that imposes a structure. This schemaless feature offers flexibility to evolve data-intensive applications when data change frequently. However, freeing from declaring schemas does not mean their absence, but rather that they are implicit in data and code. Therefore, diagramming tools similar to those available for relational systems are also needed to help developers and administrators to design and to understand NoSQL schemas.</p><p>Visualizing diagrams is not practical if schemas contain hundreds of database entities, so exploration or query facilities are then needed. In schemaless NoSQL stores, data of the same entity can be stored with different structure (e.g., non-uniform types and optional fields), which can increase the difficulty of having readable diagrams.</p><p>NoSQL schema management tools should therefore have three main components: schema extraction, schema visualization, and schema query. As there are four main NoSQL data models, it is convenient for such tools to be built on a generic data model so that they provide platform-independence (of data models and data stores) to query and visualize schemas. With the aim of favoring the creation of generic database tools, the authors of this paper defined the U-Schema unified data model that integrates the four main NoSQL data models as well as the relational model.</p><p>This paper is focused on querying NoSQL and relational schemas which are represented as U-Schema models. We present the SkiQL language designed on U-Schema to achieve a platform-independent schema query service. SkiQL provides two constructs: schema-query and relationship-query. The former allows to obtain information of entity or relationship types, and the latter that of the aggregations or references (relations among types). We will show how SkiQL was evaluated by calculating well-known metrics for languages as well as using a survey with developers with experience in NoSQL.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":null,"pages":null},"PeriodicalIF":2.5,"publicationDate":"2023-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49749816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Effective healthcare service recommendation with network representation learning: A recursive neural network approach","authors":"Mouhamed Gaith Ayadi , Haithem Mezni , Rana Alnashwan , Hela Elmannai","doi":"10.1016/j.datak.2023.102233","DOIUrl":"https://doi.org/10.1016/j.datak.2023.102233","url":null,"abstract":"<div><p>Recently, recommender systems have been combined with healthcare systems to recommend needed healthcare items for both patients and medical staff. By monitoring the patients’ states, healthcare services and their consumed smart medical objects can be recommended to a medical team according to the patient’s critical situation and requirements. However, a common drawback of the few existing solutions lies in the limited modeling of the healthcare information network. In addition, current solutions do not consider the typed nature of healthcare items. Moreover, existing healthcare recommender systems lack flexibility, and none of them offers re-configurable healthcare workflows to medical staff. In this paper, we take advantage of collaborative filtering and representation learning principles, by proposing a method for the recommendation of healthcare services. These latter follow a predefined execution pattern, i.e. treatment/medication workflow, that is determined by our framework depending on the patient’s state. To achieve this goal, we model the healthcare information network as a <em>knowledge graph</em>. This latter, based on an <em>incremental learning</em> method, is then transformed into a cuboid space to facilitate its processing. That is by learning latent representations of its content (e.g., smart objects, healthcare services, patients symptoms, etc.). Finally, a <em>collaborative recommendation</em> method is defined to select the high-quality healthcare services that will be composed and executed according to a determined workflow model. Experimental results have proven the efficiency of our solution in terms of recommended services’ quality.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":null,"pages":null},"PeriodicalIF":2.5,"publicationDate":"2023-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49749644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}