{"title":"Improving the identification of relevant variants in genome information systems: A methodological approach with a case study on early onset Alzheimer's disease","authors":"Mireia Costa, Ana León, Óscar Pastor","doi":"10.1016/j.datak.2024.102284","DOIUrl":"https://doi.org/10.1016/j.datak.2024.102284","url":null,"abstract":"<div><p>Alzheimer's disease is the most common type of dementia in the elderly. Nevertheless, there is an early onset form that is difficult to diagnose precisely. As the genetic component is the most critical factor in developing this disease, identifying relevant genetic variants is key to obtaining a more reliable and straightforward diagnosis. The information about these variants is stored in an extensive number of data sources, which must be carefully analyzed to select only the information with sufficient quality to be used in a clinical setting. This selection has become complex due to the increasing available genomic information. The SILE method was designed to systematize identifying relevant variants for a disease in this challenging context. However, several problems on how SILE identifies relevant variants were discovered when applying the method to the early onset form of Alzheimer's disease. More specifically, the method failed to address specific features of this disease such as its low incidence and familiar component. This paper proposes an improvement of the identification process defined by the SILE method to make it applicable to a further spectrum of diseases. Details of how the proposed solution has been applied are also reported. As a result of this improvement, a set of 29 variants has been identified (25 variants Accepted with a Limited Evidence and 4 Accepted with Moderate Evidence). This constitutes a valuable result that facilitates and reinforces the genetic diagnosis of the disease.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"151 ","pages":"Article 102284"},"PeriodicalIF":2.5,"publicationDate":"2024-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X24000089/pdfft?md5=571739f0b90877da191a9d94a852f178&pid=1-s2.0-S0169023X24000089-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139738034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Huma Parveen , Syed Wajahat Abbas Rizvi , Raja Sarath Kumar Boddu
{"title":"Fuzzy-Ontology based knowledge driven disease risk level prediction with optimization assisted ensemble classifier","authors":"Huma Parveen , Syed Wajahat Abbas Rizvi , Raja Sarath Kumar Boddu","doi":"10.1016/j.datak.2024.102278","DOIUrl":"10.1016/j.datak.2024.102278","url":null,"abstract":"<div><p>Modern medicinal analysis is a complex procedure, requiring precise patient data, scientific knowledge obtained over numerous years and a theoretical understanding of related medical literature. To improve the accuracy and to reduce the time for diagnosis, clinical decision support systems (DSS) were introduced, which incorporate data mining schemes for enhancing the disease diagnosing accuracy. This work proposes a new disease-predicting model that involves 3 stages. Initially, “improved stemming and tokenization” are carried out in the pre-processing stage. Then, the “Fuzzy ontology, improved mutual information (MI), and correlation features” are extracted. Then, prediction is carried out via ensemble classifiers that include “improved Fuzzy logic, Long Short Term Memory (LSTM), Deep Convolution Neural Network (DCNN), and Bidirectional Gated Recurrent Unit (Bi-GRU)”.The outcomes from improved fuzzy logic, LSTM, and DCNN are further classified via Bi-GRU which offers the results. Specifically, Bi-GRU weights are optimally tuned using Deer Hunting Update Explored Arithmetic Optimization (DHUEAO). Finally, the efficiency of the proposed work is determined concerning a variety of metrics.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"151 ","pages":"Article 102278"},"PeriodicalIF":2.5,"publicationDate":"2024-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139677918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Junrui Liu , Tong Li , Zhen Yang , Di Wu , Huan Liu
{"title":"Fusion learning of preference and bias from ratings and reviews for item recommendation","authors":"Junrui Liu , Tong Li , Zhen Yang , Di Wu , Huan Liu","doi":"10.1016/j.datak.2024.102283","DOIUrl":"10.1016/j.datak.2024.102283","url":null,"abstract":"<div><p>Recommendation methods improve rating prediction performance by learning selection bias phenomenon-users tend to rate items they like. These methods model selection bias by calculating the propensities of ratings, but inaccurate propensity could introduce more noise, fail to model selection bias, and reduce prediction performance. We argue that learning interaction features can effectively model selection bias and improve model performance, as interaction features explain the reason of the trend. Reviews can be used to model interaction features because they have a strong intrinsic correlation with user interests and item interactions. In this study, we propose a preference- and bias-oriented fusion learning model (PBFL) that models the interaction features based on reviews and user preferences to make rating predictions. Our proposal both embeds traditional user preferences in reviews, interactions, and ratings and considers word distribution bias and review quoting to model interaction features. Six real-world datasets are used to demonstrate effectiveness and performance. PBFL achieves an average improvement of 4.46% in root-mean-square error (RMSE) and 3.86% in mean absolute error (MAE) over the best baseline.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"150 ","pages":"Article 102283"},"PeriodicalIF":2.5,"publicationDate":"2024-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139677949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multivariate hierarchical DBSCAN model for enhanced maritime data analytics","authors":"Nitin Newaliya, Yudhvir Singh","doi":"10.1016/j.datak.2024.102282","DOIUrl":"10.1016/j.datak.2024.102282","url":null,"abstract":"<div><p>Clustering is an important data analytics technique and has numerous use cases. It leads to the determination of insights and knowledge which would not be readily discernible on routine examination of the data. Enhancement of clustering techniques is an active field of research, with various optimisation models being proposed. Such enhancements are also undertaken to address particular issues being faced in specific applications. This paper looks at a particular use case in the maritime domain and how an enhancement of Density-Based Spatial Clustering of Applications with Noise (DBSCAN) clustering results in the apt use of data analytics to solve a real-life issue. Passage of vessels over water is one of the significant utilisations of maritime regions. Trajectory analysis of these vessels helps provide valuable information, thus, maritime movement data and the knowledge extracted from manipulation of this data play an essential role in various applications, viz., assessing traffic densities, identifying traffic routes, reducing collision risks, etc. Optimised trajectory information would help enable safe and energy-efficient green operations at sea and assist autonomous operations of maritime systems and vehicles. Many studies focus on determining trajectory densities but miss out on individual trajectory granularities. Determining trajectories by using unique identities of the vessels may also lead to errors. Using an unsupervised DBSCAN method of identifying trajectories could help overcome these limitations. Further, to enhance outcomes and insights, the inclusion of temporal information along with additional parameters of Automatic Identification System (AIS) data in DBSCAN is proposed. Towards this, a new design and implementation for data analytics called the Multivariate Hierarchical DBSCAN method for better clustering of Maritime movement data, such as AIS, has been developed, which helps determine granular information and individual trajectories in an unsupervised manner. It is seen from the evaluation metrics that the performance of this method is better than other data clustering techniques.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"150 ","pages":"Article 102282"},"PeriodicalIF":2.5,"publicationDate":"2024-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139667962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"AI system architecture design methodology based on IMO (Input-AI Model-Output) structure for successful AI adoption in organizations","authors":"Seungkyu Park , Joong yoon Lee , Jooyeoun Lee","doi":"10.1016/j.datak.2023.102264","DOIUrl":"10.1016/j.datak.2023.102264","url":null,"abstract":"<div><p>With the advancement of AI technology, the successful AI adoption in organizations has become a top priority in modern society. However, many organizations still struggle to articulate the necessary AI, and AI experts have difficulties understanding the problems faced by these organizations. This knowledge gap makes it difficult for organizations to identify the technical requirements, such as necessary data and algorithms, for adopting AI. To overcome this problem, we propose a new AI system architecture design methodology based on the IMO (Input-AI Model-Output) structure. The IMO structure enables effective identification of the technical requirements necessary to develop real AI models. While previous research has identified the importance and challenges of technical requirements, such as data and AI algorithms, for AI adoption, there has been little research on methodology to concretize them. Our methodology is composed of three stages: problem definition, system AI solution, and AI technical solution to design the AI technology and requirements that organizations need at a system level. The effectiveness of our methodology is demonstrated through a case study, logical comparative analysis with other studies, and experts reviews, which demonstrate that our methodology can support successful AI adoption to organizations.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"150 ","pages":"Article 102264"},"PeriodicalIF":2.5,"publicationDate":"2024-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X23001246/pdfft?md5=e0d3a91ff85a9662d7d0a2bed8c5acfd&pid=1-s2.0-S0169023X23001246-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139588883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A new sentence embedding framework for the education and professional training domain with application to hierarchical multi-label text classification","authors":"Guillaume Lefebvre , Haytham Elghazel , Theodore Guillet , Alexandre Aussem , Matthieu Sonnati","doi":"10.1016/j.datak.2024.102281","DOIUrl":"10.1016/j.datak.2024.102281","url":null,"abstract":"<div><p><span>In recent years, Natural Language Processing<span> (NLP) has made significant advances through advanced general language embeddings, allowing breakthroughs in NLP tasks such as semantic similarity and text classification<span>. However, complexity increases with hierarchical multi-label classification (HMC), where a single entity can belong to several hierarchically organized classes. In such complex situations, applied on specific-domain texts, such as the Education and professional training domain, general language embedding models often inadequately represent the unique terminologies and contextual nuances of a specialized domain. To tackle this problem, we present HMCCCProbT, a novel hierarchical multi-label text classification approach. This innovative framework chains multiple classifiers<span>, where each individual classifier is built using a novel sentence-embedding method BERTEPro based on existing Transformer models, whose pre-training has been extended on education and professional training texts, before being fine-tuned on several NLP tasks. Each individual classifier is responsible for the predictions of a given hierarchical level and propagates local probability predictions augmented with the input feature vectors to the classifier in charge of the subsequent level. HMCCCProbT tackles issues of model scalability and semantic interpretation, offering a powerful solution to the challenges of domain-specific hierarchical multi-label classification. Experiments over three domain-specific textual HMC datasets indicate the effectiveness of </span></span></span></span><span>HMCCCProbT</span><span> to compare favorably to state-of-the-art HMC algorithms<span> in terms of classification accuracy and also the ability of </span></span><span>BERTEPro</span> to obtain better probability predictions, well suited to <span>HMCCCProbT</span><span>, than three other vector representation techniques.</span></p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"150 ","pages":"Article 102281"},"PeriodicalIF":2.5,"publicationDate":"2024-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139500576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ilka Jussen , Frederik Möller , Julia Schweihoff , Anna Gieß , Giulia Giussani , Boris Otto
{"title":"Issues in inter-organizational data sharing: Findings from practice and research challenges","authors":"Ilka Jussen , Frederik Möller , Julia Schweihoff , Anna Gieß , Giulia Giussani , Boris Otto","doi":"10.1016/j.datak.2024.102280","DOIUrl":"10.1016/j.datak.2024.102280","url":null,"abstract":"<div><p>Sharing data is highly potent in assisting companies in internal optimization and designing new products and services. While the benefits seem obvious, sharing data is accompanied by a spectrum of concerns ranging from fears of sharing something of value, unawareness of what will happen to the data, or simply a lack of understanding of the short- and mid-term benefits. The article analyzes data sharing in inter-organizational relationships by examining 13 cases in a qualitative interview study and through public data analysis. Given the importance of inter-organizational data sharing as indicated by large research initiatives such as Gaia-X and Catena-X, we explore issues arising in this process and formulate research challenges. We use the theoretical lens of Actor-Network Theory to analyze our data and entangle its constructs with concepts in data sharing.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"150 ","pages":"Article 102280"},"PeriodicalIF":2.5,"publicationDate":"2024-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X24000041/pdfft?md5=8cca34784bb0ed03de222b7dc6fbfc47&pid=1-s2.0-S0169023X24000041-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139412627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Suan Lee , Sangkeun Ko , Arousha Haghighian Roudsari , Wookey Lee
{"title":"A deep learning model for predicting the number of stores and average sales in commercial district","authors":"Suan Lee , Sangkeun Ko , Arousha Haghighian Roudsari , Wookey Lee","doi":"10.1016/j.datak.2024.102277","DOIUrl":"10.1016/j.datak.2024.102277","url":null,"abstract":"<div><p>This paper presents a plan for preparing for changes in the business environment by analyzing and predicting business district data in Seoul. The COVID-19 pandemic and economic crisis caused by inflation have led to an increase in store closures and a decrease in sales, which has had a significant impact on commercial districts. The number of stores and sales are critical factors that directly affect the business environment and can help prepare for changes. This study conducted correlation analysis to extract factors related to the commercial district’s environment in Seoul and estimated the number of stores and sales based on these factors. Using the Kendaltau correlation coefficient, the study found that existing population and working population were the most influential factors. Linear regression, tensor decomposition, Factorization Machine, and deep neural network models were used to estimate the number of stores and sales, with the deep neural network model showing the best performance in RMSE and evaluation indicators. This study also predicted the number of stores and sales of the service industry in a specific area using the population prediction results of the neural prophet model. The study’s findings can help identify commercial district information and predict the number of stores and sales based on location, industry, and influencing factors, contributing to the revitalization of commercial districts.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"150 ","pages":"Article 102277"},"PeriodicalIF":2.5,"publicationDate":"2024-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X24000016/pdfft?md5=399d90f81e8f5fbe38aeaa5e86a26560&pid=1-s2.0-S0169023X24000016-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139095414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A transformer-based neural network framework for full names prediction with abbreviations and contexts","authors":"Ziming Ye , Shuangyin Li","doi":"10.1016/j.datak.2023.102275","DOIUrl":"10.1016/j.datak.2023.102275","url":null,"abstract":"<div><p>With the rapid spread of information, abbreviations are used more and more common because they are convenient. However, the duplication of abbreviations can lead to confusion in many cases, such as information management and information retrieval. The resultant confusion annoys users. Thus, inferring a full name from an abbreviation has practical and significant advantages. The bulk of studies in the literature mainly inferred full names based on rule-based methods, statistical models, the similarity of representation, etc. However, these methods are unable to use various grained contexts properly. In this paper, we propose a flexible framework of Multi-attention mask Abbreviation Context and Full name language model<span>, named MACF to address the problem. With the abbreviation and contexts as the inputs, the MACF can automatically predict a full name by generation, where the contexts can be variously grained. That is, different grained contexts ranging from coarse to fine can be selected to perform such complicated tasks in which contexts include paragraphs, several sentences, or even just a few keywords. A novel multi-attention mask mechanism is also proposed, which allows the model to learn the relationships among abbreviations, contexts, and full names, a process that makes the most of various grained contexts. The three corpora of different languages and fields were analyzed and measured with seven metrics in various aspects to evaluate the proposed framework. According to the experimental results, the MACF yielded more significant and consistent outputs than other baseline methods. Moreover, we discuss the significance and findings, and give the case studies to show the performance in real applications.</span></p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"150 ","pages":"Article 102275"},"PeriodicalIF":2.5,"publicationDate":"2023-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139069387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}