Yongkang Xiao , Sinian Zhang , Huixue Zhou , Mingchen Li , Han Yang , Rui Zhang
{"title":"FuseLinker: Leveraging LLM’s pre-trained text embeddings and domain knowledge to enhance GNN-based link prediction on biomedical knowledge graphs","authors":"Yongkang Xiao , Sinian Zhang , Huixue Zhou , Mingchen Li , Han Yang , Rui Zhang","doi":"10.1016/j.jbi.2024.104730","DOIUrl":"10.1016/j.jbi.2024.104730","url":null,"abstract":"<div><h3>Objective</h3><div>To develop the FuseLinker, a novel link prediction framework for biomedical knowledge graphs (BKGs), which fully exploits the graph’s structural, textual and domain knowledge information. We evaluated the utility of FuseLinker in the graph-based drug repurposing task through detailed case studies.</div></div><div><h3>Methods</h3><div>FuseLinker leverages fused pre-trained text embedding and domain knowledge embedding to enhance the graph neural network (GNN)-based link prediction model tailored for BKGs. This framework includes three parts: a) obtain text embeddings for BKGs using embedding-visible large language models (LLMs), b) learn the representations of medical ontology as domain knowledge information by employing the Poincaré graph embedding method, and c) fuse these embeddings and further learn the graph structure representations of BKGs by applying a GNN-based link prediction model. We evaluated FuseLinker against traditional knowledge graph embedding models and a conventional GNN-based link prediction model across four public BKG datasets. Additionally, we examined the impact of using different embedding-visible LLMs on FuseLinker’s performance. Finally, we investigated FuseLinker’s ability to generate medical hypotheses through two drug repurposing case studies for Sorafenib and Parkinson’s disease.</div></div><div><h3>Results</h3><div>By comparing FuseLinker with baseline models on four BKGs, our method demonstrates superior performance. The Mean Reciprocal Rank (MRR) and Area Under receiver operating characteristic Curve (AUROC) for KEGG50k, Hetionet, SuppKG and ADInt are 0.969 and 0.987, 0.548 and 0.903, 0.739 and 0.928, and 0.831 and 0.890, respectively.</div></div><div><h3>Conclusion</h3><div>Our study demonstrates that FuseLinker is an effective novel link prediction framework that integrates multiple graph information and shows significant potential for practical applications in biomedical and clinical tasks. Source code and data are available at https://github.com/YKXia0/FuseLinker.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"158 ","pages":"Article 104730"},"PeriodicalIF":4.0,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142347388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiaqi Sun , Chen Zhang , Linlin Xing , Longbo Zhang , Hongzhen Cai , Maozu Guo
{"title":"BAMRE: Joint extraction model of Chinese medical entities and relations based on Biaffine transformation with relation attention","authors":"Jiaqi Sun , Chen Zhang , Linlin Xing , Longbo Zhang , Hongzhen Cai , Maozu Guo","doi":"10.1016/j.jbi.2024.104733","DOIUrl":"10.1016/j.jbi.2024.104733","url":null,"abstract":"<div><div>Electronic Health Records (EHRs) contain various valuable medical entities and their relationships. Although the extraction of biomedical relationships has achieved good results in the mining of electronic health records and the construction of biomedical knowledge bases, there are still some problems. There may be implied complex associations between entities and relationships in overlapping triplets, and ignoring these interactions may lead to a decrease in the accuracy of entity extraction. To address this issue, a joint extraction model for medical entity relations based on a relation attention mechanism is proposed. The relation extraction module identifies candidate relationships within a sentence. The attention mechanism based on these relationships assigns weights to contextual words in the sentence that are associated with different relationships. Additionally, it extracts the subject and object entities. Under a specific relationship, entity vector representations are utilized to construct a global entity matching matrix based on Biaffine transformations. This matrix is designed to enhance the semantic dependencies and relational representations between entities, enabling triplet extraction. This allows the two subtasks of named entity recognition and relation extraction to be interrelated, fully utilizing contextual information within the sentence, and effectively addresses the issue of overlapping triplets.</div><div>Experimental observations from the CMeIE Chinese medical relation extraction dataset and the Baidu2019 Chinese dataset confirm that our approach yields the superior <span><math><mrow><mi>F</mi><mn>1</mn></mrow></math></span> score across all cutting-edge baselines. Moreover, it offers substantial performance improvements in intricate situations involving diverse overlapping patterns, multitudes of triplets, and cross-sentence triplets.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"158 ","pages":"Article 104733"},"PeriodicalIF":4.0,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142377853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaohong Li , Guoheng Huang , Lianglun Cheng , Guo Zhong , Weihuang Liu , Xuhang Chen , Muyan Cai
{"title":"Cross-domain visual prompting with spatial proximity knowledge distillation for histological image classification","authors":"Xiaohong Li , Guoheng Huang , Lianglun Cheng , Guo Zhong , Weihuang Liu , Xuhang Chen , Muyan Cai","doi":"10.1016/j.jbi.2024.104728","DOIUrl":"10.1016/j.jbi.2024.104728","url":null,"abstract":"<div><h3>Objective:</h3><div>Histological classification is a challenging task due to the diverse appearances, unpredictable variations, and blurry edges of histological tissues. Recently, many approaches based on large networks have achieved satisfactory performance. However, most of these methods rely heavily on substantial computational resources and large high-quality datasets, limiting their practical application. Knowledge Distillation (KD) offers a promising solution by enabling smaller networks to achieve performance comparable to that of larger networks. Nonetheless, KD is hindered by the problem of high-dimensional characteristics, which makes it difficult to capture tiny scattered features and often leads to the loss of edge feature relationships.</div></div><div><h3>Methods:</h3><div>A novel cross-domain visual prompting distillation approach is proposed, compelling the teacher network to facilitate the extraction of significant high-dimensional features into low-dimensional feature maps, thereby aiding the student network in achieving superior performance. Additionally, a dynamic learnable temperature module based on novel vector-based spatial proximity is introduced to further encourage the student to imitate the teacher.</div></div><div><h3>Results:</h3><div>Experiments conducted on widely accepted histological datasets, NCT-CRC-HE-100K and LC25000, demonstrate the effectiveness of the proposed method and validate its robustness on the popular dermoscopic dataset ISIC-2019. Compared to state-of-the-art knowledge distillation methods, the proposed method achieves better performance and greater robustness with optimal domain adaptation.</div></div><div><h3>Conclusion:</h3><div>A novel distillation architecture, termed VPSP, tailored for histological classification, is proposed. This architecture achieves superior performance with optimal domain adaptation, enhancing the clinical application of histological classification. The source code will be released at <span><span>https://github.com/xiaohongji/VPSP</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"158 ","pages":"Article 104728"},"PeriodicalIF":4.0,"publicationDate":"2024-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142288115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Junrong Song , Zhiming Song , Yuanli Gong , Lichang Ge , Wenlu Lou
{"title":"Advancing cancer driver gene identification through an integrative network and pathway approach","authors":"Junrong Song , Zhiming Song , Yuanli Gong , Lichang Ge , Wenlu Lou","doi":"10.1016/j.jbi.2024.104729","DOIUrl":"10.1016/j.jbi.2024.104729","url":null,"abstract":"<div><h3>Objective</h3><div>Cancer is a complex genetic disease characterized by the accumulation of various mutations, with driver genes playing a crucial role in cancer initiation and progression. Distinguishing driver genes from passenger mutations is essential for understanding cancer biology and discovering therapeutic targets. However, the majority of existing methods ignore the mutational heterogeneity and commonalities among patients, which hinders the identification of driver genes more effectively.</div></div><div><h3>Methods</h3><div>This study introduces MCSdriver, a novel computational model that integrates network and pathway information to prioritize the identification of cancer driver genes. MCSdriver employs a bidirectional random walk algorithm to quantify the mutual exclusivity and functional relationships between mutated genes within patient cohorts. It calculates similarity scores based on a mutual exclusivity-weighted network and pathway coverage patterns, accounting for patient-specific heterogeneity and molecular profile similarity.</div></div><div><h3>Results</h3><div>This approach enhances the accuracy and quality of driver gene identification. MCSdriver demonstrates superior performance in identifying cancer driver genes across four cancer types from The Cancer Genome Atlas, showing a higher F-score, Recall and Precision compared to existing ranking list-based and module-based models.</div></div><div><h3>Conclusion</h3><div>The MCSdriver model not only outperforms other models in identifying known cancer driver genes but also effectively identifies novel driver genes involved in cancer-related biological processes. The model’s consideration of patient-specific heterogeneity and similarity in molecular profiles significantly enhances the accuracy and quality of driver gene identification. Validation through Gene Ontology enrichment analysis and literature mining further underscores its potential application value in personalized cancer therapy, offering a promising tool for advancing our understanding and treatment of cancer.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"158 ","pages":"Article 104729"},"PeriodicalIF":4.0,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142288114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maxim Kryukov , Kathleen P. Moriarty , Macarena Villamea , Ingrid O’Dwyer , Ohn Chow , Flavio Dormont , Ramon Hernandez , Ziv Bar-Joseph , Brandon Rufino
{"title":"Proxy endpoints — bridging clinical trials and real world data","authors":"Maxim Kryukov , Kathleen P. Moriarty , Macarena Villamea , Ingrid O’Dwyer , Ohn Chow , Flavio Dormont , Ramon Hernandez , Ziv Bar-Joseph , Brandon Rufino","doi":"10.1016/j.jbi.2024.104723","DOIUrl":"10.1016/j.jbi.2024.104723","url":null,"abstract":"<div><h3>Objective:</h3><p>Disease severity scores, or endpoints, are routinely measured during Randomized Controlled Trials (RCTs) to closely monitor the effect of treatment. In real-world clinical practice, although a larger set of patients is observed, the specific RCT endpoints are often not captured, which makes it hard to utilize real-world data (RWD) to evaluate drug efficacy in larger populations.</p></div><div><h3>Methods:</h3><p>To overcome this challenge, we developed an ensemble technique which learns proxy models of disease endpoints in RWD. Using a multi-stage learning framework applied to RCT data, we first identify features considered significant drivers of disease available within RWD. To create endpoint proxy models, we use Explainable Boosting Machines (EBMs) which allow for both end-user interpretability and modeling of non-linear relationships.</p></div><div><h3>Results:</h3><p>We demonstrate our approach on two diseases, rheumatoid arthritis (RA) and atopic dermatitis (AD). As we show, our combined feature selection and prediction method achieves good results for both disease areas, improving upon prior methods proposed for predictive disease severity scoring.</p></div><div><h3>Conclusion:</h3><p>Having disease severity over time for a patient is important to further disease understanding and management. Our results open the door to more use cases in the space of RA and AD such as treatment effect estimates or prognostic scoring on RWD. Our framework may be extended beyond RA and AD to other diseases where the severity score is not well measured in electronic health records.</p></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"158 ","pages":"Article 104723"},"PeriodicalIF":4.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1532046424001412/pdfft?md5=7711cb401e9e3526c4adf1c9e025c587&pid=1-s2.0-S1532046424001412-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142274254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Md Mushfiqur Rahman , Mohammad Sabik Irbaz , Kai North, Michelle S. Williams, Marcos Zampieri, Kevin Lybarger
{"title":"Health text simplification: An annotated corpus for digestive cancer education and novel strategies for reinforcement learning","authors":"Md Mushfiqur Rahman , Mohammad Sabik Irbaz , Kai North, Michelle S. Williams, Marcos Zampieri, Kevin Lybarger","doi":"10.1016/j.jbi.2024.104727","DOIUrl":"10.1016/j.jbi.2024.104727","url":null,"abstract":"<div><h3>Objective:</h3><p>The reading level of health educational materials significantly influences the understandability and accessibility of the information, particularly for minoritized populations. Many patient educational resources surpass widely accepted standards for reading level and complexity. There is a critical need for high-performing text simplification models for health information to enhance dissemination and literacy. This need is particularly acute in cancer education, where effective prevention and screening education can substantially reduce morbidity and mortality.</p></div><div><h3>Methods:</h3><p>We introduce <em>Simplified Digestive Cancer</em> (SimpleDC), a parallel corpus of cancer education materials tailored for health text simplification research, comprising educational content from the American Cancer Society, Centers for Disease Control and Prevention, and National Cancer Institute. The corpus includes 31 web pages with the corresponding manually simplified versions. It consists of 1183 annotated sentence pairs (361 train, 294 development, and 528 test). Utilizing SimpleDC and the existing Med-EASi corpus, we explore Large Language Model (LLM)-based simplification methods, including fine-tuning, reinforcement learning (RL), reinforcement learning with human feedback (RLHF), domain adaptation, and prompt-based approaches. Our experimentation encompasses Llama 2, Llama 3, and GPT-4. We introduce a novel RLHF reward function featuring a lightweight model adept at distinguishing between original and simplified texts when enables training on unlabeled data.</p></div><div><h3>Results:</h3><p>Fine-tuned Llama models demonstrated high performance across various metrics. Our RLHF reward function outperformed existing RL text simplification reward functions. The results underscore that RL/RLHF can achieve performance comparable to fine-tuning and improve the performance of fine-tuned models. Additionally, these methods effectively adapt out-of-domain text simplification models to a target domain. The best-performing RL-enhanced Llama models outperformed GPT-4 in both automatic metrics and manual evaluation by subject matter experts.</p></div><div><h3>Conclusion:</h3><p>The newly developed SimpleDC corpus will serve as a valuable asset to the research community, particularly in patient education simplification. The RL/RLHF methodologies presented herein enable effective training of simplification models on unlabeled text and the utilization of out-of-domain simplification corpora.</p></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"158 ","pages":"Article 104727"},"PeriodicalIF":4.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142243177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Steven A. De La Torre , Mohamed El Mistiri , Eric Hekler , Predrag Klasnja , Benjamin Marlin , Misha Pavel , Donna Spruijt-Metz , Daniel E. Rivera
{"title":"Modeling engagement with a digital behavior change intervention (HeartSteps II): An exploratory system identification approach","authors":"Steven A. De La Torre , Mohamed El Mistiri , Eric Hekler , Predrag Klasnja , Benjamin Marlin , Misha Pavel , Donna Spruijt-Metz , Daniel E. Rivera","doi":"10.1016/j.jbi.2024.104721","DOIUrl":"10.1016/j.jbi.2024.104721","url":null,"abstract":"<div><h3>Objective</h3><p>Digital behavior change interventions (DBCIs) are feasibly effective tools for addressing physical activity. However, in-depth understanding of participants’ long-term engagement with DBCIs remains sparse. Since the effectiveness of DBCIs to impact behavior change depends, in part, upon participant engagement, there is a need to better understand engagement as a dynamic process in response to an individual’s ever-changing biological, psychological, social, and environmental context.</p></div><div><h3>Methods</h3><p>The year-long micro-randomized trial (MRT) <em>HeartSteps II</em> provides an unprecedented opportunity to investigate DBCI engagement among ethnically diverse participants. We combined data streams from wearable sensors (Fitbit Versa, i.e., walking behavior), the <em>HeartSteps II</em> app (i.e. page views), and ecological momentary assessments (EMAs, i.e. perceived intrinsic and extrinsic motivation) to build the idiographic models. A system identification approach and a fluid analogy model were used to conduct autoregressive with exogenous input (ARX) analyses that tested hypothesized relationships between these variables inspired by Self-Determination Theory (SDT) with DBCI engagement through time.</p></div><div><h3>Results</h3><p>Data from 11 <em>HeartSteps II</em> participants was used to test aspects of the hypothesized SDT dynamic model. The average age was 46.33 (SD=7.4) years, and the average steps per day at baseline was 5,507 steps (SD=6,239). The hypothesized 5-input SDT-inspired ARX model for app engagement resulted in a 31.75 % weighted RMSEA (31.50 % on validation and 31.91 % on estimation), indicating that the model predicted app page views almost 32 % better relative to the mean of the data. Among Hispanic/Latino participants, the average overall model fit across inventories of the SDT fluid analogy was 34.22 % (SD=10.53) compared to 22.39 % (SD=6.36) among non-Hispanic/Latino Whites, a difference of 11.83 %. Across individuals, the number of daily notification prompts received by the participant was positively associated with increased app page views. The weekend/weekday indicator and perceived daily busyness were also found to be key predictors of the number of daily application page views.</p></div><div><h3>Conclusions</h3><p>This novel approach has significant implications for both personalized and adaptive DBCIs by identifying factors that foster or undermine engagement in an individual’s respective context. Once identified, these factors can be tailored to promote engagement and support sustained behavior change over time.</p></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"158 ","pages":"Article 104721"},"PeriodicalIF":4.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1532046424001394/pdfft?md5=4f63dda9bba243570e4ff38291614e5d&pid=1-s2.0-S1532046424001394-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142243167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Magdalena Wysocka , Oskar Wysocki , Maxime Delmas , Vincent Mutel , André Freitas
{"title":"Large Language Models, scientific knowledge and factuality: A framework to streamline human expert evaluation","authors":"Magdalena Wysocka , Oskar Wysocki , Maxime Delmas , Vincent Mutel , André Freitas","doi":"10.1016/j.jbi.2024.104724","DOIUrl":"10.1016/j.jbi.2024.104724","url":null,"abstract":"<div><h3>Objective:</h3><p>The paper introduces a framework for the evaluation of the encoding of factual scientific knowledge, designed to streamline the manual evaluation process typically conducted by domain experts. Inferring over and extracting information from Large Language Models (LLMs) trained on a large corpus of scientific literature can potentially define a step change in biomedical discovery, reducing the barriers for accessing and integrating existing medical evidence. This work explores the potential of LLMs for dialoguing with biomedical background knowledge, using the context of antibiotic discovery.</p></div><div><h3>Methods:</h3><p>The framework involves three evaluation steps, each assessing different aspects sequentially: fluency, prompt alignment, semantic coherence, factual knowledge, and specificity of the generated responses. By splitting these tasks between non-experts and experts, the framework reduces the effort required from the latter. The work provides a systematic assessment on the ability of eleven state-of-the-art LLMs, including ChatGPT, GPT-4 and Llama 2, in two prompting-based tasks: chemical compound definition generation and chemical compound–fungus relation determination.</p></div><div><h3>Results:</h3><p>Although recent models have improved in fluency, factual accuracy is still low and models are biased towards over-represented entities. The ability of LLMs to serve as biomedical knowledge bases is questioned, and the need for additional systematic evaluation frameworks is highlighted.</p></div><div><h3>Conclusion:</h3><p>While LLMs are currently not fit for purpose to be used as biomedical factual knowledge bases in a zero-shot setting, there is a promising emerging property in the direction of factuality as the models become domain specialised, scale up in size and level of human feedback.</p></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"158 ","pages":"Article 104724"},"PeriodicalIF":4.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1532046424001424/pdfft?md5=ac0ecdf9dc0e6bc7bc1738a6853505c0&pid=1-s2.0-S1532046424001424-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142243180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yang Zhao , Danushka Bollegala , Shunsuke Hirose , Yingzi Jin , Tomotake Kozu
{"title":"Community knowledge graph abstraction for enhanced link prediction: A study on PubMed knowledge graph","authors":"Yang Zhao , Danushka Bollegala , Shunsuke Hirose , Yingzi Jin , Tomotake Kozu","doi":"10.1016/j.jbi.2024.104725","DOIUrl":"10.1016/j.jbi.2024.104725","url":null,"abstract":"<div><h3>Objective:</h3><p>As new knowledge is produced at a rapid pace in the biomedical field, existing biomedical Knowledge Graphs (KGs) cannot be manually updated in a timely manner. Previous work in Natural Language Processing (NLP) has leveraged link prediction to infer the missing knowledge in general-purpose KGs. Inspired by this, we propose to apply link prediction to existing biomedical KGs to infer missing knowledge. Although Knowledge Graph Embedding (KGE) methods are effective in link prediction tasks, they are less capable of capturing relations between communities of entities with specific attributes (Fanourakis et al., 2023).</p></div><div><h3>Methods:</h3><p>To address this challenge, we proposed an entity distance-based method for abstracting a Community Knowledge Graph (CKG) from a simplified version of the pre-existing PubMed Knowledge Graph (PKG) (Xu et al., 2020). For link prediction on the abstracted CKG, we proposed an extension approach for the existing KGE models by linking the information in the PKG to the abstracted CKG. The applicability of this extension was proved by employing six well-known KGE models: TransE, TransH, DistMult, ComplEx, SimplE, and RotatE. Evaluation metrics including Mean Rank (MR), Mean Reciprocal Rank (MRR), and Hits@<span><math><mi>k</mi></math></span> were used to assess the link prediction performance. In addition, we presented a backtracking process that traces the results of CKG link prediction back to the PKG scale for further comparison.</p></div><div><h3>Results:</h3><p>Six different CKGs were abstracted from the PKG by using embeddings of the six KGE methods. The results of link prediction in these abstracted CKGs indicate that our proposed extension can improve the existing KGE methods, achieving a top-10 accuracy of 0.69 compared to 0.5 for TransE, 0.7 compared to 0.54 for TransH, 0.67 compared to 0.6 for DistMult, 0.73 compared to 0.57 for ComplEx, 0.73 compared to 0.63 for SimplE, and 0.85 compared to 0.76 for RotatE on their CKGs, respectively. These improved performances also highlight the wide applicability of the extension approach.</p></div><div><h3>Conclusion:</h3><p>This study proposed novel insights into abstracting CKGs from the PKG. The extension approach indicated enhanced performance of the existing KGE methods and has applicability. As an interesting future extension, we plan to conduct link prediction for entities that are newly introduced to the PKG.</p></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"158 ","pages":"Article 104725"},"PeriodicalIF":4.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1532046424001436/pdfft?md5=1241f4473bb8cac3c0c3666b4968750a&pid=1-s2.0-S1532046424001436-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142243176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}