Journal of Biomedical Informatics最新文献

筛选
英文 中文
FuseLinker: Leveraging LLM’s pre-trained text embeddings and domain knowledge to enhance GNN-based link prediction on biomedical knowledge graphs FuseLinker:利用 LLM 的预训练文本嵌入和领域知识,增强基于 GNN 的生物医学知识图谱链接预测。
IF 4 2区 医学
Journal of Biomedical Informatics Pub Date : 2024-10-01 DOI: 10.1016/j.jbi.2024.104730
{"title":"FuseLinker: Leveraging LLM’s pre-trained text embeddings and domain knowledge to enhance GNN-based link prediction on biomedical knowledge graphs","authors":"","doi":"10.1016/j.jbi.2024.104730","DOIUrl":"10.1016/j.jbi.2024.104730","url":null,"abstract":"<div><h3>Objective</h3><div>To develop the FuseLinker, a novel link prediction framework for biomedical knowledge graphs (BKGs), which fully exploits the graph’s structural, textual and domain knowledge information. We evaluated the utility of FuseLinker in the graph-based drug repurposing task through detailed case studies.</div></div><div><h3>Methods</h3><div>FuseLinker leverages fused pre-trained text embedding and domain knowledge embedding to enhance the graph neural network (GNN)-based link prediction model tailored for BKGs. This framework includes three parts: a) obtain text embeddings for BKGs using embedding-visible large language models (LLMs), b) learn the representations of medical ontology as domain knowledge information by employing the Poincaré graph embedding method, and c) fuse these embeddings and further learn the graph structure representations of BKGs by applying a GNN-based link prediction model. We evaluated FuseLinker against traditional knowledge graph embedding models and a conventional GNN-based link prediction model across four public BKG datasets. Additionally, we examined the impact of using different embedding-visible LLMs on FuseLinker’s performance. Finally, we investigated FuseLinker’s ability to generate medical hypotheses through two drug repurposing case studies for Sorafenib and Parkinson’s disease.</div></div><div><h3>Results</h3><div>By comparing FuseLinker with baseline models on four BKGs, our method demonstrates superior performance. The Mean Reciprocal Rank (MRR) and Area Under receiver operating characteristic Curve (AUROC) for KEGG50k, Hetionet, SuppKG and ADInt are 0.969 and 0.987, 0.548 and 0.903, 0.739 and 0.928, and 0.831 and 0.890, respectively.</div></div><div><h3>Conclusion</h3><div>Our study demonstrates that FuseLinker is an effective novel link prediction framework that integrates multiple graph information and shows significant potential for practical applications in biomedical and clinical tasks. Source code and data are available at https://github.com/YKXia0/FuseLinker.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142347388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Clinical outcome-guided deep temporal clustering for disease progression subtyping. 临床结果指导下的疾病进展亚型深度时间聚类。
IF 4 2区 医学
Journal of Biomedical Informatics Pub Date : 2024-09-30 DOI: 10.1016/j.jbi.2024.104732
Dulin Wang, Xiaotian Ma, Paul E Schulz, Xiaoqian Jiang, Yejin Kim
{"title":"Clinical outcome-guided deep temporal clustering for disease progression subtyping.","authors":"Dulin Wang, Xiaotian Ma, Paul E Schulz, Xiaoqian Jiang, Yejin Kim","doi":"10.1016/j.jbi.2024.104732","DOIUrl":"https://doi.org/10.1016/j.jbi.2024.104732","url":null,"abstract":"<p><strong>Objective: </strong>Complex diseases exhibit heterogeneous progression patterns, necessitating effective capture and clustering of longitudinal changes to identify disease subtypes for personalized treatments. However, existing studies often fail to design clustering-specific representations or neglect clinical outcomes, thereby limiting the interpretability and clinical utility.</p><p><strong>Method: </strong>We design a unified framework for subtyping longitudinal progressive diseases. We focus on effectively integrating all data from disease progressions and improving patient representation for downstream clustering. Specifically, we propose a clinical Outcome-Guided Deep Temporal Clustering (OG-DTC) that generates representations informed by clustering and clinical outcomes. A GRU-based seq2seq architecture captures the temporal dynamics, and the model integrates k-means clustering and outcome regression to facilitate the formation of clustering structures and the integration of clinical outcomes. The learned representations are clustered using a Gaussian mixture model to identify distinct subtypes. The clustering results are extensively validated through reproducibility, stability, and significance tests.</p><p><strong>Results: </strong>We demonstrated the efficacy of our framework by applying it to three Alzheimer's Disease (AD) clinical trials. Through the AD case study, we identified three distinct subtypes with unique patterns associated with differentiated clinical declines across multiple measures. The ablation study revealed the contributions of each component in the model and showed that jointly optimizing the full model improved patient representations for clustering. Extensive validations showed that the derived clustering is reproducible, stable, and significant.</p><p><strong>Conclusion: </strong>Our temporal clustering framework can derive robust clustering applicable for subtyping longitudinal progressive diseases and has the potential to account for subtype variability in clinical outcomes.</p>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2024-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142365373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cross-domain visual prompting with spatial proximity knowledge distillation for histological image classification 利用空间邻近性知识提炼跨域视觉提示,实现组织学图像分类。
IF 4 2区 医学
Journal of Biomedical Informatics Pub Date : 2024-09-21 DOI: 10.1016/j.jbi.2024.104728
{"title":"Cross-domain visual prompting with spatial proximity knowledge distillation for histological image classification","authors":"","doi":"10.1016/j.jbi.2024.104728","DOIUrl":"10.1016/j.jbi.2024.104728","url":null,"abstract":"<div><h3>Objective:</h3><div>Histological classification is a challenging task due to the diverse appearances, unpredictable variations, and blurry edges of histological tissues. Recently, many approaches based on large networks have achieved satisfactory performance. However, most of these methods rely heavily on substantial computational resources and large high-quality datasets, limiting their practical application. Knowledge Distillation (KD) offers a promising solution by enabling smaller networks to achieve performance comparable to that of larger networks. Nonetheless, KD is hindered by the problem of high-dimensional characteristics, which makes it difficult to capture tiny scattered features and often leads to the loss of edge feature relationships.</div></div><div><h3>Methods:</h3><div>A novel cross-domain visual prompting distillation approach is proposed, compelling the teacher network to facilitate the extraction of significant high-dimensional features into low-dimensional feature maps, thereby aiding the student network in achieving superior performance. Additionally, a dynamic learnable temperature module based on novel vector-based spatial proximity is introduced to further encourage the student to imitate the teacher.</div></div><div><h3>Results:</h3><div>Experiments conducted on widely accepted histological datasets, NCT-CRC-HE-100K and LC25000, demonstrate the effectiveness of the proposed method and validate its robustness on the popular dermoscopic dataset ISIC-2019. Compared to state-of-the-art knowledge distillation methods, the proposed method achieves better performance and greater robustness with optimal domain adaptation.</div></div><div><h3>Conclusion:</h3><div>A novel distillation architecture, termed VPSP, tailored for histological classification, is proposed. This architecture achieves superior performance with optimal domain adaptation, enhancing the clinical application of histological classification. The source code will be released at <span><span>https://github.com/xiaohongji/VPSP</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2024-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142288115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Advancing cancer driver gene identification through an integrative network and pathway approach 通过综合网络和通路方法推进癌症驱动基因的识别。
IF 4 2区 医学
Journal of Biomedical Informatics Pub Date : 2024-09-19 DOI: 10.1016/j.jbi.2024.104729
{"title":"Advancing cancer driver gene identification through an integrative network and pathway approach","authors":"","doi":"10.1016/j.jbi.2024.104729","DOIUrl":"10.1016/j.jbi.2024.104729","url":null,"abstract":"<div><h3>Objective</h3><div>Cancer is a complex genetic disease characterized by the accumulation of various mutations, with driver genes playing a crucial role in cancer initiation and progression. Distinguishing driver genes from passenger mutations is essential for understanding cancer biology and discovering therapeutic targets. However, the majority of existing methods ignore the mutational heterogeneity and commonalities among patients, which hinders the identification of driver genes more effectively.</div></div><div><h3>Methods</h3><div>This study introduces MCSdriver, a novel computational model that integrates network and pathway information to prioritize the identification of cancer driver genes. MCSdriver employs a bidirectional random walk algorithm to quantify the mutual exclusivity and functional relationships between mutated genes within patient cohorts. It calculates similarity scores based on a mutual exclusivity-weighted network and pathway coverage patterns, accounting for patient-specific heterogeneity and molecular profile similarity.</div></div><div><h3>Results</h3><div>This approach enhances the accuracy and quality of driver gene identification. MCSdriver demonstrates superior performance in identifying cancer driver genes across four cancer types from The Cancer Genome Atlas, showing a higher F-score, Recall and Precision compared to existing ranking list-based and module-based models.</div></div><div><h3>Conclusion</h3><div>The MCSdriver model not only outperforms other models in identifying known cancer driver genes but also effectively identifies novel driver genes involved in cancer-related biological processes. The model’s consideration of patient-specific heterogeneity and similarity in molecular profiles significantly enhances the accuracy and quality of driver gene identification. Validation through Gene Ontology enrichment analysis and literature mining further underscores its potential application value in personalized cancer therapy, offering a promising tool for advancing our understanding and treatment of cancer.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142288114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Proxy endpoints — bridging clinical trials and real world data 代理终点--连接临床试验与真实世界的数据
IF 4 2区 医学
Journal of Biomedical Informatics Pub Date : 2024-09-17 DOI: 10.1016/j.jbi.2024.104723
{"title":"Proxy endpoints — bridging clinical trials and real world data","authors":"","doi":"10.1016/j.jbi.2024.104723","DOIUrl":"10.1016/j.jbi.2024.104723","url":null,"abstract":"<div><h3>Objective:</h3><p>Disease severity scores, or endpoints, are routinely measured during Randomized Controlled Trials (RCTs) to closely monitor the effect of treatment. In real-world clinical practice, although a larger set of patients is observed, the specific RCT endpoints are often not captured, which makes it hard to utilize real-world data (RWD) to evaluate drug efficacy in larger populations.</p></div><div><h3>Methods:</h3><p>To overcome this challenge, we developed an ensemble technique which learns proxy models of disease endpoints in RWD. Using a multi-stage learning framework applied to RCT data, we first identify features considered significant drivers of disease available within RWD. To create endpoint proxy models, we use Explainable Boosting Machines (EBMs) which allow for both end-user interpretability and modeling of non-linear relationships.</p></div><div><h3>Results:</h3><p>We demonstrate our approach on two diseases, rheumatoid arthritis (RA) and atopic dermatitis (AD). As we show, our combined feature selection and prediction method achieves good results for both disease areas, improving upon prior methods proposed for predictive disease severity scoring.</p></div><div><h3>Conclusion:</h3><p>Having disease severity over time for a patient is important to further disease understanding and management. Our results open the door to more use cases in the space of RA and AD such as treatment effect estimates or prognostic scoring on RWD. Our framework may be extended beyond RA and AD to other diseases where the severity score is not well measured in electronic health records.</p></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1532046424001412/pdfft?md5=7711cb401e9e3526c4adf1c9e025c587&pid=1-s2.0-S1532046424001412-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142274254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Health text simplification: An annotated corpus for digestive cancer education and novel strategies for reinforcement learning 健康文本简化:用于消化系统癌症教育的注释语料库和新的强化学习策略
IF 4 2区 医学
Journal of Biomedical Informatics Pub Date : 2024-09-16 DOI: 10.1016/j.jbi.2024.104727
{"title":"Health text simplification: An annotated corpus for digestive cancer education and novel strategies for reinforcement learning","authors":"","doi":"10.1016/j.jbi.2024.104727","DOIUrl":"10.1016/j.jbi.2024.104727","url":null,"abstract":"<div><h3>Objective:</h3><p>The reading level of health educational materials significantly influences the understandability and accessibility of the information, particularly for minoritized populations. Many patient educational resources surpass widely accepted standards for reading level and complexity. There is a critical need for high-performing text simplification models for health information to enhance dissemination and literacy. This need is particularly acute in cancer education, where effective prevention and screening education can substantially reduce morbidity and mortality.</p></div><div><h3>Methods:</h3><p>We introduce <em>Simplified Digestive Cancer</em> (SimpleDC), a parallel corpus of cancer education materials tailored for health text simplification research, comprising educational content from the American Cancer Society, Centers for Disease Control and Prevention, and National Cancer Institute. The corpus includes 31 web pages with the corresponding manually simplified versions. It consists of 1183 annotated sentence pairs (361 train, 294 development, and 528 test). Utilizing SimpleDC and the existing Med-EASi corpus, we explore Large Language Model (LLM)-based simplification methods, including fine-tuning, reinforcement learning (RL), reinforcement learning with human feedback (RLHF), domain adaptation, and prompt-based approaches. Our experimentation encompasses Llama 2, Llama 3, and GPT-4. We introduce a novel RLHF reward function featuring a lightweight model adept at distinguishing between original and simplified texts when enables training on unlabeled data.</p></div><div><h3>Results:</h3><p>Fine-tuned Llama models demonstrated high performance across various metrics. Our RLHF reward function outperformed existing RL text simplification reward functions. The results underscore that RL/RLHF can achieve performance comparable to fine-tuning and improve the performance of fine-tuned models. Additionally, these methods effectively adapt out-of-domain text simplification models to a target domain. The best-performing RL-enhanced Llama models outperformed GPT-4 in both automatic metrics and manual evaluation by subject matter experts.</p></div><div><h3>Conclusion:</h3><p>The newly developed SimpleDC corpus will serve as a valuable asset to the research community, particularly in patient education simplification. The RL/RLHF methodologies presented herein enable effective training of simplification models on unlabeled text and the utilization of out-of-domain simplification corpora.</p></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142243177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Modeling engagement with a digital behavior change intervention (HeartSteps II): An exploratory system identification approach 数字行为改变干预(HeartSteps II)的参与建模:探索性系统识别方法
IF 4 2区 医学
Journal of Biomedical Informatics Pub Date : 2024-09-13 DOI: 10.1016/j.jbi.2024.104721
{"title":"Modeling engagement with a digital behavior change intervention (HeartSteps II): An exploratory system identification approach","authors":"","doi":"10.1016/j.jbi.2024.104721","DOIUrl":"10.1016/j.jbi.2024.104721","url":null,"abstract":"<div><h3>Objective</h3><p>Digital behavior change interventions (DBCIs) are feasibly effective tools for addressing physical activity. However, in-depth understanding of participants’ long-term engagement with DBCIs remains sparse. Since the effectiveness of DBCIs to impact behavior change depends, in part, upon participant engagement, there is a need to better understand engagement as a dynamic process in response to an individual’s ever-changing biological, psychological, social, and environmental context.</p></div><div><h3>Methods</h3><p>The year-long micro-randomized trial (MRT) <em>HeartSteps II</em> provides an unprecedented opportunity to investigate DBCI engagement among ethnically diverse participants. We combined data streams from wearable sensors (Fitbit Versa, i.e., walking behavior), the <em>HeartSteps II</em> app (i.e. page views), and ecological momentary assessments (EMAs, i.e. perceived intrinsic and extrinsic motivation) to build the idiographic models. A system identification approach and a fluid analogy model were used to conduct autoregressive with exogenous input (ARX) analyses that tested hypothesized relationships between these variables inspired by Self-Determination Theory (SDT) with DBCI engagement through time.</p></div><div><h3>Results</h3><p>Data from 11 <em>HeartSteps II</em> participants was used to test aspects of the hypothesized SDT dynamic model. The average age was 46.33 (SD=7.4) years, and the average steps per day at baseline was 5,507 steps (SD=6,239). The hypothesized 5-input SDT-inspired ARX model for app engagement resulted in a 31.75 % weighted RMSEA (31.50 % on validation and 31.91 % on estimation), indicating that the model predicted app page views almost 32 % better relative to the mean of the data. Among Hispanic/Latino participants, the average overall model fit across inventories of the SDT fluid analogy was 34.22 % (SD=10.53) compared to 22.39 % (SD=6.36) among non-Hispanic/Latino Whites, a difference of 11.83 %. Across individuals, the number of daily notification prompts received by the participant was positively associated with increased app page views. The weekend/weekday indicator and perceived daily busyness were also found to be key predictors of the number of daily application page views.</p></div><div><h3>Conclusions</h3><p>This novel approach has significant implications for both personalized and adaptive DBCIs by identifying factors that foster or undermine engagement in an individual’s respective context. Once identified, these factors can be tailored to promote engagement and support sustained behavior change over time.</p></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1532046424001394/pdfft?md5=4f63dda9bba243570e4ff38291614e5d&pid=1-s2.0-S1532046424001394-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142243167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Large Language Models, scientific knowledge and factuality: A framework to streamline human expert evaluation 大型语言模型、科学知识和事实性:简化人类专家评估的框架
IF 4 2区 医学
Journal of Biomedical Informatics Pub Date : 2024-09-12 DOI: 10.1016/j.jbi.2024.104724
{"title":"Large Language Models, scientific knowledge and factuality: A framework to streamline human expert evaluation","authors":"","doi":"10.1016/j.jbi.2024.104724","DOIUrl":"10.1016/j.jbi.2024.104724","url":null,"abstract":"<div><h3>Objective:</h3><p>The paper introduces a framework for the evaluation of the encoding of factual scientific knowledge, designed to streamline the manual evaluation process typically conducted by domain experts. Inferring over and extracting information from Large Language Models (LLMs) trained on a large corpus of scientific literature can potentially define a step change in biomedical discovery, reducing the barriers for accessing and integrating existing medical evidence. This work explores the potential of LLMs for dialoguing with biomedical background knowledge, using the context of antibiotic discovery.</p></div><div><h3>Methods:</h3><p>The framework involves three evaluation steps, each assessing different aspects sequentially: fluency, prompt alignment, semantic coherence, factual knowledge, and specificity of the generated responses. By splitting these tasks between non-experts and experts, the framework reduces the effort required from the latter. The work provides a systematic assessment on the ability of eleven state-of-the-art LLMs, including ChatGPT, GPT-4 and Llama 2, in two prompting-based tasks: chemical compound definition generation and chemical compound–fungus relation determination.</p></div><div><h3>Results:</h3><p>Although recent models have improved in fluency, factual accuracy is still low and models are biased towards over-represented entities. The ability of LLMs to serve as biomedical knowledge bases is questioned, and the need for additional systematic evaluation frameworks is highlighted.</p></div><div><h3>Conclusion:</h3><p>While LLMs are currently not fit for purpose to be used as biomedical factual knowledge bases in a zero-shot setting, there is a promising emerging property in the direction of factuality as the models become domain specialised, scale up in size and level of human feedback.</p></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1532046424001424/pdfft?md5=ac0ecdf9dc0e6bc7bc1738a6853505c0&pid=1-s2.0-S1532046424001424-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142243180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Community knowledge graph abstraction for enhanced link prediction: A study on PubMed knowledge graph 增强链接预测的社区知识图谱抽象:对 PubMed 知识图谱的研究
IF 4 2区 医学
Journal of Biomedical Informatics Pub Date : 2024-09-10 DOI: 10.1016/j.jbi.2024.104725
{"title":"Community knowledge graph abstraction for enhanced link prediction: A study on PubMed knowledge graph","authors":"","doi":"10.1016/j.jbi.2024.104725","DOIUrl":"10.1016/j.jbi.2024.104725","url":null,"abstract":"<div><h3>Objective:</h3><p>As new knowledge is produced at a rapid pace in the biomedical field, existing biomedical Knowledge Graphs (KGs) cannot be manually updated in a timely manner. Previous work in Natural Language Processing (NLP) has leveraged link prediction to infer the missing knowledge in general-purpose KGs. Inspired by this, we propose to apply link prediction to existing biomedical KGs to infer missing knowledge. Although Knowledge Graph Embedding (KGE) methods are effective in link prediction tasks, they are less capable of capturing relations between communities of entities with specific attributes (Fanourakis et al., 2023).</p></div><div><h3>Methods:</h3><p>To address this challenge, we proposed an entity distance-based method for abstracting a Community Knowledge Graph (CKG) from a simplified version of the pre-existing PubMed Knowledge Graph (PKG) (Xu et al., 2020). For link prediction on the abstracted CKG, we proposed an extension approach for the existing KGE models by linking the information in the PKG to the abstracted CKG. The applicability of this extension was proved by employing six well-known KGE models: TransE, TransH, DistMult, ComplEx, SimplE, and RotatE. Evaluation metrics including Mean Rank (MR), Mean Reciprocal Rank (MRR), and Hits@<span><math><mi>k</mi></math></span> were used to assess the link prediction performance. In addition, we presented a backtracking process that traces the results of CKG link prediction back to the PKG scale for further comparison.</p></div><div><h3>Results:</h3><p>Six different CKGs were abstracted from the PKG by using embeddings of the six KGE methods. The results of link prediction in these abstracted CKGs indicate that our proposed extension can improve the existing KGE methods, achieving a top-10 accuracy of 0.69 compared to 0.5 for TransE, 0.7 compared to 0.54 for TransH, 0.67 compared to 0.6 for DistMult, 0.73 compared to 0.57 for ComplEx, 0.73 compared to 0.63 for SimplE, and 0.85 compared to 0.76 for RotatE on their CKGs, respectively. These improved performances also highlight the wide applicability of the extension approach.</p></div><div><h3>Conclusion:</h3><p>This study proposed novel insights into abstracting CKGs from the PKG. The extension approach indicated enhanced performance of the existing KGE methods and has applicability. As an interesting future extension, we plan to conduct link prediction for entities that are newly introduced to the PKG.</p></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1532046424001436/pdfft?md5=1241f4473bb8cac3c0c3666b4968750a&pid=1-s2.0-S1532046424001436-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142243176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Extracting lung cancer staging descriptors from pathology reports: A generative language model approach 从病理报告中提取肺癌分期描述符:生成语言模型方法
IF 4 2区 医学
Journal of Biomedical Informatics Pub Date : 2024-09-01 DOI: 10.1016/j.jbi.2024.104720
{"title":"Extracting lung cancer staging descriptors from pathology reports: A generative language model approach","authors":"","doi":"10.1016/j.jbi.2024.104720","DOIUrl":"10.1016/j.jbi.2024.104720","url":null,"abstract":"<div><h3>Background</h3><p>In oncology, electronic health records contain textual key information for the diagnosis, staging, and treatment planning of patients with cancer. However, text data processing requires a lot of time and effort, which limits the utilization of these data. Recent advances in natural language processing (NLP) technology, including large language models, can be applied to cancer research. Particularly, extracting the information required for the pathological stage from surgical pathology reports can be utilized to update cancer staging according to the latest cancer staging guidelines.</p></div><div><h3>Objectives</h3><p>This study has two main objectives. The first objective is to evaluate the performance of extracting information from text-based surgical pathology reports and determining pathological stages based on the extracted information using fine-tuned generative language models (GLMs) for patients with lung cancer. The second objective is to determine the feasibility of utilizing relatively small GLMs for information extraction in a resource-constrained computing environment.</p></div><div><h3>Methods</h3><p>Lung cancer surgical pathology reports were collected from the Common Data Model database of Seoul National University Bundang Hospital (SNUBH), a tertiary hospital in Korea. We selected 42 descriptors necessary for tumor-node (TN) classification based on these reports and created a gold standard with validation by two clinical experts. The pathology reports and gold standard were used to generate prompt-response pairs for training and evaluating GLMs which then were used to extract information required for staging from pathology reports.</p></div><div><h3>Results</h3><p>We evaluated the information extraction performance of six trained models as well as their performance in TN classification using the extracted information. The Deductive Mistral-7B model, which was pre-trained with the deductive dataset, showed the best performance overall, with an exact match ratio of 92.24% in the information extraction problem and an accuracy of 0.9876 (predicting T and N classification concurrently) in classification.</p></div><div><h3>Conclusion</h3><p>This study demonstrated that training GLMs with deductive datasets can improve information extraction performance, and GLMs with a relatively small number of parameters at approximately seven billion can achieve high performance in this problem. The proposed GLM-based information extraction method is expected to be useful in clinical decision-making support, lung cancer staging and research.</p></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":null,"pages":null},"PeriodicalIF":4.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1532046424001382/pdfft?md5=a07a39b7bc41fc8621f04b2757525870&pid=1-s2.0-S1532046424001382-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142132875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信