Learning to Rank Complex Biomedical Hypotheses for Accelerating Scientific Discovery.

IEEE International Conference on Healthcare Informatics. IEEE International Conference on Healthcare Informatics Pub Date : 2024-06-01 Epub Date: 2024-08-22 DOI:10.1109/ichi61247.2024.00044

Juncheng Ding, Shailesh Dahal, Bijaya Adhikari, Kishlay Jha

{"title":"Learning to Rank Complex Biomedical Hypotheses for Accelerating Scientific Discovery.","authors":"Juncheng Ding, Shailesh Dahal, Bijaya Adhikari, Kishlay Jha","doi":"10.1109/ichi61247.2024.00044","DOIUrl":null,"url":null,"abstract":"<p><p>Hypothesis generation (HG) is a fundamental problem in biomedical text mining that uncovers plausible implicit links ( <math><mi>B</mi></math> terms) between two disjoint concepts of interest ( <math><mi>A</mi></math> and <math><mi>C</mi></math> terms). Over the past decade, many HG approaches based on distributional statistics, graph-theoretic measures, and supervised machine learning methods have been proposed. Despite significant advances made, the existing approaches have two major limitations. First, they mainly focus on enumerating hypotheses and often neglect to rank them in a semantically meaningful way. This leads to wasted time and resources as researchers may focus on hypotheses that are ultimately not supported by experimental evidence. Second, the existing approaches are designed to rank hypotheses with only one intermediate or evidence term (referred as simple hypotheses), and thus are unable to handle hypotheses with multiple intermediate terms (referred as complex hypotheses). This is limiting because recent research has shown that the complex hypotheses could be of greater practical value than simple ones, especially in the early stages of scientific discovery. To address these issues, we propose a new HG ranking approach that leverages upon the expressive power of Graph Neural Networks (GNN) coupled with a domain-knowledge guided Noise-Contrastive Estimation (NCE) strategy to effectively rank both simple and complex biomedical hypotheses. Specifically, the message passing capabilities of GNN allows our approach to capture the rich interactions between biomedical entities and succinctly handle the complex hypotheses with variable intermediate terms. Moreover, the proposed domain knowledge-guided NCE strategy enables the ranking of complex hypotheses based on their coherence with the established biomedical knowledge. Extensive experiment results on five recognized biomedical datasets show that the proposed approach consistently outperforms the existing baselines and prioritizes hypotheses worthy of potential clinical trials.</p>","PeriodicalId":73284,"journal":{"name":"IEEE International Conference on Healthcare Informatics. IEEE International Conference on Healthcare Informatics","volume":"2024 ","pages":"285-293"},"PeriodicalIF":0.0000,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11920884/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE International Conference on Healthcare Informatics. IEEE International Conference on Healthcare Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ichi61247.2024.00044","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/8/22 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Hypothesis generation (HG) is a fundamental problem in biomedical text mining that uncovers plausible implicit links ( $B$ terms) between two disjoint concepts of interest ( $A$ and $C$ terms). Over the past decade, many HG approaches based on distributional statistics, graph-theoretic measures, and supervised machine learning methods have been proposed. Despite significant advances made, the existing approaches have two major limitations. First, they mainly focus on enumerating hypotheses and often neglect to rank them in a semantically meaningful way. This leads to wasted time and resources as researchers may focus on hypotheses that are ultimately not supported by experimental evidence. Second, the existing approaches are designed to rank hypotheses with only one intermediate or evidence term (referred as simple hypotheses), and thus are unable to handle hypotheses with multiple intermediate terms (referred as complex hypotheses). This is limiting because recent research has shown that the complex hypotheses could be of greater practical value than simple ones, especially in the early stages of scientific discovery. To address these issues, we propose a new HG ranking approach that leverages upon the expressive power of Graph Neural Networks (GNN) coupled with a domain-knowledge guided Noise-Contrastive Estimation (NCE) strategy to effectively rank both simple and complex biomedical hypotheses. Specifically, the message passing capabilities of GNN allows our approach to capture the rich interactions between biomedical entities and succinctly handle the complex hypotheses with variable intermediate terms. Moreover, the proposed domain knowledge-guided NCE strategy enables the ranking of complex hypotheses based on their coherence with the established biomedical knowledge. Extensive experiment results on five recognized biomedical datasets show that the proposed approach consistently outperforms the existing baselines and prioritizes hypotheses worthy of potential clinical trials.

查看原文本刊更多论文

学习对复杂生物医学假设进行排序以加速科学发现。

假设生成（HG）是生物医学文本挖掘中的一个基本问题，它揭示了两个不相交的感兴趣概念（a项和C项）之间可能的隐含联系（B项）。在过去的十年中，已经提出了许多基于分布统计、图论度量和监督机器学习方法的HG方法。尽管取得了重大进展，但现有的方法有两个主要局限性。首先，它们主要侧重于列举假设，而往往忽略了以语义有意义的方式对假设进行排序。这导致浪费时间和资源，因为研究人员可能会专注于最终没有实验证据支持的假设。其次，现有的方法被设计为只有一个中间项或证据项的假设排序（称为简单假设），因此无法处理具有多个中间项的假设（称为复杂假设）。这是有限的，因为最近的研究表明，复杂的假设可能比简单的更有实际价值，特别是在科学发现的早期阶段。为了解决这些问题，我们提出了一种新的HG排序方法，该方法利用图神经网络（GNN）的表达能力以及领域知识引导的噪声对比估计（NCE）策略来有效地对简单和复杂的生物医学假设进行排序。具体来说，GNN的消息传递能力使我们的方法能够捕获生物医学实体之间丰富的相互作用，并简洁地处理具有可变中间项的复杂假设。此外，提出的领域知识引导的NCE策略可以根据复杂假设与已建立的生物医学知识的一致性对其进行排名。在五个公认的生物医学数据集上进行的大量实验结果表明，所提出的方法始终优于现有的基线，并优先考虑值得潜在临床试验的假设。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE International Conference on Healthcare Informatics. IEEE International Conference on Healthcare Informatics

自引率

0.00%

发文量