Cancer gene identification through integrating causal prompting large language model with omics data-driven causal inference.

IF 6.8 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics Pub Date : 2025-03-04 DOI:10.1093/bib/bbaf113

Haolong Zeng, Chaoyi Yin, Chunyang Chai, Yuezhu Wang, Qi Dai, Huiyan Sun

{"title":"Cancer gene identification through integrating causal prompting large language model with omics data-driven causal inference.","authors":"Haolong Zeng, Chaoyi Yin, Chunyang Chai, Yuezhu Wang, Qi Dai, Huiyan Sun","doi":"10.1093/bib/bbaf113","DOIUrl":null,"url":null,"abstract":"<p><p>Identifying genes causally linked to cancer from a multi-omics perspective is essential for understanding the mechanisms of cancer and improving therapeutic strategies. Traditional statistical and machine-learning methods that rely on generalized correlation approaches to identify cancer genes often produce redundant, biased predictions with limited interpretability, largely due to overlooking confounding factors, selection biases, and the nonlinear activation function in neural networks. In this study, we introduce a novel framework for identifying cancer genes across multiple omics domains, named ICGI (Integrative Causal Gene Identification), which leverages a large language model (LLM) prompted with causality contextual cues and prompts, in conjunction with data-driven causal feature selection. This approach demonstrates the effectiveness and potential of LLMs in uncovering cancer genes and comprehending disease mechanisms, particularly at the genomic level. However, our findings also highlight that current LLMs may not capture comprehensive information across all omics levels. By applying the proposed causal feature selection module to transcriptomic datasets from six cancer types in The Cancer Genome Atlas and comparing its performance with state-of-the-art methods, it demonstrates superior capability in identifying cancer genes that distinguish between cancerous and normal samples. Additionally, we have developed an online service platform that allows users to input a gene of interest and a specific cancer type. The platform provides automated results indicating whether the gene plays a significant role in cancer, along with clear and accessible explanations. Moreover, the platform summarizes the inference outcomes obtained from data-driven causal learning methods.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 2","pages":""},"PeriodicalIF":6.8000,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11899576/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Briefings in bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/bib/bbaf113","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Identifying genes causally linked to cancer from a multi-omics perspective is essential for understanding the mechanisms of cancer and improving therapeutic strategies. Traditional statistical and machine-learning methods that rely on generalized correlation approaches to identify cancer genes often produce redundant, biased predictions with limited interpretability, largely due to overlooking confounding factors, selection biases, and the nonlinear activation function in neural networks. In this study, we introduce a novel framework for identifying cancer genes across multiple omics domains, named ICGI (Integrative Causal Gene Identification), which leverages a large language model (LLM) prompted with causality contextual cues and prompts, in conjunction with data-driven causal feature selection. This approach demonstrates the effectiveness and potential of LLMs in uncovering cancer genes and comprehending disease mechanisms, particularly at the genomic level. However, our findings also highlight that current LLMs may not capture comprehensive information across all omics levels. By applying the proposed causal feature selection module to transcriptomic datasets from six cancer types in The Cancer Genome Atlas and comparing its performance with state-of-the-art methods, it demonstrates superior capability in identifying cancer genes that distinguish between cancerous and normal samples. Additionally, we have developed an online service platform that allows users to input a gene of interest and a specific cancer type. The platform provides automated results indicating whether the gene plays a significant role in cancer, along with clear and accessible explanations. Moreover, the platform summarizes the inference outcomes obtained from data-driven causal learning methods.

查看原文本刊更多论文

整合因果提示大语言模型与组学数据驱动的因果推理的癌症基因鉴定。

从多组学的角度确定与癌症有因果关系的基因对于理解癌症的机制和改进治疗策略至关重要。传统的统计和机器学习方法依赖于广义相关方法来识别癌症基因，通常会产生冗余的、有偏差的预测，其可解释性有限，这主要是由于忽略了混杂因素、选择偏差和神经网络中的非线性激活函数。在这项研究中，我们引入了一个新的框架，用于识别跨多个组学域的癌症基因，称为ICGI（综合因果基因鉴定），它利用了一个由因果关系上下文线索和提示提示的大型语言模型（LLM），并结合数据驱动的因果特征选择。这种方法证明了llm在揭示癌症基因和理解疾病机制方面的有效性和潜力，特别是在基因组水平上。然而，我们的研究结果也强调，目前的法学硕士可能无法捕获所有组学水平的全面信息。通过将提出的因果特征选择模块应用于癌症基因组图谱中六种癌症类型的转录组数据集，并将其性能与最先进的方法进行比较，该模块显示了识别癌症基因以区分癌症和正常样本的卓越能力。此外，我们还开发了一个在线服务平台，允许用户输入感兴趣的基因和特定的癌症类型。该平台提供自动结果，表明该基因是否在癌症中起重要作用，并提供清晰易懂的解释。此外，该平台还总结了数据驱动的因果学习方法所获得的推理结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Briefings in bioinformatics 生物-生化研究方法

CiteScore

13.20

自引率

13.70%

发文量

549

审稿时长

6 months

期刊介绍： Briefings in Bioinformatics is an international journal serving as a platform for researchers and educators in the life sciences. It also appeals to mathematicians, statisticians, and computer scientists applying their expertise to biological challenges. The journal focuses on reviews tailored for users of databases and analytical tools in contemporary genetics, molecular and systems biology. It stands out by offering practical assistance and guidance to non-specialists in computerized methodologies. Covering a wide range from introductory concepts to specific protocols and analyses, the papers address bacterial, plant, fungal, animal, and human data. The journal's detailed subject areas include genetic studies of phenotypes and genotypes, mapping, DNA sequencing, expression profiling, gene expression studies, microarrays, alignment methods, protein profiles and HMMs, lipids, metabolic and signaling pathways, structure determination and function prediction, phylogenetic studies, and education and training.