对发现基因集功能的大型语言模型的评估。

IF 36.1 1区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Nature Methods Pub Date : 2024-11-28 DOI:10.1038/s41592-024-02525-x

Mengzhou Hu, Sahar Alkhairy, Ingoo Lee, Rudolf T. Pillich, Dylan Fong, Kevin Smith, Robin Bachelder, Trey Ideker, Dexter Pratt

{"title":"对发现基因集功能的大型语言模型的评估。","authors":"Mengzhou Hu, Sahar Alkhairy, Ingoo Lee, Rudolf T. Pillich, Dylan Fong, Kevin Smith, Robin Bachelder, Trey Ideker, Dexter Pratt","doi":"10.1038/s41592-024-02525-x","DOIUrl":null,"url":null,"abstract":"Gene set enrichment is a mainstay of functional genomics, but it relies on gene function databases that are incomplete. Here we evaluate five large language models (LLMs) for their ability to discover the common functions represented by a gene set, supported by molecular rationale and a self-confidence assessment. For curated gene sets from Gene Ontology, GPT-4 suggests functions similar to the curated name in 73% of cases, with higher self-confidence predicting higher similarity. Conversely, random gene sets correctly yield zero confidence in 87% of cases. Other LLMs (GPT-3.5, Gemini Pro, Mixtral Instruct and Llama2 70b) vary in function recovery but are falsely confident for random sets. In gene clusters from omics data, GPT-4 identifies common functions for 45% of cases, fewer than functional enrichment but with higher specificity and gene coverage. Manual review of supporting rationale and citations finds these functions are largely verifiable. These results position LLMs as valuable omics assistants. Large language models show potential in suggesting common functions for a gene set.","PeriodicalId":18981,"journal":{"name":"Nature Methods","volume":"22 1","pages":"82-91"},"PeriodicalIF":36.1000,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluation of large language models for discovery of gene set function\",\"authors\":\"Mengzhou Hu, Sahar Alkhairy, Ingoo Lee, Rudolf T. Pillich, Dylan Fong, Kevin Smith, Robin Bachelder, Trey Ideker, Dexter Pratt\",\"doi\":\"10.1038/s41592-024-02525-x\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Gene set enrichment is a mainstay of functional genomics, but it relies on gene function databases that are incomplete. Here we evaluate five large language models (LLMs) for their ability to discover the common functions represented by a gene set, supported by molecular rationale and a self-confidence assessment. For curated gene sets from Gene Ontology, GPT-4 suggests functions similar to the curated name in 73% of cases, with higher self-confidence predicting higher similarity. Conversely, random gene sets correctly yield zero confidence in 87% of cases. Other LLMs (GPT-3.5, Gemini Pro, Mixtral Instruct and Llama2 70b) vary in function recovery but are falsely confident for random sets. In gene clusters from omics data, GPT-4 identifies common functions for 45% of cases, fewer than functional enrichment but with higher specificity and gene coverage. Manual review of supporting rationale and citations finds these functions are largely verifiable. These results position LLMs as valuable omics assistants. Large language models show potential in suggesting common functions for a gene set.\",\"PeriodicalId\":18981,\"journal\":{\"name\":\"Nature Methods\",\"volume\":\"22 1\",\"pages\":\"82-91\"},\"PeriodicalIF\":36.1000,\"publicationDate\":\"2024-11-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Nature Methods\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://www.nature.com/articles/s41592-024-02525-x\",\"RegionNum\":1,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"BIOCHEMICAL RESEARCH METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Nature Methods","FirstCategoryId":"99","ListUrlMain":"https://www.nature.com/articles/s41592-024-02525-x","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

摘要

基因集富集是功能基因组学的支柱，但它依赖于不完整的基因功能数据库。在这里，我们评估了五种大型语言模型（llm）在分子原理和自信评估的支持下发现基因集所代表的共同功能的能力。对于来自gene Ontology的整理过的基因集，GPT-4在73%的情况下显示了与整理过的名称相似的功能，越高的自信预示着越高的相似性。相反，在87%的案例中，随机基因集正确产生的置信度为零。其他llm （GPT-3.5, Gemini Pro， Mixtral instruction和Llama2 70b）在功能恢复方面有所不同，但对随机集具有错误的置信度。在组学数据的基因簇中，GPT-4识别了45%的病例的共同功能，低于功能富集，但具有更高的特异性和基因覆盖率。手工审查支持的基本原理和引用发现这些功能在很大程度上是可验证的。这些结果使法学硕士成为有价值的组学助手。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Evaluation of large language models for discovery of gene set function

查看原文本刊更多论文

Evaluation of large language models for discovery of gene set function

Gene set enrichment is a mainstay of functional genomics, but it relies on gene function databases that are incomplete. Here we evaluate five large language models (LLMs) for their ability to discover the common functions represented by a gene set, supported by molecular rationale and a self-confidence assessment. For curated gene sets from Gene Ontology, GPT-4 suggests functions similar to the curated name in 73% of cases, with higher self-confidence predicting higher similarity. Conversely, random gene sets correctly yield zero confidence in 87% of cases. Other LLMs (GPT-3.5, Gemini Pro, Mixtral Instruct and Llama2 70b) vary in function recovery but are falsely confident for random sets. In gene clusters from omics data, GPT-4 identifies common functions for 45% of cases, fewer than functional enrichment but with higher specificity and gene coverage. Manual review of supporting rationale and citations finds these functions are largely verifiable. These results position LLMs as valuable omics assistants. Large language models show potential in suggesting common functions for a gene set.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Nature Methods 生物-生化研究方法

CiteScore

58.70

自引率

1.70%

发文量

326

审稿时长

1 months

期刊介绍： Nature Methods is a monthly journal that focuses on publishing innovative methods and substantial enhancements to fundamental life sciences research techniques. Geared towards a diverse, interdisciplinary readership of researchers in academia and industry engaged in laboratory work, the journal offers new tools for research and emphasizes the immediate practical significance of the featured work. It publishes primary research papers and reviews recent technical and methodological advancements, with a particular interest in primary methods papers relevant to the biological and biomedical sciences. This includes methods rooted in chemistry with practical applications for studying biological problems.