Nature MethodsPub Date : 2024-11-28DOI: 10.1038/s41592-024-02526-w
{"title":"Enhancing functional gene set analysis with large language models","authors":"","doi":"10.1038/s41592-024-02526-w","DOIUrl":"10.1038/s41592-024-02526-w","url":null,"abstract":"Large language models (LLMs) demonstrate potential as assistants in functional genomics, offering a new avenue for gene set analysis. In our evaluation of five LLMs, GPT-4 was the top-performing model and generated common functions for gene sets with high specificity, reliable self-assessed confidence and supporting analysis, complementing traditional functional enrichment.","PeriodicalId":18981,"journal":{"name":"Nature Methods","volume":"22 1","pages":"22-23"},"PeriodicalIF":36.1,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142751277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nature MethodsPub Date : 2024-11-28DOI: 10.1038/s41592-024-02523-z
Hugo Dalla-Torre, Liam Gonzalez, Javier Mendoza-Revilla, Nicolas Lopez Carranza, Adam Henryk Grzywaczewski, Francesco Oteri, Christian Dallago, Evan Trop, Bernardo P de Almeida, Hassan Sirelkhatim, Guillaume Richard, Marcin Skwark, Karim Beguir, Marie Lopez, Thomas Pierrot
{"title":"Nucleotide Transformer: building and evaluating robust foundation models for human genomics.","authors":"Hugo Dalla-Torre, Liam Gonzalez, Javier Mendoza-Revilla, Nicolas Lopez Carranza, Adam Henryk Grzywaczewski, Francesco Oteri, Christian Dallago, Evan Trop, Bernardo P de Almeida, Hassan Sirelkhatim, Guillaume Richard, Marcin Skwark, Karim Beguir, Marie Lopez, Thomas Pierrot","doi":"10.1038/s41592-024-02523-z","DOIUrl":"https://doi.org/10.1038/s41592-024-02523-z","url":null,"abstract":"<p><p>The prediction of molecular phenotypes from DNA sequences remains a longstanding challenge in genomics, often driven by limited annotated data and the inability to transfer learnings between tasks. Here, we present an extensive study of foundation models pre-trained on DNA sequences, named Nucleotide Transformer, ranging from 50 million up to 2.5 billion parameters and integrating information from 3,202 human genomes and 850 genomes from diverse species. These transformer models yield context-specific representations of nucleotide sequences, which allow for accurate predictions even in low-data settings. We show that the developed models can be fine-tuned at low cost to solve a variety of genomics applications. Despite no supervision, the models learned to focus attention on key genomic elements and can be used to improve the prioritization of genetic variants. The training and application of foundational models in genomics provides a widely applicable approach for accurate molecular phenotype prediction from DNA sequence.</p>","PeriodicalId":18981,"journal":{"name":"Nature Methods","volume":" ","pages":""},"PeriodicalIF":36.1,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142751283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nature MethodsPub Date : 2024-11-28DOI: 10.1038/s41592-024-02525-x
Mengzhou Hu, Sahar Alkhairy, Ingoo Lee, Rudolf T. Pillich, Dylan Fong, Kevin Smith, Robin Bachelder, Trey Ideker, Dexter Pratt
{"title":"Evaluation of large language models for discovery of gene set function","authors":"Mengzhou Hu, Sahar Alkhairy, Ingoo Lee, Rudolf T. Pillich, Dylan Fong, Kevin Smith, Robin Bachelder, Trey Ideker, Dexter Pratt","doi":"10.1038/s41592-024-02525-x","DOIUrl":"10.1038/s41592-024-02525-x","url":null,"abstract":"Gene set enrichment is a mainstay of functional genomics, but it relies on gene function databases that are incomplete. Here we evaluate five large language models (LLMs) for their ability to discover the common functions represented by a gene set, supported by molecular rationale and a self-confidence assessment. For curated gene sets from Gene Ontology, GPT-4 suggests functions similar to the curated name in 73% of cases, with higher self-confidence predicting higher similarity. Conversely, random gene sets correctly yield zero confidence in 87% of cases. Other LLMs (GPT-3.5, Gemini Pro, Mixtral Instruct and Llama2 70b) vary in function recovery but are falsely confident for random sets. In gene clusters from omics data, GPT-4 identifies common functions for 45% of cases, fewer than functional enrichment but with higher specificity and gene coverage. Manual review of supporting rationale and citations finds these functions are largely verifiable. These results position LLMs as valuable omics assistants. Large language models show potential in suggesting common functions for a gene set.","PeriodicalId":18981,"journal":{"name":"Nature Methods","volume":"22 1","pages":"82-91"},"PeriodicalIF":36.1,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142751279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nature MethodsPub Date : 2024-11-28DOI: 10.1038/s41592-024-02524-y
{"title":"Generalized AI models for genomics applications.","authors":"","doi":"10.1038/s41592-024-02524-y","DOIUrl":"https://doi.org/10.1038/s41592-024-02524-y","url":null,"abstract":"","PeriodicalId":18981,"journal":{"name":"Nature Methods","volume":" ","pages":""},"PeriodicalIF":36.1,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142751281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SurfDock is a surface-informed diffusion generative model for reliable and accurate protein-ligand complex prediction.","authors":"Duanhua Cao, Mingan Chen, Runze Zhang, Zhaokun Wang, Manlin Huang, Jie Yu, Xinyu Jiang, Zhehuan Fan, Wei Zhang, Hao Zhou, Xutong Li, Zunyun Fu, Sulin Zhang, Mingyue Zheng","doi":"10.1038/s41592-024-02516-y","DOIUrl":"https://doi.org/10.1038/s41592-024-02516-y","url":null,"abstract":"<p><p>Accurately predicting protein-ligand interactions is crucial for understanding cellular processes. We introduce SurfDock, a deep-learning method that addresses this challenge by integrating protein sequence, three-dimensional structural graphs and surface-level features into an equivariant architecture. SurfDock employs a generative diffusion model on a non-Euclidean manifold, optimizing molecular translations, rotations and torsions to generate reliable binding poses. Our extensive evaluations across various benchmarks demonstrate SurfDock's superiority over existing methods in docking success rates and adherence to physical constraints. It also exhibits remarkable generalizability to unseen proteins and predicted apo structures, while achieving state-of-the-art performance in virtual screening tasks. In a real-world application, SurfDock identified seven novel hit molecules in a virtual screening project targeting aldehyde dehydrogenase 1B1, a key enzyme in cellular metabolism. This showcases SurfDock's ability to elucidate molecular mechanisms underlying cellular processes. These results highlight SurfDock's potential as a transformative tool in structural biology, offering enhanced accuracy, physical plausibility and practical applicability in understanding protein-ligand interactions.</p>","PeriodicalId":18981,"journal":{"name":"Nature Methods","volume":" ","pages":""},"PeriodicalIF":36.1,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142739944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nature MethodsPub Date : 2024-11-25DOI: 10.1038/s41592-024-02507-z
Shouxiang Zhang, Tze Cin Owyong, Oana Sanislav, Lukas Englmaier, Xiaojing Sui, Geqing Wang, David W. Greening, Nicholas A. Williamson, Andreas Villunger, Jonathan M. White, Begoña Heras, Wallace W. H. Wong, Paul R. Fisher, Yuning Hong
{"title":"Global analysis of endogenous protein disorder in cells","authors":"Shouxiang Zhang, Tze Cin Owyong, Oana Sanislav, Lukas Englmaier, Xiaojing Sui, Geqing Wang, David W. Greening, Nicholas A. Williamson, Andreas Villunger, Jonathan M. White, Begoña Heras, Wallace W. H. Wong, Paul R. Fisher, Yuning Hong","doi":"10.1038/s41592-024-02507-z","DOIUrl":"10.1038/s41592-024-02507-z","url":null,"abstract":"Disorder and flexibility in protein structures are essential for biological function but can also contribute to diseases, such as neurodegenerative disorders. However, characterizing protein folding on a proteome-wide scale within biological matrices remains challenging. Here we present a method using a bifunctional chemical probe, named TME, to capture in situ, enrich and quantify endogenous protein disorder in cells. TME exhibits a fluorescence turn-on effect upon selective conjugation with proteins with free cysteines in surface-exposed and flexible environments—a distinctive signature of protein disorder. Using an affinity-based proteomic approach, we identify both basal disordered proteins and those whose folding status changes under stress, with coverage to proteins even of low abundance. In lymphoblastoid cells from individuals with Parkinson’s disease and healthy controls, our TME-based strategy distinguishes the two groups more effectively than lysate profiling methods. High-throughput TME fluorescence and proteomics further reveal a universal cellular quality-control mechanism in which cells adapt to proteostatic stress by adopting aggregation-prone distributions and sequestering disordered proteins, as illustrated in Huntington’s disease cell models. This article reports a method based on a bifunctional chemical probe called TME and a workflow named RUBICON to capture, enrich and profile endogenous disordered proteins in cells. The method enables a proteome-wide analysis of protein disorder via high-throughput fluorescence and mass spectrometry-based proteomics.","PeriodicalId":18981,"journal":{"name":"Nature Methods","volume":"22 1","pages":"124-134"},"PeriodicalIF":36.1,"publicationDate":"2024-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142716540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nature MethodsPub Date : 2024-11-22DOI: 10.1038/s41592-024-02535-9
Yujie Zhang, Lu Bai, Xin Wang, Yuchen Zhao, Tianlei Zhang, Lichen Ye, Xufei Du, Zhe Zhang, Jiulin Du, Kai Wang
{"title":"Super-resolution imaging of fast morphological dynamics of neurons in behaving animals","authors":"Yujie Zhang, Lu Bai, Xin Wang, Yuchen Zhao, Tianlei Zhang, Lichen Ye, Xufei Du, Zhe Zhang, Jiulin Du, Kai Wang","doi":"10.1038/s41592-024-02535-9","DOIUrl":"10.1038/s41592-024-02535-9","url":null,"abstract":"Neurons are best studied in their native states in which their functional and morphological dynamics support animals’ natural behaviors. Super-resolution microscopy can potentially reveal these dynamics in higher details but has been challenging in behaving animals due to severe motion artifacts. Here we report multiplexed, line-scanning, structured illumination microscopy, which can tolerate motion of up to 50 μm s−1 while achieving 150-nm and 100-nm lateral resolutions in its linear and nonlinear forms, respectively. We continuously imaged the dynamics of spinules in dendritic spines and axonal boutons volumetrically over thousands of frames and tens of minutes in head-fixed mouse brains during sleep–wake cycles. Super-resolution imaging of axonal boutons revealed spinule dynamics on a scale of seconds. Simultaneous two-color imaging further enabled analyses of the spatial distributions of diverse PSD-95 clusters and opened up possibilities to study their correlations with the structural dynamics of dendrites in the brains of head-fixed awake mice. A variant of structured illumination microscopy called MLS-SIM allows super-resolution imaging of neuronal structures such as spinules, spines and boutons in awake mice.","PeriodicalId":18981,"journal":{"name":"Nature Methods","volume":"22 1","pages":"177-186"},"PeriodicalIF":36.1,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142693312","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nature MethodsPub Date : 2024-11-22DOI: 10.1038/s41592-024-02513-1
Minxing Pang, Tarun Kanti Roy, Xiaodong Wu, Kai Tan
{"title":"CelloType: a unified model for segmentation and classification of tissue images.","authors":"Minxing Pang, Tarun Kanti Roy, Xiaodong Wu, Kai Tan","doi":"10.1038/s41592-024-02513-1","DOIUrl":"10.1038/s41592-024-02513-1","url":null,"abstract":"<p><p>Cell segmentation and classification are critical tasks in spatial omics data analysis. Here we introduce CelloType, an end-to-end model designed for cell segmentation and classification for image-based spatial omics data. Unlike the traditional two-stage approach of segmentation followed by classification, CelloType adopts a multitask learning strategy that integrates these tasks, simultaneously enhancing the performance of both. CelloType leverages transformer-based deep learning techniques for improved accuracy in object detection, segmentation and classification. It outperforms existing segmentation methods on a variety of multiplexed fluorescence and spatial transcriptomic images. In terms of cell type classification, CelloType surpasses a model composed of state-of-the-art methods for individual tasks and a high-performance instance segmentation model. Using multiplexed tissue images, we further demonstrate the utility of CelloType for multiscale segmentation and classification of both cellular and noncellular elements in a tissue. The enhanced accuracy and multitask learning ability of CelloType facilitate automated annotation of rapidly growing spatial omics data.</p>","PeriodicalId":18981,"journal":{"name":"Nature Methods","volume":" ","pages":""},"PeriodicalIF":36.1,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142693311","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nature MethodsPub Date : 2024-11-21DOI: 10.1038/s41592-024-02487-0
Tao Shen, Zhihang Hu, Siqi Sun, Di Liu, Felix Wong, Jiuming Wang, Jiayang Chen, Yixuan Wang, Liang Hong, Jin Xiao, Liangzhen Zheng, Tejas Krishnamoorthi, Irwin King, Sheng Wang, Peng Yin, James J. Collins, Yu Li
{"title":"Accurate RNA 3D structure prediction using a language model-based deep learning approach","authors":"Tao Shen, Zhihang Hu, Siqi Sun, Di Liu, Felix Wong, Jiuming Wang, Jiayang Chen, Yixuan Wang, Liang Hong, Jin Xiao, Liangzhen Zheng, Tejas Krishnamoorthi, Irwin King, Sheng Wang, Peng Yin, James J. Collins, Yu Li","doi":"10.1038/s41592-024-02487-0","DOIUrl":"10.1038/s41592-024-02487-0","url":null,"abstract":"Accurate prediction of RNA three-dimensional (3D) structures remains an unsolved challenge. Determining RNA 3D structures is crucial for understanding their functions and informing RNA-targeting drug development and synthetic biology design. The structural flexibility of RNA, which leads to the scarcity of experimentally determined data, complicates computational prediction efforts. Here we present RhoFold+, an RNA language model-based deep learning method that accurately predicts 3D structures of single-chain RNAs from sequences. By integrating an RNA language model pretrained on ~23.7 million RNA sequences and leveraging techniques to address data scarcity, RhoFold+ offers a fully automated end-to-end pipeline for RNA 3D structure prediction. Retrospective evaluations on RNA-Puzzles and CASP15 natural RNA targets demonstrate the superiority of RhoFold+ over existing methods, including human expert groups. Its efficacy and generalizability are further validated through cross-family and cross-type assessments, as well as time-censored benchmarks. Additionally, RhoFold+ predicts RNA secondary structures and interhelical angles, providing empirically verifiable features that broaden its applicability to RNA structure and function studies. RhoFold+ is an end-to-end language model-based deep learning method to predict RNA three-dimensional structures of single-chain RNAs from sequences.","PeriodicalId":18981,"journal":{"name":"Nature Methods","volume":"21 12","pages":"2287-2298"},"PeriodicalIF":36.1,"publicationDate":"2024-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.nature.com/articles/s41592-024-02487-0.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142687584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nature MethodsPub Date : 2024-11-21DOI: 10.1038/s41592-024-02488-z
{"title":"Large language modeling and deep learning shed light on RNA structure prediction","authors":"","doi":"10.1038/s41592-024-02488-z","DOIUrl":"10.1038/s41592-024-02488-z","url":null,"abstract":"We present an RNA language model-based deep learning pipeline for accurate and rapid de novo RNA 3D structure prediction, demonstrating strong accuracy in modeling single-stranded RNAs and excellent generalization across RNA families and types while also being capable of capturing local features such as interhelical angles and secondary structures.","PeriodicalId":18981,"journal":{"name":"Nature Methods","volume":"21 12","pages":"2237-2238"},"PeriodicalIF":36.1,"publicationDate":"2024-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142687586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}