Large-Scale Multi-omic Biosequence Transformers for Modeling Peptide-Nucleotide Interactions

arXiv - QuanBio - Biomolecules Pub Date : 2024-08-29 DOI:arxiv-2408.16245

Sully F. Chen, Robert J. Steele, Beakal Lemeneh, Shivanand P. Lad, Eric Oermann

{"title":"Large-Scale Multi-omic Biosequence Transformers for Modeling Peptide-Nucleotide Interactions","authors":"Sully F. Chen, Robert J. Steele, Beakal Lemeneh, Shivanand P. Lad, Eric Oermann","doi":"arxiv-2408.16245","DOIUrl":null,"url":null,"abstract":"The transformer architecture has revolutionized bioinformatics and driven\nprogress in the understanding and prediction of the properties of biomolecules.\nAlmost all research on large-scale biosequence transformers has focused on one\ndomain at a time (single-omic), usually nucleotides or peptides. These models\nhave seen incredible success in downstream tasks in each domain and have\nachieved particularly noteworthy breakthroughs in sequences of peptides and\nstructural modeling. However, these single-omic models are naturally incapable\nof modeling multi-omic tasks, one of the most biologically critical being\nnucleotide-peptide interactions. We present our work training the first multi-omic nucleotide-peptide\nfoundation models. We show that these multi-omic models (MOMs) can learn joint\nrepresentations between various single-omic distributions that are emergently\nconsistent with the Central Dogma of molecular biology, despite only being\ntrained on unlabeled biosequences. We further demonstrate that MOMs can be\nfine-tuned to achieve state-of-the-art results on peptide-nucleotide\ninteraction tasks, namely predicting the change in Gibbs free energy\n({\\Delta}G) of the binding interaction between a given oligonucleotide and\npeptide, as well as the effect on this binding interaction due to mutations in\nthe oligonucleotide sequence ({\\Delta}{\\Delta}G). Remarkably, we show that multi-omic biosequence transformers emergently learn\nuseful structural information without any prior structural training, allowing\nus to predict which peptide residues are most involved in the\npeptide-nucleotide binding interaction. Lastly, we provide evidence that\nmulti-omic biosequence models are non-inferior to foundation models trained on\nsingle-omics distributions, suggesting a more generalized or foundational\napproach to building these models.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"318 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Biomolecules","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.16245","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The transformer architecture has revolutionized bioinformatics and driven progress in the understanding and prediction of the properties of biomolecules. Almost all research on large-scale biosequence transformers has focused on one domain at a time (single-omic), usually nucleotides or peptides. These models have seen incredible success in downstream tasks in each domain and have achieved particularly noteworthy breakthroughs in sequences of peptides and structural modeling. However, these single-omic models are naturally incapable of modeling multi-omic tasks, one of the most biologically critical being nucleotide-peptide interactions. We present our work training the first multi-omic nucleotide-peptide foundation models. We show that these multi-omic models (MOMs) can learn joint representations between various single-omic distributions that are emergently consistent with the Central Dogma of molecular biology, despite only being trained on unlabeled biosequences. We further demonstrate that MOMs can be fine-tuned to achieve state-of-the-art results on peptide-nucleotide interaction tasks, namely predicting the change in Gibbs free energy ({\Delta}G) of the binding interaction between a given oligonucleotide and peptide, as well as the effect on this binding interaction due to mutations in the oligonucleotide sequence ({\Delta}{\Delta}G). Remarkably, we show that multi-omic biosequence transformers emergently learn useful structural information without any prior structural training, allowing us to predict which peptide residues are most involved in the peptide-nucleotide binding interaction. Lastly, we provide evidence that multi-omic biosequence models are non-inferior to foundation models trained on single-omics distributions, suggesting a more generalized or foundational approach to building these models.

查看原文本刊更多论文

为肽-核苷酸相互作用建模的大规模多组生物序列转换器

几乎所有关于大规模生物序列转换器的研究都集中在一次一个域（单原子）上，通常是核苷酸或肽。这些模型在每个领域的下游任务中都取得了令人难以置信的成功，尤其是在多肽序列和结构建模方面取得了令人瞩目的突破。然而，这些单原子模型自然无法模拟多原子任务，其中最关键的生物任务之一是核苷酸与肽的相互作用。我们介绍了我们在训练首个多原子核苷酸-肽基础模型方面所做的工作。我们的研究表明，这些多原子模型（MOMs）可以学习各种单原子分布之间的联合表述，这些表述与分子生物学的 "中心教条"（Central Dogma）是一致的，尽管它们只在未标记的生物序列上进行过训练。我们进一步证明，可以对多核苷酸模型进行微调，使其在肽-核苷酸相互作用任务上取得最先进的结果，即预测给定寡核苷酸和肽之间结合相互作用的吉布斯自由能（{\Δ}G）的变化，以及寡核苷酸序列突变对这种结合相互作用的影响（{\Δ}{\Δ}G）。值得注意的是，我们表明多组生物序列转换器在没有任何事先结构训练的情况下就能学习到有用的结构信息，使我们能够预测哪些肽残基在肽-核苷酸结合相互作用中参与度最高。最后，我们提供的证据表明，多组学生物序列模型并不逊色于根据单组学分布训练的基础模型，这表明在构建这些模型时可以采用一种更具通用性或基础性的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - QuanBio - Biomolecules

自引率

0.00%

发文量