蛋白质序列同源性的黄昏地带：蛋白质语言模型能学习蛋白质结构吗？

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances Pub Date : 2024-08-17 eCollection Date: 2024-01-01 DOI:10.1093/bioadv/vbae119

Anowarul Kabir, Asher Moldwin, Yana Bromberg, Amarda Shehu

{"title":"蛋白质序列同源性的黄昏地带：蛋白质语言模型能学习蛋白质结构吗？","authors":"Anowarul Kabir, Asher Moldwin, Yana Bromberg, Amarda Shehu","doi":"10.1093/bioadv/vbae119","DOIUrl":null,"url":null,"abstract":"Motivation: Protein language models based on the transformer architecture are increasingly improving performance on protein prediction tasks, including secondary structure, subcellular localization, and more. Despite being trained only on protein sequences, protein language models appear to implicitly learn protein structure. This paper investigates whether sequence representations learned by protein language models encode structural information and to what extent.Results: We address this by evaluating protein language models on remote homology prediction, where identifying remote homologs from sequence information alone requires structural knowledge, especially in the \"twilight zone\" of very low sequence identity. Through rigorous testing at progressively lower sequence identities, we profile the performance of protein language models ranging from millions to billions of parameters in a zero-shot setting. Our findings indicate that while transformer-based protein language models outperform traditional sequence alignment methods, they still struggle in the twilight zone. This suggests that current protein language models have not sufficiently learned protein structure to address remote homology prediction when sequence signals are weak.Availability and implementation: We believe this opens the way for further research both on remote homology prediction and on the broader goal of learning sequence- and structure-rich representations of protein molecules. All code, data, and models are made publicly available.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"4 1","pages":"vbae119"},"PeriodicalIF":2.4000,"publicationDate":"2024-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11344590/pdf/","citationCount":"0","resultStr":"{\"title\":\"In the twilight zone of protein sequence homology: do protein language models learn protein structure?\",\"authors\":\"Anowarul Kabir, Asher Moldwin, Yana Bromberg, Amarda Shehu\",\"doi\":\"10.1093/bioadv/vbae119\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Motivation: Protein language models based on the transformer architecture are increasingly improving performance on protein prediction tasks, including secondary structure, subcellular localization, and more. Despite being trained only on protein sequences, protein language models appear to implicitly learn protein structure. This paper investigates whether sequence representations learned by protein language models encode structural information and to what extent.Results: We address this by evaluating protein language models on remote homology prediction, where identifying remote homologs from sequence information alone requires structural knowledge, especially in the \\\"twilight zone\\\" of very low sequence identity. Through rigorous testing at progressively lower sequence identities, we profile the performance of protein language models ranging from millions to billions of parameters in a zero-shot setting. Our findings indicate that while transformer-based protein language models outperform traditional sequence alignment methods, they still struggle in the twilight zone. This suggests that current protein language models have not sufficiently learned protein structure to address remote homology prediction when sequence signals are weak.Availability and implementation: We believe this opens the way for further research both on remote homology prediction and on the broader goal of learning sequence- and structure-rich representations of protein molecules. All code, data, and models are made publicly available.\",\"PeriodicalId\":72368,\"journal\":{\"name\":\"Bioinformatics advances\",\"volume\":\"4 1\",\"pages\":\"vbae119\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2024-08-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11344590/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Bioinformatics advances\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/bioadv/vbae119\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q2\",\"JCRName\":\"MATHEMATICAL & COMPUTATIONAL BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics advances","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioadv/vbae119","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

动机：基于转换器架构的蛋白质语言模型在蛋白质预测任务（包括二级结构、亚细胞定位等）上的性能日益提高。尽管蛋白质语言模型只针对蛋白质序列进行训练，但它似乎能隐式地学习蛋白质结构。本文研究了蛋白质语言模型学习到的序列表征是否编码了结构信息以及编码的程度：我们通过评估远程同源预测中的蛋白质语言模型来解决这个问题，在远程同源预测中，仅从序列信息识别远程同源物需要结构知识，尤其是在序列同一性非常低的 "曙光地带"。通过在序列同一性逐渐降低的情况下进行严格的测试，我们对蛋白质语言模型的性能进行了剖析，其参数范围从数百万到数十亿不等。我们的研究结果表明，虽然基于变换器的蛋白质语言模型优于传统的序列比对方法，但它们在 "黄昏区 "仍然很吃力。这表明，目前的蛋白质语言模型还没有充分学习蛋白质结构，无法在序列信号较弱的情况下解决远程同源性预测问题：我们相信，这为进一步研究远程同源性预测以及学习蛋白质分子富含序列和结构的表征这一更广泛的目标开辟了道路。所有代码、数据和模型均可公开获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

In the twilight zone of protein sequence homology: do protein language models learn protein structure?

Motivation: Protein language models based on the transformer architecture are increasingly improving performance on protein prediction tasks, including secondary structure, subcellular localization, and more. Despite being trained only on protein sequences, protein language models appear to implicitly learn protein structure. This paper investigates whether sequence representations learned by protein language models encode structural information and to what extent.

Results: We address this by evaluating protein language models on remote homology prediction, where identifying remote homologs from sequence information alone requires structural knowledge, especially in the "twilight zone" of very low sequence identity. Through rigorous testing at progressively lower sequence identities, we profile the performance of protein language models ranging from millions to billions of parameters in a zero-shot setting. Our findings indicate that while transformer-based protein language models outperform traditional sequence alignment methods, they still struggle in the twilight zone. This suggests that current protein language models have not sufficiently learned protein structure to address remote homology prediction when sequence signals are weak.

Availability and implementation: We believe this opens the way for further research both on remote homology prediction and on the broader goal of learning sequence- and structure-rich representations of protein molecules. All code, data, and models are made publicly available.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Bioinformatics advances

CiteScore

1.60

自引率

0.00%

发文量