蛋白质序列表示学习的预训练语言模型重编程[j]

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY

Digital discovery Pub Date : 2025-05-23 DOI:10.1039/D4DD00195H

Ria Vinod, Pin-Yu Chen and Payel Das

{"title":"蛋白质序列表示学习的预训练语言模型重编程[j]","authors":"Ria Vinod, Pin-Yu Chen and Payel Das","doi":"10.1039/D4DD00195H","DOIUrl":null,"url":null,"abstract":"Machine learning-guided solutions for protein learning tasks have made significant headway in recent years. However, success in scientific discovery tasks is limited by the accessibility of well-defined and labeled in-domain data. To tackle the low-data constraint, recent adaptions of deep learning models pretrained on millions of protein sequences have shown promise; however, the construction of such domain-specific large-scale models is computationally expensive. Herein, we propose representation reprogramming via dictionary learning (R2DL), an end-to-end representation learning framework in which we reprogram deep models for alternate-domain tasks that can perform well on protein property prediction with significantly fewer training samples. R2DL reprograms a pretrained English language model to learn the embeddings of protein sequences, by learning a sparse linear mapping between English and protein sequence vocabulary embeddings. Our model can attain better accuracy and significantly improve the data efficiency by up to 104 times over the baselines set by pretrained and standard supervised methods. To this end, we reprogram several recent state-of-the-art pretrained English language classification models (BERT, TinyBERT, T5, and roBERTa) and benchmark on a set of protein physicochemical prediction tasks (secondary structure, stability, homology, and solubility) as well as on a biomedically relevant set of protein function prediction tasks (antimicrobial, toxicity, antibody affinity, and protein–protein interaction).","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 6","pages":" 1591-1601"},"PeriodicalIF":6.2000,"publicationDate":"2025-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00195h?page=search","citationCount":"0","resultStr":"{\"title\":\"Reprogramming pretrained language models for protein sequence representation learning†\",\"authors\":\"Ria Vinod, Pin-Yu Chen and Payel Das\",\"doi\":\"10.1039/D4DD00195H\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Machine learning-guided solutions for protein learning tasks have made significant headway in recent years. However, success in scientific discovery tasks is limited by the accessibility of well-defined and labeled in-domain data. To tackle the low-data constraint, recent adaptions of deep learning models pretrained on millions of protein sequences have shown promise; however, the construction of such domain-specific large-scale models is computationally expensive. Herein, we propose representation reprogramming via dictionary learning (R2DL), an end-to-end representation learning framework in which we reprogram deep models for alternate-domain tasks that can perform well on protein property prediction with significantly fewer training samples. R2DL reprograms a pretrained English language model to learn the embeddings of protein sequences, by learning a sparse linear mapping between English and protein sequence vocabulary embeddings. Our model can attain better accuracy and significantly improve the data efficiency by up to 104 times over the baselines set by pretrained and standard supervised methods. To this end, we reprogram several recent state-of-the-art pretrained English language classification models (BERT, TinyBERT, T5, and roBERTa) and benchmark on a set of protein physicochemical prediction tasks (secondary structure, stability, homology, and solubility) as well as on a biomedically relevant set of protein function prediction tasks (antimicrobial, toxicity, antibody affinity, and protein–protein interaction).\",\"PeriodicalId\":72816,\"journal\":{\"name\":\"Digital discovery\",\"volume\":\" 6\",\"pages\":\" 1591-1601\"},\"PeriodicalIF\":6.2000,\"publicationDate\":\"2025-05-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00195h?page=search\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Digital discovery\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://pubs.rsc.org/en/content/articlelanding/2025/dd/d4dd00195h\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CHEMISTRY, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital discovery","FirstCategoryId":"1085","ListUrlMain":"https://pubs.rsc.org/en/content/articlelanding/2025/dd/d4dd00195h","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

摘要

近年来，以机器学习为指导的蛋白质学习任务解决方案取得了重大进展。然而，科学发现任务的成功受到明确定义和标记的领域内数据的可访问性的限制。为了解决低数据约束问题，最近对数百万个蛋白质序列进行预训练的深度学习模型的适应显示出了希望；然而，这种特定领域的大规模模型的构建在计算上是昂贵的。在此，我们提出了通过字典学习（R2DL）的表示重编程，这是一种端到端表示学习框架，我们在其中为交替域任务重编程深度模型，可以在训练样本显著减少的情况下在蛋白质特性预测上表现良好。R2DL通过学习英语和蛋白质序列词汇嵌入之间的稀疏线性映射，重新编程一个预训练的英语语言模型来学习蛋白质序列的嵌入。我们的模型可以获得更好的准确性，并显着提高数据效率，比预训练和标准监督方法设置的基线提高了104倍。为此，我们重新编程了几个最新的最先进的预训练英语语言分类模型（BERT, TinyBERT， T5和roBERTa），并对一组蛋白质物理化学预测任务（二级结构，稳定性，同源性和溶解度）以及一组生物医学相关的蛋白质功能预测任务（抗菌，毒性，抗体亲和力和蛋白质-蛋白质相互作用）进行了基准测试。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Reprogramming pretrained language models for protein sequence representation learning†

查看原文本刊更多论文

Reprogramming pretrained language models for protein sequence representation learning†

Machine learning-guided solutions for protein learning tasks have made significant headway in recent years. However, success in scientific discovery tasks is limited by the accessibility of well-defined and labeled in-domain data. To tackle the low-data constraint, recent adaptions of deep learning models pretrained on millions of protein sequences have shown promise; however, the construction of such domain-specific large-scale models is computationally expensive. Herein, we propose representation reprogramming via dictionary learning (R2DL), an end-to-end representation learning framework in which we reprogram deep models for alternate-domain tasks that can perform well on protein property prediction with significantly fewer training samples. R2DL reprograms a pretrained English language model to learn the embeddings of protein sequences, by learning a sparse linear mapping between English and protein sequence vocabulary embeddings. Our model can attain better accuracy and significantly improve the data efficiency by up to 10⁴ times over the baselines set by pretrained and standard supervised methods. To this end, we reprogram several recent state-of-the-art pretrained English language classification models (BERT, TinyBERT, T5, and roBERTa) and benchmark on a set of protein physicochemical prediction tasks (secondary structure, stability, homology, and solubility) as well as on a biomedically relevant set of protein function prediction tasks (antimicrobial, toxicity, antibody affinity, and protein–protein interaction).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Digital discovery

CiteScore

2.80

自引率

0.00%

发文量