Toward De Novo Protein Design from Natural Language

Fengyuan Dai, Yuliang Fan, Jin Su, Chentong Wang, Chenchen Han, Xibin Zhou, Jianming Liu, Hui Qian, Shunzhi Wang, Anping Zeng, Yajie Wang, Fajie Yuan
{"title":"Toward De Novo Protein Design from Natural Language","authors":"Fengyuan Dai, Yuliang Fan, Jin Su, Chentong Wang, Chenchen Han, Xibin Zhou, Jianming Liu, Hui Qian, Shunzhi Wang, Anping Zeng, Yajie Wang, Fajie Yuan","doi":"10.1101/2024.08.01.606258","DOIUrl":null,"url":null,"abstract":"De novo protein design (DNPD) aims to create new protein sequences from scratch, without relying on existing protein templates. However, current deep learning-based DNPD approaches are often limited by their focus on specific or narrowly defined protein designs, restricting broader exploration and the discovery of diverse, functional proteins. To address this issue, we introduce Pinal, a probabilistic sampling method that generates protein sequences using rich natural language as guidance. Unlike end-to-end text-to-sequence generation approaches, we employ a two-stage generative process. Initially, we generate structures based on given language instructions, followed by designing sequences conditioned on both the structure and the language. This approach facilitates searching within the smaller structure space rather than the vast sequence space. Experiments demonstrate that Pinal outperforms existing models, including the concurrent work ESM3, and can generalize to novel protein structures outside the training distribution when provided with appropriate instructions. This work aims to aid the biological community by advancing the design of novel proteins, and our code will be made publicly available soon.","PeriodicalId":501308,"journal":{"name":"bioRxiv - Bioengineering","volume":"189 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"bioRxiv - Bioengineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.08.01.606258","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

De novo protein design (DNPD) aims to create new protein sequences from scratch, without relying on existing protein templates. However, current deep learning-based DNPD approaches are often limited by their focus on specific or narrowly defined protein designs, restricting broader exploration and the discovery of diverse, functional proteins. To address this issue, we introduce Pinal, a probabilistic sampling method that generates protein sequences using rich natural language as guidance. Unlike end-to-end text-to-sequence generation approaches, we employ a two-stage generative process. Initially, we generate structures based on given language instructions, followed by designing sequences conditioned on both the structure and the language. This approach facilitates searching within the smaller structure space rather than the vast sequence space. Experiments demonstrate that Pinal outperforms existing models, including the concurrent work ESM3, and can generalize to novel protein structures outside the training distribution when provided with appropriate instructions. This work aims to aid the biological community by advancing the design of novel proteins, and our code will be made publicly available soon.
通过自然语言实现新蛋白质设计
全新蛋白质设计(DNPD)旨在不依赖现有蛋白质模板,从头开始创建新的蛋白质序列。然而,目前基于深度学习的 DNPD 方法往往局限于特定或狭义的蛋白质设计,限制了更广泛的探索和对多样化功能蛋白质的发现。为了解决这个问题,我们引入了 Pinal,这是一种以丰富的自然语言为指导生成蛋白质序列的概率抽样方法。与端到端文本到序列生成方法不同,我们采用了两阶段生成过程。首先,我们根据给定的语言指令生成结构,然后根据结构和语言设计序列。这种方法有利于在较小的结构空间而非广阔的序列空间内进行搜索。实验证明,Pinal 的性能优于现有模型,包括同时进行的工作 ESM3,而且在提供适当指令的情况下,还能泛化到训练分布之外的新蛋白质结构。这项工作旨在通过推进新型蛋白质的设计来帮助生物界,我们的代码将很快公开发布。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信