Unifying Sequence-Structure Coding for Advanced Protein Engineering via a Multimodal Diffusion Transformer

IF 7.6 1区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY
Xiaohan Lin, Zhenyu Chen, Yanheng Li, Zicheng Ma, Chuanliu Fan, Ziqiang Cao, Shihao Feng, Jun Zhang, Yi Qin Gao
{"title":"Unifying Sequence-Structure Coding for Advanced Protein Engineering via a Multimodal Diffusion Transformer","authors":"Xiaohan Lin, Zhenyu Chen, Yanheng Li, Zicheng Ma, Chuanliu Fan, Ziqiang Cao, Shihao Feng, Jun Zhang, Yi Qin Gao","doi":"10.1039/d5sc02055g","DOIUrl":null,"url":null,"abstract":"Modern protein engineering demands integrated sequence-structure representations to tackle key challenges in designing, modifying, and evolving proteins for specific functions. While sequence-based methods are promising for generate novel proteins, incorporating structure-oriented information improves success rate and helps target corresponding functions. Therefore, rather than relying solely on sequence or structure-based approaches, a consensus strategy is essential. Here, we introduce ProTokens, machine-learned “amino acids” derived from structural databases via self-supervised learning, providing a compact yet information-rich representation that bridges sequence and structure modalities. Instead of treating sequences and structures separately, we build PT-DiT, a multimodal diffusion transformer-based model that integrates both into a unified representation, enabling protein engineering in a joint sequence–structure space, streamlining the design process and facilitating the efficient encoding of 3D folds, contextual protein design, sampling of metastable states, and directed evolution for diverse objectives. Therefore, as a unified solution for in-silico protein engineering, PT-DiT leverages sequence and structure insights to realize functional protein design.","PeriodicalId":9909,"journal":{"name":"Chemical Science","volume":"129 1","pages":""},"PeriodicalIF":7.6000,"publicationDate":"2025-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Chemical Science","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1039/d5sc02055g","RegionNum":1,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

Abstract

Modern protein engineering demands integrated sequence-structure representations to tackle key challenges in designing, modifying, and evolving proteins for specific functions. While sequence-based methods are promising for generate novel proteins, incorporating structure-oriented information improves success rate and helps target corresponding functions. Therefore, rather than relying solely on sequence or structure-based approaches, a consensus strategy is essential. Here, we introduce ProTokens, machine-learned “amino acids” derived from structural databases via self-supervised learning, providing a compact yet information-rich representation that bridges sequence and structure modalities. Instead of treating sequences and structures separately, we build PT-DiT, a multimodal diffusion transformer-based model that integrates both into a unified representation, enabling protein engineering in a joint sequence–structure space, streamlining the design process and facilitating the efficient encoding of 3D folds, contextual protein design, sampling of metastable states, and directed evolution for diverse objectives. Therefore, as a unified solution for in-silico protein engineering, PT-DiT leverages sequence and structure insights to realize functional protein design.
基于多模态扩散变压器的高级蛋白质工程统一序列-结构编码
现代蛋白质工程需要集成的序列结构表示来解决设计,修改和进化特定功能的蛋白质的关键挑战。虽然基于序列的方法有望产生新的蛋白质,但结合面向结构的信息可以提高成功率并有助于定位相应的功能。因此,与其仅仅依赖于基于序列或结构的方法,共识策略是必不可少的。在这里,我们介绍了ProTokens,这是一种机器学习的“氨基酸”,通过自监督学习从结构数据库中获得,提供了一个紧凑而信息丰富的表示,连接了序列和结构模式。我们不是单独处理序列和结构,而是构建了PT-DiT,这是一种基于多模态扩散转换器的模型,将两者集成到一个统一的表示中,使蛋白质工程能够在联合序列结构空间中进行,简化设计过程,促进3D折叠的有效编码,上下文蛋白质设计,亚稳态采样,以及针对不同目标的定向进化。因此,PT-DiT作为硅蛋白工程的统一解决方案,利用序列和结构的洞察力来实现功能蛋白设计。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Chemical Science
Chemical Science CHEMISTRY, MULTIDISCIPLINARY-
CiteScore
14.40
自引率
4.80%
发文量
1352
审稿时长
2.1 months
期刊介绍: Chemical Science is a journal that encompasses various disciplines within the chemical sciences. Its scope includes publishing ground-breaking research with significant implications for its respective field, as well as appealing to a wider audience in related areas. To be considered for publication, articles must showcase innovative and original advances in their field of study and be presented in a manner that is understandable to scientists from diverse backgrounds. However, the journal generally does not publish highly specialized research.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信