与自然对话:蛋白质的深度学习表征模型引入了蛋白质语言学。

IF 2.6 Q2 BIOCHEMICAL RESEARCH METHODS
Synthetic biology (Oxford, England) Pub Date : 2019-05-21 eCollection Date: 2019-01-01 DOI:10.1093/synbio/ysz013
Daniel Bojar
{"title":"与自然对话:蛋白质的深度学习表征模型引入了蛋白质语言学。","authors":"Daniel Bojar","doi":"10.1093/synbio/ysz013","DOIUrl":null,"url":null,"abstract":"Understanding, modifying and designing proteins require an intimate knowledge of their 3D structure. Even structure-agnostic protein engineering approaches, such as directed evolution, are limited in scope because of the vast potential sequence space and the epistatic effects that multiple mutations have on protein function. To overcome these difficulties, a holistic understanding of sequence–structure–function relationships has to be established. In their recent preprint, members of the Church Group at the Wyss Institute and collaborators describe a novel approach to predicting protein stability and functionality from raw sequence (1). Their representational model UniRep (unified representation), for the first time, demonstrates an advanced understanding of protein features by means of language modeling. Using deep learning techniques, which were recently recognized with the prestigious Turing Award, Alley et al. built a language model for proteins with amino acids as characters based on natural language processing (NLP) techniques. NLP has not only revolutionized our computational understanding of language—think for instance voice-to-text software—but has been coopted for exciting applications in synthetic biology. The recurrent neural network (RNN; a type of neural network which can process sequential inputs such as text) used by Alley et al. was trained by iteratively predicting the next amino acid given the preceding amino acids for the 24 million protein sequences contained in the UniRef50 database. The RNN thus gathered implicit knowledge about the context of a given amino acid and higher-level features such as secondary structure. The authors then averaged the protein representation of their RNN at every sequence position to yield a protein language representation they call UniRep. They then extended UniRep by adding representations of the final sequence position of their RNN to generate the more complete representation called ‘UniRep Fusion’, which serves as an overview of the entire protein sequence. UniRep Fusion was then used as an input for a machine learning model to predict protein stability. Notably, this architecture was more accurate than Rosetta, the de facto state-ofthe-art for predicting protein stability. Their protein language representation allowed the authors to predict the relative brightness of 64 800 GFP mutants differing in as few as one amino acids. Remarkably, their predicted relative brightness values correlated strongly with experimental observation (r1⁄4 0.98). UniRep, as the representation of 24 million proteins, captures many phenomena of general importance for protein structure and function. These general features can be complemented by dataset-specific attributes when training on a subset of protein mutants or de novo designed proteins. This approach could for instance be adopted for screening novel proteins generated by deep learning models. Analogous to de novo designed proteins by Rosetta, generating proteins through protein language models might be most advantageous for proteins with radically new functionalities, which are unlikely to be generated by incremental directed evolution. To arrive in this virtual world of protein engineering though, more advances have to be made. It required the authors of UniRep 1 week of GPU usage to train their large model for one epoch (seeing every protein sequence in UniRef50 once). Switching from the redundancy-ridden UniRef50 database ( 24 million sequences) to preUEP (2), a redundancy-reduced protein sequence database ( 8 million sequences), might enable faster training. This reductionist approach might allow for the ‘vocabulary’ of the model to be extended from single amino acids to larger protein fragments, capturing more structural properties. In general, there are a plethora of NLP techniques developed for written languages which might be useful in protein linguistics. One particularly promising concept would be attention (3), the selective focus on sequence stretches far away from each other, which dramatically improves language models. Given that protein language may be considered one of the most natural languages by definition, modern NLP techniques could transform protein linguistics into a potent tool for the study as well as engineering of proteins for the purposes of synthetic biology.","PeriodicalId":74902,"journal":{"name":"Synthetic biology (Oxford, England)","volume":"4 1","pages":"ysz013"},"PeriodicalIF":2.6000,"publicationDate":"2019-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1093/synbio/ysz013","citationCount":"0","resultStr":"{\"title\":\"Speaking to nature: a deep learning representational model of proteins ushers in protein linguistics.\",\"authors\":\"Daniel Bojar\",\"doi\":\"10.1093/synbio/ysz013\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Understanding, modifying and designing proteins require an intimate knowledge of their 3D structure. Even structure-agnostic protein engineering approaches, such as directed evolution, are limited in scope because of the vast potential sequence space and the epistatic effects that multiple mutations have on protein function. To overcome these difficulties, a holistic understanding of sequence–structure–function relationships has to be established. In their recent preprint, members of the Church Group at the Wyss Institute and collaborators describe a novel approach to predicting protein stability and functionality from raw sequence (1). Their representational model UniRep (unified representation), for the first time, demonstrates an advanced understanding of protein features by means of language modeling. Using deep learning techniques, which were recently recognized with the prestigious Turing Award, Alley et al. built a language model for proteins with amino acids as characters based on natural language processing (NLP) techniques. NLP has not only revolutionized our computational understanding of language—think for instance voice-to-text software—but has been coopted for exciting applications in synthetic biology. The recurrent neural network (RNN; a type of neural network which can process sequential inputs such as text) used by Alley et al. was trained by iteratively predicting the next amino acid given the preceding amino acids for the 24 million protein sequences contained in the UniRef50 database. The RNN thus gathered implicit knowledge about the context of a given amino acid and higher-level features such as secondary structure. The authors then averaged the protein representation of their RNN at every sequence position to yield a protein language representation they call UniRep. They then extended UniRep by adding representations of the final sequence position of their RNN to generate the more complete representation called ‘UniRep Fusion’, which serves as an overview of the entire protein sequence. UniRep Fusion was then used as an input for a machine learning model to predict protein stability. Notably, this architecture was more accurate than Rosetta, the de facto state-ofthe-art for predicting protein stability. Their protein language representation allowed the authors to predict the relative brightness of 64 800 GFP mutants differing in as few as one amino acids. Remarkably, their predicted relative brightness values correlated strongly with experimental observation (r1⁄4 0.98). UniRep, as the representation of 24 million proteins, captures many phenomena of general importance for protein structure and function. These general features can be complemented by dataset-specific attributes when training on a subset of protein mutants or de novo designed proteins. This approach could for instance be adopted for screening novel proteins generated by deep learning models. Analogous to de novo designed proteins by Rosetta, generating proteins through protein language models might be most advantageous for proteins with radically new functionalities, which are unlikely to be generated by incremental directed evolution. To arrive in this virtual world of protein engineering though, more advances have to be made. It required the authors of UniRep 1 week of GPU usage to train their large model for one epoch (seeing every protein sequence in UniRef50 once). Switching from the redundancy-ridden UniRef50 database ( 24 million sequences) to preUEP (2), a redundancy-reduced protein sequence database ( 8 million sequences), might enable faster training. This reductionist approach might allow for the ‘vocabulary’ of the model to be extended from single amino acids to larger protein fragments, capturing more structural properties. In general, there are a plethora of NLP techniques developed for written languages which might be useful in protein linguistics. One particularly promising concept would be attention (3), the selective focus on sequence stretches far away from each other, which dramatically improves language models. Given that protein language may be considered one of the most natural languages by definition, modern NLP techniques could transform protein linguistics into a potent tool for the study as well as engineering of proteins for the purposes of synthetic biology.\",\"PeriodicalId\":74902,\"journal\":{\"name\":\"Synthetic biology (Oxford, England)\",\"volume\":\"4 1\",\"pages\":\"ysz013\"},\"PeriodicalIF\":2.6000,\"publicationDate\":\"2019-05-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1093/synbio/ysz013\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Synthetic biology (Oxford, England)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/synbio/ysz013\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2019/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q2\",\"JCRName\":\"BIOCHEMICAL RESEARCH METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Synthetic biology (Oxford, England)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/synbio/ysz013","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2019/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0

摘要

本文章由计算机程序翻译,如有差异,请以英文原文为准。
Speaking to nature: a deep learning representational model of proteins ushers in protein linguistics.
Understanding, modifying and designing proteins require an intimate knowledge of their 3D structure. Even structure-agnostic protein engineering approaches, such as directed evolution, are limited in scope because of the vast potential sequence space and the epistatic effects that multiple mutations have on protein function. To overcome these difficulties, a holistic understanding of sequence–structure–function relationships has to be established. In their recent preprint, members of the Church Group at the Wyss Institute and collaborators describe a novel approach to predicting protein stability and functionality from raw sequence (1). Their representational model UniRep (unified representation), for the first time, demonstrates an advanced understanding of protein features by means of language modeling. Using deep learning techniques, which were recently recognized with the prestigious Turing Award, Alley et al. built a language model for proteins with amino acids as characters based on natural language processing (NLP) techniques. NLP has not only revolutionized our computational understanding of language—think for instance voice-to-text software—but has been coopted for exciting applications in synthetic biology. The recurrent neural network (RNN; a type of neural network which can process sequential inputs such as text) used by Alley et al. was trained by iteratively predicting the next amino acid given the preceding amino acids for the 24 million protein sequences contained in the UniRef50 database. The RNN thus gathered implicit knowledge about the context of a given amino acid and higher-level features such as secondary structure. The authors then averaged the protein representation of their RNN at every sequence position to yield a protein language representation they call UniRep. They then extended UniRep by adding representations of the final sequence position of their RNN to generate the more complete representation called ‘UniRep Fusion’, which serves as an overview of the entire protein sequence. UniRep Fusion was then used as an input for a machine learning model to predict protein stability. Notably, this architecture was more accurate than Rosetta, the de facto state-ofthe-art for predicting protein stability. Their protein language representation allowed the authors to predict the relative brightness of 64 800 GFP mutants differing in as few as one amino acids. Remarkably, their predicted relative brightness values correlated strongly with experimental observation (r1⁄4 0.98). UniRep, as the representation of 24 million proteins, captures many phenomena of general importance for protein structure and function. These general features can be complemented by dataset-specific attributes when training on a subset of protein mutants or de novo designed proteins. This approach could for instance be adopted for screening novel proteins generated by deep learning models. Analogous to de novo designed proteins by Rosetta, generating proteins through protein language models might be most advantageous for proteins with radically new functionalities, which are unlikely to be generated by incremental directed evolution. To arrive in this virtual world of protein engineering though, more advances have to be made. It required the authors of UniRep 1 week of GPU usage to train their large model for one epoch (seeing every protein sequence in UniRef50 once). Switching from the redundancy-ridden UniRef50 database ( 24 million sequences) to preUEP (2), a redundancy-reduced protein sequence database ( 8 million sequences), might enable faster training. This reductionist approach might allow for the ‘vocabulary’ of the model to be extended from single amino acids to larger protein fragments, capturing more structural properties. In general, there are a plethora of NLP techniques developed for written languages which might be useful in protein linguistics. One particularly promising concept would be attention (3), the selective focus on sequence stretches far away from each other, which dramatically improves language models. Given that protein language may be considered one of the most natural languages by definition, modern NLP techniques could transform protein linguistics into a potent tool for the study as well as engineering of proteins for the purposes of synthetic biology.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信