CodonTransformer: a multispecies codon optimizer using context-aware neural networks

Adibvafa Fallahpour, Vincent Gureghian, Guillaume J. Filion, Ariel B. Lindner, Amir Pandi
{"title":"CodonTransformer: a multispecies codon optimizer using context-aware neural networks","authors":"Adibvafa Fallahpour, Vincent Gureghian, Guillaume J. Filion, Ariel B. Lindner, Amir Pandi","doi":"10.1101/2024.09.13.612903","DOIUrl":null,"url":null,"abstract":"The genetic code is degenerate allowing a multitude of possible DNA sequences to encode the same protein. This degeneracy impacts the efficiency of heterologous protein production due to the codon usage preferences of each organism. The process of tailoring organism-specific synonymous codons, known as codon optimization, must respect local sequence patterns that go beyond global codon preferences. As a result, the search space faces a combinatorial explosion that makes exhaustive exploration impossible. Nevertheless, throughout the diverse life on Earth, natural selection has already optimized the sequences, thereby providing a rich source of data allowing machine learning algorithms to explore the underlying rules. Here, we introduce CodonTransformer, a multispecies deep learning model trained on over 1 million DNA-protein pairs from 164 organisms spanning all kingdoms of life. The model demonstrates context-awareness thanks to the attention mechanism and bidirectionality of the Transformers we used, and to a novel sequence representation that combines organism, amino acid, and codon encodings. CodonTransformer generates host-specific DNA sequences with natural-like codon distribution profiles and with negative cis-regulatory elements. This work introduces a novel strategy of Shared Token Representation and Encoding with Aligned Multi-masking (STREAM) and provides a state-of-the-art codon optimization framework with a customizable open-access model and a user-friendly interface.","PeriodicalId":501408,"journal":{"name":"bioRxiv - Synthetic Biology","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"bioRxiv - Synthetic Biology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.09.13.612903","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The genetic code is degenerate allowing a multitude of possible DNA sequences to encode the same protein. This degeneracy impacts the efficiency of heterologous protein production due to the codon usage preferences of each organism. The process of tailoring organism-specific synonymous codons, known as codon optimization, must respect local sequence patterns that go beyond global codon preferences. As a result, the search space faces a combinatorial explosion that makes exhaustive exploration impossible. Nevertheless, throughout the diverse life on Earth, natural selection has already optimized the sequences, thereby providing a rich source of data allowing machine learning algorithms to explore the underlying rules. Here, we introduce CodonTransformer, a multispecies deep learning model trained on over 1 million DNA-protein pairs from 164 organisms spanning all kingdoms of life. The model demonstrates context-awareness thanks to the attention mechanism and bidirectionality of the Transformers we used, and to a novel sequence representation that combines organism, amino acid, and codon encodings. CodonTransformer generates host-specific DNA sequences with natural-like codon distribution profiles and with negative cis-regulatory elements. This work introduces a novel strategy of Shared Token Representation and Encoding with Aligned Multi-masking (STREAM) and provides a state-of-the-art codon optimization framework with a customizable open-access model and a user-friendly interface.
CodonTransformer:使用上下文感知神经网络的多物种密码子优化器
遗传密码是退化的,允许多种可能的 DNA 序列编码相同的蛋白质。由于每种生物的密码子使用偏好不同,这种退化性影响了异源蛋白质的生产效率。定制生物特异性同义密码子的过程被称为密码子优化,必须尊重超出全局密码子偏好的局部序列模式。因此,搜索空间面临着组合爆炸,不可能进行详尽的探索。然而,在地球上多种多样的生命中,自然选择已经对序列进行了优化,从而提供了丰富的数据源,使机器学习算法能够探索潜在的规则。在这里,我们介绍一个多物种深度学习模型--CodonTransformer,它是在来自164个生物体的100多万个DNA-蛋白质对上训练出来的,这些生物体横跨生命的所有领域。得益于我们使用的注意力机制和 Transformers 的双向性,以及结合了生物体、氨基酸和密码子编码的新型序列表示法,该模型展示了上下文感知能力。CodonTransformer 生成的宿主特异性 DNA 序列具有类似自然的密码子分布图和负顺式调控元素。这项工作引入了一种新颖的共享标记表示和编码与对齐多掩码(STREAM)策略,并提供了一个最先进的密码子优化框架,该框架具有可定制的开放存取模型和用户友好界面。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信