mRNA2vec: mRNA Embedding with Language Model in the 5'UTR-CDS for mRNA Design

Honggen Zhang, Xiangrui Gao, June Zhang, Lipeng Lai
{"title":"mRNA2vec: mRNA Embedding with Language Model in the 5'UTR-CDS for mRNA Design","authors":"Honggen Zhang, Xiangrui Gao, June Zhang, Lipeng Lai","doi":"arxiv-2408.09048","DOIUrl":null,"url":null,"abstract":"Messenger RNA (mRNA)-based vaccines are accelerating the discovery of new\ndrugs and revolutionizing the pharmaceutical industry. However, selecting\nparticular mRNA sequences for vaccines and therapeutics from extensive mRNA\nlibraries is costly. Effective mRNA therapeutics require carefully designed\nsequences with optimized expression levels and stability. This paper proposes a\nnovel contextual language model (LM)-based embedding method: mRNA2vec. In\ncontrast to existing mRNA embedding approaches, our method is based on the\nself-supervised teacher-student learning framework of data2vec. We jointly use\nthe 5' untranslated region (UTR) and coding sequence (CDS) region as the input\nsequences. We adapt our LM-based approach specifically to mRNA by 1)\nconsidering the importance of location on the mRNA sequence with probabilistic\nmasking, 2) using Minimum Free Energy (MFE) prediction and Secondary Structure\n(SS) classification as additional pretext tasks. mRNA2vec demonstrates\nsignificant improvements in translation efficiency (TE) and expression level\n(EL) prediction tasks in UTR compared to SOTA methods such as UTR-LM. It also\ngives a competitive performance in mRNA stability and protein production level\ntasks in CDS such as CodonBERT.","PeriodicalId":501266,"journal":{"name":"arXiv - QuanBio - Quantitative Methods","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Quantitative Methods","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.09048","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Messenger RNA (mRNA)-based vaccines are accelerating the discovery of new drugs and revolutionizing the pharmaceutical industry. However, selecting particular mRNA sequences for vaccines and therapeutics from extensive mRNA libraries is costly. Effective mRNA therapeutics require carefully designed sequences with optimized expression levels and stability. This paper proposes a novel contextual language model (LM)-based embedding method: mRNA2vec. In contrast to existing mRNA embedding approaches, our method is based on the self-supervised teacher-student learning framework of data2vec. We jointly use the 5' untranslated region (UTR) and coding sequence (CDS) region as the input sequences. We adapt our LM-based approach specifically to mRNA by 1) considering the importance of location on the mRNA sequence with probabilistic masking, 2) using Minimum Free Energy (MFE) prediction and Secondary Structure (SS) classification as additional pretext tasks. mRNA2vec demonstrates significant improvements in translation efficiency (TE) and expression level (EL) prediction tasks in UTR compared to SOTA methods such as UTR-LM. It also gives a competitive performance in mRNA stability and protein production level tasks in CDS such as CodonBERT.
mRNA2vec:在 5'UTR-CDS 中嵌入语言模型,进行 mRNA 设计
以信使核糖核酸(mRNA)为基础的疫苗正在加速新药的发现,并给制药业带来革命性的变化。然而,从庞大的 mRNA 库中挑选用于疫苗和治疗的特定 mRNA 序列成本高昂。有效的 mRNA 疗法需要精心设计的具有优化表达水平和稳定性的序列。本文提出了一种基于上下文语言模型(LM)的嵌入方法:mRNA2vec。与现有的 mRNA 嵌入方法不同,我们的方法基于 data2vec 的自我监督师生学习框架。我们共同使用 5' 非翻译区(UTR)和编码序列(CDS)区域作为输入序列。与 UTR-LM 等 SOTA 方法相比,mRNA2vec 在 UTR 的翻译效率(TE)和表达水平(EL)预测任务上有显著提高。此外,它在 CDS(如 CodonBERT)中的 mRNA 稳定性和蛋白质生产水平任务方面的表现也很有竞争力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信