Honggen Zhang, Xiangrui Gao, June Zhang, Lipeng Lai
{"title":"mRNA2vec: mRNA Embedding with Language Model in the 5'UTR-CDS for mRNA Design","authors":"Honggen Zhang, Xiangrui Gao, June Zhang, Lipeng Lai","doi":"arxiv-2408.09048","DOIUrl":null,"url":null,"abstract":"Messenger RNA (mRNA)-based vaccines are accelerating the discovery of new\ndrugs and revolutionizing the pharmaceutical industry. However, selecting\nparticular mRNA sequences for vaccines and therapeutics from extensive mRNA\nlibraries is costly. Effective mRNA therapeutics require carefully designed\nsequences with optimized expression levels and stability. This paper proposes a\nnovel contextual language model (LM)-based embedding method: mRNA2vec. In\ncontrast to existing mRNA embedding approaches, our method is based on the\nself-supervised teacher-student learning framework of data2vec. We jointly use\nthe 5' untranslated region (UTR) and coding sequence (CDS) region as the input\nsequences. We adapt our LM-based approach specifically to mRNA by 1)\nconsidering the importance of location on the mRNA sequence with probabilistic\nmasking, 2) using Minimum Free Energy (MFE) prediction and Secondary Structure\n(SS) classification as additional pretext tasks. mRNA2vec demonstrates\nsignificant improvements in translation efficiency (TE) and expression level\n(EL) prediction tasks in UTR compared to SOTA methods such as UTR-LM. It also\ngives a competitive performance in mRNA stability and protein production level\ntasks in CDS such as CodonBERT.","PeriodicalId":501266,"journal":{"name":"arXiv - QuanBio - Quantitative Methods","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Quantitative Methods","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.09048","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Messenger RNA (mRNA)-based vaccines are accelerating the discovery of new
drugs and revolutionizing the pharmaceutical industry. However, selecting
particular mRNA sequences for vaccines and therapeutics from extensive mRNA
libraries is costly. Effective mRNA therapeutics require carefully designed
sequences with optimized expression levels and stability. This paper proposes a
novel contextual language model (LM)-based embedding method: mRNA2vec. In
contrast to existing mRNA embedding approaches, our method is based on the
self-supervised teacher-student learning framework of data2vec. We jointly use
the 5' untranslated region (UTR) and coding sequence (CDS) region as the input
sequences. We adapt our LM-based approach specifically to mRNA by 1)
considering the importance of location on the mRNA sequence with probabilistic
masking, 2) using Minimum Free Energy (MFE) prediction and Secondary Structure
(SS) classification as additional pretext tasks. mRNA2vec demonstrates
significant improvements in translation efficiency (TE) and expression level
(EL) prediction tasks in UTR compared to SOTA methods such as UTR-LM. It also
gives a competitive performance in mRNA stability and protein production level
tasks in CDS such as CodonBERT.