Understanding the Natural Language of DNA using Encoder-Decoder Foundation Models with Byte-level Precision.

ArXiv Pub Date : 2024-08-22

Aditya Malusare, Harish Kothandaraman, Dipesh Tamboli, Nadia A Lanman, Vaneet Aggarwal

{"title":"Understanding the Natural Language of DNA using Encoder-Decoder Foundation Models with Byte-level Precision.","authors":"Aditya Malusare, Harish Kothandaraman, Dipesh Tamboli, Nadia A Lanman, Vaneet Aggarwal","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>This paper presents the Ensemble Nucleotide Byte-level Encoder-Decoder (ENBED) foundation model, analyzing DNA sequences at byte-level precision with an encoder-decoder Transformer architecture. ENBED uses a sub-quadratic implementation of attention to develop an efficient model capable of sequence-to-sequence transformations, generalizing previous genomic models with encoder-only or decoder-only architectures. We use Masked Language Modeling to pre-train the foundation model using reference genome sequences and apply it in the following downstream tasks: (1) identification of enhancers, promotors and splice sites, (2) recognition of sequences containing base call mismatches and insertion/deletion errors, an advantage over tokenization schemes involving multiple base pairs, which lose the ability to analyze with byte-level precision, (3) identification of biological function annotations of genomic sequences, and (4) generating mutations of the Influenza virus using the encoder-decoder architecture and validating them against real-world observations. In each of these tasks, we demonstrate significant improvement as compared to the existing state-of-the-art results.</p>","PeriodicalId":93888,"journal":{"name":"ArXiv","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10896356/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ArXiv","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

This paper presents the Ensemble Nucleotide Byte-level Encoder-Decoder (ENBED) foundation model, analyzing DNA sequences at byte-level precision with an encoder-decoder Transformer architecture. ENBED uses a sub-quadratic implementation of attention to develop an efficient model capable of sequence-to-sequence transformations, generalizing previous genomic models with encoder-only or decoder-only architectures. We use Masked Language Modeling to pre-train the foundation model using reference genome sequences and apply it in the following downstream tasks: (1) identification of enhancers, promotors and splice sites, (2) recognition of sequences containing base call mismatches and insertion/deletion errors, an advantage over tokenization schemes involving multiple base pairs, which lose the ability to analyze with byte-level precision, (3) identification of biological function annotations of genomic sequences, and (4) generating mutations of the Influenza virus using the encoder-decoder architecture and validating them against real-world observations. In each of these tasks, we demonstrate significant improvement as compared to the existing state-of-the-art results.

本刊更多论文

利用具有字节级精度的编码器-解码器基础模型理解 DNA 的自然语言。

本文介绍了组合核苷酸字节级编码器-解码器（ENBED）基础模型，利用编码器-解码器变换器架构分析字节级精度的 DNA 序列。ENBED 使用注意力的亚二次方实现，开发出一种能够进行序列到序列转换的高效模型，从而推广了之前仅使用编码器或仅使用解码器架构的基因组模型。我们使用掩码语言建模（Masked Language Modeling）技术，利用参考基因组序列对基础模型进行预训练，并将其应用于以下下游任务：(1)识别增强子、启动子和剪接位点；(2)识别包含碱基调用错配和插入/删除错误的序列，这比涉及多个碱基对的标记化方案更有优势，因为后者失去了以字节级精度进行分析的能力；(3)识别基因组序列的生物功能注释；(4)使用编码器-解码器架构生成流感病毒的突变，并根据现实世界的观察结果对其进行验证。在上述每项任务中，我们都展示了与现有先进成果相比的显著改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ArXiv

自引率

0.00%

发文量