Tokenization and deep learning architectures in genomics: A comprehensive review.

IF 4.1 2区生物学 Q2 BIOCHEMISTRY & MOLECULAR BIOLOGY

Computational and structural biotechnology journal Pub Date : 2025-07-28 eCollection Date: 2025-01-01 DOI:10.1016/j.csbj.2025.07.038

Conrad Testagrose, Christina Boucher

{"title":"Tokenization and deep learning architectures in genomics: A comprehensive review.","authors":"Conrad Testagrose, Christina Boucher","doi":"10.1016/j.csbj.2025.07.038","DOIUrl":null,"url":null,"abstract":"<p><p>The development of modern DNA sequencing technologies has resulted in the rapid growth of genomic data. Alongside the collection of this data, there is an increasing need for the development of modern computational tools leveraging this data for tasks including but not limited to antimicrobial resistance and gene annotation. Current deep learning architectures and tokenization techniques have been explored for the extraction of meaningful underlying information contained within this sequencing data. We aim to survey current and foundational literature surrounding the area of deep learning architectures and tokenization techniques in the field of genomics. Our survey of the literature outlines that significant work remains in developing efficient tokenization techniques that can capture or model underlying motifs within DNA sequences. While deep learning models have become more efficient, many current tokenization methods either reduce scalability through naive sequence representation, incorrectly model motifs or are borrowed directly from NLP tasks for use with biological sequences. Current and future model architectures should seek to implement and support more advanced, and biologically relevant, tokenization techniques to more effectively model the underlying information in biological sequencing data.</p>","PeriodicalId":10715,"journal":{"name":"Computational and structural biotechnology journal","volume":"27 ","pages":"3547-3555"},"PeriodicalIF":4.1000,"publicationDate":"2025-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12356405/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational and structural biotechnology journal","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1016/j.csbj.2025.07.038","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

The development of modern DNA sequencing technologies has resulted in the rapid growth of genomic data. Alongside the collection of this data, there is an increasing need for the development of modern computational tools leveraging this data for tasks including but not limited to antimicrobial resistance and gene annotation. Current deep learning architectures and tokenization techniques have been explored for the extraction of meaningful underlying information contained within this sequencing data. We aim to survey current and foundational literature surrounding the area of deep learning architectures and tokenization techniques in the field of genomics. Our survey of the literature outlines that significant work remains in developing efficient tokenization techniques that can capture or model underlying motifs within DNA sequences. While deep learning models have become more efficient, many current tokenization methods either reduce scalability through naive sequence representation, incorrectly model motifs or are borrowed directly from NLP tasks for use with biological sequences. Current and future model architectures should seek to implement and support more advanced, and biologically relevant, tokenization techniques to more effectively model the underlying information in biological sequencing data.

Abstract Image

查看原文本刊更多论文

基因组学中的标记化和深度学习架构：全面回顾。

现代DNA测序技术的发展导致了基因组数据的快速增长。除了收集这些数据外，越来越需要开发利用这些数据的现代计算工具，包括但不限于抗菌素耐药性和基因注释。目前已经探索了深度学习架构和标记化技术，以提取包含在该测序数据中的有意义的底层信息。我们的目标是调查围绕基因组学领域的深度学习架构和标记化技术领域的当前和基础文献。我们对文献的调查概述了在开发有效的标记化技术方面仍有重要的工作，这些技术可以捕获或模拟DNA序列中的潜在基序。虽然深度学习模型变得更加高效，但许多当前的标记化方法要么通过朴素序列表示降低了可扩展性，要么错误地建模基序，要么直接从NLP任务中借鉴用于生物序列。当前和未来的模型架构应该寻求实现和支持更先进的和生物学相关的标记化技术，以更有效地模拟生物测序数据中的潜在信息。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computational and structural biotechnology journal Biochemistry, Genetics and Molecular Biology-Biophysics

CiteScore

9.30

自引率

3.30%

发文量

540

审稿时长

6 weeks

期刊介绍： Computational and Structural Biotechnology Journal (CSBJ) is an online gold open access journal publishing research articles and reviews after full peer review. All articles are published, without barriers to access, immediately upon acceptance. The journal places a strong emphasis on functional and mechanistic understanding of how molecular components in a biological process work together through the application of computational methods. Structural data may provide such insights, but they are not a pre-requisite for publication in the journal. Specific areas of interest include, but are not limited to: Structure and function of proteins, nucleic acids and other macromolecules Structure and function of multi-component complexes Protein folding, processing and degradation Enzymology Computational and structural studies of plant systems Microbial Informatics Genomics Proteomics Metabolomics Algorithms and Hypothesis in Bioinformatics Mathematical and Theoretical Biology Computational Chemistry and Drug Discovery Microscopy and Molecular Imaging Nanotechnology Systems and Synthetic Biology