Siyuan Li, Zedong Wang, Zicheng Liu, Di Wu, Cheng Tan, Jiangbin Zheng, Yufei Huang, Stan Z. Li
{"title":"VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling","authors":"Siyuan Li, Zedong Wang, Zicheng Liu, Di Wu, Cheng Tan, Jiangbin Zheng, Yufei Huang, Stan Z. Li","doi":"arxiv-2405.10812","DOIUrl":null,"url":null,"abstract":"Similar to natural language models, pre-trained genome language models are\nproposed to capture the underlying intricacies within genomes with unsupervised\nsequence modeling. They have become essential tools for researchers and\npractitioners in biology. However, the \\textit{hand-crafted} tokenization\npolicies used in these models may not encode the most discriminative patterns\nfrom the limited vocabulary of genomic data. In this paper, we introduce VQDNA,\na general-purpose framework that renovates genome tokenization from the\nperspective of genome vocabulary learning. By leveraging vector-quantized\ncodebook as \\textit{learnable} vocabulary, VQDNA can adaptively tokenize\ngenomes into \\textit{pattern-aware} embeddings in an end-to-end manner. To\nfurther push its limits, we propose Hierarchical Residual Quantization (HRQ),\nwhere varying scales of codebooks are designed in a hierarchy to enrich the\ngenome vocabulary in a coarse-to-fine manner. Extensive experiments on 32\ngenome datasets demonstrate VQDNA's superiority and favorable parameter\nefficiency compared to existing genome language models. Notably, empirical\nanalysis of SARS-CoV-2 mutations reveals the fine-grained pattern awareness and\nbiological significance of learned HRQ vocabulary, highlighting its untapped\npotential for broader applications in genomics.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"60 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.10812","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Similar to natural language models, pre-trained genome language models are
proposed to capture the underlying intricacies within genomes with unsupervised
sequence modeling. They have become essential tools for researchers and
practitioners in biology. However, the \textit{hand-crafted} tokenization
policies used in these models may not encode the most discriminative patterns
from the limited vocabulary of genomic data. In this paper, we introduce VQDNA,
a general-purpose framework that renovates genome tokenization from the
perspective of genome vocabulary learning. By leveraging vector-quantized
codebook as \textit{learnable} vocabulary, VQDNA can adaptively tokenize
genomes into \textit{pattern-aware} embeddings in an end-to-end manner. To
further push its limits, we propose Hierarchical Residual Quantization (HRQ),
where varying scales of codebooks are designed in a hierarchy to enrich the
genome vocabulary in a coarse-to-fine manner. Extensive experiments on 32
genome datasets demonstrate VQDNA's superiority and favorable parameter
efficiency compared to existing genome language models. Notably, empirical
analysis of SARS-CoV-2 mutations reveals the fine-grained pattern awareness and
biological significance of learned HRQ vocabulary, highlighting its untapped
potential for broader applications in genomics.