{"title":"Dy-mer: An Explainable DNA Sequence Representation Scheme using Sparse Recovery","authors":"Zhiyuan Peng, Yuanbo Tang, Yang Li","doi":"arxiv-2407.12051","DOIUrl":null,"url":null,"abstract":"DNA sequences encode vital genetic and biological information, yet these\nunfixed-length sequences cannot serve as the input of common data mining\nalgorithms. Hence, various representation schemes have been developed to\ntransform DNA sequences into fixed-length numerical representations. However,\nthese schemes face difficulties in learning high-quality representations due to\nthe complexity and sparsity of DNA data. Additionally, DNA sequences are\ninherently noisy because of mutations. While several schemes have been proposed\nfor their effectiveness, they often lack semantic structure, making it\ndifficult for biologists to validate and leverage the results. To address these\nchallenges, we propose \\textbf{Dy-mer}, an explainable and robust DNA\nrepresentation scheme based on sparse recovery. Leveraging the underlying\nsemantic structure of DNA, we modify the traditional sparse recovery to capture\nrecurring patterns indicative of biological functions by representing frequent\nK-mers as basis vectors and reconstructing each DNA sequence through simple\nconcatenation. Experimental results demonstrate that \\textbf{Dy-mer} achieves\nstate-of-the-art performance in DNA promoter classification, yielding a\nremarkable \\textbf{13\\%} increase in accuracy. Moreover, its inherent\nexplainability facilitates DNA clustering and motif detection, enhancing its\nutility in biological research.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"14 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.12051","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
DNA sequences encode vital genetic and biological information, yet these
unfixed-length sequences cannot serve as the input of common data mining
algorithms. Hence, various representation schemes have been developed to
transform DNA sequences into fixed-length numerical representations. However,
these schemes face difficulties in learning high-quality representations due to
the complexity and sparsity of DNA data. Additionally, DNA sequences are
inherently noisy because of mutations. While several schemes have been proposed
for their effectiveness, they often lack semantic structure, making it
difficult for biologists to validate and leverage the results. To address these
challenges, we propose \textbf{Dy-mer}, an explainable and robust DNA
representation scheme based on sparse recovery. Leveraging the underlying
semantic structure of DNA, we modify the traditional sparse recovery to capture
recurring patterns indicative of biological functions by representing frequent
K-mers as basis vectors and reconstructing each DNA sequence through simple
concatenation. Experimental results demonstrate that \textbf{Dy-mer} achieves
state-of-the-art performance in DNA promoter classification, yielding a
remarkable \textbf{13\%} increase in accuracy. Moreover, its inherent
explainability facilitates DNA clustering and motif detection, enhancing its
utility in biological research.
DNA 序列编码着重要的遗传和生物信息,但这些长度不固定的序列无法作为普通数据挖掘算法的输入。因此,人们开发了各种表示方案,将 DNA 序列转换为固定长度的数字表示。然而,由于 DNA 数据的复杂性和稀疏性,这些方案在学习高质量表示时面临困难。此外,由于突变,DNA 序列本身就存在噪声。虽然已经提出了几种有效的方案,但它们往往缺乏语义结构,使得生物学家难以验证和利用这些结果。为了应对这些挑战,我们提出了一种基于稀疏恢复的可解释且稳健的 DNA 表示方案--textbf{Dy-mer}。利用DNA的基本语义结构,我们修改了传统的稀疏恢复方法,将频繁出现的K-mers表示为基向量,并通过简单的合并重建每个DNA序列,从而捕捉到表明生物功能的重复出现的模式。实验结果表明,textbf{Dy-mer}在DNA启动子分类中达到了最先进的性能,准确率提高了13%。此外,其固有的可解释性也为DNA聚类和主题检测提供了便利,提高了其在生物学研究中的实用性。