Dy-mer: An Explainable DNA Sequence Representation Scheme using Sparse Recovery

Zhiyuan Peng, Yuanbo Tang, Yang Li
{"title":"Dy-mer: An Explainable DNA Sequence Representation Scheme using Sparse Recovery","authors":"Zhiyuan Peng, Yuanbo Tang, Yang Li","doi":"arxiv-2407.12051","DOIUrl":null,"url":null,"abstract":"DNA sequences encode vital genetic and biological information, yet these\nunfixed-length sequences cannot serve as the input of common data mining\nalgorithms. Hence, various representation schemes have been developed to\ntransform DNA sequences into fixed-length numerical representations. However,\nthese schemes face difficulties in learning high-quality representations due to\nthe complexity and sparsity of DNA data. Additionally, DNA sequences are\ninherently noisy because of mutations. While several schemes have been proposed\nfor their effectiveness, they often lack semantic structure, making it\ndifficult for biologists to validate and leverage the results. To address these\nchallenges, we propose \\textbf{Dy-mer}, an explainable and robust DNA\nrepresentation scheme based on sparse recovery. Leveraging the underlying\nsemantic structure of DNA, we modify the traditional sparse recovery to capture\nrecurring patterns indicative of biological functions by representing frequent\nK-mers as basis vectors and reconstructing each DNA sequence through simple\nconcatenation. Experimental results demonstrate that \\textbf{Dy-mer} achieves\nstate-of-the-art performance in DNA promoter classification, yielding a\nremarkable \\textbf{13\\%} increase in accuracy. Moreover, its inherent\nexplainability facilitates DNA clustering and motif detection, enhancing its\nutility in biological research.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"14 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.12051","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

DNA sequences encode vital genetic and biological information, yet these unfixed-length sequences cannot serve as the input of common data mining algorithms. Hence, various representation schemes have been developed to transform DNA sequences into fixed-length numerical representations. However, these schemes face difficulties in learning high-quality representations due to the complexity and sparsity of DNA data. Additionally, DNA sequences are inherently noisy because of mutations. While several schemes have been proposed for their effectiveness, they often lack semantic structure, making it difficult for biologists to validate and leverage the results. To address these challenges, we propose \textbf{Dy-mer}, an explainable and robust DNA representation scheme based on sparse recovery. Leveraging the underlying semantic structure of DNA, we modify the traditional sparse recovery to capture recurring patterns indicative of biological functions by representing frequent K-mers as basis vectors and reconstructing each DNA sequence through simple concatenation. Experimental results demonstrate that \textbf{Dy-mer} achieves state-of-the-art performance in DNA promoter classification, yielding a remarkable \textbf{13\%} increase in accuracy. Moreover, its inherent explainability facilitates DNA clustering and motif detection, enhancing its utility in biological research.
Dy-mer:使用稀疏恢复的可解释 DNA 序列表示方案
DNA 序列编码着重要的遗传和生物信息,但这些长度不固定的序列无法作为普通数据挖掘算法的输入。因此,人们开发了各种表示方案,将 DNA 序列转换为固定长度的数字表示。然而,由于 DNA 数据的复杂性和稀疏性,这些方案在学习高质量表示时面临困难。此外,由于突变,DNA 序列本身就存在噪声。虽然已经提出了几种有效的方案,但它们往往缺乏语义结构,使得生物学家难以验证和利用这些结果。为了应对这些挑战,我们提出了一种基于稀疏恢复的可解释且稳健的 DNA 表示方案--textbf{Dy-mer}。利用DNA的基本语义结构,我们修改了传统的稀疏恢复方法,将频繁出现的K-mers表示为基向量,并通过简单的合并重建每个DNA序列,从而捕捉到表明生物功能的重复出现的模式。实验结果表明,textbf{Dy-mer}在DNA启动子分类中达到了最先进的性能,准确率提高了13%。此外,其固有的可解释性也为DNA聚类和主题检测提供了便利,提高了其在生物学研究中的实用性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信