Dy-mer: An Explainable DNA Sequence Representation Scheme using Sparse Recovery

arXiv - QuanBio - Genomics Pub Date : 2024-07-06 DOI:arxiv-2407.12051

Zhiyuan Peng, Yuanbo Tang, Yang Li

{"title":"Dy-mer: An Explainable DNA Sequence Representation Scheme using Sparse Recovery","authors":"Zhiyuan Peng, Yuanbo Tang, Yang Li","doi":"arxiv-2407.12051","DOIUrl":null,"url":null,"abstract":"DNA sequences encode vital genetic and biological information, yet these\nunfixed-length sequences cannot serve as the input of common data mining\nalgorithms. Hence, various representation schemes have been developed to\ntransform DNA sequences into fixed-length numerical representations. However,\nthese schemes face difficulties in learning high-quality representations due to\nthe complexity and sparsity of DNA data. Additionally, DNA sequences are\ninherently noisy because of mutations. While several schemes have been proposed\nfor their effectiveness, they often lack semantic structure, making it\ndifficult for biologists to validate and leverage the results. To address these\nchallenges, we propose \\textbf{Dy-mer}, an explainable and robust DNA\nrepresentation scheme based on sparse recovery. Leveraging the underlying\nsemantic structure of DNA, we modify the traditional sparse recovery to capture\nrecurring patterns indicative of biological functions by representing frequent\nK-mers as basis vectors and reconstructing each DNA sequence through simple\nconcatenation. Experimental results demonstrate that \\textbf{Dy-mer} achieves\nstate-of-the-art performance in DNA promoter classification, yielding a\nremarkable \\textbf{13\\%} increase in accuracy. Moreover, its inherent\nexplainability facilitates DNA clustering and motif detection, enhancing its\nutility in biological research.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"14 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.12051","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

DNA sequences encode vital genetic and biological information, yet these unfixed-length sequences cannot serve as the input of common data mining algorithms. Hence, various representation schemes have been developed to transform DNA sequences into fixed-length numerical representations. However, these schemes face difficulties in learning high-quality representations due to the complexity and sparsity of DNA data. Additionally, DNA sequences are inherently noisy because of mutations. While several schemes have been proposed for their effectiveness, they often lack semantic structure, making it difficult for biologists to validate and leverage the results. To address these challenges, we propose \textbf{Dy-mer}, an explainable and robust DNA representation scheme based on sparse recovery. Leveraging the underlying semantic structure of DNA, we modify the traditional sparse recovery to capture recurring patterns indicative of biological functions by representing frequent K-mers as basis vectors and reconstructing each DNA sequence through simple concatenation. Experimental results demonstrate that \textbf{Dy-mer} achieves state-of-the-art performance in DNA promoter classification, yielding a remarkable \textbf{13\%} increase in accuracy. Moreover, its inherent explainability facilitates DNA clustering and motif detection, enhancing its utility in biological research.

查看原文本刊更多论文

Dy-mer：使用稀疏恢复的可解释 DNA 序列表示方案

DNA 序列编码着重要的遗传和生物信息，但这些长度不固定的序列无法作为普通数据挖掘算法的输入。因此，人们开发了各种表示方案，将 DNA 序列转换为固定长度的数字表示。然而，由于 DNA 数据的复杂性和稀疏性，这些方案在学习高质量表示时面临困难。此外，由于突变，DNA 序列本身就存在噪声。虽然已经提出了几种有效的方案，但它们往往缺乏语义结构，使得生物学家难以验证和利用这些结果。为了应对这些挑战，我们提出了一种基于稀疏恢复的可解释且稳健的 DNA 表示方案--textbf{Dy-mer}。利用DNA的基本语义结构，我们修改了传统的稀疏恢复方法，将频繁出现的K-mers表示为基向量，并通过简单的合并重建每个DNA序列，从而捕捉到表明生物功能的重复出现的模式。实验结果表明，textbf{Dy-mer}在DNA启动子分类中达到了最先进的性能，准确率提高了13%。此外，其固有的可解释性也为DNA聚类和主题检测提供了便利，提高了其在生物学研究中的实用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - QuanBio - Genomics

自引率

0.00%

发文量