Efficient and low-complexity variable-to-variable length coding for DNA storage.

IF 2.9 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS
Yunfei Gao, Albert No
{"title":"Efficient and low-complexity variable-to-variable length coding for DNA storage.","authors":"Yunfei Gao, Albert No","doi":"10.1186/s12859-024-05943-y","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Efficient DNA-based storage systems offer substantial capacity and longevity at reduced costs, addressing anticipated data growth. However, encoding data into DNA sequences is limited by two key constraints: 1) a maximum of h consecutive identical bases (homopolymer constraint h), and 2) a GC ratio between <math><mrow><mo>[</mo> <mn>0.5</mn> <mo>-</mo> <msub><mi>c</mi> <mrow><mi>GC</mi></mrow> </msub> <mo>,</mo> <mn>0.5</mn> <mo>+</mo> <msub><mi>c</mi> <mrow><mi>GC</mi></mrow> </msub> <mo>]</mo></mrow> </math> (GC content constraint <math><msub><mi>c</mi> <mrow><mi>GC</mi></mrow> </msub> </math> ). Sequencing or synthesis errors tend to increase when these constraints are violated.</p><p><strong>Results: </strong>In this research, we address a pure source coding problem in the context of DNA storage, considering both homopolymer and GC content constraints. We introduce a novel coding technique that adheres to these constraints while maintaining linear complexity for increased block lengths and achieving near-optimal rates. We demonstrate the effectiveness of the proposed method through experiments on both randomly generated data and existing files. For example, when <math><mrow><mi>h</mi> <mo>=</mo> <mn>4</mn></mrow> </math> and <math> <mrow><msub><mi>c</mi> <mrow><mi>GC</mi></mrow> </msub> <mo>=</mo> <mn>0.05</mn></mrow> </math> , the rate reached 1.988, close to the theoretical limit of 1.990. The associated code can be accessed at GitHub.</p><p><strong>Conclusion: </strong>We propose a variable-to-variable-length encoding method that does not rely on concatenating short predefined sequences, which achieves near-optimal rates.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"320"},"PeriodicalIF":2.9000,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11446080/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12859-024-05943-y","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Efficient DNA-based storage systems offer substantial capacity and longevity at reduced costs, addressing anticipated data growth. However, encoding data into DNA sequences is limited by two key constraints: 1) a maximum of h consecutive identical bases (homopolymer constraint h), and 2) a GC ratio between [ 0.5 - c GC , 0.5 + c GC ] (GC content constraint c GC ). Sequencing or synthesis errors tend to increase when these constraints are violated.

Results: In this research, we address a pure source coding problem in the context of DNA storage, considering both homopolymer and GC content constraints. We introduce a novel coding technique that adheres to these constraints while maintaining linear complexity for increased block lengths and achieving near-optimal rates. We demonstrate the effectiveness of the proposed method through experiments on both randomly generated data and existing files. For example, when h = 4 and c GC = 0.05 , the rate reached 1.988, close to the theoretical limit of 1.990. The associated code can be accessed at GitHub.

Conclusion: We propose a variable-to-variable-length encoding method that does not rely on concatenating short predefined sequences, which achieves near-optimal rates.

用于 DNA 存储的高效、低复杂度变长编码。
背景:基于 DNA 的高效存储系统能以更低的成本提供巨大的容量和更长的寿命,从而应对预期的数据增长。然而,将数据编码到 DNA 序列中受到两个关键约束的限制:1) 最多有 h 个连续的相同碱基(同源多聚约束 h),以及 2) GC 比率在 [ 0.5 - c GC , 0.5 + c GC ] 之间(GC 含量约束 c GC)。当违反这些限制条件时,测序或合成错误往往会增加:在这项研究中,我们解决了 DNA 存储背景下的纯源编码问题,同时考虑了同源多聚物和 GC 含量约束。我们引入了一种新颖的编码技术,它既能遵守这些约束条件,又能在块长度增加时保持线性复杂性,并实现接近最优的速率。我们通过对随机生成的数据和现有文件进行实验,证明了所提方法的有效性。例如,当 h = 4 和 c GC = 0.05 时,速率达到 1.988,接近理论极限 1.990。相关代码可在 GitHub.Conclusion 上获取:我们提出了一种不依赖于连接预定义短序列的变长到变长编码方法,它能达到接近最优的速率。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
BMC Bioinformatics
BMC Bioinformatics 生物-生化研究方法
CiteScore
5.70
自引率
3.30%
发文量
506
审稿时长
4.3 months
期刊介绍: BMC Bioinformatics is an open access, peer-reviewed journal that considers articles on all aspects of the development, testing and novel application of computational and statistical methods for the modeling and analysis of all kinds of biological data, as well as other areas of computational biology. BMC Bioinformatics is part of the BMC series which publishes subject-specific journals focused on the needs of individual research communities across all areas of biology and medicine. We offer an efficient, fair and friendly peer review service, and are committed to publishing all sound science, provided that there is some advance in knowledge presented by the work.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信