Protecting the Future of Information: LOCO Coding With Error Detection for DNA Data Storage

IF 2.3 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Molecular, Biological, and Multi-Scale Communications Pub Date : 2024-03-14 DOI:10.1109/TMBMC.2024.3400794

Canberk İrimağzı;Yusuf Uslan;Ahmed Hareedy

{"title":"Protecting the Future of Information: LOCO Coding With Error Detection for DNA Data Storage","authors":"Canberk İrimağzı;Yusuf Uslan;Ahmed Hareedy","doi":"10.1109/TMBMC.2024.3400794","DOIUrl":null,"url":null,"abstract":"From the information-theoretic perspective, DNA strands serve as a storage medium for 4-ary data over the alphabet \n<inline-formula> <tex-math>$\\{A,T,G,C\\}$ </tex-math></inline-formula>\n. DNA data storage promises formidable information density, long-term durability, and ease of replicability. However, information in this intriguing storage technology might be corrupted because of error-prone data sequences as well as insertion, deletion, and substitution errors. Experiments have revealed that DNA sequences with long homopolymers and/or with low GC-content are notably more subject to errors upon storage. In order to address this biochemical challenge, constrained codes are proposed for usage in DNA data storage systems, and they are studied in the literature accordingly. This paper investigates the utilization of the recently-introduced method for designing lexicographically-ordered constrained (LOCO) codes in DNA data storage to improve performance. LOCO codes offer capacity-achievability, low complexity, and ease of reconfigurability. This paper introduces novel constrained codes, namely DNA LOCO (D-LOCO) codes, over the alphabet \n<inline-formula> <tex-math>$\\{A,T,G,C\\}$ </tex-math></inline-formula>\n with limited runs of identical symbols. Due to their ordered structure, these codes come with an encoding-decoding rule we derive, which provides simple and affordable encoding-decoding algorithms. In terms of storage overhead, the proposed encoding-decoding algorithms outperform those in the existing literature. Our algorithms are based on small-size adders, and therefore they are readily reconfigurable. D-LOCO codes are intrinsically balanced, which allows us to achieve balanced AT- and GC-content over the entire DNA strand with minimal rate penalty. Moreover, we propose four schemes to bridge consecutive codewords, three of which guarantee single substitution error detection per codeword. We examine the probability of undetecting errors over a presumed symmetric DNA storage channel subject to substitution errors only. We also show that D-LOCO codes are capacity-achieving and that they offer remarkably high rates even at moderate lengths.","PeriodicalId":36530,"journal":{"name":"IEEE Transactions on Molecular, Biological, and Multi-Scale Communications","volume":"10 2","pages":"317-333"},"PeriodicalIF":2.3000,"publicationDate":"2024-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Molecular, Biological, and Multi-Scale Communications","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10530402/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

From the information-theoretic perspective, DNA strands serve as a storage medium for 4-ary data over the alphabet

$\{A,T,G,C\}$

. DNA data storage promises formidable information density, long-term durability, and ease of replicability. However, information in this intriguing storage technology might be corrupted because of error-prone data sequences as well as insertion, deletion, and substitution errors. Experiments have revealed that DNA sequences with long homopolymers and/or with low GC-content are notably more subject to errors upon storage. In order to address this biochemical challenge, constrained codes are proposed for usage in DNA data storage systems, and they are studied in the literature accordingly. This paper investigates the utilization of the recently-introduced method for designing lexicographically-ordered constrained (LOCO) codes in DNA data storage to improve performance. LOCO codes offer capacity-achievability, low complexity, and ease of reconfigurability. This paper introduces novel constrained codes, namely DNA LOCO (D-LOCO) codes, over the alphabet

$\{A,T,G,C\}$

with limited runs of identical symbols. Due to their ordered structure, these codes come with an encoding-decoding rule we derive, which provides simple and affordable encoding-decoding algorithms. In terms of storage overhead, the proposed encoding-decoding algorithms outperform those in the existing literature. Our algorithms are based on small-size adders, and therefore they are readily reconfigurable. D-LOCO codes are intrinsically balanced, which allows us to achieve balanced AT- and GC-content over the entire DNA strand with minimal rate penalty. Moreover, we propose four schemes to bridge consecutive codewords, three of which guarantee single substitution error detection per codeword. We examine the probability of undetecting errors over a presumed symmetric DNA storage channel subject to substitution errors only. We also show that D-LOCO codes are capacity-achieving and that they offer remarkably high rates even at moderate lengths.

查看原文本刊更多论文

保护信息的未来：带有错误检测功能的 LOCO 编码用于 DNA 数据存储

从信息论的角度来看，DNA 链是字母 $\{A,T,G,C\}$ 上 4ary 数据的存储介质。DNA 数据存储具有强大的信息密度、长期耐久性和易于复制的特点。然而，由于容易出错的数据序列以及插入、删除和替换错误，这种有趣的存储技术中的信息可能会被破坏。实验表明，同聚物较长和/或 GC 含量较低的 DNA 序列在存储时明显更容易出错。为了应对这一生化挑战，有人提出在 DNA 数据存储系统中使用约束码，并在文献中对其进行了相应的研究。本文研究了在 DNA 数据存储中如何利用最近推出的词典排序受限（LOCO）代码设计方法来提高性能。LOCO 代码具有容量可实现性、低复杂性和易重构性。本文介绍了在字母表 $\{A,T,G,C\}$ 上有限运行相同符号的新型约束码，即 DNA LOCO（D-LOCO）码。由于它们的有序结构，这些编码带有我们推导出的编码-解码规则，它提供了简单、经济的编码-解码算法。就存储开销而言，所提出的编码-解码算法优于现有文献中的算法。我们的算法基于小尺寸加法器，因此很容易重新配置。D-LOCO 编码本质上是平衡的，这使我们能够在整个 DNA 链上实现 AT 和 GC 含量的平衡，同时将速率损失降到最低。此外，我们还提出了四种桥接连续码字的方案，其中三种方案可保证每个码字只检测到一次替换错误。我们研究了在假定的对称 DNA 存储信道上，仅受替换错误影响的未检测到错误的概率。我们还证明，D-LOCO 编码具有很高的容量，即使长度适中，也能提供很高的速率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Molecular, Biological, and Multi-Scale Communications Mathematics-Modeling and Simulation

CiteScore

3.90

自引率

13.60%

发文量

期刊介绍： As a result of recent advances in MEMS/NEMS and systems biology, as well as the emergence of synthetic bacteria and lab/process-on-a-chip techniques, it is now possible to design chemical “circuits”, custom organisms, micro/nanoscale swarms of devices, and a host of other new systems. This success opens up a new frontier for interdisciplinary communications techniques using chemistry, biology, and other principles that have not been considered in the communications literature. The IEEE Transactions on Molecular, Biological, and Multi-Scale Communications (T-MBMSC) is devoted to the principles, design, and analysis of communication systems that use physics beyond classical electromagnetism. This includes molecular, quantum, and other physical, chemical and biological techniques; as well as new communication techniques at small scales or across multiple scales (e.g., nano to micro to macro; note that strictly nanoscale systems, 1-100 nm, are outside the scope of this journal). Original research articles on one or more of the following topics are within scope: mathematical modeling, information/communication and network theoretic analysis, standardization and industrial applications, and analytical or experimental studies on communication processes or networks in biology. Contributions on related topics may also be considered for publication. Contributions from researchers outside the IEEE’s typical audience are encouraged.