An End-to-End Coding Scheme for DNA-Based Data Storage With Nanopore-Sequenced Reads

IF 2.2

IEEE journal on selected areas in information theory Pub Date : 2026-01-19 DOI:10.1109/JSAIT.2026.3655592

Lorenz Welter;Roman Sokolovskii;Thomas Heinis;Antonia Wachter-Zeh;Eirik Rosnes;Alexandre Graell i Amat

{"title":"An End-to-End Coding Scheme for DNA-Based Data Storage With Nanopore-Sequenced Reads","authors":"Lorenz Welter;Roman Sokolovskii;Thomas Heinis;Antonia Wachter-Zeh;Eirik Rosnes;Alexandre Graell i Amat","doi":"10.1109/JSAIT.2026.3655592","DOIUrl":null,"url":null,"abstract":"We consider error-correcting coding for deoxyribonucleic acid (DNA)-based storage using nanopore sequencing. We model the DNA storage channel as a sampling noise channel where the input data is chunked into <inline-formula> <tex-math>$M$ </tex-math></inline-formula> short DNA strands, which are copied a random number of times, and the channel outputs a random selection of <inline-formula> <tex-math>$N$ </tex-math></inline-formula> noisy DNA strands. The retrieved DNA reads are prone to strand-dependent insertion, deletion, and substitution (IDS) errors. We construct an index-based concatenated coding scheme, i.e., the concatenation of an outer code, an index code, and an inner code. We further propose a low-complexity (linear in <inline-formula> <tex-math>$N$ </tex-math></inline-formula>) maximum a posteriori probability decoder that takes into account the strand-dependent IDS errors and the randomness of the drawing to infer symbolwise a posteriori probabilities for the outer decoder. We present Monte-Carlo simulations for information-outage probabilities and frame error rates for different channel setups on experimental data. We finally evaluate the overall system performance using the read/write cost trade-off. A powerful combination of tailored channel modeling and soft information processing allows us to achieve excellent performance even with error-prone nanopore-sequenced reads outperforming state-of-the-art schemes.","PeriodicalId":73295,"journal":{"name":"IEEE journal on selected areas in information theory","volume":"7 ","pages":"17-32"},"PeriodicalIF":2.2000,"publicationDate":"2026-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE journal on selected areas in information theory","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11357937/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

We consider error-correcting coding for deoxyribonucleic acid (DNA)-based storage using nanopore sequencing. We model the DNA storage channel as a sampling noise channel where the input data is chunked into

$M$

short DNA strands, which are copied a random number of times, and the channel outputs a random selection of

$N$

noisy DNA strands. The retrieved DNA reads are prone to strand-dependent insertion, deletion, and substitution (IDS) errors. We construct an index-based concatenated coding scheme, i.e., the concatenation of an outer code, an index code, and an inner code. We further propose a low-complexity (linear in

$N$

) maximum a posteriori probability decoder that takes into account the strand-dependent IDS errors and the randomness of the drawing to infer symbolwise a posteriori probabilities for the outer decoder. We present Monte-Carlo simulations for information-outage probabilities and frame error rates for different channel setups on experimental data. We finally evaluate the overall system performance using the read/write cost trade-off. A powerful combination of tailored channel modeling and soft information processing allows us to achieve excellent performance even with error-prone nanopore-sequenced reads outperforming state-of-the-art schemes.

查看原文本刊更多论文

基于dna的数据存储与纳米孔测序读取端到端编码方案

我们考虑使用纳米孔测序对基于脱氧核糖核酸（DNA）的存储进行错误纠正编码。我们将DNA存储通道建模为采样噪声通道，其中输入数据被分割成$M$短DNA链，这些DNA链被随机复制数次，通道输出随机选择的$N$噪声DNA链。检索到的DNA读取容易出现依赖于链的插入、删除和替换（IDS）错误。我们构造了一个基于索引的连接编码方案，即外部代码、索引代码和内部代码的连接。我们进一步提出了一种低复杂度（线性）最大后验概率解码器，该解码器考虑到链相关IDS错误和绘制的随机性，以符号方式推断外部解码器的后验概率。我们提出蒙特卡罗模拟的信息中断概率和帧错误率的不同信道设置的实验数据。最后，我们使用读/写成本权衡来评估整体系统性能。量身定制的通道建模和软信息处理的强大组合使我们即使在容易出错的纳米孔测序读取优于最先进方案的情况下也能实现出色的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE journal on selected areas in information theory

CiteScore

8.20

自引率

0.00%

发文量