Lorenz Welter;Roman Sokolovskii;Thomas Heinis;Antonia Wachter-Zeh;Eirik Rosnes;Alexandre Graell i Amat
{"title":"基于dna的数据存储与纳米孔测序读取端到端编码方案","authors":"Lorenz Welter;Roman Sokolovskii;Thomas Heinis;Antonia Wachter-Zeh;Eirik Rosnes;Alexandre Graell i Amat","doi":"10.1109/JSAIT.2026.3655592","DOIUrl":null,"url":null,"abstract":"We consider error-correcting coding for deoxyribonucleic acid (DNA)-based storage using nanopore sequencing. We model the DNA storage channel as a sampling noise channel where the input data is chunked into <inline-formula> <tex-math>$M$ </tex-math></inline-formula> short DNA strands, which are copied a random number of times, and the channel outputs a random selection of <inline-formula> <tex-math>$N$ </tex-math></inline-formula> noisy DNA strands. The retrieved DNA reads are prone to strand-dependent insertion, deletion, and substitution (IDS) errors. We construct an index-based concatenated coding scheme, i.e., the concatenation of an outer code, an index code, and an inner code. We further propose a low-complexity (linear in <inline-formula> <tex-math>$N$ </tex-math></inline-formula>) maximum a posteriori probability decoder that takes into account the strand-dependent IDS errors and the randomness of the drawing to infer symbolwise a posteriori probabilities for the outer decoder. We present Monte-Carlo simulations for information-outage probabilities and frame error rates for different channel setups on experimental data. We finally evaluate the overall system performance using the read/write cost trade-off. A powerful combination of tailored channel modeling and soft information processing allows us to achieve excellent performance even with error-prone nanopore-sequenced reads outperforming state-of-the-art schemes.","PeriodicalId":73295,"journal":{"name":"IEEE journal on selected areas in information theory","volume":"7 ","pages":"17-32"},"PeriodicalIF":2.2000,"publicationDate":"2026-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An End-to-End Coding Scheme for DNA-Based Data Storage With Nanopore-Sequenced Reads\",\"authors\":\"Lorenz Welter;Roman Sokolovskii;Thomas Heinis;Antonia Wachter-Zeh;Eirik Rosnes;Alexandre Graell i Amat\",\"doi\":\"10.1109/JSAIT.2026.3655592\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We consider error-correcting coding for deoxyribonucleic acid (DNA)-based storage using nanopore sequencing. We model the DNA storage channel as a sampling noise channel where the input data is chunked into <inline-formula> <tex-math>$M$ </tex-math></inline-formula> short DNA strands, which are copied a random number of times, and the channel outputs a random selection of <inline-formula> <tex-math>$N$ </tex-math></inline-formula> noisy DNA strands. The retrieved DNA reads are prone to strand-dependent insertion, deletion, and substitution (IDS) errors. We construct an index-based concatenated coding scheme, i.e., the concatenation of an outer code, an index code, and an inner code. We further propose a low-complexity (linear in <inline-formula> <tex-math>$N$ </tex-math></inline-formula>) maximum a posteriori probability decoder that takes into account the strand-dependent IDS errors and the randomness of the drawing to infer symbolwise a posteriori probabilities for the outer decoder. We present Monte-Carlo simulations for information-outage probabilities and frame error rates for different channel setups on experimental data. We finally evaluate the overall system performance using the read/write cost trade-off. A powerful combination of tailored channel modeling and soft information processing allows us to achieve excellent performance even with error-prone nanopore-sequenced reads outperforming state-of-the-art schemes.\",\"PeriodicalId\":73295,\"journal\":{\"name\":\"IEEE journal on selected areas in information theory\",\"volume\":\"7 \",\"pages\":\"17-32\"},\"PeriodicalIF\":2.2000,\"publicationDate\":\"2026-01-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE journal on selected areas in information theory\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11357937/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE journal on selected areas in information theory","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11357937/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
An End-to-End Coding Scheme for DNA-Based Data Storage With Nanopore-Sequenced Reads
We consider error-correcting coding for deoxyribonucleic acid (DNA)-based storage using nanopore sequencing. We model the DNA storage channel as a sampling noise channel where the input data is chunked into $M$ short DNA strands, which are copied a random number of times, and the channel outputs a random selection of $N$ noisy DNA strands. The retrieved DNA reads are prone to strand-dependent insertion, deletion, and substitution (IDS) errors. We construct an index-based concatenated coding scheme, i.e., the concatenation of an outer code, an index code, and an inner code. We further propose a low-complexity (linear in $N$ ) maximum a posteriori probability decoder that takes into account the strand-dependent IDS errors and the randomness of the drawing to infer symbolwise a posteriori probabilities for the outer decoder. We present Monte-Carlo simulations for information-outage probabilities and frame error rates for different channel setups on experimental data. We finally evaluate the overall system performance using the read/write cost trade-off. A powerful combination of tailored channel modeling and soft information processing allows us to achieve excellent performance even with error-prone nanopore-sequenced reads outperforming state-of-the-art schemes.