{"title":"关于噪声子串信道的编码","authors":"Yonatan Yehezkeally;Nikita Polyanskii","doi":"10.1109/TMBMC.2024.3382499","DOIUrl":null,"url":null,"abstract":"We consider the problem of coding for the substring channel, in which information strings are observed only through their (multisets of) substrings. Due to existing DNA sequencing techniques and applications in DNA-based storage systems, interest in this channel has renewed in recent years. In contrast to existing literature, we consider a noisy channel model where information is subject to noise before its substrings are sampled, motivated by in-vivo storage. We study two separate noise models, substitutions or deletions. In both cases, we examine families of codes which may be utilized for error-correction and present combinatorial bounds on their sizes. Through a generalization of the concept of repeat-free strings, we show that the added required redundancy due to this imperfect observation assumption is sublinear, either when the fraction of errors in the observed substring length is sufficiently small, or when that length is sufficiently long. This suggests that no asymptotic cost in rate is incurred by this channel model in these cases. Moreover, we develop an efficient encoder for such constrained strings in some cases. Finally, we show how a similar encoder can be used to avoid formation of secondary-structures in coded DNA strands, even when accounting for imperfect structures.","PeriodicalId":36530,"journal":{"name":"IEEE Transactions on Molecular, Biological, and Multi-Scale Communications","volume":"10 2","pages":"368-381"},"PeriodicalIF":2.4000,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10480728","citationCount":"0","resultStr":"{\"title\":\"On Codes for the Noisy Substring Channel\",\"authors\":\"Yonatan Yehezkeally;Nikita Polyanskii\",\"doi\":\"10.1109/TMBMC.2024.3382499\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We consider the problem of coding for the substring channel, in which information strings are observed only through their (multisets of) substrings. Due to existing DNA sequencing techniques and applications in DNA-based storage systems, interest in this channel has renewed in recent years. In contrast to existing literature, we consider a noisy channel model where information is subject to noise before its substrings are sampled, motivated by in-vivo storage. We study two separate noise models, substitutions or deletions. In both cases, we examine families of codes which may be utilized for error-correction and present combinatorial bounds on their sizes. Through a generalization of the concept of repeat-free strings, we show that the added required redundancy due to this imperfect observation assumption is sublinear, either when the fraction of errors in the observed substring length is sufficiently small, or when that length is sufficiently long. This suggests that no asymptotic cost in rate is incurred by this channel model in these cases. Moreover, we develop an efficient encoder for such constrained strings in some cases. Finally, we show how a similar encoder can be used to avoid formation of secondary-structures in coded DNA strands, even when accounting for imperfect structures.\",\"PeriodicalId\":36530,\"journal\":{\"name\":\"IEEE Transactions on Molecular, Biological, and Multi-Scale Communications\",\"volume\":\"10 2\",\"pages\":\"368-381\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2024-03-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10480728\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Molecular, Biological, and Multi-Scale Communications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10480728/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Molecular, Biological, and Multi-Scale Communications","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10480728/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
摘要
我们考虑的是子串信道的编码问题,在这种信道中,只能通过(多组)子串来观察信息串。由于现有的 DNA 测序技术和在基于 DNA 的存储系统中的应用,近年来人们对这种信道的兴趣再次高涨。与现有文献不同的是,我们从体内存储的角度出发,考虑了一种噪声信道模型,即在对信息子串进行采样之前,信息会受到噪声的影响。我们研究了两种不同的噪声模型:替换或删除。在这两种情况下,我们都研究了可用于纠错的编码系列,并提出了编码大小的组合界限。通过对无重复字符串概念的概括,我们证明,当观察到的子串长度中的错误率足够小,或子串长度足够长时,由于这种不完全观察假设而增加的所需冗余度是亚线性的。这表明,在这些情况下,这种信道模型不会产生速率上的渐进成本。此外,在某些情况下,我们还为这种受限字符串开发了一种高效的编码器。最后,我们展示了如何使用类似的编码器来避免在编码 DNA 链中形成次级结构,即使在考虑到不完美结构的情况下也是如此。
We consider the problem of coding for the substring channel, in which information strings are observed only through their (multisets of) substrings. Due to existing DNA sequencing techniques and applications in DNA-based storage systems, interest in this channel has renewed in recent years. In contrast to existing literature, we consider a noisy channel model where information is subject to noise before its substrings are sampled, motivated by in-vivo storage. We study two separate noise models, substitutions or deletions. In both cases, we examine families of codes which may be utilized for error-correction and present combinatorial bounds on their sizes. Through a generalization of the concept of repeat-free strings, we show that the added required redundancy due to this imperfect observation assumption is sublinear, either when the fraction of errors in the observed substring length is sufficiently small, or when that length is sufficiently long. This suggests that no asymptotic cost in rate is incurred by this channel model in these cases. Moreover, we develop an efficient encoder for such constrained strings in some cases. Finally, we show how a similar encoder can be used to avoid formation of secondary-structures in coded DNA strands, even when accounting for imperfect structures.
期刊介绍:
As a result of recent advances in MEMS/NEMS and systems biology, as well as the emergence of synthetic bacteria and lab/process-on-a-chip techniques, it is now possible to design chemical “circuits”, custom organisms, micro/nanoscale swarms of devices, and a host of other new systems. This success opens up a new frontier for interdisciplinary communications techniques using chemistry, biology, and other principles that have not been considered in the communications literature. The IEEE Transactions on Molecular, Biological, and Multi-Scale Communications (T-MBMSC) is devoted to the principles, design, and analysis of communication systems that use physics beyond classical electromagnetism. This includes molecular, quantum, and other physical, chemical and biological techniques; as well as new communication techniques at small scales or across multiple scales (e.g., nano to micro to macro; note that strictly nanoscale systems, 1-100 nm, are outside the scope of this journal). Original research articles on one or more of the following topics are within scope: mathematical modeling, information/communication and network theoretic analysis, standardization and industrial applications, and analytical or experimental studies on communication processes or networks in biology. Contributions on related topics may also be considered for publication. Contributions from researchers outside the IEEE’s typical audience are encouraged.