{"title":"An Upper Bound on the Capacity of the DNA Storage Channel","authors":"A. Lenz, P. Siegel, A. Wachter-Zeh, Eitan Yaakobi","doi":"10.1109/ITW44776.2019.8989388","DOIUrl":null,"url":null,"abstract":"Paved by recent advances in sequencing and synthesis technologies, DNA has evolved to a competitive medium for long-term data storage. In this paper we conduct an information theoretic study of the storage channel-the entity that formulates the relation between stored and sequenced strands. In particular, we derive an upper bound on the Shannon capacity of the channel. In our channel model, we incorporate the main attributes that characterize DNA-based data storage. That is, information is synthesized on many short DNA strands, and each strand is copied many times. Due to the storage and sequencing methods, the receiver draws strands from the original sequences in an uncontrollable manner, where it is possible that copies of the same sequence are drawn multiple times. Additionally, due to imperfections, the obtained strands can be perturbed by errors. We show that for a large range of parameters, the channel decomposes into sub-channels from each input sequence to multiple output sequences, so-called clusters. The cluster sizes hereby follow a Poisson distribution. Furthermore, the ordering of sub-channels is unknown to the receiver. Our results can be used to guide future experiments for DNA-based data storage by giving an upper bound on the achievable rate of any error-correcting code. We further give a detailed discussion and intuitive interpretation of the channel that provide insights about the nature of the channel and can inspire new ideas for error-correcting codes and decoding methods.","PeriodicalId":214379,"journal":{"name":"2019 IEEE Information Theory Workshop (ITW)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"25","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE Information Theory Workshop (ITW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ITW44776.2019.8989388","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 25
Abstract
Paved by recent advances in sequencing and synthesis technologies, DNA has evolved to a competitive medium for long-term data storage. In this paper we conduct an information theoretic study of the storage channel-the entity that formulates the relation between stored and sequenced strands. In particular, we derive an upper bound on the Shannon capacity of the channel. In our channel model, we incorporate the main attributes that characterize DNA-based data storage. That is, information is synthesized on many short DNA strands, and each strand is copied many times. Due to the storage and sequencing methods, the receiver draws strands from the original sequences in an uncontrollable manner, where it is possible that copies of the same sequence are drawn multiple times. Additionally, due to imperfections, the obtained strands can be perturbed by errors. We show that for a large range of parameters, the channel decomposes into sub-channels from each input sequence to multiple output sequences, so-called clusters. The cluster sizes hereby follow a Poisson distribution. Furthermore, the ordering of sub-channels is unknown to the receiver. Our results can be used to guide future experiments for DNA-based data storage by giving an upper bound on the achievable rate of any error-correcting code. We further give a detailed discussion and intuitive interpretation of the channel that provide insights about the nature of the channel and can inspire new ideas for error-correcting codes and decoding methods.