{"title":"DNA palette code for time-series archival data storage","authors":"Zihui Yan, Haoran Zhang, Boyuan Lu, Tong Han, Xiaoguang Tong, Yingjin Yuan","doi":"10.1093/nsr/nwae321","DOIUrl":null,"url":null,"abstract":"The long-term preservation of large volumes of infrequently accessed cold data poses challenges to the storage community. Deoxyribonucleic Acid (DNA) is considered a promising solution due to its inherent physical stability and significant storage density. The information density and decoding sequence coverage are two important metrics that influence the efficiency of DNA data storage. In this study, we propose a novel coding scheme called DNA Palette code, which is suitable for cold data, especially time-series archival datasets. These datasets are not frequently accessed but necessitate reliable long-term storage for retrospective research. The DNA Palette code employs unordered combinations of index-free oligonucleotides (oligos) to represent binary information. It can achieve high net information density encoding and lossless decoding with low sequencing coverage. When sequencing reads are corrupted, it can still effectively recover partial information, preventing the complete failure of file retrieval. The in vivo testing of clinical brain magnetic resonance imaging (MRI) data storage, as well as simulation validations using large-scale public MRI datasets (10 GB), planetary science datasets, and meteorological datasets, demonstrate the advantages of our coding scheme, including high information density, low decoding sequence coverage, and wide applicability.","PeriodicalId":18842,"journal":{"name":"National Science Review","volume":null,"pages":null},"PeriodicalIF":16.3000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"National Science Review","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.1093/nsr/nwae321","RegionNum":1,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0
Abstract
The long-term preservation of large volumes of infrequently accessed cold data poses challenges to the storage community. Deoxyribonucleic Acid (DNA) is considered a promising solution due to its inherent physical stability and significant storage density. The information density and decoding sequence coverage are two important metrics that influence the efficiency of DNA data storage. In this study, we propose a novel coding scheme called DNA Palette code, which is suitable for cold data, especially time-series archival datasets. These datasets are not frequently accessed but necessitate reliable long-term storage for retrospective research. The DNA Palette code employs unordered combinations of index-free oligonucleotides (oligos) to represent binary information. It can achieve high net information density encoding and lossless decoding with low sequencing coverage. When sequencing reads are corrupted, it can still effectively recover partial information, preventing the complete failure of file retrieval. The in vivo testing of clinical brain magnetic resonance imaging (MRI) data storage, as well as simulation validations using large-scale public MRI datasets (10 GB), planetary science datasets, and meteorological datasets, demonstrate the advantages of our coding scheme, including high information density, low decoding sequence coverage, and wide applicability.
长期保存大量不常访问的冷数据给存储界带来了挑战。脱氧核糖核酸(DNA)因其固有的物理稳定性和巨大的存储密度而被认为是一种有前途的解决方案。信息密度和解码序列覆盖率是影响 DNA 数据存储效率的两个重要指标。在这项研究中,我们提出了一种名为 DNA 调色板代码的新型编码方案,它适用于冷数据,尤其是时间序列档案数据集。这些数据集不经常被访问,但需要可靠的长期存储,以便进行回顾性研究。DNA 调色板代码采用无索引寡核苷酸(oligos)的无序组合来表示二进制信息。它可以在低测序覆盖率的情况下实现高净信息密度编码和无损解码。当测序读数被破坏时,它仍能有效恢复部分信息,防止文件检索完全失败。临床脑磁共振成像(MRI)数据存储的活体测试,以及使用大规模公共磁共振成像数据集(10 GB)、行星科学数据集和气象数据集进行的模拟验证,证明了我们的编码方案具有高信息密度、低解码序列覆盖率和广泛适用性等优势。
期刊介绍:
National Science Review (NSR; ISSN abbreviation: Natl. Sci. Rev.) is an English-language peer-reviewed multidisciplinary open-access scientific journal published by Oxford University Press under the auspices of the Chinese Academy of Sciences.According to Journal Citation Reports, its 2021 impact factor was 23.178.
National Science Review publishes both review articles and perspectives as well as original research in the form of brief communications and research articles.