Ben Cao, Tiantian He, Xue Li, Bin Wang, Xiaohu Wu, Qiang Zhang, Yew-Soon Ong
{"title":"Learning Structurally Stabilized Representations for Multi-modal Lossless DNA Storage","authors":"Ben Cao, Tiantian He, Xue Li, Bin Wang, Xiaohu Wu, Qiang Zhang, Yew-Soon Ong","doi":"arxiv-2408.00779","DOIUrl":null,"url":null,"abstract":"In this paper, we present Reed-Solomon coded single-stranded representation\nlearning (RSRL), a novel end-to-end model for learning representations for\nmulti-modal lossless DNA storage. In contrast to existing learning-based\nmethods, the proposed RSRL is inspired by both error-correction codec and\nstructural biology. Specifically, RSRL first learns the representations for the\nsubsequent storage from the binary data transformed by the Reed-Solomon codec.\nThen, the representations are masked by an RS-code-informed mask to focus on\ncorrecting the burst errors occurring in the learning process. With the decoded\nrepresentations with error corrections, a novel biologically stabilized loss is\nformulated to regularize the data representations to possess stable\nsingle-stranded structures. By incorporating these novel strategies, the\nproposed RSRL can learn highly durable, dense, and lossless representations for\nthe subsequent storage tasks into DNA sequences. The proposed RSRL has been\ncompared with a number of strong baselines in real-world tasks of multi-modal\ndata storage. The experimental results obtained demonstrate that RSRL can store\ndiverse types of data with much higher information density and durability but\nmuch lower error rates.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"104 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Biomolecules","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.00779","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In this paper, we present Reed-Solomon coded single-stranded representation
learning (RSRL), a novel end-to-end model for learning representations for
multi-modal lossless DNA storage. In contrast to existing learning-based
methods, the proposed RSRL is inspired by both error-correction codec and
structural biology. Specifically, RSRL first learns the representations for the
subsequent storage from the binary data transformed by the Reed-Solomon codec.
Then, the representations are masked by an RS-code-informed mask to focus on
correcting the burst errors occurring in the learning process. With the decoded
representations with error corrections, a novel biologically stabilized loss is
formulated to regularize the data representations to possess stable
single-stranded structures. By incorporating these novel strategies, the
proposed RSRL can learn highly durable, dense, and lossless representations for
the subsequent storage tasks into DNA sequences. The proposed RSRL has been
compared with a number of strong baselines in real-world tasks of multi-modal
data storage. The experimental results obtained demonstrate that RSRL can store
diverse types of data with much higher information density and durability but
much lower error rates.