Ben Cao, Tiantian He, Xue Li, Bin Wang, Xiaohu Wu, Qiang Zhang, Yew-Soon Ong
{"title":"学习结构稳定的表征,实现多模态无损 DNA 存储","authors":"Ben Cao, Tiantian He, Xue Li, Bin Wang, Xiaohu Wu, Qiang Zhang, Yew-Soon Ong","doi":"arxiv-2408.00779","DOIUrl":null,"url":null,"abstract":"In this paper, we present Reed-Solomon coded single-stranded representation\nlearning (RSRL), a novel end-to-end model for learning representations for\nmulti-modal lossless DNA storage. In contrast to existing learning-based\nmethods, the proposed RSRL is inspired by both error-correction codec and\nstructural biology. Specifically, RSRL first learns the representations for the\nsubsequent storage from the binary data transformed by the Reed-Solomon codec.\nThen, the representations are masked by an RS-code-informed mask to focus on\ncorrecting the burst errors occurring in the learning process. With the decoded\nrepresentations with error corrections, a novel biologically stabilized loss is\nformulated to regularize the data representations to possess stable\nsingle-stranded structures. By incorporating these novel strategies, the\nproposed RSRL can learn highly durable, dense, and lossless representations for\nthe subsequent storage tasks into DNA sequences. The proposed RSRL has been\ncompared with a number of strong baselines in real-world tasks of multi-modal\ndata storage. The experimental results obtained demonstrate that RSRL can store\ndiverse types of data with much higher information density and durability but\nmuch lower error rates.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"104 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Learning Structurally Stabilized Representations for Multi-modal Lossless DNA Storage\",\"authors\":\"Ben Cao, Tiantian He, Xue Li, Bin Wang, Xiaohu Wu, Qiang Zhang, Yew-Soon Ong\",\"doi\":\"arxiv-2408.00779\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we present Reed-Solomon coded single-stranded representation\\nlearning (RSRL), a novel end-to-end model for learning representations for\\nmulti-modal lossless DNA storage. In contrast to existing learning-based\\nmethods, the proposed RSRL is inspired by both error-correction codec and\\nstructural biology. Specifically, RSRL first learns the representations for the\\nsubsequent storage from the binary data transformed by the Reed-Solomon codec.\\nThen, the representations are masked by an RS-code-informed mask to focus on\\ncorrecting the burst errors occurring in the learning process. With the decoded\\nrepresentations with error corrections, a novel biologically stabilized loss is\\nformulated to regularize the data representations to possess stable\\nsingle-stranded structures. By incorporating these novel strategies, the\\nproposed RSRL can learn highly durable, dense, and lossless representations for\\nthe subsequent storage tasks into DNA sequences. The proposed RSRL has been\\ncompared with a number of strong baselines in real-world tasks of multi-modal\\ndata storage. The experimental results obtained demonstrate that RSRL can store\\ndiverse types of data with much higher information density and durability but\\nmuch lower error rates.\",\"PeriodicalId\":501022,\"journal\":{\"name\":\"arXiv - QuanBio - Biomolecules\",\"volume\":\"104 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - QuanBio - Biomolecules\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.00779\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Biomolecules","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.00779","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
在本文中,我们提出了里德-所罗门编码单链表征学习(RSRL),这是一种用于学习多模态无损 DNA 存储表征的新型端到端模型。与现有的基于学习的方法相比,所提出的 RSRL 同时受到纠错编解码器和结构生物学的启发。具体来说,RSRL 首先从经过里德-所罗门编解码器转换的二进制数据中学习用于后续存储的表征,然后用 RS 代码信息掩码对表征进行掩码,重点纠正学习过程中出现的突发错误。有了经过纠错的解码表征,就会形成一种新的生物稳定损失,以规范化数据表征,使其具有稳定的单链结构。通过采用这些新颖的策略,拟议的 RSRL 可以学习到高度持久、密集和无损的表征,用于随后的 DNA 序列存储任务。在现实世界的多模式数据存储任务中,所提出的 RSRL 与一些强大的基线进行了比较。实验结果表明,RSRL 能以更高的信息密度和耐用性存储多种类型的数据,但错误率却低得多。
Learning Structurally Stabilized Representations for Multi-modal Lossless DNA Storage
In this paper, we present Reed-Solomon coded single-stranded representation
learning (RSRL), a novel end-to-end model for learning representations for
multi-modal lossless DNA storage. In contrast to existing learning-based
methods, the proposed RSRL is inspired by both error-correction codec and
structural biology. Specifically, RSRL first learns the representations for the
subsequent storage from the binary data transformed by the Reed-Solomon codec.
Then, the representations are masked by an RS-code-informed mask to focus on
correcting the burst errors occurring in the learning process. With the decoded
representations with error corrections, a novel biologically stabilized loss is
formulated to regularize the data representations to possess stable
single-stranded structures. By incorporating these novel strategies, the
proposed RSRL can learn highly durable, dense, and lossless representations for
the subsequent storage tasks into DNA sequences. The proposed RSRL has been
compared with a number of strong baselines in real-world tasks of multi-modal
data storage. The experimental results obtained demonstrate that RSRL can store
diverse types of data with much higher information density and durability but
much lower error rates.