大规模数字DNA存储中基于引物库的多模式数据组织与文件检索

IF 10.1 1区工程技术 Q1 ENGINEERING, MULTIDISCIPLINARY

Engineering Pub Date : 2025-05-01 DOI:10.1016/j.eng.2023.10.021

Shu-Fang Zhang , Yu-Hui Li , Rui-Xian Zhang , Bing-Zhi Li , Qing Wang

{"title":"大规模数字DNA存储中基于引物库的多模式数据组织与文件检索","authors":"Shu-Fang Zhang , Yu-Hui Li , Rui-Xian Zhang , Bing-Zhi Li , Qing Wang","doi":"10.1016/j.eng.2023.10.021","DOIUrl":null,"url":null,"abstract":"<div><div>At present, the polymerase chain reaction (PCR) amplification-based file retrieval method is the most commonly used and effective means of DNA file retrieval. The number of orthogonal primers limits the number of files that can be accurately accessed, which in turn affects the density in a single oligo pool of digital DNA storage. In this paper, a multi-mode DNA sequence design method based on PCR file retrieval in a single oligonucleotide pool is proposed for high-capacity DNA data storage. Firstly, by analyzing the maximum number of orthogonal primers at each predicted primer length, it was found that the relationship between primer length and the maximum available primer number does not increase linearly, and the maximum number of orthogonal primers is on the order of 10<sup>4</sup>. Next, this paper analyzes the maximum address space capacity of DNA sequences with different types of primer binding sites for file mapping. In the case where the capacity of the primer library is <span><math><mrow><mi>R</mi></mrow></math></span> (where <span><math><mrow><mi>R</mi></mrow></math></span> is even), the number of address spaces that can be mapped by the single-primer DNA sequence design scheme proposed in this paper is four times that of the previous one, and the two-level primer DNA sequence design scheme can reach <span><math><mrow><msup><mrow><mfenced><mrow><mfrac><mrow><mi>ℝ</mi></mrow><mrow><mn>2</mn></mrow></mfrac><mo>·</mo><mrow><mfenced><mrow><mfrac><mrow><mi>ℝ</mi></mrow><mrow><mn>2</mn></mrow></mfrac><mo>-</mo><mn>1</mn></mrow></mfenced></mrow></mrow></mfenced></mrow><mrow><mn>2</mn></mrow></msup></mrow></math></span> times. Finally, a multi-mode DNA sequence generation method is designed based on the number of files to be stored in the oligonucleotide pool, in order to meet the requirements of the random retrieval of target files in an oligonucleotide pool with large-scale file numbers. The performance of the primers generated by the orthogonal primer library generator proposed in this paper is verified, and the average Gibbs free energy of the most stable heterodimer formed between the orthogonal primers produced is −1 kcal∙(mol∙L<sup>−1</sup>)<sup>−1</sup> (1 kcal = 4.184 kJ). At the same time, by selectively PCR-amplifying the DNA sequences of the two-level primer binding sites for random access, the target sequence can be accurately read with a minimum of 10<sup>3</sup> reads, when the primer binding site sequences at different positions are mutually different. This paper provides a pipeline for orthogonal primer library generation and multi-mode mapping schemes between files and primers, which can help achieve precise random access to files in large-scale DNA oligo pools.</div></div>","PeriodicalId":11783,"journal":{"name":"Engineering","volume":"48 ","pages":"Pages 151-162"},"PeriodicalIF":10.1000,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multi-Mode Data Organization and File Retrieval Based on a Primer Library in Large-Scale Digital DNA Storage\",\"authors\":\"Shu-Fang Zhang , Yu-Hui Li , Rui-Xian Zhang , Bing-Zhi Li , Qing Wang\",\"doi\":\"10.1016/j.eng.2023.10.021\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>At present, the polymerase chain reaction (PCR) amplification-based file retrieval method is the most commonly used and effective means of DNA file retrieval. The number of orthogonal primers limits the number of files that can be accurately accessed, which in turn affects the density in a single oligo pool of digital DNA storage. In this paper, a multi-mode DNA sequence design method based on PCR file retrieval in a single oligonucleotide pool is proposed for high-capacity DNA data storage. Firstly, by analyzing the maximum number of orthogonal primers at each predicted primer length, it was found that the relationship between primer length and the maximum available primer number does not increase linearly, and the maximum number of orthogonal primers is on the order of 10<sup>4</sup>. Next, this paper analyzes the maximum address space capacity of DNA sequences with different types of primer binding sites for file mapping. In the case where the capacity of the primer library is <span><math><mrow><mi>R</mi></mrow></math></span> (where <span><math><mrow><mi>R</mi></mrow></math></span> is even), the number of address spaces that can be mapped by the single-primer DNA sequence design scheme proposed in this paper is four times that of the previous one, and the two-level primer DNA sequence design scheme can reach <span><math><mrow><msup><mrow><mfenced><mrow><mfrac><mrow><mi>ℝ</mi></mrow><mrow><mn>2</mn></mrow></mfrac><mo>·</mo><mrow><mfenced><mrow><mfrac><mrow><mi>ℝ</mi></mrow><mrow><mn>2</mn></mrow></mfrac><mo>-</mo><mn>1</mn></mrow></mfenced></mrow></mrow></mfenced></mrow><mrow><mn>2</mn></mrow></msup></mrow></math></span> times. Finally, a multi-mode DNA sequence generation method is designed based on the number of files to be stored in the oligonucleotide pool, in order to meet the requirements of the random retrieval of target files in an oligonucleotide pool with large-scale file numbers. The performance of the primers generated by the orthogonal primer library generator proposed in this paper is verified, and the average Gibbs free energy of the most stable heterodimer formed between the orthogonal primers produced is −1 kcal∙(mol∙L<sup>−1</sup>)<sup>−1</sup> (1 kcal = 4.184 kJ). At the same time, by selectively PCR-amplifying the DNA sequences of the two-level primer binding sites for random access, the target sequence can be accurately read with a minimum of 10<sup>3</sup> reads, when the primer binding site sequences at different positions are mutually different. This paper provides a pipeline for orthogonal primer library generation and multi-mode mapping schemes between files and primers, which can help achieve precise random access to files in large-scale DNA oligo pools.</div></div>\",\"PeriodicalId\":11783,\"journal\":{\"name\":\"Engineering\",\"volume\":\"48 \",\"pages\":\"Pages 151-162\"},\"PeriodicalIF\":10.1000,\"publicationDate\":\"2025-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Engineering\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2095809924006404\",\"RegionNum\":1,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Engineering","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2095809924006404","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

摘要

目前，基于聚合酶链反应（PCR）扩增的文件检索方法是最常用、最有效的DNA文件检索手段。正交引物的数量限制了可以准确访问的文件数量，这反过来又影响了数字DNA存储中单个寡核苷酸池的密度。本文提出了一种基于单寡核苷酸池PCR文件检索的多模式DNA序列设计方法，用于大容量DNA数据存储。首先，通过分析各预测引物长度处的最大正交引物数，发现引物长度与最大可用引物数的关系不是线性增加的，最大正交引物数约为104个数量级。其次，本文分析了不同类型引物结合位点的DNA序列用于文件映射的最大地址空间容量。在引物库容量为R的情况下（其中R为偶数），本文提出的单引物DNA序列设计方案可映射的地址空间数是前人设计方案的4倍，两级引物DNA序列设计方案可映射的地址空间数达到了1·2-12倍。最后，设计了一种基于寡核苷酸池中存储文件数量的多模式DNA序列生成方法，以满足寡核苷酸池中文件数量大的目标文件随机检索的要求。验证了本文提出的正交引物库生成器生成的引物的性能，所生成的正交引物之间形成的最稳定异二聚体的平均吉布斯自由能为−1 kcal∙(mol∙L−1)−1 （1 kcal = 4.184 kJ）。同时，通过选择性pcr扩增两级引物结合位点的DNA序列随机进入，当不同位置的引物结合位点序列相互不同时，至少103个reads可以准确读取目标序列。本文提供了一种正交引物库生成管道和文件与引物之间的多模式映射方案，可以实现大规模DNA寡核苷酸池中文件的精确随机存取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Multi-Mode Data Organization and File Retrieval Based on a Primer Library in Large-Scale Digital DNA Storage

At present, the polymerase chain reaction (PCR) amplification-based file retrieval method is the most commonly used and effective means of DNA file retrieval. The number of orthogonal primers limits the number of files that can be accurately accessed, which in turn affects the density in a single oligo pool of digital DNA storage. In this paper, a multi-mode DNA sequence design method based on PCR file retrieval in a single oligonucleotide pool is proposed for high-capacity DNA data storage. Firstly, by analyzing the maximum number of orthogonal primers at each predicted primer length, it was found that the relationship between primer length and the maximum available primer number does not increase linearly, and the maximum number of orthogonal primers is on the order of 10⁴. Next, this paper analyzes the maximum address space capacity of DNA sequences with different types of primer binding sites for file mapping. In the case where the capacity of the primer library is

R

(where

R

is even), the number of address spaces that can be mapped by the single-primer DNA sequence design scheme proposed in this paper is four times that of the previous one, and the two-level primer DNA sequence design scheme can reach

{(\frac{ℝ}{2} \cdot (\frac{ℝ}{2} - 1))}^{2}

times. Finally, a multi-mode DNA sequence generation method is designed based on the number of files to be stored in the oligonucleotide pool, in order to meet the requirements of the random retrieval of target files in an oligonucleotide pool with large-scale file numbers. The performance of the primers generated by the orthogonal primer library generator proposed in this paper is verified, and the average Gibbs free energy of the most stable heterodimer formed between the orthogonal primers produced is −1 kcal∙(mol∙L⁻¹)⁻¹ (1 kcal = 4.184 kJ). At the same time, by selectively PCR-amplifying the DNA sequences of the two-level primer binding sites for random access, the target sequence can be accurately read with a minimum of 10³ reads, when the primer binding site sequences at different positions are mutually different. This paper provides a pipeline for orthogonal primer library generation and multi-mode mapping schemes between files and primers, which can help achieve precise random access to files in large-scale DNA oligo pools.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Engineering Environmental Science-Environmental Engineering

自引率

1.60%

发文量

335

审稿时长

35 days

期刊介绍： Engineering, an international open-access journal initiated by the Chinese Academy of Engineering (CAE) in 2015, serves as a distinguished platform for disseminating cutting-edge advancements in engineering R&D, sharing major research outputs, and highlighting key achievements worldwide. The journal's objectives encompass reporting progress in engineering science, fostering discussions on hot topics, addressing areas of interest, challenges, and prospects in engineering development, while considering human and environmental well-being and ethics in engineering. It aims to inspire breakthroughs and innovations with profound economic and social significance, propelling them to advanced international standards and transforming them into a new productive force. Ultimately, this endeavor seeks to bring about positive changes globally, benefit humanity, and shape a new future.