Subhasiny Sankar, Yixin Wang, Zhang Jiayu, Nur Sabrina, E. Gunawan, Y. L. Guan, Noor-A-Rahim Md., C. Poh
{"title":"DNA存储中聚类方法的比较分析","authors":"Subhasiny Sankar, Yixin Wang, Zhang Jiayu, Nur Sabrina, E. Gunawan, Y. L. Guan, Noor-A-Rahim Md., C. Poh","doi":"10.1109/ICSEC56337.2022.10049327","DOIUrl":null,"url":null,"abstract":"Owing to the significance of DNA storage technology in meeting exponential storage demands and longevity, the challenges caused by bio-molecular errors while reading/sequencing data from DNA molecules must be addressed. By reading redundant copies, data can be reconstructed but with associated cost of sequencing and decoding complexities. Hence, solutions for dealing with both errors and complexities are sought after. The main objective of this work is to study data reconstruction methods for processing sequence readouts at downstream stage of DNA data storage. We investigated applicability of three clustering tools -Starcode, Slidesort, MeShClust, and two algorithms - Majority Nucleotide Selection (MNS), Cooperative Sequence Clustering (CSC) by transforming them into suitable tools for storage application. We observed that for fixed redundancy of 6.3x to 8.6x based on the nature of the dataset, Starcode outperforms other tools with 1% to 40% higher recovery rate. However, it costs the highest decoding complexity whereas MNS and CSC provides the lowest decoding complexity. Moreover, the distribution of the cluster and clustering speed of each tool/method are compared. This is the first comparative analysis study of tools/methods for data reconstruction in DNA data storage.","PeriodicalId":430850,"journal":{"name":"2022 26th International Computer Science and Engineering Conference (ICSEC)","volume":"142 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Comparative Analysis of Clustering Methodologies in DNA Storage\",\"authors\":\"Subhasiny Sankar, Yixin Wang, Zhang Jiayu, Nur Sabrina, E. Gunawan, Y. L. Guan, Noor-A-Rahim Md., C. Poh\",\"doi\":\"10.1109/ICSEC56337.2022.10049327\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Owing to the significance of DNA storage technology in meeting exponential storage demands and longevity, the challenges caused by bio-molecular errors while reading/sequencing data from DNA molecules must be addressed. By reading redundant copies, data can be reconstructed but with associated cost of sequencing and decoding complexities. Hence, solutions for dealing with both errors and complexities are sought after. The main objective of this work is to study data reconstruction methods for processing sequence readouts at downstream stage of DNA data storage. We investigated applicability of three clustering tools -Starcode, Slidesort, MeShClust, and two algorithms - Majority Nucleotide Selection (MNS), Cooperative Sequence Clustering (CSC) by transforming them into suitable tools for storage application. We observed that for fixed redundancy of 6.3x to 8.6x based on the nature of the dataset, Starcode outperforms other tools with 1% to 40% higher recovery rate. However, it costs the highest decoding complexity whereas MNS and CSC provides the lowest decoding complexity. Moreover, the distribution of the cluster and clustering speed of each tool/method are compared. This is the first comparative analysis study of tools/methods for data reconstruction in DNA data storage.\",\"PeriodicalId\":430850,\"journal\":{\"name\":\"2022 26th International Computer Science and Engineering Conference (ICSEC)\",\"volume\":\"142 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 26th International Computer Science and Engineering Conference (ICSEC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICSEC56337.2022.10049327\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 26th International Computer Science and Engineering Conference (ICSEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSEC56337.2022.10049327","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Comparative Analysis of Clustering Methodologies in DNA Storage
Owing to the significance of DNA storage technology in meeting exponential storage demands and longevity, the challenges caused by bio-molecular errors while reading/sequencing data from DNA molecules must be addressed. By reading redundant copies, data can be reconstructed but with associated cost of sequencing and decoding complexities. Hence, solutions for dealing with both errors and complexities are sought after. The main objective of this work is to study data reconstruction methods for processing sequence readouts at downstream stage of DNA data storage. We investigated applicability of three clustering tools -Starcode, Slidesort, MeShClust, and two algorithms - Majority Nucleotide Selection (MNS), Cooperative Sequence Clustering (CSC) by transforming them into suitable tools for storage application. We observed that for fixed redundancy of 6.3x to 8.6x based on the nature of the dataset, Starcode outperforms other tools with 1% to 40% higher recovery rate. However, it costs the highest decoding complexity whereas MNS and CSC provides the lowest decoding complexity. Moreover, the distribution of the cluster and clustering speed of each tool/method are compared. This is the first comparative analysis study of tools/methods for data reconstruction in DNA data storage.