未标记DNA序列双对比聚类的混沌博弈表示。

IF 3.5 2区生物学 Q2 BIOTECHNOLOGY & APPLIED MICROBIOLOGY

BMC Genomics Pub Date : 2024-12-18 DOI:10.1186/s12864-024-11135-y

Fatemeh Alipour, Kathleen A Hill, Lila Kari

{"title":"未标记DNA序列双对比聚类的混沌博弈表示。","authors":"Fatemeh Alipour, Kathleen A Hill, Lila Kari","doi":"10.1186/s12864-024-11135-y","DOIUrl":null,"url":null,"abstract":"Background: Traditional supervised learning methods applied to DNA sequence taxonomic classification rely on the labor-intensive and time-consuming step of labelling the primary DNA sequences. Additionally, standard DNA classification/clustering methods involve time-intensive multiple sequence alignments, which impacts their applicability to large genomic datasets or distantly related organisms. These limitations indicate a need for robust, efficient, and scalable unsupervised DNA sequence clustering methods that do not depend on sequence labels or alignment.Results: This study proposes CGRclust, a novel combination of unsupervised twin contrastive clustering of Chaos Game Representations (CGR) of DNA sequences, with convolutional neural networks (CNNs). To the best of our knowledge, CGRclust is the first method to use unsupervised learning for image classification (herein applied to two-dimensional CGR images) for clustering datasets of DNA sequences. CGRclust overcomes the limitations of traditional sequence classification methods by leveraging unsupervised twin contrastive learning to detect distinctive sequence patterns, without requiring DNA sequence alignment or biological/taxonomic labels. CGRclust accurately clustered twenty-five diverse datasets, with sequence lengths ranging from 664 bp to 100 kbp, including mitochondrial genomes of fish, fungi, and protists, as well as viral whole genome assemblies and synthetic DNA sequences. Compared with three recent clustering methods for DNA sequences (DeLUCS, iDeLUCS, and MeShClust v3.0.), CGRclust is the only method that surpasses 81.70% accuracy across all four taxonomic levels tested for mitochondrial DNA genomes of fish. Moreover, CGRclust also consistently demonstrates superior performance across all the viral genomic datasets. The high clustering accuracy of CGRclust on these twenty-five datasets, which vary significantly in terms of sequence length, number of genomes, number of clusters, and level of taxonomy, demonstrates its robustness, scalability, and versatility.Conclusion: CGRclust is a novel, scalable, alignment-free DNA sequence clustering method that uses CGR images of DNA sequences and CNNs for twin contrastive clustering of unlabelled primary DNA sequences, achieving superior or comparable accuracy and performance over current approaches. CGRclust demonstrated enhanced reliability, by consistently achieving over 80% accuracy in more than 90% of the datasets analyzed. In particular, CGRclust performed especially well in clustering viral DNA datasets, where it consistently outperformed all competing methods.","PeriodicalId":9030,"journal":{"name":"BMC Genomics","volume":"25 1","pages":"1214"},"PeriodicalIF":3.5000,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11657719/pdf/","citationCount":"0","resultStr":"{\"title\":\"CGRclust: Chaos Game Representation for twin contrastive clustering of unlabelled DNA sequences.\",\"authors\":\"Fatemeh Alipour, Kathleen A Hill, Lila Kari\",\"doi\":\"10.1186/s12864-024-11135-y\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Traditional supervised learning methods applied to DNA sequence taxonomic classification rely on the labor-intensive and time-consuming step of labelling the primary DNA sequences. Additionally, standard DNA classification/clustering methods involve time-intensive multiple sequence alignments, which impacts their applicability to large genomic datasets or distantly related organisms. These limitations indicate a need for robust, efficient, and scalable unsupervised DNA sequence clustering methods that do not depend on sequence labels or alignment.Results: This study proposes CGRclust, a novel combination of unsupervised twin contrastive clustering of Chaos Game Representations (CGR) of DNA sequences, with convolutional neural networks (CNNs). To the best of our knowledge, CGRclust is the first method to use unsupervised learning for image classification (herein applied to two-dimensional CGR images) for clustering datasets of DNA sequences. CGRclust overcomes the limitations of traditional sequence classification methods by leveraging unsupervised twin contrastive learning to detect distinctive sequence patterns, without requiring DNA sequence alignment or biological/taxonomic labels. CGRclust accurately clustered twenty-five diverse datasets, with sequence lengths ranging from 664 bp to 100 kbp, including mitochondrial genomes of fish, fungi, and protists, as well as viral whole genome assemblies and synthetic DNA sequences. Compared with three recent clustering methods for DNA sequences (DeLUCS, iDeLUCS, and MeShClust v3.0.), CGRclust is the only method that surpasses 81.70% accuracy across all four taxonomic levels tested for mitochondrial DNA genomes of fish. Moreover, CGRclust also consistently demonstrates superior performance across all the viral genomic datasets. The high clustering accuracy of CGRclust on these twenty-five datasets, which vary significantly in terms of sequence length, number of genomes, number of clusters, and level of taxonomy, demonstrates its robustness, scalability, and versatility.Conclusion: CGRclust is a novel, scalable, alignment-free DNA sequence clustering method that uses CGR images of DNA sequences and CNNs for twin contrastive clustering of unlabelled primary DNA sequences, achieving superior or comparable accuracy and performance over current approaches. CGRclust demonstrated enhanced reliability, by consistently achieving over 80% accuracy in more than 90% of the datasets analyzed. In particular, CGRclust performed especially well in clustering viral DNA datasets, where it consistently outperformed all competing methods.\",\"PeriodicalId\":9030,\"journal\":{\"name\":\"BMC Genomics\",\"volume\":\"25 1\",\"pages\":\"1214\"},\"PeriodicalIF\":3.5000,\"publicationDate\":\"2024-12-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11657719/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMC Genomics\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1186/s12864-024-11135-y\",\"RegionNum\":2,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"BIOTECHNOLOGY & APPLIED MICROBIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Genomics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12864-024-11135-y","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOTECHNOLOGY & APPLIED MICROBIOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

背景：用于DNA序列分类分类的传统监督学习方法依赖于标记初级DNA序列这一劳动密集型且耗时的步骤。此外，标准的DNA分类/聚类方法涉及时间密集的多序列比对，这影响了它们对大型基因组数据集或远亲生物的适用性。这些限制表明需要健壮、高效和可扩展的不依赖于序列标记或比对的无监督DNA序列聚类方法。结果：本研究提出了一种将DNA序列混沌博弈表示（CGR）的无监督双胞胎对比聚类与卷积神经网络（cnn）相结合的新方法CGRclust。据我们所知，CGRclust是第一个将无监督学习用于图像分类（本文应用于二维CGR图像）的DNA序列数据集聚类的方法。CGRclust克服了传统序列分类方法的局限性，利用无监督的双胞胎对比学习来检测独特的序列模式，而不需要DNA序列比对或生物/分类标签。CGRclust准确聚类了25个不同的数据集，序列长度从664 bp到100 kbp不等，包括鱼类、真菌和原生生物的线粒体基因组，以及病毒全基因组组装和合成DNA序列。与最近的三种DNA序列聚类方法（DeLUCS、iducus和MeShClust v3.0.）相比，CGRclust是唯一一种在鱼类线粒体DNA基因组的所有四个分类水平上准确率超过81.70%的方法。此外，CGRclust在所有病毒基因组数据集上也始终表现出卓越的性能。在序列长度、基因组数量、聚类数量和分类水平等方面存在显著差异的25个数据集上，cgrcluster具有较高的聚类精度，证明了其鲁棒性、可扩展性和通用性。结论：CGRclust是一种新颖的、可扩展的、无比对的DNA序列聚类方法，它使用DNA序列的CGR图像和cnn对未标记的初级DNA序列进行双对比聚类，比目前的方法具有更高的精度和性能。CGRclust展示了更高的可靠性，在超过90%的分析数据集中始终达到80%以上的准确率。特别是，CGRclust在聚类病毒DNA数据集方面表现特别好，在这方面它始终优于所有竞争方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

CGRclust: Chaos Game Representation for twin contrastive clustering of unlabelled DNA sequences.

Background: Traditional supervised learning methods applied to DNA sequence taxonomic classification rely on the labor-intensive and time-consuming step of labelling the primary DNA sequences. Additionally, standard DNA classification/clustering methods involve time-intensive multiple sequence alignments, which impacts their applicability to large genomic datasets or distantly related organisms. These limitations indicate a need for robust, efficient, and scalable unsupervised DNA sequence clustering methods that do not depend on sequence labels or alignment.

Results: This study proposes CGRclust, a novel combination of unsupervised twin contrastive clustering of Chaos Game Representations (CGR) of DNA sequences, with convolutional neural networks (CNNs). To the best of our knowledge, CGRclust is the first method to use unsupervised learning for image classification (herein applied to two-dimensional CGR images) for clustering datasets of DNA sequences. CGRclust overcomes the limitations of traditional sequence classification methods by leveraging unsupervised twin contrastive learning to detect distinctive sequence patterns, without requiring DNA sequence alignment or biological/taxonomic labels. CGRclust accurately clustered twenty-five diverse datasets, with sequence lengths ranging from 664 bp to 100 kbp, including mitochondrial genomes of fish, fungi, and protists, as well as viral whole genome assemblies and synthetic DNA sequences. Compared with three recent clustering methods for DNA sequences (DeLUCS, iDeLUCS, and MeShClust v3.0.), CGRclust is the only method that surpasses 81.70% accuracy across all four taxonomic levels tested for mitochondrial DNA genomes of fish. Moreover, CGRclust also consistently demonstrates superior performance across all the viral genomic datasets. The high clustering accuracy of CGRclust on these twenty-five datasets, which vary significantly in terms of sequence length, number of genomes, number of clusters, and level of taxonomy, demonstrates its robustness, scalability, and versatility.

Conclusion: CGRclust is a novel, scalable, alignment-free DNA sequence clustering method that uses CGR images of DNA sequences and CNNs for twin contrastive clustering of unlabelled primary DNA sequences, achieving superior or comparable accuracy and performance over current approaches. CGRclust demonstrated enhanced reliability, by consistently achieving over 80% accuracy in more than 90% of the datasets analyzed. In particular, CGRclust performed especially well in clustering viral DNA datasets, where it consistently outperformed all competing methods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

BMC Genomics 生物-生物工程与应用微生物

CiteScore

7.40

自引率

4.50%

发文量

769

审稿时长

6.4 months

期刊介绍： BMC Genomics is an open access, peer-reviewed journal that considers articles on all aspects of genome-scale analysis, functional genomics, and proteomics. BMC Genomics is part of the BMC series which publishes subject-specific journals focused on the needs of individual research communities across all areas of biology and medicine. We offer an efficient, fair and friendly peer review service, and are committed to publishing all sound science, provided that there is some advance in knowledge presented by the work.