MetaConClust - Unsupervised Binning of Metagenomics Data using Consensus Clustering.

IF 1.4 4区生物学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Current Genomics Pub Date : 2022-06-10 DOI:10.2174/1389202923666220413114659

Dipro Sinha, Anu Sharma, Dwijesh Chandra Mishra, Anil Rai, Shashi Bhushan Lal, Sanjeev Kumar, Moh Samir Farooqi, Krishna Kumar Chaturvedi

{"title":"MetaConClust - Unsupervised Binning of Metagenomics Data using Consensus Clustering.","authors":"Dipro Sinha, Anu Sharma, Dwijesh Chandra Mishra, Anil Rai, Shashi Bhushan Lal, Sanjeev Kumar, Moh Samir Farooqi, Krishna Kumar Chaturvedi","doi":"10.2174/1389202923666220413114659","DOIUrl":null,"url":null,"abstract":"Background: Binning of metagenomic reads is an active area of research, and many unsupervised machine learning-based techniques have been used for taxonomic independent binning of metagenomic reads. Objective: It is important to find the optimum number of the cluster as well as develop an efficient pipeline for deciphering the complexity of the microbial genome. Methods: Applying unsupervised clustering techniques for binning requires finding the optimal number of clusters beforehand and is observed to be a difficult task. This paper describes a novel method, MetaConClust, using coverage information for grouping of contigs and automatically finding the optimal number of clusters for binning of metagenomics data using a consensus-based clustering approach. The coverage of contigs in a metagenomics sample has been observed to be directly proportional to the abundance of species in the sample and is used for grouping of data in the first phase by MetaConClust. The Partitioning Around Medoid (PAM) method is used for clustering in the second phase for generating bins with the initial number of clusters determined automatically through a consensus-based method. Results: Finally, the quality of the obtained bins is tested using silhouette index, rand Index, recall, precision, and accuracy. Performance of MetaConClust is compared with recent methods and tools using benchmarked low complexity simulated and real metagenomic datasets and is found better for unsupervised and comparable for hybrid methods. Conclusion: This is suggestive of the proposition that the consensus-based clustering approach is a promising method for automatically finding the number of bins for metagenomics data.","PeriodicalId":10803,"journal":{"name":"Current Genomics","volume":"23 2","pages":"137-146"},"PeriodicalIF":1.4000,"publicationDate":"2022-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/8c/3c/CG-23-137.PMC9878838.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Current Genomics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.2174/1389202923666220413114659","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Binning of metagenomic reads is an active area of research, and many unsupervised machine learning-based techniques have been used for taxonomic independent binning of metagenomic reads. Objective: It is important to find the optimum number of the cluster as well as develop an efficient pipeline for deciphering the complexity of the microbial genome. Methods: Applying unsupervised clustering techniques for binning requires finding the optimal number of clusters beforehand and is observed to be a difficult task. This paper describes a novel method, MetaConClust, using coverage information for grouping of contigs and automatically finding the optimal number of clusters for binning of metagenomics data using a consensus-based clustering approach. The coverage of contigs in a metagenomics sample has been observed to be directly proportional to the abundance of species in the sample and is used for grouping of data in the first phase by MetaConClust. The Partitioning Around Medoid (PAM) method is used for clustering in the second phase for generating bins with the initial number of clusters determined automatically through a consensus-based method. Results: Finally, the quality of the obtained bins is tested using silhouette index, rand Index, recall, precision, and accuracy. Performance of MetaConClust is compared with recent methods and tools using benchmarked low complexity simulated and real metagenomic datasets and is found better for unsupervised and comparable for hybrid methods. Conclusion: This is suggestive of the proposition that the consensus-based clustering approach is a promising method for automatically finding the number of bins for metagenomics data.

Abstract Image

查看原文本刊更多论文

MetaConClust - 使用共识聚类对元基因组学数据进行无监督分选。

背景：元基因组读数的分选是一个活跃的研究领域，许多基于无监督机器学习的技术已被用于元基因组读数的分类独立分选。研究目的找到最佳簇数以及开发一种高效的解密微生物基因组复杂性的管道非常重要。方法：应用无监督聚类技术进行分选需要事先找到最佳聚类数目，据观察这是一项艰巨的任务。本文介绍了一种名为 MetaConClust 的新方法，该方法利用覆盖率信息对等位基因进行分组，并采用基于共识的聚类方法自动找出最佳聚类数目，以便对元基因组学数据进行分选。据观察，元基因组学样本中等位基因的覆盖率与样本中物种的丰度成正比，MetaConClust 在第一阶段使用等位基因的覆盖率对数据进行分组。在第二阶段，使用围绕中间值分区（PAM）方法进行聚类，通过基于共识的方法自动确定初始聚类的数量，从而生成分区。结果最后，使用剪影指数、兰德指数、召回率、精确度和准确度对所获得的分群质量进行测试。使用基准低复杂度模拟数据集和真实元基因组数据集，将 MetaConClust 的性能与最新的方法和工具进行了比较，发现无监督方法的性能更好，混合方法的性能相当。结论这表明，基于共识的聚类方法是一种很有前途的自动寻找元基因组数据分仓数的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Current Genomics 生物-生化与分子生物学

CiteScore

5.20

自引率

0.00%

发文量

审稿时长

>0 weeks

期刊介绍： Current Genomics is a peer-reviewed journal that provides essential reading about the latest and most important developments in genome science and related fields of research. Systems biology, systems modeling, machine learning, network inference, bioinformatics, computational biology, epigenetics, single cell genomics, extracellular vesicles, quantitative biology, and synthetic biology for the study of evolution, development, maintenance, aging and that of human health, human diseases, clinical genomics and precision medicine are topics of particular interest. The journal covers plant genomics. The journal will not consider articles dealing with breeding and livestock. Current Genomics publishes three types of articles including: i) Research papers from internationally-recognized experts reporting on new and original data generated at the genome scale level. Position papers dealing with new or challenging methodological approaches, whether experimental or mathematical, are greatly welcome in this section. ii) Authoritative and comprehensive full-length or mini reviews from widely recognized experts, covering the latest developments in genome science and related fields of research such as systems biology, statistics and machine learning, quantitative biology, and precision medicine. Proposals for mini-hot topics (2-3 review papers) and full hot topics (6-8 review papers) guest edited by internationally-recognized experts are welcome in this section. Hot topic proposals should not contain original data and they should contain articles originating from at least 2 different countries. iii) Opinion papers from internationally recognized experts addressing contemporary questions and issues in the field of genome science and systems biology and basic and clinical research practices.