A Deep Clustering-based Novel Approach for Binning of Metagenomics Data.

IF 1.4 4区生物学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY

Current Genomics Pub Date : 2022-11-18 DOI:10.2174/1389202923666220928150100

Sharanbasappa D Madival, Dwijesh Chandra Mishra, Anu Sharma, Sanjeev Kumar, Arpan Kumar Maji, Neeraj Budhlakoti, Dipro Sinha, Anil Rai

{"title":"A Deep Clustering-based Novel Approach for Binning of Metagenomics Data.","authors":"Sharanbasappa D Madival, Dwijesh Chandra Mishra, Anu Sharma, Sanjeev Kumar, Arpan Kumar Maji, Neeraj Budhlakoti, Dipro Sinha, Anil Rai","doi":"10.2174/1389202923666220928150100","DOIUrl":null,"url":null,"abstract":"Background: One major challenge in binning Metagenomics data is the limited availability of reference datasets, as only 1% of the total microbial population is yet cultured. This has given rise to the efficacy of unsupervised methods for binning in the absence of any reference datasets.Objective: To develop a deep clustering-based binning approach for Metagenomics data and to evaluate results with suitable measures.Methods: In this study, a deep learning-based approach has been taken for binning the Metagenomics data. The results are validated on different datasets by considering features such as Tetra-nucleotide frequency (TNF), Hexa-nucleotide frequency (HNF) and GC-Content. Convolutional Autoencoder is used for feature extraction and for binning; the K-means clustering method is used.Results: In most cases, it has been found that evaluation parameters such as the Silhouette index and Rand index are more than 0.5 and 0.8, respectively, which indicates that the proposed approach is giving satisfactory results. The performance of the developed approach is compared with current methods and tools using benchmarked low complexity simulated and real metagenomic datasets. It is found better for unsupervised and at par with semi-supervised methods.Conclusion: An unsupervised advanced learning-based approach for binning has been proposed, and the developed method shows promising results for various datasets. This is a novel approach for solving the lack of reference data problem of binning in metagenomics.","PeriodicalId":10803,"journal":{"name":"Current Genomics","volume":"23 5","pages":"353-368"},"PeriodicalIF":1.4000,"publicationDate":"2022-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/72/5e/CG-23-353.PMC9878855.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Current Genomics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.2174/1389202923666220928150100","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Background: One major challenge in binning Metagenomics data is the limited availability of reference datasets, as only 1% of the total microbial population is yet cultured. This has given rise to the efficacy of unsupervised methods for binning in the absence of any reference datasets.

Objective: To develop a deep clustering-based binning approach for Metagenomics data and to evaluate results with suitable measures.

Methods: In this study, a deep learning-based approach has been taken for binning the Metagenomics data. The results are validated on different datasets by considering features such as Tetra-nucleotide frequency (TNF), Hexa-nucleotide frequency (HNF) and GC-Content. Convolutional Autoencoder is used for feature extraction and for binning; the K-means clustering method is used.

Results: In most cases, it has been found that evaluation parameters such as the Silhouette index and Rand index are more than 0.5 and 0.8, respectively, which indicates that the proposed approach is giving satisfactory results. The performance of the developed approach is compared with current methods and tools using benchmarked low complexity simulated and real metagenomic datasets. It is found better for unsupervised and at par with semi-supervised methods.

Conclusion: An unsupervised advanced learning-based approach for binning has been proposed, and the developed method shows promising results for various datasets. This is a novel approach for solving the lack of reference data problem of binning in metagenomics.

Abstract Image

查看原文本刊更多论文

基于深度聚类的元基因组学数据分选新方法

背景：元基因组学数据分选的一个主要挑战是参考数据集的可用性有限，因为目前培养的微生物种群仅占总数的 1%。这就要求在没有任何参考数据集的情况下，采用无监督方法进行分选：目的：为元基因组学数据开发一种基于深度聚类的分选方法，并用合适的方法评估结果：本研究采用基于深度学习的方法对元基因组学数据进行分选。考虑到四核苷酸频率（TNF）、六核苷酸频率（HNF）和 GC-Content 等特征，在不同数据集上对结果进行了验证。卷积自动编码器用于特征提取和分选，K-means 聚类方法用于特征提取和分选：在大多数情况下，我们发现 Silhouette 指数和 Rand 指数等评价参数分别大于 0.5 和 0.8，这表明所提出的方法取得了令人满意的结果。利用基准低复杂度模拟数据集和真实元基因组数据集，将所开发方法的性能与现有方法和工具进行了比较。结果发现，无监督方法的性能更好，与半监督方法相当：提出了一种基于高级学习的无监督分选方法，所开发的方法在各种数据集上都显示出良好的效果。这是一种解决元基因组学中缺乏分选参考数据问题的新方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Current Genomics 生物-生化与分子生物学

CiteScore

5.20

自引率

0.00%

发文量

审稿时长

>0 weeks

期刊介绍： Current Genomics is a peer-reviewed journal that provides essential reading about the latest and most important developments in genome science and related fields of research. Systems biology, systems modeling, machine learning, network inference, bioinformatics, computational biology, epigenetics, single cell genomics, extracellular vesicles, quantitative biology, and synthetic biology for the study of evolution, development, maintenance, aging and that of human health, human diseases, clinical genomics and precision medicine are topics of particular interest. The journal covers plant genomics. The journal will not consider articles dealing with breeding and livestock. Current Genomics publishes three types of articles including: i) Research papers from internationally-recognized experts reporting on new and original data generated at the genome scale level. Position papers dealing with new or challenging methodological approaches, whether experimental or mathematical, are greatly welcome in this section. ii) Authoritative and comprehensive full-length or mini reviews from widely recognized experts, covering the latest developments in genome science and related fields of research such as systems biology, statistics and machine learning, quantitative biology, and precision medicine. Proposals for mini-hot topics (2-3 review papers) and full hot topics (6-8 review papers) guest edited by internationally-recognized experts are welcome in this section. Hot topic proposals should not contain original data and they should contain articles originating from at least 2 different countries. iii) Opinion papers from internationally recognized experts addressing contemporary questions and issues in the field of genome science and systems biology and basic and clinical research practices.