Binning long reads in metagenomics datasets using composition and coverage information.

IF 1.5 4区生物学 Q4 BIOCHEMICAL RESEARCH METHODS

Algorithms for Molecular Biology Pub Date : 2022-07-11 DOI:10.1186/s13015-022-00221-z

Anuradha Wickramarachchi, Yu Lin

{"title":"Binning long reads in metagenomics datasets using composition and coverage information.","authors":"Anuradha Wickramarachchi, Yu Lin","doi":"10.1186/s13015-022-00221-z","DOIUrl":null,"url":null,"abstract":"Background: Advancements in metagenomics sequencing allow the study of microbial communities directly from their environments. Metagenomics binning is a key step in the species characterisation of microbial communities. Next-generation sequencing reads are usually assembled into contigs for metagenomics binning mainly due to the limited information within short reads. Third-generation sequencing provides much longer reads that have lengths similar to the contigs assembled from short reads. However, existing contig-binning tools cannot be directly applied on long reads due to the absence of coverage information and the presence of high error rates. The few existing long-read binning tools either use only composition or use composition and coverage information separately. This may ignore bins that correspond to low-abundance species or erroneously split bins that correspond to species with non-uniform coverages. Here we present a reference-free binning approach, LRBinner, that combines composition and coverage information of complete long-read datasets. LRBinner also uses a distance-histogram-based clustering algorithm to extract clusters with varying sizes.Results: The experimental results on both simulated and real datasets show that LRBinner achieves the best binning accuracy in most cases while handling the complete datasets without any sampling. Moreover, we show that binning reads using LRBinner prior to assembly reduces computational resources required for assembly while attaining satisfactory assembly qualities.Conclusion: LRBinner shows that deep-learning techniques can be used for effective feature aggregation to support the metagenomics binning of long reads. Furthermore, accurate binning of long reads supports improvements in metagenomics assembly, especially in complex datasets. Binning also helps to reduce the resources required for assembly. Source code for LRBinner is freely available at https://github.com/anuradhawick/LRBinner.","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":" ","pages":"14"},"PeriodicalIF":1.5000,"publicationDate":"2022-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9277797/pdf/","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Algorithms for Molecular Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13015-022-00221-z","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 8

Abstract

Background: Advancements in metagenomics sequencing allow the study of microbial communities directly from their environments. Metagenomics binning is a key step in the species characterisation of microbial communities. Next-generation sequencing reads are usually assembled into contigs for metagenomics binning mainly due to the limited information within short reads. Third-generation sequencing provides much longer reads that have lengths similar to the contigs assembled from short reads. However, existing contig-binning tools cannot be directly applied on long reads due to the absence of coverage information and the presence of high error rates. The few existing long-read binning tools either use only composition or use composition and coverage information separately. This may ignore bins that correspond to low-abundance species or erroneously split bins that correspond to species with non-uniform coverages. Here we present a reference-free binning approach, LRBinner, that combines composition and coverage information of complete long-read datasets. LRBinner also uses a distance-histogram-based clustering algorithm to extract clusters with varying sizes.

Results: The experimental results on both simulated and real datasets show that LRBinner achieves the best binning accuracy in most cases while handling the complete datasets without any sampling. Moreover, we show that binning reads using LRBinner prior to assembly reduces computational resources required for assembly while attaining satisfactory assembly qualities.

Conclusion: LRBinner shows that deep-learning techniques can be used for effective feature aggregation to support the metagenomics binning of long reads. Furthermore, accurate binning of long reads supports improvements in metagenomics assembly, especially in complex datasets. Binning also helps to reduce the resources required for assembly. Source code for LRBinner is freely available at https://github.com/anuradhawick/LRBinner.

Abstract Image

查看原文本刊更多论文

使用组合和覆盖信息对宏基因组数据集中的长读取进行分组。

背景:宏基因组测序技术的进步使得直接从环境中研究微生物群落成为可能。宏基因组分类是微生物群落物种特征的关键步骤。由于短序列的信息有限，下一代测序reads通常被组装成contigs进行宏基因组组合。第三代测序提供了更长的序列，其长度与由短序列组装的contigs相似。然而，由于缺乏覆盖信息和存在高错误率，现有的组合合并工具不能直接应用于长读。少数现有的长读存储工具要么只使用组合，要么单独使用组合和覆盖信息。这可能会忽略与低丰度物种相对应的箱型或与非均匀覆盖的物种相对应的错误分割箱型。在这里，我们提出了一种无参考的分类方法，LRBinner，它结合了完整长读数据集的组成和覆盖信息。LRBinner还使用了一种基于距离直方图的聚类算法来提取不同大小的聚类。结果:在模拟和真实数据集上的实验结果表明，在不进行任何采样的完整数据集上，LRBinner在大多数情况下都能达到最佳的分箱精度。此外，我们表明，在装配之前使用LRBinner进行分组读取可以减少装配所需的计算资源，同时获得令人满意的装配质量。结论:LRBinner表明，深度学习技术可以用于有效的特征聚合，以支持长读段的宏基因组分类。此外，长读段的精确分组支持宏基因组组装的改进，特别是在复杂的数据集中。分箱还有助于减少组装所需的资源。LRBinner的源代码可在https://github.com/anuradhawick/LRBinner免费获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Algorithms for Molecular Biology 生物-生化研究方法

CiteScore

2.40

自引率

10.00%

发文量

审稿时长

>12 weeks

期刊介绍： Algorithms for Molecular Biology publishes articles on novel algorithms for biological sequence and structure analysis, phylogeny reconstruction, and combinatorial algorithms and machine learning. Areas of interest include but are not limited to: algorithms for RNA and protein structure analysis, gene prediction and genome analysis, comparative sequence analysis and alignment, phylogeny, gene expression, machine learning, and combinatorial algorithms. Where appropriate, manuscripts should describe applications to real-world data. However, pure algorithm papers are also welcome if future applications to biological data are to be expected, or if they address complexity or approximation issues of novel computational problems in molecular biology. Articles about novel software tools will be considered for publication if they contain some algorithmically interesting aspects.