metagWGS,使用 Illumina 或 PacBio HiFi 读数分析元基因组数据的综合工作流程

Jean Mainguy, Mäina Vienne, Joanna Fourquet, Vincent Darbot, Céline Noirot, Adrien Castinel, Sylvie Combes, Christine Gaspin, Denis Milan, Cecile Donnadieu, Carole Iampietro, Olivier Bouchez, Géraldine Pascal, Claire Hoede
{"title":"metagWGS,使用 Illumina 或 PacBio HiFi 读数分析元基因组数据的综合工作流程","authors":"Jean Mainguy, Mäina Vienne, Joanna Fourquet, Vincent Darbot, Céline Noirot, Adrien Castinel, Sylvie Combes, Christine Gaspin, Denis Milan, Cecile Donnadieu, Carole Iampietro, Olivier Bouchez, Géraldine Pascal, Claire Hoede","doi":"10.1101/2024.09.13.612854","DOIUrl":null,"url":null,"abstract":"Background: To study communities of micro-organisms taxonomically and functionally, metagenomic analyses are now often used. If there is no reference gene catalogue, a de novo approach is required. Because genomes are easier to interpret than contigs, the recovery of metagenome-assembled genomes (MAGs) by binning of contigs from metagenomic data has recently become a common task for microbial studies. However, during this process, there is a significant loss of information between the assembly and the binning of contigs. This is why it is important to produce taxonomic and functional matrices for all contigs and not just those included in correct bins. In addition, Pacbio HiFi reads (long and of good quality) are now a possible, albeit more expensive, alternative to short Illumina reads. We therefore developed a workflow that is easy to install with dependencies fixed using singularity images and easy to use on a computing cluster, that is capable of analyzing either short or long reads, and that should allow analysis at the contig and/or bin level, depending on the user's choice. Following is a presentation of metagWGS, a fully automated workflow for metagenomic data analysis. It uses a new tool for refining bins (called Binette) that we will demonstrate is more efficient than competing tools. Methods: metagWGS is a Nextflow workflow distributed with two singularity images and complete documentation to facilitate its installation and use. Because the main original features of metagWGS concern binning (short and long reads) and the analysis of HiFi reads, we compared metagWGS with the MAG construction workflow proposed by PacBio to a public dataset used by Pacbio to promote its workflow. Results: metagWGS differs from existing workflows by (i) offering flexible approaches for the assembly; (ii) supporting short reads (Illumina) or PacBio HiFi reads; (iii) combining multiple binning algorithms with a new bin refinement tool, referred to as Binette, to achieve high-quality genome bins; and (iv) providing taxonomic and functional annotation for all genes, all contigs built and bins. metagWGS produces more medium (708) and high-quality (255) bins on 11 public metagenomic samples from human gut data than the Pacbio HiFi dedicated workflow, referred to as the HiFi-MAGS-pipeline (659 medium quality bins and 231 high quality bins), primarily due to the better performance of Binette.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"186 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"metagWGS, a comprehensive workflow to analyze metagenomic data using Illumina or PacBio HiFi reads\",\"authors\":\"Jean Mainguy, Mäina Vienne, Joanna Fourquet, Vincent Darbot, Céline Noirot, Adrien Castinel, Sylvie Combes, Christine Gaspin, Denis Milan, Cecile Donnadieu, Carole Iampietro, Olivier Bouchez, Géraldine Pascal, Claire Hoede\",\"doi\":\"10.1101/2024.09.13.612854\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: To study communities of micro-organisms taxonomically and functionally, metagenomic analyses are now often used. If there is no reference gene catalogue, a de novo approach is required. Because genomes are easier to interpret than contigs, the recovery of metagenome-assembled genomes (MAGs) by binning of contigs from metagenomic data has recently become a common task for microbial studies. However, during this process, there is a significant loss of information between the assembly and the binning of contigs. This is why it is important to produce taxonomic and functional matrices for all contigs and not just those included in correct bins. In addition, Pacbio HiFi reads (long and of good quality) are now a possible, albeit more expensive, alternative to short Illumina reads. We therefore developed a workflow that is easy to install with dependencies fixed using singularity images and easy to use on a computing cluster, that is capable of analyzing either short or long reads, and that should allow analysis at the contig and/or bin level, depending on the user's choice. Following is a presentation of metagWGS, a fully automated workflow for metagenomic data analysis. It uses a new tool for refining bins (called Binette) that we will demonstrate is more efficient than competing tools. Methods: metagWGS is a Nextflow workflow distributed with two singularity images and complete documentation to facilitate its installation and use. Because the main original features of metagWGS concern binning (short and long reads) and the analysis of HiFi reads, we compared metagWGS with the MAG construction workflow proposed by PacBio to a public dataset used by Pacbio to promote its workflow. Results: metagWGS differs from existing workflows by (i) offering flexible approaches for the assembly; (ii) supporting short reads (Illumina) or PacBio HiFi reads; (iii) combining multiple binning algorithms with a new bin refinement tool, referred to as Binette, to achieve high-quality genome bins; and (iv) providing taxonomic and functional annotation for all genes, all contigs built and bins. metagWGS produces more medium (708) and high-quality (255) bins on 11 public metagenomic samples from human gut data than the Pacbio HiFi dedicated workflow, referred to as the HiFi-MAGS-pipeline (659 medium quality bins and 231 high quality bins), primarily due to the better performance of Binette.\",\"PeriodicalId\":501307,\"journal\":{\"name\":\"bioRxiv - Bioinformatics\",\"volume\":\"186 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"bioRxiv - Bioinformatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1101/2024.09.13.612854\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"bioRxiv - Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.09.13.612854","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

背景:为了从分类学和功能上研究微生物群落,现在经常使用元基因组分析。如果没有参考基因目录,就需要从头开始研究。由于基因组比等位基因更容易解读,因此通过对元基因组数据中的等位基因进行分选来恢复元基因组组装基因组(MAG)已成为微生物研究的一项常见任务。然而,在这一过程中,等位基因的组装和分选之间的信息损失很大。这就是为什么必须为所有等位基因生成分类和功能矩阵,而不仅仅是那些包含在正确分选中的等位基因。此外,Pacbio HiFi 读数(长度长、质量好)现在可以替代 Illumina 短读数,尽管价格更贵。因此,我们开发了一种工作流程,该流程易于安装,使用奇异图像固定了依赖关系,并易于在计算集群上使用,既能分析短读数,也能分析长读数,还能根据用户的选择在队列和/或数据集层面进行分析。下面将介绍 metagWGS,这是一种用于元基因组数据分析的全自动工作流程。它使用了一种新工具来细化数据集(称为 Binette),我们将证明这种工具比同类工具更有效。方法:metagWGS 是一个 Nextflow 工作流,附带两个奇异图像和完整的文档,便于安装和使用。由于 metagWGS 的主要原始特征涉及分选(短读取和长读取)和 HiFi 读取的分析,我们将 metagWGS 与 PacBio 提出的 MAG 构建工作流程进行了比较,并对 Pacbio 用于推广其工作流程的公共数据集进行了比较。结果:metagWGS 与现有工作流程的不同之处在于:(i) 提供灵活的组装方法;(ii) 支持短读数(Illumina)或 PacBio HiFi 读数;(iii) 将多种分选算法与新的分选细化工具(称为 Binette)相结合,以实现高质量的基因组分选;(iv) 为所有基因、所有构建的等位基因和分选提供分类和功能注释。与 Pacbio HiFi 专用工作流程(即 HiFi-MAGS-管道)(659 个中等质量 bins 和 231 个高质量 bins)相比,metagWGS 在 11 个人类肠道公共元基因组样本上产生了更多的中等质量 bins(708 个)和高质量 bins(255 个),这主要归功于 Binette 更好的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
metagWGS, a comprehensive workflow to analyze metagenomic data using Illumina or PacBio HiFi reads
Background: To study communities of micro-organisms taxonomically and functionally, metagenomic analyses are now often used. If there is no reference gene catalogue, a de novo approach is required. Because genomes are easier to interpret than contigs, the recovery of metagenome-assembled genomes (MAGs) by binning of contigs from metagenomic data has recently become a common task for microbial studies. However, during this process, there is a significant loss of information between the assembly and the binning of contigs. This is why it is important to produce taxonomic and functional matrices for all contigs and not just those included in correct bins. In addition, Pacbio HiFi reads (long and of good quality) are now a possible, albeit more expensive, alternative to short Illumina reads. We therefore developed a workflow that is easy to install with dependencies fixed using singularity images and easy to use on a computing cluster, that is capable of analyzing either short or long reads, and that should allow analysis at the contig and/or bin level, depending on the user's choice. Following is a presentation of metagWGS, a fully automated workflow for metagenomic data analysis. It uses a new tool for refining bins (called Binette) that we will demonstrate is more efficient than competing tools. Methods: metagWGS is a Nextflow workflow distributed with two singularity images and complete documentation to facilitate its installation and use. Because the main original features of metagWGS concern binning (short and long reads) and the analysis of HiFi reads, we compared metagWGS with the MAG construction workflow proposed by PacBio to a public dataset used by Pacbio to promote its workflow. Results: metagWGS differs from existing workflows by (i) offering flexible approaches for the assembly; (ii) supporting short reads (Illumina) or PacBio HiFi reads; (iii) combining multiple binning algorithms with a new bin refinement tool, referred to as Binette, to achieve high-quality genome bins; and (iv) providing taxonomic and functional annotation for all genes, all contigs built and bins. metagWGS produces more medium (708) and high-quality (255) bins on 11 public metagenomic samples from human gut data than the Pacbio HiFi dedicated workflow, referred to as the HiFi-MAGS-pipeline (659 medium quality bins and 231 high quality bins), primarily due to the better performance of Binette.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信