GCompip: a pipeline for estimating the gene abundance in microbial communities.

IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances Pub Date : 2025-08-29 eCollection Date: 2025-01-01 DOI:10.1093/bioadv/vbaf207

Xiang Zhou, Qiushuang Li, Shizhe Zhang, Wenxing Wang, Rong Wang, Xiumin Zhang, Zhiliang Tan, Min Wang

{"title":"GCompip: a pipeline for estimating the gene abundance in microbial communities.","authors":"Xiang Zhou, Qiushuang Li, Shizhe Zhang, Wenxing Wang, Rong Wang, Xiumin Zhang, Zhiliang Tan, Min Wang","doi":"10.1093/bioadv/vbaf207","DOIUrl":null,"url":null,"abstract":"Motivation: Gene abundance in metagenome datasets is commonly represented in terms of Counts or Copies Per Million. However, above term lack the consideration of the size of the microbial communities. To reflect the gene abundance in the microbial communities (GAM), GCompip, a comprehensive pipeline for estimating GAM, was developed based on specialized universal single copy genes (USCG) database, stringent alignment parameters, and rigorous filtering criteria.Results: GCompip showed high specificity without compromising computational efficiency, and improved the precision of downstream GAM estimations across diverse six ecological environments (i.e. human gut, rumen, freshwater, marine, hydrothermal sediment, and glacier). In contrast, the comparative annotation tools (i.e. KofamScan, eggNOG-mapper and HUMAnN3) showed larger error intervals, higher susceptibility to false positives, or overestimation of USCG abundance, primarily due to more relaxed thresholds, multifamily matches, or less stringent alignment settings. To facilitating the applicability of GCompip, we provided both Linux command line and R package versions. Overall, this GCompip presented an accurate, robust, user-friendly, and efficient computational pipeline designed to calculate GAM using metagenomic sequencing data. The developed pipeline makes it accessible to researchers seeking to evaluate the metabolic capabilities of microbial communities, and improve the capacity of interpreting metagenomic data related to microbial communities.Availability and implementation: GCompip package source code and documentation are freely available for download at https://github.com/XiangZhouCAS/GCompip. A separate Linux command line version is available at https://github.com/XiangZhouCAS/GCompip_onlinux.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf207"},"PeriodicalIF":2.8000,"publicationDate":"2025-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12460045/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics advances","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioadv/vbaf207","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Motivation: Gene abundance in metagenome datasets is commonly represented in terms of Counts or Copies Per Million. However, above term lack the consideration of the size of the microbial communities. To reflect the gene abundance in the microbial communities (GAM), GCompip, a comprehensive pipeline for estimating GAM, was developed based on specialized universal single copy genes (USCG) database, stringent alignment parameters, and rigorous filtering criteria.

Results: GCompip showed high specificity without compromising computational efficiency, and improved the precision of downstream GAM estimations across diverse six ecological environments (i.e. human gut, rumen, freshwater, marine, hydrothermal sediment, and glacier). In contrast, the comparative annotation tools (i.e. KofamScan, eggNOG-mapper and HUMAnN3) showed larger error intervals, higher susceptibility to false positives, or overestimation of USCG abundance, primarily due to more relaxed thresholds, multifamily matches, or less stringent alignment settings. To facilitating the applicability of GCompip, we provided both Linux command line and R package versions. Overall, this GCompip presented an accurate, robust, user-friendly, and efficient computational pipeline designed to calculate GAM using metagenomic sequencing data. The developed pipeline makes it accessible to researchers seeking to evaluate the metabolic capabilities of microbial communities, and improve the capacity of interpreting metagenomic data related to microbial communities.

Availability and implementation: GCompip package source code and documentation are freely available for download at https://github.com/XiangZhouCAS/GCompip. A separate Linux command line version is available at https://github.com/XiangZhouCAS/GCompip_onlinux.

查看原文本刊更多论文

GCompip：一个估算微生物群落中基因丰度的管道。

动机：宏基因组数据集中的基因丰度通常以计数或每百万拷贝数表示。然而，上述术语缺乏对微生物群落规模的考虑。为了反映微生物群落（GAM）的基因丰度，GCompip是一个基于专用通用单拷贝基因（USCG）数据库、严格的比对参数和严格的过滤标准开发的综合GAM估算管道。结果：GCompip在不影响计算效率的情况下具有高特异性，提高了6种不同生态环境（即人类肠道、瘤胃、淡水、海洋、热液沉积物和冰川）下游GAM估计的精度。相比之下，比较注释工具（即KofamScan， eggNOG-mapper和HUMAnN3）显示出更大的误差区间，对假阳性的敏感性更高，或对USCG丰度的高估，主要是由于更宽松的阈值，多家族匹配或不太严格的比对设置。为了方便GCompip的适用性，我们提供了Linux命令行和R包版本。总体而言，该GCompip提供了一个准确、稳健、用户友好且高效的计算管道，用于使用宏基因组测序数据计算GAM。开发的管道使研究人员能够评估微生物群落的代谢能力，并提高解释与微生物群落相关的宏基因组数据的能力。可用性和实现：GCompip包的源代码和文档可以在https://github.com/XiangZhouCAS/GCompip上免费下载。一个单独的Linux命令行版本可从https://github.com/XiangZhouCAS/GCompip_onlinux获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Bioinformatics advances

CiteScore

1.60

自引率

0.00%

发文量