Stephanie N Majernik, Larry Beaver, Patrick H Bradley
{"title":"少量的错误组装会对基于泛基因组的宏基因组分析产生不成比例的影响。","authors":"Stephanie N Majernik, Larry Beaver, Patrick H Bradley","doi":"10.1128/msphere.00857-24","DOIUrl":null,"url":null,"abstract":"<p><p>Individual genes from microbiomes can drive host-level phenotypes. To help identify such candidate genes, several recent tools estimate microbial gene copy numbers directly from metagenomes. These tools rely on alignments to pangenomes, which, in turn, are derived from the set of all individual genomes from one species. While large-scale metagenomic assembly efforts have made pangenome estimates more complete, mixed communities can also introduce contamination into assemblies, and it is unknown how robust pangenome-based metagenomic analyses are to these errors. To gain insight into this problem, we re-analyzed a case-control study of the gut microbiome in cirrhosis, focusing on commensal Clostridia previously implicated in this disease. We tested for differentially prevalent genes in the <i>Lachnospiraceae</i> and then investigated which were likely to be contaminants using sequence similarity searches. Out of 86 differentially prevalent genes, we found that 33 (38%) were probably contaminants originating in taxa such as <i>Veillonella</i> and <i>Haemophilus</i>, unrelated genera that were independently correlated with disease status. Our results demonstrate that even small amounts of contamination in metagenome assemblies, below typical quality thresholds, can threaten to overwhelm gene-level metagenomic analyses. However, we also show that such contaminants can be accurately identified using a method based on gene-to-species correlation. After removing these contaminants, we observe that several flagellar motility gene clusters in the <i>Lachnospira eligens</i> pangenome are associated with cirrhosis status. We have integrated our analyses into an analysis and visualization pipeline, PanSweep, that can automatically identify cases where pangenome contamination may bias the results of gene-resolved analyses.IMPORTANCEMetagenome-assembled genomes, or MAGs, can be constructed without pure cultures of microbes. Large-scale efforts to build MAGs have yielded more complete pangenomes (i.e., sets of all genes found in one species). Pangenomes allow us to measure strain variation in gene content, which can strongly affect phenotype. However, because MAGs come from mixed communities, they can contaminate pangenomes with unrelated DNA; how much this impacts downstream analyses has not been studied. Using a metagenomic study of gut microbes in cirrhosis as our test case, we investigate how contamination affects analyses of microbial gene content. Surprisingly, even small, typical amounts of MAG contamination (<5%) result in disproportionately high levels of false positive associations (38%). Fortunately, we show that most contaminants can be automatically flagged and provide a simple method for doing so. Furthermore, applying this method reveals a new association between cirrhosis and gut microbial motility.</p>","PeriodicalId":19052,"journal":{"name":"mSphere","volume":" ","pages":"e0085724"},"PeriodicalIF":3.7000,"publicationDate":"2025-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12108083/pdf/","citationCount":"0","resultStr":"{\"title\":\"Small amounts of misassembly can have disproportionate effects on pangenome-based metagenomic analyses.\",\"authors\":\"Stephanie N Majernik, Larry Beaver, Patrick H Bradley\",\"doi\":\"10.1128/msphere.00857-24\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Individual genes from microbiomes can drive host-level phenotypes. To help identify such candidate genes, several recent tools estimate microbial gene copy numbers directly from metagenomes. These tools rely on alignments to pangenomes, which, in turn, are derived from the set of all individual genomes from one species. While large-scale metagenomic assembly efforts have made pangenome estimates more complete, mixed communities can also introduce contamination into assemblies, and it is unknown how robust pangenome-based metagenomic analyses are to these errors. To gain insight into this problem, we re-analyzed a case-control study of the gut microbiome in cirrhosis, focusing on commensal Clostridia previously implicated in this disease. We tested for differentially prevalent genes in the <i>Lachnospiraceae</i> and then investigated which were likely to be contaminants using sequence similarity searches. Out of 86 differentially prevalent genes, we found that 33 (38%) were probably contaminants originating in taxa such as <i>Veillonella</i> and <i>Haemophilus</i>, unrelated genera that were independently correlated with disease status. Our results demonstrate that even small amounts of contamination in metagenome assemblies, below typical quality thresholds, can threaten to overwhelm gene-level metagenomic analyses. However, we also show that such contaminants can be accurately identified using a method based on gene-to-species correlation. After removing these contaminants, we observe that several flagellar motility gene clusters in the <i>Lachnospira eligens</i> pangenome are associated with cirrhosis status. We have integrated our analyses into an analysis and visualization pipeline, PanSweep, that can automatically identify cases where pangenome contamination may bias the results of gene-resolved analyses.IMPORTANCEMetagenome-assembled genomes, or MAGs, can be constructed without pure cultures of microbes. Large-scale efforts to build MAGs have yielded more complete pangenomes (i.e., sets of all genes found in one species). Pangenomes allow us to measure strain variation in gene content, which can strongly affect phenotype. However, because MAGs come from mixed communities, they can contaminate pangenomes with unrelated DNA; how much this impacts downstream analyses has not been studied. Using a metagenomic study of gut microbes in cirrhosis as our test case, we investigate how contamination affects analyses of microbial gene content. Surprisingly, even small, typical amounts of MAG contamination (<5%) result in disproportionately high levels of false positive associations (38%). Fortunately, we show that most contaminants can be automatically flagged and provide a simple method for doing so. Furthermore, applying this method reveals a new association between cirrhosis and gut microbial motility.</p>\",\"PeriodicalId\":19052,\"journal\":{\"name\":\"mSphere\",\"volume\":\" \",\"pages\":\"e0085724\"},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2025-05-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12108083/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"mSphere\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1128/msphere.00857-24\",\"RegionNum\":2,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/4/29 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q2\",\"JCRName\":\"MICROBIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"mSphere","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1128/msphere.00857-24","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/4/29 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"MICROBIOLOGY","Score":null,"Total":0}
Small amounts of misassembly can have disproportionate effects on pangenome-based metagenomic analyses.
Individual genes from microbiomes can drive host-level phenotypes. To help identify such candidate genes, several recent tools estimate microbial gene copy numbers directly from metagenomes. These tools rely on alignments to pangenomes, which, in turn, are derived from the set of all individual genomes from one species. While large-scale metagenomic assembly efforts have made pangenome estimates more complete, mixed communities can also introduce contamination into assemblies, and it is unknown how robust pangenome-based metagenomic analyses are to these errors. To gain insight into this problem, we re-analyzed a case-control study of the gut microbiome in cirrhosis, focusing on commensal Clostridia previously implicated in this disease. We tested for differentially prevalent genes in the Lachnospiraceae and then investigated which were likely to be contaminants using sequence similarity searches. Out of 86 differentially prevalent genes, we found that 33 (38%) were probably contaminants originating in taxa such as Veillonella and Haemophilus, unrelated genera that were independently correlated with disease status. Our results demonstrate that even small amounts of contamination in metagenome assemblies, below typical quality thresholds, can threaten to overwhelm gene-level metagenomic analyses. However, we also show that such contaminants can be accurately identified using a method based on gene-to-species correlation. After removing these contaminants, we observe that several flagellar motility gene clusters in the Lachnospira eligens pangenome are associated with cirrhosis status. We have integrated our analyses into an analysis and visualization pipeline, PanSweep, that can automatically identify cases where pangenome contamination may bias the results of gene-resolved analyses.IMPORTANCEMetagenome-assembled genomes, or MAGs, can be constructed without pure cultures of microbes. Large-scale efforts to build MAGs have yielded more complete pangenomes (i.e., sets of all genes found in one species). Pangenomes allow us to measure strain variation in gene content, which can strongly affect phenotype. However, because MAGs come from mixed communities, they can contaminate pangenomes with unrelated DNA; how much this impacts downstream analyses has not been studied. Using a metagenomic study of gut microbes in cirrhosis as our test case, we investigate how contamination affects analyses of microbial gene content. Surprisingly, even small, typical amounts of MAG contamination (<5%) result in disproportionately high levels of false positive associations (38%). Fortunately, we show that most contaminants can be automatically flagged and provide a simple method for doing so. Furthermore, applying this method reveals a new association between cirrhosis and gut microbial motility.
期刊介绍:
mSphere™ is a multi-disciplinary open-access journal that will focus on rapid publication of fundamental contributions to our understanding of microbiology. Its scope will reflect the immense range of fields within the microbial sciences, creating new opportunities for researchers to share findings that are transforming our understanding of human health and disease, ecosystems, neuroscience, agriculture, energy production, climate change, evolution, biogeochemical cycling, and food and drug production. Submissions will be encouraged of all high-quality work that makes fundamental contributions to our understanding of microbiology. mSphere™ will provide streamlined decisions, while carrying on ASM''s tradition for rigorous peer review.