Machine learning models for delineating marine microbial taxa.

IF 2.8 Q1 GENETICS & HEREDITY
NAR Genomics and Bioinformatics Pub Date : 2025-06-19 eCollection Date: 2025-06-01 DOI:10.1093/nargab/lqaf090
Stilianos Louca
{"title":"Machine learning models for delineating marine microbial taxa.","authors":"Stilianos Louca","doi":"10.1093/nargab/lqaf090","DOIUrl":null,"url":null,"abstract":"<p><p>The relationship between gene content differences and microbial taxonomic divergence remains poorly understood, and algorithms for delineating novel microbial taxa above genus level based on multiple genome similarity metrics are lacking. Addressing these gaps is important for macroevolutionary theory, biodiversity assessments, and discovery of novel taxa in metagenomes. Here, I develop machine learning classifier models, based on multiple genome similarity metrics, to determine whether any two marine bacterial and archaeal (prokaryotic) metagenome-assembled genomes (MAGs) belong to the same taxon, from the genus up to the phylum levels. Metrics include average amino acid and nucleotide identities, and fractions of shared genes within various categories, applied to 14 390 previously published non-redundant MAGs. At all taxonomic levels, the balanced accuracy (average of the true-positive and true-negative rate) of classifiers exceeded 92%, suggesting that simple genome similarity metrics serve as good taxon differentiators. Predictor selection and sensitivity analyses revealed gene categories, e.g. those involved in metabolism of cofactors and vitamins, particularly correlated to taxon divergence. Predicted taxon delineations were further used to <i>de novo</i> enumerate marine prokaryotic taxa. Statistical analyses of those enumerations suggest that over half of extant marine prokaryotic phyla, classes, and orders have already been recovered by genome-resolved metagenomic surveys.</p>","PeriodicalId":33994,"journal":{"name":"NAR Genomics and Bioinformatics","volume":"7 2","pages":"lqaf090"},"PeriodicalIF":2.8000,"publicationDate":"2025-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12204397/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"NAR Genomics and Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/nargab/lqaf090","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/6/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}
引用次数: 0

Abstract

The relationship between gene content differences and microbial taxonomic divergence remains poorly understood, and algorithms for delineating novel microbial taxa above genus level based on multiple genome similarity metrics are lacking. Addressing these gaps is important for macroevolutionary theory, biodiversity assessments, and discovery of novel taxa in metagenomes. Here, I develop machine learning classifier models, based on multiple genome similarity metrics, to determine whether any two marine bacterial and archaeal (prokaryotic) metagenome-assembled genomes (MAGs) belong to the same taxon, from the genus up to the phylum levels. Metrics include average amino acid and nucleotide identities, and fractions of shared genes within various categories, applied to 14 390 previously published non-redundant MAGs. At all taxonomic levels, the balanced accuracy (average of the true-positive and true-negative rate) of classifiers exceeded 92%, suggesting that simple genome similarity metrics serve as good taxon differentiators. Predictor selection and sensitivity analyses revealed gene categories, e.g. those involved in metabolism of cofactors and vitamins, particularly correlated to taxon divergence. Predicted taxon delineations were further used to de novo enumerate marine prokaryotic taxa. Statistical analyses of those enumerations suggest that over half of extant marine prokaryotic phyla, classes, and orders have already been recovered by genome-resolved metagenomic surveys.

描述海洋微生物分类群的机器学习模型。
基因含量差异与微生物分类分化之间的关系尚不清楚,并且缺乏基于多个基因组相似性度量来描绘属水平以上新微生物分类群的算法。解决这些差距对于宏观进化理论、生物多样性评估和发现宏基因组中的新分类群具有重要意义。在这里,我开发了机器学习分类器模型,基于多个基因组相似性指标,以确定任何两个海洋细菌和古细菌(原核生物)宏基因组组装基因组(MAGs)是否属于同一分类单元,从属到门水平。指标包括平均氨基酸和核苷酸身份,以及不同类别中共享基因的部分,应用于14390个先前发表的非冗余mag。在所有分类水平上,分类器的平衡准确率(真阳性率和真阴性率的平均值)超过92%,表明简单的基因组相似性指标是很好的分类单元区分指标。预测因子选择和敏感性分析揭示了基因类别,例如参与辅助因子和维生素代谢的基因类别,特别是与分类单元分化相关的基因类别。预测的分类群划分进一步用于重新枚举海洋原核生物分类群。统计分析表明,通过基因组解析的宏基因组调查,已经恢复了一半以上现存的海洋原核生物门、纲和目。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
8.00
自引率
2.20%
发文量
95
审稿时长
15 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信