EPheClass: ensemble-based phenotype classifier from 16S rRNA gene sequences.

IF 3.9 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Frontiers in bioinformatics Pub Date : 2025-09-30 eCollection Date: 2025-01-01 DOI:10.3389/fbinf.2025.1514880

Lara Vázquez-González, Carlos Peña-Reyes, Alba Regueira-Iglesias, Carlos Balsa-Castro, Inmaculada Tomás, María J Carreira

{"title":"EPheClass: ensemble-based phenotype classifier from 16S rRNA gene sequences.","authors":"Lara Vázquez-González, Carlos Peña-Reyes, Alba Regueira-Iglesias, Carlos Balsa-Castro, Inmaculada Tomás, María J Carreira","doi":"10.3389/fbinf.2025.1514880","DOIUrl":null,"url":null,"abstract":"<p><p>One area of bioinformatics that is currently attracting particular interest is the classification of polymicrobial diseases using machine learning (ML), with data obtained from high-throughput amplicon sequencing of the 16S rRNA gene in human microbiome samples. The microbial dysbiosis underlying these types of diseases is particularly challenging to classify, as the data is highly dimensional, with potentially hundreds or even thousands of predictive features. In addition, the imbalance in the composition of the microbial community is highly heterogeneous across samples. In this paper, we propose a curated pipeline for binary phenotype classification based on a count table of 16S rRNA gene amplicons, which can be applied to any microbiome. To evaluate our proposal, raw 16S rRNA gene sequences from samples of healthy and periodontally affected oral microbiomes that met certain quality criteria were downloaded from public repositories. In the end, a total of 2,581 samples were analysed. In our approach, we first reduced the dimensionality of the data using feature selection methods. After tuning and evaluating different machine learning (ML) models and ensembles created using Dynamic Ensemble Selection (DES) techniques, we found that all DES models performed similarly and were more robust than individual models. Although the margin over other methods was minimal, DES-P achieved the highest AUC and was therefore selected as the representative technique in our analysis. When diagnosing periodontal disease with saliva samples, it achieved with only 13 features an F1 score of 0.913, a precision of 0.881, a recall (sensitivity) of 0.947, an accuracy of 0.929, and an AUC of 0.973. In addition, we used EPheClass to diagnose inflammatory bowel disease (IBD) and obtained better results than other works in the literature using the same dataset. We also evaluated its effectiveness in detecting antibiotic exposure, where it again demonstrated competitive results. This highlights the importance and generalisation aspect of our classification approach, which is applicable to different phenotypes, study niches, and sample types. The code is available at https://gitlab.citius.usc.es/lara.vazquez/epheclass.</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1514880"},"PeriodicalIF":3.9000,"publicationDate":"2025-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12518240/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fbinf.2025.1514880","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

One area of bioinformatics that is currently attracting particular interest is the classification of polymicrobial diseases using machine learning (ML), with data obtained from high-throughput amplicon sequencing of the 16S rRNA gene in human microbiome samples. The microbial dysbiosis underlying these types of diseases is particularly challenging to classify, as the data is highly dimensional, with potentially hundreds or even thousands of predictive features. In addition, the imbalance in the composition of the microbial community is highly heterogeneous across samples. In this paper, we propose a curated pipeline for binary phenotype classification based on a count table of 16S rRNA gene amplicons, which can be applied to any microbiome. To evaluate our proposal, raw 16S rRNA gene sequences from samples of healthy and periodontally affected oral microbiomes that met certain quality criteria were downloaded from public repositories. In the end, a total of 2,581 samples were analysed. In our approach, we first reduced the dimensionality of the data using feature selection methods. After tuning and evaluating different machine learning (ML) models and ensembles created using Dynamic Ensemble Selection (DES) techniques, we found that all DES models performed similarly and were more robust than individual models. Although the margin over other methods was minimal, DES-P achieved the highest AUC and was therefore selected as the representative technique in our analysis. When diagnosing periodontal disease with saliva samples, it achieved with only 13 features an F1 score of 0.913, a precision of 0.881, a recall (sensitivity) of 0.947, an accuracy of 0.929, and an AUC of 0.973. In addition, we used EPheClass to diagnose inflammatory bowel disease (IBD) and obtained better results than other works in the literature using the same dataset. We also evaluated its effectiveness in detecting antibiotic exposure, where it again demonstrated competitive results. This highlights the importance and generalisation aspect of our classification approach, which is applicable to different phenotypes, study niches, and sample types. The code is available at https://gitlab.citius.usc.es/lara.vazquez/epheclass.

查看原文本刊更多论文

epeclass：基于集成的16S rRNA基因序列表型分类器。

生物信息学的一个领域目前特别吸引人的兴趣是使用机器学习（ML）对多微生物疾病进行分类，其数据来自人类微生物组样本中16S rRNA基因的高通量扩增子测序。这些类型疾病背后的微生物生态失调尤其具有挑战性，因为数据是高度多维的，可能有数百甚至数千个预测特征。此外，微生物群落组成的不平衡在不同样品中是高度异质性的。在本文中，我们提出了一个基于16S rRNA基因扩增子计数表的二元表型分类管道，该管道可应用于任何微生物组。为了评估我们的建议，从公共存储库下载了健康和牙周影响的口腔微生物组样本中符合一定质量标准的原始16S rRNA基因序列。最后，总共分析了2581个样本。在我们的方法中，我们首先使用特征选择方法降低数据的维数。在调整和评估使用动态集成选择（DES）技术创建的不同机器学习（ML）模型和集成后，我们发现所有DES模型的表现相似，并且比单个模型更健壮。虽然与其他方法的差异很小，但DES-P获得了最高的AUC，因此在我们的分析中被选为代表性技术。当唾液样本诊断牙周病时，仅13个特征的F1得分为0.913，精密度为0.881，召回率（灵敏度）为0.947，准确度为0.929，AUC为0.973。此外，我们使用EPheClass来诊断炎症性肠病（IBD），并获得了比使用相同数据集的其他文献更好的结果。我们还评估了它在检测抗生素暴露方面的有效性，再次展示了具有竞争力的结果。这突出了我们的分类方法的重要性和概括性方面，这适用于不同的表型，研究利基和样本类型。代码可在https://gitlab.citius.usc.es/lara.vazquez/epheclass上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Frontiers in bioinformatics

CiteScore

2.60

自引率

0.00%

发文量