Sebastian Staab, Anny Cardénas, Raquel S Peixoto, Falk Schreiber, Christian R Voolstra
{"title":"Coracle—A Machine Learning Framework to Identify Bacteria Associated with Continuous Variables","authors":"Sebastian Staab, Anny Cardénas, Raquel S Peixoto, Falk Schreiber, Christian R Voolstra","doi":"10.1093/bioinformatics/btad749","DOIUrl":null,"url":null,"abstract":"Summary We present Coracle, an Artificial Intelligence (AI) framework that can identify associations between bacterial communities and continuous variables. Coracle uses an ensemble approach of prominent feature selection methods and machine learning (ML) models to identify features, i.e., bacteria, associated with a continuous variable, e.g. host thermal tolerance. The results are aggregated into a score that incorporates the performances of the different ML models and the respective feature importance, while also considering the robustness of feature selection. Additionally, regression coefficients provide first insights into the direction of the association. We show the utility of Coracle by analyzing associations between bacterial composition data (i.e., 16S rRNA Amplicon Sequence Variants, ASVs) and coral thermal tolerance (i.e., standardized short-term heat stress-derived diagnostics). This analysis identified high-scoring bacterial taxa that were previously found associated with coral thermal tolerance. Coracle scales with feature number and performs well with hundreds to thousands of features, corresponding to the typical size of current datasets. Coracle performs best if run at a higher taxonomic level first (e.g., order or family) to identify groups of interest that can subsequently be run at the ASV level. Availability and Implementation Coracle can be accessed via a dedicated web server that allows free and simple access: http://www.micportal.org/coracle/index. The underlying code is open-source and available via GitHub https://github.com/SebastianStaab/coracle.git. Supplementary information Example datasets and a tutorial are available on the web server webpage. Supplementary data are available at Bioinformatics online.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"1 1","pages":""},"PeriodicalIF":4.4000,"publicationDate":"2023-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btad749","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
Summary We present Coracle, an Artificial Intelligence (AI) framework that can identify associations between bacterial communities and continuous variables. Coracle uses an ensemble approach of prominent feature selection methods and machine learning (ML) models to identify features, i.e., bacteria, associated with a continuous variable, e.g. host thermal tolerance. The results are aggregated into a score that incorporates the performances of the different ML models and the respective feature importance, while also considering the robustness of feature selection. Additionally, regression coefficients provide first insights into the direction of the association. We show the utility of Coracle by analyzing associations between bacterial composition data (i.e., 16S rRNA Amplicon Sequence Variants, ASVs) and coral thermal tolerance (i.e., standardized short-term heat stress-derived diagnostics). This analysis identified high-scoring bacterial taxa that were previously found associated with coral thermal tolerance. Coracle scales with feature number and performs well with hundreds to thousands of features, corresponding to the typical size of current datasets. Coracle performs best if run at a higher taxonomic level first (e.g., order or family) to identify groups of interest that can subsequently be run at the ASV level. Availability and Implementation Coracle can be accessed via a dedicated web server that allows free and simple access: http://www.micportal.org/coracle/index. The underlying code is open-source and available via GitHub https://github.com/SebastianStaab/coracle.git. Supplementary information Example datasets and a tutorial are available on the web server webpage. Supplementary data are available at Bioinformatics online.
期刊介绍:
The leading journal in its field, Bioinformatics publishes the highest quality scientific papers and review articles of interest to academic and industrial researchers. Its main focus is on new developments in genome bioinformatics and computational biology. Two distinct sections within the journal - Discovery Notes and Application Notes- focus on shorter papers; the former reporting biologically interesting discoveries using computational methods, the latter exploring the applications used for experiments.