Modelling soil prokaryotic traits across environments with the trait sequence database ampliconTraits and the R package MicEnvMod

IF 5.8 2区 环境科学与生态学 Q1 ECOLOGY
Jonathan Donhauser , Anna Doménech-Pascual , Xingguo Han , Karen Jordaan , Jean-Baptiste Ramond , Aline Frossard , Anna M. Romaní , Anders Priemé
{"title":"Modelling soil prokaryotic traits across environments with the trait sequence database ampliconTraits and the R package MicEnvMod","authors":"Jonathan Donhauser ,&nbsp;Anna Doménech-Pascual ,&nbsp;Xingguo Han ,&nbsp;Karen Jordaan ,&nbsp;Jean-Baptiste Ramond ,&nbsp;Aline Frossard ,&nbsp;Anna M. Romaní ,&nbsp;Anders Priemé","doi":"10.1016/j.ecoinf.2024.102817","DOIUrl":null,"url":null,"abstract":"<div><p>We present a comprehensive, customizable workflow for inferring prokaryotic phenotypic traits from marker gene sequences and modelling the relationships between these traits and environmental factors, thus overcoming the limited ecological interpretability of marker gene sequencing data. We created the trait sequence database <em>ampliconTraits</em>, constructed by cross-mapping species from a phenotypic trait database to the SILVA sequence database and formatted to enable seamless classification of environmental sequences using the SINAPS algorithm. The R package <em>MicEnvMod</em> enables modelling of trait – environment relationships, combining the strengths of different model types and integrating an approach to evaluate the models' predictive performance in a single framework. Traits could be accurately predicted even for sequences with low sequence identity (80 %) with the reference sequences, indicating that our approach is suitable to classify a wide range of environmental sequences. Validating our approach in a large trans-continental soil dataset, we showed that trait distributions were robust to classification settings such as the bootstrap cutoff for classification and the number of discrete intervals for continuous traits. Using functions from <em>MicEnvMod,</em> we revealed precipitation seasonality and land cover as the most important predictors of genome size. We found Pearson correlation coefficients between observed and predicted values up to 0.70 using repeated split sampling cross validation, corroborating the predictive ability of our models beyond the training data. Predicting genome size across the Iberian Peninsula, we found the largest genomes in the northern part. Potential limitations of our trait inference approach include dependence on the phylogenetic conservation of traits and limited database coverage of environmental prokaryotes. Overall, our approach enables robust inference of ecologically interpretable traits combined with environmental modelling allowing to harness traits as bioindicators of soil ecosystem functioning.</p></div>","PeriodicalId":51024,"journal":{"name":"Ecological Informatics","volume":"83 ","pages":"Article 102817"},"PeriodicalIF":5.8000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1574954124003595/pdfft?md5=a975351ee65c86e764ade9d9b4d869ae&pid=1-s2.0-S1574954124003595-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ecological Informatics","FirstCategoryId":"93","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1574954124003595","RegionNum":2,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ECOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

We present a comprehensive, customizable workflow for inferring prokaryotic phenotypic traits from marker gene sequences and modelling the relationships between these traits and environmental factors, thus overcoming the limited ecological interpretability of marker gene sequencing data. We created the trait sequence database ampliconTraits, constructed by cross-mapping species from a phenotypic trait database to the SILVA sequence database and formatted to enable seamless classification of environmental sequences using the SINAPS algorithm. The R package MicEnvMod enables modelling of trait – environment relationships, combining the strengths of different model types and integrating an approach to evaluate the models' predictive performance in a single framework. Traits could be accurately predicted even for sequences with low sequence identity (80 %) with the reference sequences, indicating that our approach is suitable to classify a wide range of environmental sequences. Validating our approach in a large trans-continental soil dataset, we showed that trait distributions were robust to classification settings such as the bootstrap cutoff for classification and the number of discrete intervals for continuous traits. Using functions from MicEnvMod, we revealed precipitation seasonality and land cover as the most important predictors of genome size. We found Pearson correlation coefficients between observed and predicted values up to 0.70 using repeated split sampling cross validation, corroborating the predictive ability of our models beyond the training data. Predicting genome size across the Iberian Peninsula, we found the largest genomes in the northern part. Potential limitations of our trait inference approach include dependence on the phylogenetic conservation of traits and limited database coverage of environmental prokaryotes. Overall, our approach enables robust inference of ecologically interpretable traits combined with environmental modelling allowing to harness traits as bioindicators of soil ecosystem functioning.

Abstract Image

利用性状序列数据库 ampliconTraits 和 R 软件包 MicEnvMod 建立跨环境土壤原核生物性状模型
我们提出了一个全面的、可定制的工作流程,用于从标记基因序列推断原核生物的表型性状,并模拟这些性状与环境因素之间的关系,从而克服标记基因测序数据的生态学可解释性有限的问题。我们创建了性状序列数据库 ampliconTraits,该数据库是通过将表型性状数据库中的物种与 SILVA 序列数据库进行交叉映射而构建的,其格式可使用 SINAPS 算法对环境序列进行无缝分类。R 软件包 MicEnvMod 可以建立性状与环境关系的模型,它结合了不同模型类型的优势,并将评估模型预测性能的方法整合到一个框架中。即使与参考序列的序列同一性较低(80%),也能准确预测性状,这表明我们的方法适用于对各种环境序列进行分类。我们在一个大型跨大陆土壤数据集上验证了我们的方法,结果表明性状分布对分类设置(如分类的引导截止值和连续性状的离散区间数)具有稳健性。利用 MicEnvMod 中的函数,我们发现降水季节性和土地覆盖是预测基因组大小的最重要因素。通过重复分样交叉验证,我们发现观察值和预测值之间的皮尔逊相关系数高达 0.70,这证实了我们的模型在训练数据之外的预测能力。在预测整个伊比利亚半岛的基因组大小时,我们发现北部地区的基因组最大。我们的性状推断方法的潜在局限性包括对性状系统发育保护的依赖性和环境原核生物数据库覆盖范围的有限性。总之,我们的方法能够结合环境建模,对生态学上可解释的性状进行稳健推断,从而利用性状作为土壤生态系统功能的生物指标。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Ecological Informatics
Ecological Informatics 环境科学-生态学
CiteScore
8.30
自引率
11.80%
发文量
346
审稿时长
46 days
期刊介绍: The journal Ecological Informatics is devoted to the publication of high quality, peer-reviewed articles on all aspects of computational ecology, data science and biogeography. The scope of the journal takes into account the data-intensive nature of ecology, the growing capacity of information technology to access, harness and leverage complex data as well as the critical need for informing sustainable management in view of global environmental and climate change. The nature of the journal is interdisciplinary at the crossover between ecology and informatics. It focuses on novel concepts and techniques for image- and genome-based monitoring and interpretation, sensor- and multimedia-based data acquisition, internet-based data archiving and sharing, data assimilation, modelling and prediction of ecological data.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信