Environmental adaptations in metagenomes revealed by deep learning.

IF 4.5 1区 生物学 Q1 BIOLOGY
Johanna C Winder, Simon Poulton, Taoyang Wu, Thomas Mock, Cock van Oosterhout
{"title":"Environmental adaptations in metagenomes revealed by deep learning.","authors":"Johanna C Winder, Simon Poulton, Taoyang Wu, Thomas Mock, Cock van Oosterhout","doi":"10.1186/s12915-025-02361-1","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Deep learning has emerged as a powerful tool in the analysis of biological data, including the analysis of large metagenome data. However, its application remains limited due to high computational costs, model complexity, and difficulty extracting biological insights from these artificial neural networks (ANNs). In this study, we applied a transfer learning approach using the ESM-2 protein structure prediction model and our own smaller ANN to classify proteins containing the domain of unknown function 3494 (DUF3494) by their source environments. DUF3494 is found in a diverse group of putative ice-binding and substrate-binding proteins across a range of environments in prokaryotic and eukaryotic microorganisms. They present a compelling test case for exploring the balance between prediction accuracy and interpretability in sequence classification.</p><p><strong>Results: </strong>Our ANN analysed 50,669 DUF3494 sequences from publicly available metagenomes, and successfully classified a large proportion of sequences by source environment (polar marine, glacier ice, frozen sediment, rock, subsurface). We identified environment-specific features that appear to drive classification. Our best-performing ANN was able to classify between 75.9 and 97.8% of sequences correctly. To enhance biological interpretability of these predictions, we compared this model with a genetic algorithm (GA), which, although it had lower predictive ability, provided transparent classification rules and predictors. Further in silico mutagenesis of key residues uncovered a vertically aligned column of amino acids on the b-face of the protein which was important for environmental differentiation, suggesting that both methods captured distinct evolutionary and ecological aspects of the sequences. Feature importance analysis identified that steric and electronic properties of the protein were associated with predictive ability.</p><p><strong>Conclusions: </strong>Our findings highlight the utility of deep learning for classification of diverse biological sequences and provide a framework for combining methods to improve model interpretability and ecological insights.</p>","PeriodicalId":9339,"journal":{"name":"BMC Biology","volume":"23 1","pages":"252"},"PeriodicalIF":4.5000,"publicationDate":"2025-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12337378/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12915-025-02361-1","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Deep learning has emerged as a powerful tool in the analysis of biological data, including the analysis of large metagenome data. However, its application remains limited due to high computational costs, model complexity, and difficulty extracting biological insights from these artificial neural networks (ANNs). In this study, we applied a transfer learning approach using the ESM-2 protein structure prediction model and our own smaller ANN to classify proteins containing the domain of unknown function 3494 (DUF3494) by their source environments. DUF3494 is found in a diverse group of putative ice-binding and substrate-binding proteins across a range of environments in prokaryotic and eukaryotic microorganisms. They present a compelling test case for exploring the balance between prediction accuracy and interpretability in sequence classification.

Results: Our ANN analysed 50,669 DUF3494 sequences from publicly available metagenomes, and successfully classified a large proportion of sequences by source environment (polar marine, glacier ice, frozen sediment, rock, subsurface). We identified environment-specific features that appear to drive classification. Our best-performing ANN was able to classify between 75.9 and 97.8% of sequences correctly. To enhance biological interpretability of these predictions, we compared this model with a genetic algorithm (GA), which, although it had lower predictive ability, provided transparent classification rules and predictors. Further in silico mutagenesis of key residues uncovered a vertically aligned column of amino acids on the b-face of the protein which was important for environmental differentiation, suggesting that both methods captured distinct evolutionary and ecological aspects of the sequences. Feature importance analysis identified that steric and electronic properties of the protein were associated with predictive ability.

Conclusions: Our findings highlight the utility of deep learning for classification of diverse biological sequences and provide a framework for combining methods to improve model interpretability and ecological insights.

深度学习揭示的宏基因组环境适应。
背景:深度学习已经成为生物数据分析的有力工具,包括大型宏基因组数据的分析。然而,由于计算成本高、模型复杂以及难以从这些人工神经网络(ann)中提取生物学信息,其应用仍然受到限制。在这项研究中,我们采用迁移学习方法,使用ESM-2蛋白质结构预测模型和我们自己的较小的人工神经网络,根据来源环境对含有未知功能域3494 (DUF3494)的蛋白质进行分类。DUF3494在原核和真核微生物的一系列环境中发现了多种假定的冰结合蛋白和底物结合蛋白。它们为探索序列分类中预测准确性和可解释性之间的平衡提供了一个令人信服的测试案例。结果:我们的人工神经网络分析了来自公开可用的元基因组的50,669个DUF3494序列,并成功地根据源环境(极地海洋,冰川冰,冰冻沉积物,岩石,地下)对大部分序列进行了分类。我们确定了驱动分类的特定于环境的特征。我们表现最好的人工神经网络能够正确分类75.9到97.8%的序列。为了提高这些预测的生物学可解释性,我们将该模型与遗传算法(GA)进行了比较,遗传算法虽然预测能力较低,但提供了透明的分类规则和预测因子。进一步的硅诱变发现关键残基在蛋白质的b面有一个垂直排列的氨基酸柱,这对环境分化很重要,这表明两种方法都捕获了序列不同的进化和生态方面。特征重要性分析发现,蛋白质的空间和电子性质与预测能力有关。结论:我们的研究结果突出了深度学习在不同生物序列分类中的效用,并为提高模型可解释性和生态学见解的组合方法提供了一个框架。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
BMC Biology
BMC Biology 生物-生物学
CiteScore
7.80
自引率
1.90%
发文量
260
审稿时长
3 months
期刊介绍: BMC Biology is a broad scope journal covering all areas of biology. Our content includes research articles, new methods and tools. BMC Biology also publishes reviews, Q&A, and commentaries.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信