NFEmbed: modeling nitrogenase activity via classification and regression with pretrained protein embeddings.

IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY
Bioinformatics advances Pub Date : 2025-08-23 eCollection Date: 2025-01-01 DOI:10.1093/bioadv/vbaf204
Md Muhaiminul Islam Nafi, Abdullah Al Mohaimin
{"title":"NFEmbed: modeling nitrogenase activity via classification and regression with pretrained protein embeddings.","authors":"Md Muhaiminul Islam Nafi, Abdullah Al Mohaimin","doi":"10.1093/bioadv/vbaf204","DOIUrl":null,"url":null,"abstract":"<p><strong>Motivation: </strong>Heavy usage of synthetic nitrogen fertilizers to satisfy the increasing demands for food has led to severe environmental impacts like decreasing crop yields and eutrophication. One promising alternative is using nitrogen-fixing microorganisms as biofertilizers, which use the nitrogenase enzyme. This could also be achieved by expressing a functional nitrogenase enzyme in the cells of the cereal crops.</p><p><strong>Results: </strong>In this study, we predicted microbial strains with a high potential for nitrogenase activity using machine learning techniques. Its objective was to enable the screening and ranking of potential strains based on genomic information. We explored several protein language model embeddings for this prediction task and built two stacking ensemble models. One of them, NFEmbed-C, used k-Nearest Neighbors and Random Forest as base and meta learners, respectively. The other one, NFEmbed-R, combined Decision Tree Regressor and eXtreme Gradient Boosting Regressor as base learners, with Support Vector Regressor as the meta learner. On the Test set, both NFEmbed-C and NFEmbed-R performed better than the state-of-the-art methods with improvements ranging from 0% to 11.2% and from 30% to 51%, respectively. While NFEmbed-R got a 0.783 <i>R</i> <sup>2</sup> score, 0.158 MSE, and 0.398 RMSE, NFEmbed-C acquired 0.949 sensitivity, 0.892 F1 score, and 0.784 Matthews Correlation Coefficient on the test set.</p><p><strong>Availability and implementation: </strong>We performed our analysis in Python; code is available at https://github.com/nafcoder/NFEmbed.</p>","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf204"},"PeriodicalIF":2.8000,"publicationDate":"2025-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12417089/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics advances","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioadv/vbaf204","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Motivation: Heavy usage of synthetic nitrogen fertilizers to satisfy the increasing demands for food has led to severe environmental impacts like decreasing crop yields and eutrophication. One promising alternative is using nitrogen-fixing microorganisms as biofertilizers, which use the nitrogenase enzyme. This could also be achieved by expressing a functional nitrogenase enzyme in the cells of the cereal crops.

Results: In this study, we predicted microbial strains with a high potential for nitrogenase activity using machine learning techniques. Its objective was to enable the screening and ranking of potential strains based on genomic information. We explored several protein language model embeddings for this prediction task and built two stacking ensemble models. One of them, NFEmbed-C, used k-Nearest Neighbors and Random Forest as base and meta learners, respectively. The other one, NFEmbed-R, combined Decision Tree Regressor and eXtreme Gradient Boosting Regressor as base learners, with Support Vector Regressor as the meta learner. On the Test set, both NFEmbed-C and NFEmbed-R performed better than the state-of-the-art methods with improvements ranging from 0% to 11.2% and from 30% to 51%, respectively. While NFEmbed-R got a 0.783 R 2 score, 0.158 MSE, and 0.398 RMSE, NFEmbed-C acquired 0.949 sensitivity, 0.892 F1 score, and 0.784 Matthews Correlation Coefficient on the test set.

Availability and implementation: We performed our analysis in Python; code is available at https://github.com/nafcoder/NFEmbed.

Abstract Image

Abstract Image

Abstract Image

NFEmbed:通过分类和回归与预训练的蛋白质包埋建模的氮酶活性。
动机:为了满足日益增长的粮食需求,大量使用合成氮肥,导致作物减产和富营养化等严重的环境影响。一个有希望的替代方案是使用固氮微生物作为生物肥料,它使用氮酶。这也可以通过在谷类作物的细胞中表达一种功能性的氮酶来实现。结果:在这项研究中,我们使用机器学习技术预测了具有高潜力的氮酶活性的微生物菌株。其目的是基于基因组信息对潜在菌株进行筛选和排序。我们为该预测任务探索了几种蛋白质语言模型嵌入,并建立了两个堆叠集成模型。其中一个,NFEmbed-C,分别使用k近邻和随机森林作为基础和元学习器。另一种是NFEmbed-R,它结合了决策树回归器和极端梯度增强回归器作为基础学习器,支持向量回归器作为元学习器。在测试集上,NFEmbed-C和NFEmbed-R都比最先进的方法表现得更好,分别提高了0%到11.2%和30%到51%。NFEmbed-R的r2评分为0.783,MSE为0.158,RMSE为0.398,NFEmbed-C的灵敏度为0.949,F1评分为0.892,Matthews相关系数为0.784。可用性和实现:我们使用Python执行分析;代码可在https://github.com/nafcoder/NFEmbed上获得。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
1.60
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信