NFEmbed: modeling nitrogenase activity via classification and regression with pretrained protein embeddings.

IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances Pub Date : 2025-08-23 eCollection Date: 2025-01-01 DOI:10.1093/bioadv/vbaf204

Md Muhaiminul Islam Nafi, Abdullah Al Mohaimin

{"title":"NFEmbed: modeling nitrogenase activity via classification and regression with pretrained protein embeddings.","authors":"Md Muhaiminul Islam Nafi, Abdullah Al Mohaimin","doi":"10.1093/bioadv/vbaf204","DOIUrl":null,"url":null,"abstract":"Motivation: Heavy usage of synthetic nitrogen fertilizers to satisfy the increasing demands for food has led to severe environmental impacts like decreasing crop yields and eutrophication. One promising alternative is using nitrogen-fixing microorganisms as biofertilizers, which use the nitrogenase enzyme. This could also be achieved by expressing a functional nitrogenase enzyme in the cells of the cereal crops.Results: In this study, we predicted microbial strains with a high potential for nitrogenase activity using machine learning techniques. Its objective was to enable the screening and ranking of potential strains based on genomic information. We explored several protein language model embeddings for this prediction task and built two stacking ensemble models. One of them, NFEmbed-C, used k-Nearest Neighbors and Random Forest as base and meta learners, respectively. The other one, NFEmbed-R, combined Decision Tree Regressor and eXtreme Gradient Boosting Regressor as base learners, with Support Vector Regressor as the meta learner. On the Test set, both NFEmbed-C and NFEmbed-R performed better than the state-of-the-art methods with improvements ranging from 0% to 11.2% and from 30% to 51%, respectively. While NFEmbed-R got a 0.783 R 2 score, 0.158 MSE, and 0.398 RMSE, NFEmbed-C acquired 0.949 sensitivity, 0.892 F1 score, and 0.784 Matthews Correlation Coefficient on the test set.Availability and implementation: We performed our analysis in Python; code is available at https://github.com/nafcoder/NFEmbed.","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf204"},"PeriodicalIF":2.8000,"publicationDate":"2025-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12417089/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics advances","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioadv/vbaf204","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Motivation: Heavy usage of synthetic nitrogen fertilizers to satisfy the increasing demands for food has led to severe environmental impacts like decreasing crop yields and eutrophication. One promising alternative is using nitrogen-fixing microorganisms as biofertilizers, which use the nitrogenase enzyme. This could also be achieved by expressing a functional nitrogenase enzyme in the cells of the cereal crops.

Results: In this study, we predicted microbial strains with a high potential for nitrogenase activity using machine learning techniques. Its objective was to enable the screening and ranking of potential strains based on genomic information. We explored several protein language model embeddings for this prediction task and built two stacking ensemble models. One of them, NFEmbed-C, used k-Nearest Neighbors and Random Forest as base and meta learners, respectively. The other one, NFEmbed-R, combined Decision Tree Regressor and eXtreme Gradient Boosting Regressor as base learners, with Support Vector Regressor as the meta learner. On the Test set, both NFEmbed-C and NFEmbed-R performed better than the state-of-the-art methods with improvements ranging from 0% to 11.2% and from 30% to 51%, respectively. While NFEmbed-R got a 0.783 R ² score, 0.158 MSE, and 0.398 RMSE, NFEmbed-C acquired 0.949 sensitivity, 0.892 F1 score, and 0.784 Matthews Correlation Coefficient on the test set.

Availability and implementation: We performed our analysis in Python; code is available at https://github.com/nafcoder/NFEmbed.

Abstract Image

查看原文本刊更多论文

NFEmbed：通过分类和回归与预训练的蛋白质包埋建模的氮酶活性。

动机：为了满足日益增长的粮食需求，大量使用合成氮肥，导致作物减产和富营养化等严重的环境影响。一个有希望的替代方案是使用固氮微生物作为生物肥料，它使用氮酶。这也可以通过在谷类作物的细胞中表达一种功能性的氮酶来实现。结果：在这项研究中，我们使用机器学习技术预测了具有高潜力的氮酶活性的微生物菌株。其目的是基于基因组信息对潜在菌株进行筛选和排序。我们为该预测任务探索了几种蛋白质语言模型嵌入，并建立了两个堆叠集成模型。其中一个，NFEmbed-C，分别使用k近邻和随机森林作为基础和元学习器。另一种是NFEmbed-R，它结合了决策树回归器和极端梯度增强回归器作为基础学习器，支持向量回归器作为元学习器。在测试集上，NFEmbed-C和NFEmbed-R都比最先进的方法表现得更好，分别提高了0%到11.2%和30%到51%。NFEmbed-R的r2评分为0.783，MSE为0.158，RMSE为0.398，NFEmbed-C的灵敏度为0.949，F1评分为0.892，Matthews相关系数为0.784。可用性和实现：我们使用Python执行分析；代码可在https://github.com/nafcoder/NFEmbed上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Bioinformatics advances

CiteScore

1.60

自引率

0.00%

发文量