NNKcat: deep neural network to predict catalytic constants (Kcat) by integrating protein sequence and substrate structure with enhanced data imbalance handling.

IF 6.8 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics Pub Date : 2025-05-01 DOI:10.1093/bib/bbaf212

Jingchen Zhai, Xiguang Qi, Lianjin Cai, Yue Liu, Haocheng Tang, Lei Xie, Junmei Wang

{"title":"NNKcat: deep neural network to predict catalytic constants (Kcat) by integrating protein sequence and substrate structure with enhanced data imbalance handling.","authors":"Jingchen Zhai, Xiguang Qi, Lianjin Cai, Yue Liu, Haocheng Tang, Lei Xie, Junmei Wang","doi":"10.1093/bib/bbaf212","DOIUrl":null,"url":null,"abstract":"<p><p>Catalytic constant (Kcat) is to describe the efficiency of catalyzing reactions. The Kcat value of an enzyme-substrate pair indicates the rate an enzyme converts saturated substrates into product during the catalytic process. However, it is challenging to construct robust prediction models for this important property. Most of the existing models, including the one recently published by Nature Catalysis (Li et al.), are suffering from the overfitting issue. In this study, we proposed a novel protocol to construct Kcat prediction models, introducing an intermedia step to separately develop substrate and protein processors. The substrate processor leverages analyzing Simplified Molecular Input Line Entry System (SMILES) strings using a graph neural network model, attentive FP, while the protein processor abstracts protein sequence information utilizing long short-term memory architecture. This protocol not only mitigates the impact of data imbalance in the original dataset but also provides greater flexibility in customizing the general-purpose Kcat prediction model to enhance the prediction accuracy for specific enzyme classes. Our general-purpose Kcat prediction model demonstrates significantly enhanced stability and slightly better accuracy (R2 value of 0.54 versus 0.50) in comparison with Li et al.'s model using the same dataset. Additionally, our modeling protocol enables personalization of fine-tuning the general-purpose Kcat model for specific enzyme categories through focused learning. Using Cytochrome P450 (CYP450) enzymes as a case study, we achieved the best R2 value of 0.64 for the focused model. The high-quality performance and expandability of the model guarantee its broad applications in enzyme engineering and drug research & development.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 3","pages":""},"PeriodicalIF":6.8000,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12078937/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Briefings in bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/bib/bbaf212","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Catalytic constant (Kcat) is to describe the efficiency of catalyzing reactions. The Kcat value of an enzyme-substrate pair indicates the rate an enzyme converts saturated substrates into product during the catalytic process. However, it is challenging to construct robust prediction models for this important property. Most of the existing models, including the one recently published by Nature Catalysis (Li et al.), are suffering from the overfitting issue. In this study, we proposed a novel protocol to construct Kcat prediction models, introducing an intermedia step to separately develop substrate and protein processors. The substrate processor leverages analyzing Simplified Molecular Input Line Entry System (SMILES) strings using a graph neural network model, attentive FP, while the protein processor abstracts protein sequence information utilizing long short-term memory architecture. This protocol not only mitigates the impact of data imbalance in the original dataset but also provides greater flexibility in customizing the general-purpose Kcat prediction model to enhance the prediction accuracy for specific enzyme classes. Our general-purpose Kcat prediction model demonstrates significantly enhanced stability and slightly better accuracy (R2 value of 0.54 versus 0.50) in comparison with Li et al.'s model using the same dataset. Additionally, our modeling protocol enables personalization of fine-tuning the general-purpose Kcat model for specific enzyme categories through focused learning. Using Cytochrome P450 (CYP450) enzymes as a case study, we achieved the best R2 value of 0.64 for the focused model. The high-quality performance and expandability of the model guarantee its broad applications in enzyme engineering and drug research & development.

查看原文本刊更多论文

NNKcat：通过整合蛋白质序列和底物结构并增强数据不平衡处理来预测催化常数（Kcat）的深度神经网络。

催化常数（Kcat）用来描述催化反应的效率。酶-底物对的Kcat值表示酶在催化过程中将饱和底物转化为产物的速率。然而，为这一重要特性构建稳健的预测模型是一项挑战。大多数现有模型，包括Nature Catalysis最近发表的模型（Li et al.），都存在过拟合问题。在这项研究中，我们提出了一种新的协议来构建Kcat预测模型，引入了一个中间步骤来分别开发底物和蛋白质处理程序。底物处理器利用图形神经网络模型（attention FP）分析简化分子输入行输入系统（SMILES）字符串，而蛋白质处理器利用长短期记忆架构抽象蛋白质序列信息。该协议不仅减轻了原始数据不平衡的影响，而且为定制通用Kcat预测模型提供了更大的灵活性，以提高对特定酶类的预测精度。与Li等人使用相同数据集的模型相比，我们的通用Kcat预测模型显示出显著增强的稳定性和略好的准确性（R2值为0.54对0.50）。此外，我们的建模协议允许通过集中学习对特定酶类别的通用Kcat模型进行个性化微调。以细胞色素P450 （CYP450）酶为例，我们获得了聚焦模型的最佳R2值为0.64。该模型的高质量性能和可扩展性保证了其在酶工程和药物研发中的广泛应用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Briefings in bioinformatics 生物-生化研究方法

CiteScore

13.20

自引率

13.70%

发文量

549

审稿时长

6 months

期刊介绍： Briefings in Bioinformatics is an international journal serving as a platform for researchers and educators in the life sciences. It also appeals to mathematicians, statisticians, and computer scientists applying their expertise to biological challenges. The journal focuses on reviews tailored for users of databases and analytical tools in contemporary genetics, molecular and systems biology. It stands out by offering practical assistance and guidance to non-specialists in computerized methodologies. Covering a wide range from introductory concepts to specific protocols and analyses, the papers address bacterial, plant, fungal, animal, and human data. The journal's detailed subject areas include genetic studies of phenotypes and genotypes, mapping, DNA sequencing, expression profiling, gene expression studies, microarrays, alignment methods, protein profiles and HMMs, lipids, metabolic and signaling pathways, structure determination and function prediction, phylogenetic studies, and education and training.