A Three-Module Machine Learning Framework for Protein Sequence- and Temperature-Dependent kcat/Km Prediction in β-Glucosidases.

IF 3.9 2区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS
Mehmet Emre Erkanli, Yunseok Jang, Ali Malli, Khalid El-Halabi, Chaehyun Ryu, Jin Ryoun Kim
{"title":"A Three-Module Machine Learning Framework for Protein Sequence- and Temperature-Dependent <i>k</i><sub>cat</sub>/<i>K</i><sub>m</sub> Prediction in β-Glucosidases.","authors":"Mehmet Emre Erkanli, Yunseok Jang, Ali Malli, Khalid El-Halabi, Chaehyun Ryu, Jin Ryoun Kim","doi":"10.1021/acssynbio.5c00257","DOIUrl":null,"url":null,"abstract":"<p><p>The catalytic activity of enzymes is intricately determined by their amino acid sequences and assay conditions, particularly temperature. Navigating the complex interplay among sequence, temperature, and catalytic function is crucial for unlocking a multitude of enzyme applications. Machine learning has recently emerged as a tool for quantitative prediction of enzyme activity from protein sequences. Unfortunately, ML models designed to predict the comprehensive enzyme activity parameter, <i>k</i><sub>cat</sub>/<i>K</i><sub>m</sub>, from protein sequences are rare compared to those predicting <i>k</i><sub>cat</sub> or <i>K</i><sub>m</sub> alone. Combining both protein sequence and temperature as input features further challenges predictions; no current ML models capture the nonlinear relationship between <i>k</i><sub>cat</sub>/<i>K</i><sub>m</sub> and temperature for a protein sequence of interest. In this study, we developed a unique three-module ML framework that predicts β-glucosidase <i>k</i><sub>cat</sub>/<i>K</i><sub>m</sub> values based on protein sequence and temperature. Each module was designed to capture a distinct aspect of the interplay among protein sequence, temperature, and <i>k</i><sub>cat</sub>/<i>K</i><sub>m</sub> for β-glucosidase activity; when integrated, they formed an ML framework that maps the sequence and temperature spaces associated with β-glucosidase <i>k</i><sub>cat</sub>/<i>K</i><sub>m</sub>. This modular approach allowed for optimizations of ML models within each module, collectively achieving notable generalization performance when predicting temperature-dependent <i>k</i><sub>cat</sub>/<i>K</i><sub>m</sub> values for protein sequences not encountered during training. Our findings underscore the advantages of the three-module framework over traditional single-module methods, particularly by reducing prediction variability due to data splitting and mitigating overfitting. We anticipate that our multimodule ML framework will be directly applicable to other complex systems, enabling quantitative exploration of their property domains.</p>","PeriodicalId":26,"journal":{"name":"ACS Synthetic Biology","volume":" ","pages":""},"PeriodicalIF":3.9000,"publicationDate":"2025-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACS Synthetic Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1021/acssynbio.5c00257","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

The catalytic activity of enzymes is intricately determined by their amino acid sequences and assay conditions, particularly temperature. Navigating the complex interplay among sequence, temperature, and catalytic function is crucial for unlocking a multitude of enzyme applications. Machine learning has recently emerged as a tool for quantitative prediction of enzyme activity from protein sequences. Unfortunately, ML models designed to predict the comprehensive enzyme activity parameter, kcat/Km, from protein sequences are rare compared to those predicting kcat or Km alone. Combining both protein sequence and temperature as input features further challenges predictions; no current ML models capture the nonlinear relationship between kcat/Km and temperature for a protein sequence of interest. In this study, we developed a unique three-module ML framework that predicts β-glucosidase kcat/Km values based on protein sequence and temperature. Each module was designed to capture a distinct aspect of the interplay among protein sequence, temperature, and kcat/Km for β-glucosidase activity; when integrated, they formed an ML framework that maps the sequence and temperature spaces associated with β-glucosidase kcat/Km. This modular approach allowed for optimizations of ML models within each module, collectively achieving notable generalization performance when predicting temperature-dependent kcat/Km values for protein sequences not encountered during training. Our findings underscore the advantages of the three-module framework over traditional single-module methods, particularly by reducing prediction variability due to data splitting and mitigating overfitting. We anticipate that our multimodule ML framework will be directly applicable to other complex systems, enabling quantitative exploration of their property domains.

β-葡萄糖苷酶蛋白序列和温度依赖性kcat/Km预测的三模块机器学习框架。
酶的催化活性复杂地取决于它们的氨基酸序列和测定条件,特别是温度。在序列,温度和催化功能之间导航复杂的相互作用对于解锁大量酶的应用至关重要。机器学习最近成为一种从蛋白质序列定量预测酶活性的工具。不幸的是,与单独预测kcat或Km的模型相比,设计用于预测蛋白质序列的综合酶活性参数kcat/Km的ML模型很少。结合蛋白质序列和温度作为输入特征进一步挑战预测;对于感兴趣的蛋白质序列,目前没有ML模型捕获kcat/Km与温度之间的非线性关系。在这项研究中,我们开发了一个独特的三模块ML框架,根据蛋白质序列和温度预测β-葡萄糖苷酶kcat/Km值。每个模块都被设计用于捕获蛋白质序列,温度和kcat/Km之间相互作用的不同方面,用于β-葡萄糖苷酶活性;整合后,它们形成了一个ML框架,绘制了与β-葡萄糖苷酶kcat/Km相关的序列和温度空间。这种模块化方法允许在每个模块内优化ML模型,在预测训练过程中未遇到的蛋白质序列的温度相关kcat/Km值时,共同实现显着的泛化性能。我们的研究结果强调了三模块框架相对于传统单模块方法的优势,特别是通过减少由于数据分裂和缓解过拟合而导致的预测变异性。我们预计我们的多模块机器学习框架将直接适用于其他复杂系统,使其属性域的定量探索成为可能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
8.00
自引率
10.60%
发文量
380
审稿时长
6-12 weeks
期刊介绍: The journal is particularly interested in studies on the design and synthesis of new genetic circuits and gene products; computational methods in the design of systems; and integrative applied approaches to understanding disease and metabolism. Topics may include, but are not limited to: Design and optimization of genetic systems Genetic circuit design and their principles for their organization into programs Computational methods to aid the design of genetic systems Experimental methods to quantify genetic parts, circuits, and metabolic fluxes Genetic parts libraries: their creation, analysis, and ontological representation Protein engineering including computational design Metabolic engineering and cellular manufacturing, including biomass conversion Natural product access, engineering, and production Creative and innovative applications of cellular programming Medical applications, tissue engineering, and the programming of therapeutic cells Minimal cell design and construction Genomics and genome replacement strategies Viral engineering Automated and robotic assembly platforms for synthetic biology DNA synthesis methodologies Metagenomics and synthetic metagenomic analysis Bioinformatics applied to gene discovery, chemoinformatics, and pathway construction Gene optimization Methods for genome-scale measurements of transcription and metabolomics Systems biology and methods to integrate multiple data sources in vitro and cell-free synthetic biology and molecular programming Nucleic acid engineering.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信