A Three-Module Machine Learning Framework for Protein Sequence- and Temperature-Dependent k_cat/K_m Prediction in β-Glucosidases.

IF 3.9 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

ACS Synthetic Biology Pub Date : 2025-10-02 DOI:10.1021/acssynbio.5c00257

Mehmet Emre Erkanli, Yunseok Jang, Ali Malli, Khalid El-Halabi, Chaehyun Ryu, Jin Ryoun Kim

{"title":"A Three-Module Machine Learning Framework for Protein Sequence- and Temperature-Dependent kcat/Km Prediction in β-Glucosidases.","authors":"Mehmet Emre Erkanli, Yunseok Jang, Ali Malli, Khalid El-Halabi, Chaehyun Ryu, Jin Ryoun Kim","doi":"10.1021/acssynbio.5c00257","DOIUrl":null,"url":null,"abstract":"The catalytic activity of enzymes is intricately determined by their amino acid sequences and assay conditions, particularly temperature. Navigating the complex interplay among sequence, temperature, and catalytic function is crucial for unlocking a multitude of enzyme applications. Machine learning has recently emerged as a tool for quantitative prediction of enzyme activity from protein sequences. Unfortunately, ML models designed to predict the comprehensive enzyme activity parameter, kcat/Km, from protein sequences are rare compared to those predicting kcat or Km alone. Combining both protein sequence and temperature as input features further challenges predictions; no current ML models capture the nonlinear relationship between kcat/Km and temperature for a protein sequence of interest. In this study, we developed a unique three-module ML framework that predicts β-glucosidase kcat/Km values based on protein sequence and temperature. Each module was designed to capture a distinct aspect of the interplay among protein sequence, temperature, and kcat/Km for β-glucosidase activity; when integrated, they formed an ML framework that maps the sequence and temperature spaces associated with β-glucosidase kcat/Km. This modular approach allowed for optimizations of ML models within each module, collectively achieving notable generalization performance when predicting temperature-dependent kcat/Km values for protein sequences not encountered during training. Our findings underscore the advantages of the three-module framework over traditional single-module methods, particularly by reducing prediction variability due to data splitting and mitigating overfitting. We anticipate that our multimodule ML framework will be directly applicable to other complex systems, enabling quantitative exploration of their property domains.","PeriodicalId":26,"journal":{"name":"ACS Synthetic Biology","volume":" ","pages":""},"PeriodicalIF":3.9000,"publicationDate":"2025-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACS Synthetic Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1021/acssynbio.5c00257","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

The catalytic activity of enzymes is intricately determined by their amino acid sequences and assay conditions, particularly temperature. Navigating the complex interplay among sequence, temperature, and catalytic function is crucial for unlocking a multitude of enzyme applications. Machine learning has recently emerged as a tool for quantitative prediction of enzyme activity from protein sequences. Unfortunately, ML models designed to predict the comprehensive enzyme activity parameter, k_cat/K_m, from protein sequences are rare compared to those predicting k_cat or K_m alone. Combining both protein sequence and temperature as input features further challenges predictions; no current ML models capture the nonlinear relationship between k_cat/K_m and temperature for a protein sequence of interest. In this study, we developed a unique three-module ML framework that predicts β-glucosidase k_cat/K_m values based on protein sequence and temperature. Each module was designed to capture a distinct aspect of the interplay among protein sequence, temperature, and k_cat/K_m for β-glucosidase activity; when integrated, they formed an ML framework that maps the sequence and temperature spaces associated with β-glucosidase k_cat/K_m. This modular approach allowed for optimizations of ML models within each module, collectively achieving notable generalization performance when predicting temperature-dependent k_cat/K_m values for protein sequences not encountered during training. Our findings underscore the advantages of the three-module framework over traditional single-module methods, particularly by reducing prediction variability due to data splitting and mitigating overfitting. We anticipate that our multimodule ML framework will be directly applicable to other complex systems, enabling quantitative exploration of their property domains.

查看原文本刊更多论文

β-葡萄糖苷酶蛋白序列和温度依赖性kcat/Km预测的三模块机器学习框架。

酶的催化活性复杂地取决于它们的氨基酸序列和测定条件，特别是温度。在序列，温度和催化功能之间导航复杂的相互作用对于解锁大量酶的应用至关重要。机器学习最近成为一种从蛋白质序列定量预测酶活性的工具。不幸的是，与单独预测kcat或Km的模型相比，设计用于预测蛋白质序列的综合酶活性参数kcat/Km的ML模型很少。结合蛋白质序列和温度作为输入特征进一步挑战预测；对于感兴趣的蛋白质序列，目前没有ML模型捕获kcat/Km与温度之间的非线性关系。在这项研究中，我们开发了一个独特的三模块ML框架，根据蛋白质序列和温度预测β-葡萄糖苷酶kcat/Km值。每个模块都被设计用于捕获蛋白质序列，温度和kcat/Km之间相互作用的不同方面，用于β-葡萄糖苷酶活性；整合后，它们形成了一个ML框架，绘制了与β-葡萄糖苷酶kcat/Km相关的序列和温度空间。这种模块化方法允许在每个模块内优化ML模型，在预测训练过程中未遇到的蛋白质序列的温度相关kcat/Km值时，共同实现显着的泛化性能。我们的研究结果强调了三模块框架相对于传统单模块方法的优势，特别是通过减少由于数据分裂和缓解过拟合而导致的预测变异性。我们预计我们的多模块机器学习框架将直接适用于其他复杂系统，使其属性域的定量探索成为可能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACS Synthetic Biology 生物-

CiteScore

8.00

自引率

10.60%

发文量

380

审稿时长

6-12 weeks

期刊介绍： The journal is particularly interested in studies on the design and synthesis of new genetic circuits and gene products; computational methods in the design of systems; and integrative applied approaches to understanding disease and metabolism. Topics may include, but are not limited to: Design and optimization of genetic systems Genetic circuit design and their principles for their organization into programs Computational methods to aid the design of genetic systems Experimental methods to quantify genetic parts, circuits, and metabolic fluxes Genetic parts libraries: their creation, analysis, and ontological representation Protein engineering including computational design Metabolic engineering and cellular manufacturing, including biomass conversion Natural product access, engineering, and production Creative and innovative applications of cellular programming Medical applications, tissue engineering, and the programming of therapeutic cells Minimal cell design and construction Genomics and genome replacement strategies Viral engineering Automated and robotic assembly platforms for synthetic biology DNA synthesis methodologies Metagenomics and synthetic metagenomic analysis Bioinformatics applied to gene discovery, chemoinformatics, and pathway construction Gene optimization Methods for genome-scale measurements of transcription and metabolomics Systems biology and methods to integrate multiple data sources in vitro and cell-free synthetic biology and molecular programming Nucleic acid engineering.