Mehmet Emre Erkanli, Yunseok Jang, Ali Malli, Khalid El-Halabi, Chaehyun Ryu, Jin Ryoun Kim
{"title":"A Three-Module Machine Learning Framework for Protein Sequence- and Temperature-Dependent <i>k</i><sub>cat</sub>/<i>K</i><sub>m</sub> Prediction in β-Glucosidases.","authors":"Mehmet Emre Erkanli, Yunseok Jang, Ali Malli, Khalid El-Halabi, Chaehyun Ryu, Jin Ryoun Kim","doi":"10.1021/acssynbio.5c00257","DOIUrl":null,"url":null,"abstract":"<p><p>The catalytic activity of enzymes is intricately determined by their amino acid sequences and assay conditions, particularly temperature. Navigating the complex interplay among sequence, temperature, and catalytic function is crucial for unlocking a multitude of enzyme applications. Machine learning has recently emerged as a tool for quantitative prediction of enzyme activity from protein sequences. Unfortunately, ML models designed to predict the comprehensive enzyme activity parameter, <i>k</i><sub>cat</sub>/<i>K</i><sub>m</sub>, from protein sequences are rare compared to those predicting <i>k</i><sub>cat</sub> or <i>K</i><sub>m</sub> alone. Combining both protein sequence and temperature as input features further challenges predictions; no current ML models capture the nonlinear relationship between <i>k</i><sub>cat</sub>/<i>K</i><sub>m</sub> and temperature for a protein sequence of interest. In this study, we developed a unique three-module ML framework that predicts β-glucosidase <i>k</i><sub>cat</sub>/<i>K</i><sub>m</sub> values based on protein sequence and temperature. Each module was designed to capture a distinct aspect of the interplay among protein sequence, temperature, and <i>k</i><sub>cat</sub>/<i>K</i><sub>m</sub> for β-glucosidase activity; when integrated, they formed an ML framework that maps the sequence and temperature spaces associated with β-glucosidase <i>k</i><sub>cat</sub>/<i>K</i><sub>m</sub>. This modular approach allowed for optimizations of ML models within each module, collectively achieving notable generalization performance when predicting temperature-dependent <i>k</i><sub>cat</sub>/<i>K</i><sub>m</sub> values for protein sequences not encountered during training. Our findings underscore the advantages of the three-module framework over traditional single-module methods, particularly by reducing prediction variability due to data splitting and mitigating overfitting. We anticipate that our multimodule ML framework will be directly applicable to other complex systems, enabling quantitative exploration of their property domains.</p>","PeriodicalId":26,"journal":{"name":"ACS Synthetic Biology","volume":" ","pages":""},"PeriodicalIF":3.9000,"publicationDate":"2025-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACS Synthetic Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1021/acssynbio.5c00257","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
The catalytic activity of enzymes is intricately determined by their amino acid sequences and assay conditions, particularly temperature. Navigating the complex interplay among sequence, temperature, and catalytic function is crucial for unlocking a multitude of enzyme applications. Machine learning has recently emerged as a tool for quantitative prediction of enzyme activity from protein sequences. Unfortunately, ML models designed to predict the comprehensive enzyme activity parameter, kcat/Km, from protein sequences are rare compared to those predicting kcat or Km alone. Combining both protein sequence and temperature as input features further challenges predictions; no current ML models capture the nonlinear relationship between kcat/Km and temperature for a protein sequence of interest. In this study, we developed a unique three-module ML framework that predicts β-glucosidase kcat/Km values based on protein sequence and temperature. Each module was designed to capture a distinct aspect of the interplay among protein sequence, temperature, and kcat/Km for β-glucosidase activity; when integrated, they formed an ML framework that maps the sequence and temperature spaces associated with β-glucosidase kcat/Km. This modular approach allowed for optimizations of ML models within each module, collectively achieving notable generalization performance when predicting temperature-dependent kcat/Km values for protein sequences not encountered during training. Our findings underscore the advantages of the three-module framework over traditional single-module methods, particularly by reducing prediction variability due to data splitting and mitigating overfitting. We anticipate that our multimodule ML framework will be directly applicable to other complex systems, enabling quantitative exploration of their property domains.
期刊介绍:
The journal is particularly interested in studies on the design and synthesis of new genetic circuits and gene products; computational methods in the design of systems; and integrative applied approaches to understanding disease and metabolism.
Topics may include, but are not limited to:
Design and optimization of genetic systems
Genetic circuit design and their principles for their organization into programs
Computational methods to aid the design of genetic systems
Experimental methods to quantify genetic parts, circuits, and metabolic fluxes
Genetic parts libraries: their creation, analysis, and ontological representation
Protein engineering including computational design
Metabolic engineering and cellular manufacturing, including biomass conversion
Natural product access, engineering, and production
Creative and innovative applications of cellular programming
Medical applications, tissue engineering, and the programming of therapeutic cells
Minimal cell design and construction
Genomics and genome replacement strategies
Viral engineering
Automated and robotic assembly platforms for synthetic biology
DNA synthesis methodologies
Metagenomics and synthetic metagenomic analysis
Bioinformatics applied to gene discovery, chemoinformatics, and pathway construction
Gene optimization
Methods for genome-scale measurements of transcription and metabolomics
Systems biology and methods to integrate multiple data sources
in vitro and cell-free synthetic biology and molecular programming
Nucleic acid engineering.