A protein fitness predictive framework based on feature combination and intelligent searching.

IF 4.5 3区 生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY
Protein Science Pub Date : 2024-12-01 DOI:10.1002/pro.5211
Zhihui Zhang, Zhixuan Li, Qianyue Wang, Hanlin Wu, Manli Yang, Fengguang Zhao, Mingkui Tan, Shuangyan Han
{"title":"A protein fitness predictive framework based on feature combination and intelligent searching.","authors":"Zhihui Zhang, Zhixuan Li, Qianyue Wang, Hanlin Wu, Manli Yang, Fengguang Zhao, Mingkui Tan, Shuangyan Han","doi":"10.1002/pro.5211","DOIUrl":null,"url":null,"abstract":"<p><p>Machine learning (ML) constructs predictive models by understanding the relationship between protein sequences and their functions, enabling efficient identification of protein sequences with high fitness values without falling into local optima, like directional evolution. However, how to extract the most pertinent functional feature information from a limited number of protein sequences is vital for optimizing the performance of ML models. Here, we propose scut_ProFP (Protein Fitness Predictor), a predictive framework that integrates feature combination and feature selection techniques. Feature combination offers comprehensive sequence information, while feature selection searches for the most beneficial features to enhance model performance, enabling accurate sequence-to-function mapping. Compared to similar frameworks, scut_ProFP demonstrates superior performance and is also competitive with more complex deep learning models-ECNet, EVmutation, and UniRep. In addition, scut_ProFP enables generalization from low-order mutants to high-order mutants. Finally, we utilized scut_ProFP to simulate the engineering of the fluorescent protein CreiLOV and highly enriched mutants with high fluorescence based on only a small number of low-fluorescence mutants. Essentially, the developed method is advantageous for ML in protein engineering, providing an effective approach to data-driven protein engineering. The code and datasets for scut_ProFP are available at https://github.com/Zhang66-star/scut_ProFP.</p>","PeriodicalId":20761,"journal":{"name":"Protein Science","volume":"33 12","pages":"e5211"},"PeriodicalIF":4.5000,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11567853/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Protein Science","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1002/pro.5211","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Machine learning (ML) constructs predictive models by understanding the relationship between protein sequences and their functions, enabling efficient identification of protein sequences with high fitness values without falling into local optima, like directional evolution. However, how to extract the most pertinent functional feature information from a limited number of protein sequences is vital for optimizing the performance of ML models. Here, we propose scut_ProFP (Protein Fitness Predictor), a predictive framework that integrates feature combination and feature selection techniques. Feature combination offers comprehensive sequence information, while feature selection searches for the most beneficial features to enhance model performance, enabling accurate sequence-to-function mapping. Compared to similar frameworks, scut_ProFP demonstrates superior performance and is also competitive with more complex deep learning models-ECNet, EVmutation, and UniRep. In addition, scut_ProFP enables generalization from low-order mutants to high-order mutants. Finally, we utilized scut_ProFP to simulate the engineering of the fluorescent protein CreiLOV and highly enriched mutants with high fluorescence based on only a small number of low-fluorescence mutants. Essentially, the developed method is advantageous for ML in protein engineering, providing an effective approach to data-driven protein engineering. The code and datasets for scut_ProFP are available at https://github.com/Zhang66-star/scut_ProFP.

基于特征组合和智能搜索的蛋白质适配性预测框架。
机器学习(ML)通过理解蛋白质序列与其功能之间的关系来构建预测模型,从而高效识别具有高适应值的蛋白质序列,而不会像定向进化那样陷入局部最优。然而,如何从有限的蛋白质序列中提取最相关的功能特征信息对于优化 ML 模型的性能至关重要。在此,我们提出了 scut_ProFP(Protein Fitness Predictor),这是一个整合了特征组合和特征选择技术的预测框架。特征组合提供全面的序列信息,而特征选择则寻找最有利的特征来提高模型性能,从而实现精确的序列-功能映射。与类似的框架相比,scut_ProFP 表现出更优越的性能,与更复杂的深度学习模型--ECNet、EVmutation 和 UniRep 相比也具有竞争力。此外,scut_ProFP 还能从低阶突变体泛化到高阶突变体。最后,我们利用 scut_ProFP 模拟了荧光蛋白 CreiLOV 的工程设计,并在少量低荧光突变体的基础上高度富集了高荧光突变体。从本质上讲,所开发的方法对于蛋白质工程中的ML具有优势,为数据驱动的蛋白质工程提供了一种有效的方法。scut_ProFP的代码和数据集可在https://github.com/Zhang66-star/scut_ProFP。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Protein Science
Protein Science 生物-生化与分子生物学
CiteScore
12.40
自引率
1.20%
发文量
246
审稿时长
1 months
期刊介绍: Protein Science, the flagship journal of The Protein Society, is a publication that focuses on advancing fundamental knowledge in the field of protein molecules. The journal welcomes original reports and review articles that contribute to our understanding of protein function, structure, folding, design, and evolution. Additionally, Protein Science encourages papers that explore the applications of protein science in various areas such as therapeutics, protein-based biomaterials, bionanotechnology, synthetic biology, and bioelectronics. The journal accepts manuscript submissions in any suitable format for review, with the requirement of converting the manuscript to journal-style format only upon acceptance for publication. Protein Science is indexed and abstracted in numerous databases, including the Agricultural & Environmental Science Database (ProQuest), Biological Science Database (ProQuest), CAS: Chemical Abstracts Service (ACS), Embase (Elsevier), Health & Medical Collection (ProQuest), Health Research Premium Collection (ProQuest), Materials Science & Engineering Database (ProQuest), MEDLINE/PubMed (NLM), Natural Science Collection (ProQuest), and SciTech Premium Collection (ProQuest).
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信