基于特征组合和智能搜索的蛋白质适配性预测框架。

IF 4.5 3区生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY

Protein Science Pub Date : 2024-12-01 DOI:10.1002/pro.5211

Zhihui Zhang, Zhixuan Li, Qianyue Wang, Hanlin Wu, Manli Yang, Fengguang Zhao, Mingkui Tan, Shuangyan Han

{"title":"基于特征组合和智能搜索的蛋白质适配性预测框架。","authors":"Zhihui Zhang, Zhixuan Li, Qianyue Wang, Hanlin Wu, Manli Yang, Fengguang Zhao, Mingkui Tan, Shuangyan Han","doi":"10.1002/pro.5211","DOIUrl":null,"url":null,"abstract":"Machine learning (ML) constructs predictive models by understanding the relationship between protein sequences and their functions, enabling efficient identification of protein sequences with high fitness values without falling into local optima, like directional evolution. However, how to extract the most pertinent functional feature information from a limited number of protein sequences is vital for optimizing the performance of ML models. Here, we propose scut_ProFP (Protein Fitness Predictor), a predictive framework that integrates feature combination and feature selection techniques. Feature combination offers comprehensive sequence information, while feature selection searches for the most beneficial features to enhance model performance, enabling accurate sequence-to-function mapping. Compared to similar frameworks, scut_ProFP demonstrates superior performance and is also competitive with more complex deep learning models-ECNet, EVmutation, and UniRep. In addition, scut_ProFP enables generalization from low-order mutants to high-order mutants. Finally, we utilized scut_ProFP to simulate the engineering of the fluorescent protein CreiLOV and highly enriched mutants with high fluorescence based on only a small number of low-fluorescence mutants. Essentially, the developed method is advantageous for ML in protein engineering, providing an effective approach to data-driven protein engineering. The code and datasets for scut_ProFP are available at https://github.com/Zhang66-star/scut_ProFP.","PeriodicalId":20761,"journal":{"name":"Protein Science","volume":"33 12","pages":"e5211"},"PeriodicalIF":4.5000,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11567853/pdf/","citationCount":"0","resultStr":"{\"title\":\"A protein fitness predictive framework based on feature combination and intelligent searching.\",\"authors\":\"Zhihui Zhang, Zhixuan Li, Qianyue Wang, Hanlin Wu, Manli Yang, Fengguang Zhao, Mingkui Tan, Shuangyan Han\",\"doi\":\"10.1002/pro.5211\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Machine learning (ML) constructs predictive models by understanding the relationship between protein sequences and their functions, enabling efficient identification of protein sequences with high fitness values without falling into local optima, like directional evolution. However, how to extract the most pertinent functional feature information from a limited number of protein sequences is vital for optimizing the performance of ML models. Here, we propose scut_ProFP (Protein Fitness Predictor), a predictive framework that integrates feature combination and feature selection techniques. Feature combination offers comprehensive sequence information, while feature selection searches for the most beneficial features to enhance model performance, enabling accurate sequence-to-function mapping. Compared to similar frameworks, scut_ProFP demonstrates superior performance and is also competitive with more complex deep learning models-ECNet, EVmutation, and UniRep. In addition, scut_ProFP enables generalization from low-order mutants to high-order mutants. Finally, we utilized scut_ProFP to simulate the engineering of the fluorescent protein CreiLOV and highly enriched mutants with high fluorescence based on only a small number of low-fluorescence mutants. Essentially, the developed method is advantageous for ML in protein engineering, providing an effective approach to data-driven protein engineering. The code and datasets for scut_ProFP are available at https://github.com/Zhang66-star/scut_ProFP.\",\"PeriodicalId\":20761,\"journal\":{\"name\":\"Protein Science\",\"volume\":\"33 12\",\"pages\":\"e5211\"},\"PeriodicalIF\":4.5000,\"publicationDate\":\"2024-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11567853/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Protein Science\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1002/pro.5211\",\"RegionNum\":3,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"BIOCHEMISTRY & MOLECULAR BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Protein Science","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1002/pro.5211","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

机器学习（ML）通过理解蛋白质序列与其功能之间的关系来构建预测模型，从而高效识别具有高适应值的蛋白质序列，而不会像定向进化那样陷入局部最优。然而，如何从有限的蛋白质序列中提取最相关的功能特征信息对于优化 ML 模型的性能至关重要。在此，我们提出了 scut_ProFP（Protein Fitness Predictor），这是一个整合了特征组合和特征选择技术的预测框架。特征组合提供全面的序列信息，而特征选择则寻找最有利的特征来提高模型性能，从而实现精确的序列-功能映射。与类似的框架相比，scut_ProFP 表现出更优越的性能，与更复杂的深度学习模型--ECNet、EVmutation 和 UniRep 相比也具有竞争力。此外，scut_ProFP 还能从低阶突变体泛化到高阶突变体。最后，我们利用 scut_ProFP 模拟了荧光蛋白 CreiLOV 的工程设计，并在少量低荧光突变体的基础上高度富集了高荧光突变体。从本质上讲，所开发的方法对于蛋白质工程中的ML具有优势，为数据驱动的蛋白质工程提供了一种有效的方法。scut_ProFP的代码和数据集可在https://github.com/Zhang66-star/scut_ProFP。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A protein fitness predictive framework based on feature combination and intelligent searching.

Machine learning (ML) constructs predictive models by understanding the relationship between protein sequences and their functions, enabling efficient identification of protein sequences with high fitness values without falling into local optima, like directional evolution. However, how to extract the most pertinent functional feature information from a limited number of protein sequences is vital for optimizing the performance of ML models. Here, we propose scut_ProFP (Protein Fitness Predictor), a predictive framework that integrates feature combination and feature selection techniques. Feature combination offers comprehensive sequence information, while feature selection searches for the most beneficial features to enhance model performance, enabling accurate sequence-to-function mapping. Compared to similar frameworks, scut_ProFP demonstrates superior performance and is also competitive with more complex deep learning models-ECNet, EVmutation, and UniRep. In addition, scut_ProFP enables generalization from low-order mutants to high-order mutants. Finally, we utilized scut_ProFP to simulate the engineering of the fluorescent protein CreiLOV and highly enriched mutants with high fluorescence based on only a small number of low-fluorescence mutants. Essentially, the developed method is advantageous for ML in protein engineering, providing an effective approach to data-driven protein engineering. The code and datasets for scut_ProFP are available at https://github.com/Zhang66-star/scut_ProFP.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Protein Science 生物-生化与分子生物学

CiteScore

12.40

自引率

1.20%

发文量

246

审稿时长

1 months

期刊介绍： Protein Science, the flagship journal of The Protein Society, is a publication that focuses on advancing fundamental knowledge in the field of protein molecules. The journal welcomes original reports and review articles that contribute to our understanding of protein function, structure, folding, design, and evolution. Additionally, Protein Science encourages papers that explore the applications of protein science in various areas such as therapeutics, protein-based biomaterials, bionanotechnology, synthetic biology, and bioelectronics. The journal accepts manuscript submissions in any suitable format for review, with the requirement of converting the manuscript to journal-style format only upon acceptance for publication. Protein Science is indexed and abstracted in numerous databases, including the Agricultural & Environmental Science Database (ProQuest), Biological Science Database (ProQuest), CAS: Chemical Abstracts Service (ACS), Embase (Elsevier), Health & Medical Collection (ProQuest), Health Research Premium Collection (ProQuest), Materials Science & Engineering Database (ProQuest), MEDLINE/PubMed (NLM), Natural Science Collection (ProQuest), and SciTech Premium Collection (ProQuest).