Prediction of Solubility of Proteins in Escherichia coli Based on Functional and Structural Features Using Machine Learning Methods

IF 1.9 4区 生物学 Q4 BIOCHEMISTRY & MOLECULAR BIOLOGY
Feiming Huang, Qian Gao, XianChao Zhou, Wei Guo, KaiYan Feng, Lin Zhu, Tao Huang, Yu-Dong Cai
{"title":"Prediction of Solubility of Proteins in Escherichia coli Based on Functional and Structural Features Using Machine Learning Methods","authors":"Feiming Huang,&nbsp;Qian Gao,&nbsp;XianChao Zhou,&nbsp;Wei Guo,&nbsp;KaiYan Feng,&nbsp;Lin Zhu,&nbsp;Tao Huang,&nbsp;Yu-Dong Cai","doi":"10.1007/s10930-024-10230-z","DOIUrl":null,"url":null,"abstract":"<div><p>Protein solubility is a critical parameter that determines the stability, activity, and functionality of proteins, with broad and far-reaching implications in biotechnology and biochemistry. Accurate prediction and control of protein solubility are essential for successful protein expression and purification in research and industrial settings. This study gathered information on soluble and insoluble proteins. In characterizing the proteins, they were mapped to STRING and characterized by functional and structural features. All functional/structural features were integrated to create a 5768-dimensional binary vector to encode proteins. Seven feature-ranking algorithms were employed to analyze the functional/structural features, yielding seven feature lists. These lists were subjected to the incremental feature selection, incorporating four classification algorithms, one by one to build effective classification models and identify functional/structural features with classification-related importance. Some essential functional/structural features used to differentiate between soluble and insoluble proteins were identified, including GO:0009987 (intercellular communication) and GO:0022613 (ribonucleoprotein complex biogenesis). The best classification model using support vector machine as the classification algorithm and 295 optimized functional/structural features generated the F1 score of 0.825, which can be a powerful tool to differentiate soluble proteins from insoluble proteins.</p></div>","PeriodicalId":793,"journal":{"name":"The Protein Journal","volume":"43 5","pages":"983 - 996"},"PeriodicalIF":1.9000,"publicationDate":"2024-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Protein Journal","FirstCategoryId":"2","ListUrlMain":"https://link.springer.com/article/10.1007/s10930-024-10230-z","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Protein solubility is a critical parameter that determines the stability, activity, and functionality of proteins, with broad and far-reaching implications in biotechnology and biochemistry. Accurate prediction and control of protein solubility are essential for successful protein expression and purification in research and industrial settings. This study gathered information on soluble and insoluble proteins. In characterizing the proteins, they were mapped to STRING and characterized by functional and structural features. All functional/structural features were integrated to create a 5768-dimensional binary vector to encode proteins. Seven feature-ranking algorithms were employed to analyze the functional/structural features, yielding seven feature lists. These lists were subjected to the incremental feature selection, incorporating four classification algorithms, one by one to build effective classification models and identify functional/structural features with classification-related importance. Some essential functional/structural features used to differentiate between soluble and insoluble proteins were identified, including GO:0009987 (intercellular communication) and GO:0022613 (ribonucleoprotein complex biogenesis). The best classification model using support vector machine as the classification algorithm and 295 optimized functional/structural features generated the F1 score of 0.825, which can be a powerful tool to differentiate soluble proteins from insoluble proteins.

Abstract Image

利用机器学习方法,基于功能和结构特征预测蛋白质在大肠杆菌中的溶解度
蛋白质溶解度是决定蛋白质稳定性、活性和功能的关键参数,对生物技术和生物化学具有广泛而深远的影响。准确预测和控制蛋白质的溶解度对于在研究和工业环境中成功表达和纯化蛋白质至关重要。本研究收集了有关可溶性和不可溶性蛋白质的信息。在表征蛋白质时,它们被映射到 STRING 中,并根据功能和结构特征进行表征。整合所有功能/结构特征后,创建了一个 5768 维的二进制向量来编码蛋白质。在分析功能/结构特征时,采用了七种特征排序算法,得出了七个特征列表。这些列表经过增量特征选择,结合四种分类算法,逐一建立有效的分类模型,并识别出与分类相关的重要功能/结构特征。结果发现了一些用于区分可溶性和非可溶性蛋白质的基本功能/结构特征,包括 GO:0009987(细胞间通讯)和 GO:0022613(核糖核蛋白复合物生物生成)。使用支持向量机作为分类算法和 295 个优化的功能/结构特征的最佳分类模型产生了 0.825 的 F1 分数,这可以作为区分可溶性蛋白质和不溶性蛋白质的有力工具。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
The Protein Journal
The Protein Journal 生物-生化与分子生物学
CiteScore
5.20
自引率
0.00%
发文量
57
审稿时长
12 months
期刊介绍: The Protein Journal (formerly the Journal of Protein Chemistry) publishes original research work on all aspects of proteins and peptides. These include studies concerned with covalent or three-dimensional structure determination (X-ray, NMR, cryoEM, EPR/ESR, optical methods, etc.), computational aspects of protein structure and function, protein folding and misfolding, assembly, genetics, evolution, proteomics, molecular biology, protein engineering, protein nanotechnology, protein purification and analysis and peptide synthesis, as well as the elucidation and interpretation of the molecular bases of biological activities of proteins and peptides. We accept original research papers, reviews, mini-reviews, hypotheses, opinion papers, and letters to the editor.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信