利用初级结构组成特征预测蛋白质溶解度:机器学习视角

N. Rasool, Waqar Hussain, S. Mahmood
{"title":"利用初级结构组成特征预测蛋白质溶解度:机器学习视角","authors":"N. Rasool, Waqar Hussain, S. Mahmood","doi":"10.4172/JPB.1000458","DOIUrl":null,"url":null,"abstract":"It is a recurring limiting factor to obtain sufficient concentrations of soluble proteins using in vitro methodologies. Solubility is an independent characteristic of a protein which can be determined using amino acid compositions under specific experimental conditions. The present study aims at the prediction of protein solubility by adapting machine learning based approaches using the primary structure information. The features involve amino acid compositional features as well as the physiochemical properties of the amino acids i.e. canonical value, hydrophobicity, solubility index and solubility score. For a dataset of 6372 protein sequences (4850 soluble protein sequences and 1522 insoluble protein sequences), all the four features were calculated. Using the calculated values, four different prediction models were developed based on Multilayer Perceptron (MLP), Random Forest (RF), Decision Tree (DT), and Naive Bayes Classifier (NBC). For performance evaluation, MCC, F-measure, accuracy, precision and recall rate are determined. Among all the four prediction models, MLP has been observed to be the most accurate model for the prediction of protein solubility with an accuracy rate of 95.92%, followed by RF and NBC. The proposed model, based on MLP, can be used for predicting protein solubility as a preprocess of experimental predictions. The method is resource and time efficient, and can help in predicting solubility of proteins instead of laborious and hectic experimental work.","PeriodicalId":73911,"journal":{"name":"Journal of proteomics & bioinformatics","volume":"10 1","pages":"324-328"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Prediction of Protein Solubility using Primary Structure Compositional Features: A Machine Learning Perspective\",\"authors\":\"N. Rasool, Waqar Hussain, S. Mahmood\",\"doi\":\"10.4172/JPB.1000458\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"It is a recurring limiting factor to obtain sufficient concentrations of soluble proteins using in vitro methodologies. Solubility is an independent characteristic of a protein which can be determined using amino acid compositions under specific experimental conditions. The present study aims at the prediction of protein solubility by adapting machine learning based approaches using the primary structure information. The features involve amino acid compositional features as well as the physiochemical properties of the amino acids i.e. canonical value, hydrophobicity, solubility index and solubility score. For a dataset of 6372 protein sequences (4850 soluble protein sequences and 1522 insoluble protein sequences), all the four features were calculated. Using the calculated values, four different prediction models were developed based on Multilayer Perceptron (MLP), Random Forest (RF), Decision Tree (DT), and Naive Bayes Classifier (NBC). For performance evaluation, MCC, F-measure, accuracy, precision and recall rate are determined. Among all the four prediction models, MLP has been observed to be the most accurate model for the prediction of protein solubility with an accuracy rate of 95.92%, followed by RF and NBC. The proposed model, based on MLP, can be used for predicting protein solubility as a preprocess of experimental predictions. The method is resource and time efficient, and can help in predicting solubility of proteins instead of laborious and hectic experimental work.\",\"PeriodicalId\":73911,\"journal\":{\"name\":\"Journal of proteomics & bioinformatics\",\"volume\":\"10 1\",\"pages\":\"324-328\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-12-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of proteomics & bioinformatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.4172/JPB.1000458\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of proteomics & bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4172/JPB.1000458","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9

摘要

使用体外方法获得足够浓度的可溶性蛋白质是一个反复出现的限制因素。溶解度是蛋白质的一个独立特性,可以在特定的实验条件下使用氨基酸组合物来确定。本研究旨在通过使用初级结构信息采用基于机器学习的方法来预测蛋白质溶解度。这些特征包括氨基酸的组成特征以及氨基酸的理化性质,即标准值、疏水性、溶解度指数和溶解度得分。对于6372个蛋白质序列(4850个可溶性蛋白质序列和1522个不溶性蛋白质序列)的数据集,计算了所有四个特征。利用计算值,基于多层感知器(MLP)、随机森林(RF)、决策树(DT)和朴素贝叶斯分类器(NBC)开发了四种不同的预测模型。对于性能评估,确定了MCC、F-measure、准确度、精密度和召回率。在所有四种预测模型中,MLP被认为是预测蛋白质溶解度最准确的模型,准确率为95.92%,其次是RF和NBC。所提出的基于MLP的模型可用于预测蛋白质溶解度,作为实验预测的预处理。该方法具有资源和时间效率,有助于预测蛋白质的溶解度,而不是费力和繁忙的实验工作。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Prediction of Protein Solubility using Primary Structure Compositional Features: A Machine Learning Perspective
It is a recurring limiting factor to obtain sufficient concentrations of soluble proteins using in vitro methodologies. Solubility is an independent characteristic of a protein which can be determined using amino acid compositions under specific experimental conditions. The present study aims at the prediction of protein solubility by adapting machine learning based approaches using the primary structure information. The features involve amino acid compositional features as well as the physiochemical properties of the amino acids i.e. canonical value, hydrophobicity, solubility index and solubility score. For a dataset of 6372 protein sequences (4850 soluble protein sequences and 1522 insoluble protein sequences), all the four features were calculated. Using the calculated values, four different prediction models were developed based on Multilayer Perceptron (MLP), Random Forest (RF), Decision Tree (DT), and Naive Bayes Classifier (NBC). For performance evaluation, MCC, F-measure, accuracy, precision and recall rate are determined. Among all the four prediction models, MLP has been observed to be the most accurate model for the prediction of protein solubility with an accuracy rate of 95.92%, followed by RF and NBC. The proposed model, based on MLP, can be used for predicting protein solubility as a preprocess of experimental predictions. The method is resource and time efficient, and can help in predicting solubility of proteins instead of laborious and hectic experimental work.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信