利用初级结构组成特征预测蛋白质溶解度:机器学习视角

Journal of proteomics & bioinformatics Pub Date : 2017-12-29 DOI:10.4172/JPB.1000458

N. Rasool, Waqar Hussain, S. Mahmood

{"title":"利用初级结构组成特征预测蛋白质溶解度:机器学习视角","authors":"N. Rasool, Waqar Hussain, S. Mahmood","doi":"10.4172/JPB.1000458","DOIUrl":null,"url":null,"abstract":"It is a recurring limiting factor to obtain sufficient concentrations of soluble proteins using in vitro methodologies. Solubility is an independent characteristic of a protein which can be determined using amino acid compositions under specific experimental conditions. The present study aims at the prediction of protein solubility by adapting machine learning based approaches using the primary structure information. The features involve amino acid compositional features as well as the physiochemical properties of the amino acids i.e. canonical value, hydrophobicity, solubility index and solubility score. For a dataset of 6372 protein sequences (4850 soluble protein sequences and 1522 insoluble protein sequences), all the four features were calculated. Using the calculated values, four different prediction models were developed based on Multilayer Perceptron (MLP), Random Forest (RF), Decision Tree (DT), and Naive Bayes Classifier (NBC). For performance evaluation, MCC, F-measure, accuracy, precision and recall rate are determined. Among all the four prediction models, MLP has been observed to be the most accurate model for the prediction of protein solubility with an accuracy rate of 95.92%, followed by RF and NBC. The proposed model, based on MLP, can be used for predicting protein solubility as a preprocess of experimental predictions. The method is resource and time efficient, and can help in predicting solubility of proteins instead of laborious and hectic experimental work.","PeriodicalId":73911,"journal":{"name":"Journal of proteomics & bioinformatics","volume":"10 1","pages":"324-328"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Prediction of Protein Solubility using Primary Structure Compositional Features: A Machine Learning Perspective\",\"authors\":\"N. Rasool, Waqar Hussain, S. Mahmood\",\"doi\":\"10.4172/JPB.1000458\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"It is a recurring limiting factor to obtain sufficient concentrations of soluble proteins using in vitro methodologies. Solubility is an independent characteristic of a protein which can be determined using amino acid compositions under specific experimental conditions. The present study aims at the prediction of protein solubility by adapting machine learning based approaches using the primary structure information. The features involve amino acid compositional features as well as the physiochemical properties of the amino acids i.e. canonical value, hydrophobicity, solubility index and solubility score. For a dataset of 6372 protein sequences (4850 soluble protein sequences and 1522 insoluble protein sequences), all the four features were calculated. Using the calculated values, four different prediction models were developed based on Multilayer Perceptron (MLP), Random Forest (RF), Decision Tree (DT), and Naive Bayes Classifier (NBC). For performance evaluation, MCC, F-measure, accuracy, precision and recall rate are determined. Among all the four prediction models, MLP has been observed to be the most accurate model for the prediction of protein solubility with an accuracy rate of 95.92%, followed by RF and NBC. The proposed model, based on MLP, can be used for predicting protein solubility as a preprocess of experimental predictions. The method is resource and time efficient, and can help in predicting solubility of proteins instead of laborious and hectic experimental work.\",\"PeriodicalId\":73911,\"journal\":{\"name\":\"Journal of proteomics & bioinformatics\",\"volume\":\"10 1\",\"pages\":\"324-328\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-12-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of proteomics & bioinformatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.4172/JPB.1000458\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of proteomics & bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4172/JPB.1000458","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

摘要

使用体外方法获得足够浓度的可溶性蛋白质是一个反复出现的限制因素。溶解度是蛋白质的一个独立特性，可以在特定的实验条件下使用氨基酸组合物来确定。本研究旨在通过使用初级结构信息采用基于机器学习的方法来预测蛋白质溶解度。这些特征包括氨基酸的组成特征以及氨基酸的理化性质，即标准值、疏水性、溶解度指数和溶解度得分。对于6372个蛋白质序列（4850个可溶性蛋白质序列和1522个不溶性蛋白质序列）的数据集，计算了所有四个特征。利用计算值，基于多层感知器（MLP）、随机森林（RF）、决策树（DT）和朴素贝叶斯分类器（NBC）开发了四种不同的预测模型。对于性能评估，确定了MCC、F-measure、准确度、精密度和召回率。在所有四种预测模型中，MLP被认为是预测蛋白质溶解度最准确的模型，准确率为95.92%，其次是RF和NBC。所提出的基于MLP的模型可用于预测蛋白质溶解度，作为实验预测的预处理。该方法具有资源和时间效率，有助于预测蛋白质的溶解度，而不是费力和繁忙的实验工作。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Prediction of Protein Solubility using Primary Structure Compositional Features: A Machine Learning Perspective

It is a recurring limiting factor to obtain sufficient concentrations of soluble proteins using in vitro methodologies. Solubility is an independent characteristic of a protein which can be determined using amino acid compositions under specific experimental conditions. The present study aims at the prediction of protein solubility by adapting machine learning based approaches using the primary structure information. The features involve amino acid compositional features as well as the physiochemical properties of the amino acids i.e. canonical value, hydrophobicity, solubility index and solubility score. For a dataset of 6372 protein sequences (4850 soluble protein sequences and 1522 insoluble protein sequences), all the four features were calculated. Using the calculated values, four different prediction models were developed based on Multilayer Perceptron (MLP), Random Forest (RF), Decision Tree (DT), and Naive Bayes Classifier (NBC). For performance evaluation, MCC, F-measure, accuracy, precision and recall rate are determined. Among all the four prediction models, MLP has been observed to be the most accurate model for the prediction of protein solubility with an accuracy rate of 95.92%, followed by RF and NBC. The proposed model, based on MLP, can be used for predicting protein solubility as a preprocess of experimental predictions. The method is resource and time efficient, and can help in predicting solubility of proteins instead of laborious and hectic experimental work.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of proteomics & bioinformatics

自引率

0.00%

发文量