Prediction of protein solubility in E. coli

2012 IEEE 8th International Conference on E-Science Pub Date : 2012-10-08 DOI:10.1109/eScience.2012.6404416

T. Samak, D. Gunter, Zhong Wang

{"title":"Prediction of protein solubility in E. coli","authors":"T. Samak, D. Gunter, Zhong Wang","doi":"10.1109/eScience.2012.6404416","DOIUrl":null,"url":null,"abstract":"Gene synthesis is a key step to convert digitally predicted proteins to functional proteins. However, it is a relatively expensive and labor-intensive process. About 30-50% of the synthesized proteins are not soluble, thereby further reduces the efficacy of gene synthesis as a method for protein function characterization. Solubility prediction from primary protein sequences holds the promise to dramatically reduce the cost of gene synthesis. This work presents a framework that creates models of solubility from sequence information. From the primary protein sequences of the genes to be synthesized, sequence features can be used to build computational models for solubility. This way, biologists can focus the effort on synthesizing genes that are highly likely to generate soluble proteins. We have developed a framework that employs several machine learning algorithms to model protein solubility. The framework is used to predict protein solubility in the Escherichia coli expression system. The analysis is performed on over 1,600 quantified proteins. The approach successfully predicted the solubility with more than 80% accuracy, and enabled in depth analysis of the most important features affecting solubility. The analysis pipeline is general and can be applied to any set of sequence features to predict any binary measure. The framework also provides the biologist with a comprehensive comparison between different learning algorithms, and insightful feature analysis.","PeriodicalId":6364,"journal":{"name":"2012 IEEE 8th International Conference on E-Science","volume":"10 1","pages":"1-8"},"PeriodicalIF":0.0000,"publicationDate":"2012-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE 8th International Conference on E-Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/eScience.2012.6404416","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

Abstract

Gene synthesis is a key step to convert digitally predicted proteins to functional proteins. However, it is a relatively expensive and labor-intensive process. About 30-50% of the synthesized proteins are not soluble, thereby further reduces the efficacy of gene synthesis as a method for protein function characterization. Solubility prediction from primary protein sequences holds the promise to dramatically reduce the cost of gene synthesis. This work presents a framework that creates models of solubility from sequence information. From the primary protein sequences of the genes to be synthesized, sequence features can be used to build computational models for solubility. This way, biologists can focus the effort on synthesizing genes that are highly likely to generate soluble proteins. We have developed a framework that employs several machine learning algorithms to model protein solubility. The framework is used to predict protein solubility in the Escherichia coli expression system. The analysis is performed on over 1,600 quantified proteins. The approach successfully predicted the solubility with more than 80% accuracy, and enabled in depth analysis of the most important features affecting solubility. The analysis pipeline is general and can be applied to any set of sequence features to predict any binary measure. The framework also provides the biologist with a comprehensive comparison between different learning algorithms, and insightful feature analysis.

查看原文本刊更多论文

蛋白质在大肠杆菌中的溶解度预测

基因合成是将数字预测蛋白转化为功能蛋白的关键步骤。然而，这是一个相对昂贵和劳动密集型的过程。大约30-50%的合成蛋白是不溶的，从而进一步降低了基因合成作为蛋白质功能表征方法的有效性。从初级蛋白序列进行溶解度预测有望大大降低基因合成的成本。这项工作提出了一个框架，从序列信息中创建溶解度模型。从待合成基因的初级蛋白序列中，序列特征可以用来建立溶解度的计算模型。这样，生物学家就可以集中精力合成那些极有可能产生可溶性蛋白质的基因。我们开发了一个框架，使用几种机器学习算法来模拟蛋白质的溶解度。该框架用于预测蛋白质在大肠杆菌表达系统中的溶解度。该分析在超过1600种定量蛋白质上进行。该方法成功地预测了溶解度，准确度超过80%，并能够深入分析影响溶解度的最重要特征。分析流水线是通用的，可以应用于任意序列特征集来预测任意二值测度。该框架还为生物学家提供了不同学习算法之间的全面比较，以及深刻的特征分析。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2012 IEEE 8th International Conference on E-Science

自引率

0.00%

发文量