Deep and Self-Taught Learning for Protein Accessible Surface Area Prediction

2017 International Conference on Frontiers of Information Technology (FIT) Pub Date : 2017-01-18 DOI:10.1109/FIT.2017.00054

F. Hassan, F. Minhas

{"title":"Deep and Self-Taught Learning for Protein Accessible Surface Area Prediction","authors":"F. Hassan, F. Minhas","doi":"10.1109/FIT.2017.00054","DOIUrl":null,"url":null,"abstract":"ASA captures the degree of burial or surface accessibility of a protein residue. It is a very important indicator of the behavior of amino acids within a protein as well. It can be used to find protein interactions, interfaces, folding states, etc. Calculation of the ASA requires the presence of the structure of the protein. However, structure determination for proteins is expensive and requires significant technical effort. As a consequence, the prediction of ASA is a very important and fundamental problem in Bioinformatics and Proteomics. In this work, we have investigated self-taught machine learning methods along with deep neural network to predict the residue level accessible surface area (ASA) of a protein. We have found that deep learning neural networks can predict the ASA of the residues in a protein accurately. Furthermore, the proposed deep learning based method does not require the use of computationally demanding features such as the position specific scoring matrix (PSSM) which have been used in previous works. A simple Blosum62 matrix based position dependent representation of amino acids in a sequence window gives comparable performance. This is particularly attractive for proteome wide prediction of ASA. We have used various self-taught learning schemes for obtaining an optimal feature representation from unlabeled data. These include a sparse and regularized autoencoder neural network and a dictionary based learning scheme. We have used unlabeled data from the protein universe in an attempt to improve the feature representation. We have also evaluated the performance of a stochastic gradient based predictor of accessible surface area for different feature representations.","PeriodicalId":107273,"journal":{"name":"2017 International Conference on Frontiers of Information Technology (FIT)","volume":"144 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 International Conference on Frontiers of Information Technology (FIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FIT.2017.00054","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

ASA captures the degree of burial or surface accessibility of a protein residue. It is a very important indicator of the behavior of amino acids within a protein as well. It can be used to find protein interactions, interfaces, folding states, etc. Calculation of the ASA requires the presence of the structure of the protein. However, structure determination for proteins is expensive and requires significant technical effort. As a consequence, the prediction of ASA is a very important and fundamental problem in Bioinformatics and Proteomics. In this work, we have investigated self-taught machine learning methods along with deep neural network to predict the residue level accessible surface area (ASA) of a protein. We have found that deep learning neural networks can predict the ASA of the residues in a protein accurately. Furthermore, the proposed deep learning based method does not require the use of computationally demanding features such as the position specific scoring matrix (PSSM) which have been used in previous works. A simple Blosum62 matrix based position dependent representation of amino acids in a sequence window gives comparable performance. This is particularly attractive for proteome wide prediction of ASA. We have used various self-taught learning schemes for obtaining an optimal feature representation from unlabeled data. These include a sparse and regularized autoencoder neural network and a dictionary based learning scheme. We have used unlabeled data from the protein universe in an attempt to improve the feature representation. We have also evaluated the performance of a stochastic gradient based predictor of accessible surface area for different feature representations.

查看原文本刊更多论文

蛋白质可接近表面积预测的深度自学

ASA捕获蛋白质残基的埋藏程度或表面可接近性。它也是蛋白质中氨基酸行为的一个非常重要的指标。它可以用来发现蛋白质的相互作用，界面，折叠状态等。ASA的计算需要蛋白质结构的存在。然而，蛋白质的结构测定是昂贵的，需要大量的技术努力。因此，ASA的预测是生物信息学和蛋白质组学中一个非常重要和基础的问题。在这项工作中，我们研究了自学机器学习方法以及深度神经网络来预测蛋白质的残留水平可达表面积(ASA)。我们发现深度学习神经网络可以准确地预测蛋白质中残基的ASA。此外，所提出的基于深度学习的方法不需要使用先前工作中使用的位置特定评分矩阵(PSSM)等计算要求高的特征。一个简单的基于Blosum62矩阵的氨基酸在序列窗口中的位置依赖表示给出了类似的性能。这对于ASA的蛋白质组预测特别有吸引力。我们使用了各种自学方案来从未标记的数据中获得最优特征表示。其中包括一个稀疏和正则化的自编码器神经网络和一个基于字典的学习方案。我们使用了来自蛋白质领域的未标记数据，试图改进特征表示。我们还评估了基于随机梯度的可达表面积预测器的性能，用于不同的特征表示。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 International Conference on Frontiers of Information Technology (FIT)

自引率

0.00%

发文量