Feature Selection Based on Physicochemical Properties of Redefined N-term Region and C-term Regions for Predicting Disorder

2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology Pub Date : 1900-01-01 DOI:10.1109/CIBCB.2005.1594927

Kana Shimizu, Y. Muraoka, S. Hirose, T. Noguchi

{"title":"Feature Selection Based on Physicochemical Properties of Redefined N-term Region and C-term Regions for Predicting Disorder","authors":"Kana Shimizu, Y. Muraoka, S. Hirose, T. Noguchi","doi":"10.1109/CIBCB.2005.1594927","DOIUrl":null,"url":null,"abstract":"The prediction of intrinsic disorder from amino acid sequence has been gaining increasing attention because these have come to be known as important regions for protein functions. The most common way of predicting disorder is based on binary classification with machine learning. Since amino acid composition has different propensities in the N-term, C-term, and internal regions, the accuracy of prediction increases by dividing training data into these three regions and predicting them separately. However, previous work has lacked discussion about a concrete definition of the N-term and C-term regions, and has only used the heuristic length from the terminal. Other previous work has shown that general physicochemical properties rather than specific amino acids are important factors contributing to disorder, and a reduced amino acid alphabet can maintain excellent precision in predicting disorder. In this paper, we redefine a suitable length and position for the N-term and C-term regions for predicting disorder. Moreover, we show that each region has different physicochemical properties, which are important factors contributing to disorder. We also suggest a region-specific-reduced set of amino acid and modified PSSM based on that for predicting disorder. We implemented our method and (1) compare it with the conventional division method, (2) compare our feature selection with all physicochemical features, on casp6 benchmark, PDB dataset, and DisProt. The result supports that the method of new data separation is effective, and indicates each region has different physicochemical properties that are important factors for predicting protein disorders.","PeriodicalId":330810,"journal":{"name":"2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CIBCB.2005.1594927","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

The prediction of intrinsic disorder from amino acid sequence has been gaining increasing attention because these have come to be known as important regions for protein functions. The most common way of predicting disorder is based on binary classification with machine learning. Since amino acid composition has different propensities in the N-term, C-term, and internal regions, the accuracy of prediction increases by dividing training data into these three regions and predicting them separately. However, previous work has lacked discussion about a concrete definition of the N-term and C-term regions, and has only used the heuristic length from the terminal. Other previous work has shown that general physicochemical properties rather than specific amino acids are important factors contributing to disorder, and a reduced amino acid alphabet can maintain excellent precision in predicting disorder. In this paper, we redefine a suitable length and position for the N-term and C-term regions for predicting disorder. Moreover, we show that each region has different physicochemical properties, which are important factors contributing to disorder. We also suggest a region-specific-reduced set of amino acid and modified PSSM based on that for predicting disorder. We implemented our method and (1) compare it with the conventional division method, (2) compare our feature selection with all physicochemical features, on casp6 benchmark, PDB dataset, and DisProt. The result supports that the method of new data separation is effective, and indicates each region has different physicochemical properties that are important factors for predicting protein disorders.

查看原文本刊更多论文

基于重定义n项区域和c项区域物理化学性质的特征选择预测紊乱

由于氨基酸序列被认为是蛋白质功能的重要区域，因此对氨基酸序列内在紊乱的预测越来越受到人们的关注。最常见的预测障碍的方法是基于机器学习的二元分类。由于氨基酸组成在n项，c项和内部区域具有不同的倾向，因此将训练数据分为这三个区域并分别预测可以提高预测的准确性。然而，以前的工作缺乏对n项和c项区域的具体定义的讨论，并且只使用了来自终端的启发式长度。其他先前的工作已经表明，一般的物理化学性质而不是特定的氨基酸是导致紊乱的重要因素，并且减少的氨基酸字母表可以在预测紊乱方面保持很高的精度。在本文中，我们重新定义了n项和c项区域的合适长度和位置来预测无序性。此外，我们发现每个区域具有不同的物理化学性质，这是导致无序的重要因素。我们还在此基础上提出了一个区域特异性的氨基酸还原集和修饰的PSSM，用于预测疾病。我们实现了我们的方法，并(1)将其与传统的分割方法进行比较，(2)将我们的特征选择与所有物理化学特征进行比较，在casp6基准，PDB数据集和DisProt上。结果表明，新的数据分离方法是有效的，并且表明每个区域具有不同的物理化学性质，这是预测蛋白质紊乱的重要因素。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology

自引率

0.00%

发文量