{"title":"Feature Selection Based on Physicochemical Properties of Redefined N-term Region and C-term Regions for Predicting Disorder","authors":"Kana Shimizu, Y. Muraoka, S. Hirose, T. Noguchi","doi":"10.1109/CIBCB.2005.1594927","DOIUrl":"https://doi.org/10.1109/CIBCB.2005.1594927","url":null,"abstract":"The prediction of intrinsic disorder from amino acid sequence has been gaining increasing attention because these have come to be known as important regions for protein functions. The most common way of predicting disorder is based on binary classification with machine learning. Since amino acid composition has different propensities in the N-term, C-term, and internal regions, the accuracy of prediction increases by dividing training data into these three regions and predicting them separately. However, previous work has lacked discussion about a concrete definition of the N-term and C-term regions, and has only used the heuristic length from the terminal. Other previous work has shown that general physicochemical properties rather than specific amino acids are important factors contributing to disorder, and a reduced amino acid alphabet can maintain excellent precision in predicting disorder. In this paper, we redefine a suitable length and position for the N-term and C-term regions for predicting disorder. Moreover, we show that each region has different physicochemical properties, which are important factors contributing to disorder. We also suggest a region-specific-reduced set of amino acid and modified PSSM based on that for predicting disorder. We implemented our method and (1) compare it with the conventional division method, (2) compare our feature selection with all physicochemical features, on casp6 benchmark, PDB dataset, and DisProt. The result supports that the method of new data separation is effective, and indicates each region has different physicochemical properties that are important factors for predicting protein disorders.","PeriodicalId":330810,"journal":{"name":"2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131372335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Neural Network for Predicting Protein Disorder using Amino Acid Hydropathy Values","authors":"Deborah Stoffer, L. Volkert","doi":"10.1109/CIBCB.2005.1594958","DOIUrl":"https://doi.org/10.1109/CIBCB.2005.1594958","url":null,"abstract":"Proteins have been discovered to contain ordered regions and disordered regions, where ordered regions have a defined three-dimensional (3D) structure and disordered regions do not. While in the past it was believed that proteins only function in a defined 3D structure, proteins with disordered regions have been discovered to have at least 28 distinct functions. It is now important to be able to determine the ordered and disordered regions in proteins. Several experimental techniques such as X-ray crystallography, NMR spectroscopy, circular dichroism, protease digestion, and Stokes radius determination, along with several computational techniques such as artificial neural networks (ANNs), support vector machines (SVMs), logistic regression, and discriminant analysis have so far been used to detect disordered proteins. Past research has shown that ANNs and amino acid properties are an effective tool at predicting protein disorder. This research uses a feed-forward neural network implemented using JavaNNS and the hydropathy values of amino acids to predict protein disorder. The results show that hydropathy is an important amino acid property for disorder.","PeriodicalId":330810,"journal":{"name":"2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127224908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Larry Huang, Wen-Lin Huang, Shinn-Ying Ho, Shiow-Fen Hwang
{"title":"Interpretable Prediction of Protein Stability Changes upon Mutation by Using Decision Tree","authors":"Larry Huang, Wen-Lin Huang, Shinn-Ying Ho, Shiow-Fen Hwang","doi":"10.1109/CIBCB.2005.1594963","DOIUrl":"https://doi.org/10.1109/CIBCB.2005.1594963","url":null,"abstract":"For protein stability changes upon mutation, an accurate predictor with linguistic interpretability is beneficial to protein designs. Traditional analysis based on linear correlation between predicted and experimental data reveals their primitive relationships. Recently, some machine learning techniques such as artificial neural network (ANN)-based methods were applied to find an accurate predictor. However, the ANN-predictor without interpretability is insufficient in knowledge discovery. This paper proposes an interpretable predictor using a rule-based decision tree method (named iPTREE) for accurately predicting protein stability changes upon single point mutations. Besides being a sign predictor, iPTREE can be used both as a model for verifying attributes effect, and as a rules miner in the protein stability change study. iPTREE is depending on features including mutation type (deleted and introduced residues), the relative solvent accessibility value (RSA), the experimental conditions (pH and temperature) and the local spatial environment. To evaluate the performance of iPTREE, a thermodynamic dataset consisting of 1615 mutations generated from ProTherm is used. The computer simulation shows that iPTREE has an accurate prediction for the direction of stability changes as high as 87%, which is significantly better than the ANN-predictor for the same features.","PeriodicalId":330810,"journal":{"name":"2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126513820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Identification of Functional RNA Genes Using Evolved Neural Networks","authors":"Mars Cheung, G. Fogel","doi":"10.1109/CIBCB.2005.1594895","DOIUrl":"https://doi.org/10.1109/CIBCB.2005.1594895","url":null,"abstract":"Functional RNAs (fRNAs) play a key role in gene regulation, at both the transcriptional and translational levels. Identification of fRNA genes can be difficult, given that some classes of fRNAs (especially microRNAs) have short coding regions and do not use classical signals common to protein coding genes. This paper presents an approach to identify fRNA genes using evolved neural networks to discriminate between noncoding regions of genomes and regions that are likely to be fRNA coding. The results indicate that for human and C. elegans this approach can be used with considerable success.","PeriodicalId":330810,"journal":{"name":"2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116649222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hybrid PCA and LDA Analysis of Microarray Gene Expression Data","authors":"Yijuan Lu, Q. Tian, Maribel Sanchez, Yufeng Wang","doi":"10.1109/CIBCB.2005.1594942","DOIUrl":"https://doi.org/10.1109/CIBCB.2005.1594942","url":null,"abstract":"Microarray technology offers a high throughput means to study expression networks and gene regulatory networks in cells. The intrinsic nature of high dimensionality and small sample size in microarray data calls for the development of effective computational methods. In this paper, we propose a novel hybrid dimension reduction technique for classification - hybrid PCA (principal component analysis) and LDA (linear discriminant analysis) analysis. This technique effectively solves the singular scatter matrix problem caused by small training samples and increases the effective dimension of the projected subspace. It offers more flexibility and a richer set of alternatives to LDA and PCA in the parametric space. In addition, generalization of hybrid analysis of other dimension reduction techniques is also proposed in this paper, such as multiple discriminant analysis (MDA) and biased discriminant analysis (BDA). Extensive experiments on the yeast cell cycle regulation data set show the superior performance of the hybrid analysis over the traditional methods such as SVM.","PeriodicalId":330810,"journal":{"name":"2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123495373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Homology Kernel: A Biologically Motivated Sequence Embedding into Euclidean Space","authors":"E. Eskin, S. Snir","doi":"10.1109/CIBCB.2005.1594915","DOIUrl":"https://doi.org/10.1109/CIBCB.2005.1594915","url":null,"abstract":"Part of the challenge of modeling protein sequences is their discrete nature. Many of the most powerful statistical and learning techniques are applicable to points in a Euclidean space but not directly applicable to discrete sequences. One way to apply these techniques to protein sequences is to embed the sequences into a Euclidean space and then apply these techniques to the embedded points. In this paper, we introduce a biologically motivated sequence embedding, the homology kernel, which takes into account intuitions from local alignment, sequence homology, and predicted secondary structure. We apply the homology kernel in several ways. We demonstrate how the homology kernel can be used for protein family classification and outperforms state-of-the-art methods for remote homology detection. We show that the homology kernel can be used for secondary structure prediction and is competitive with popular secondary structure prediction methods. Finally, we show how the homology kernel can be used to incorporate information from homologous sequences in local sequence alignment.","PeriodicalId":330810,"journal":{"name":"2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128871295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimized Kernel Machines for Cancer Classification Using Gene Expression Data","authors":"Huilin Xiong, Xue-wen Chen","doi":"10.1109/CIBCB.2005.1594928","DOIUrl":"https://doi.org/10.1109/CIBCB.2005.1594928","url":null,"abstract":"The cancer classification using gene expression data has shown to be very useful for cancer diagnose and prediction. However, the nature of very high dimensionality and relatively small sample size associated with the gene expression data make the tasks of classification quite challenging. In this paper, we present a new approach, which is based on optimizing the kernel function, to improve the performances of the classifiers in classifying gene expression data. Aiming to increase the class separability of the data, we utilize a more flexible kernel function model, the data-dependent kernel, as the objective kernel to be optimized. The experimental results show that using the optimized kernel usually results in a substantial improvement for the K-nearest-neighbor (KNN) algorithm in classifying gene expression data.","PeriodicalId":330810,"journal":{"name":"2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology","volume":"287 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133624666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Architecture Combining Bayesian segmentation and Neural Network Ensembles for Protein Secondary Structure Prediction","authors":"Niranjan P. Bidargaddi, M. Chetty, J. Kamruzzaman","doi":"10.1109/CIBCB.2005.1594960","DOIUrl":"https://doi.org/10.1109/CIBCB.2005.1594960","url":null,"abstract":"A combined architecture of Bayesian segmentation along with ensembles of two layered feedforward network has been built and tested on widely studied two non membrane, non homologous databases comprising of 480 and 608 protein sequences respectively. In the first stage, Bayesian segmentation is used to infer sequence/structure relationship in terms of structural segments which is well suited to model non-local interactions among segments. The probability scores for the three structural states (helix, sheet and coil) of each residue obtained from the Bayesian segmentation has been used as the inputs at the second stage to a feedforward neural network. The neural network is trained with the sliding window comprising of the scores of seven consecutive residues along with additional inputs for physicochemical properties of the residues where the prediction is made for the central residue. The key aspect of the model is inclusion of physicochemical properties of the amino acids at the second stage. An ensemble of neural networks have been trained in second stage based on the posterior probabilities approach to determine the number of neural networks. This model achieves a Q3 accuracy of above 71% which is one of the highest accuracy values for single sequence prediction methods.","PeriodicalId":330810,"journal":{"name":"2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology","volume":"342 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132425871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
E. Marchiori, N. Heegaard, C. Jiménez, Mikkel West-Nielsen
{"title":"Feature Selection for Classification with Proteomic Data of Mixed Quality","authors":"E. Marchiori, N. Heegaard, C. Jiménez, Mikkel West-Nielsen","doi":"10.1109/CIBCB.2005.1594944","DOIUrl":"https://doi.org/10.1109/CIBCB.2005.1594944","url":null,"abstract":"In this paper we assess experimentally the performance of two state-of-the-art feature selection methods, called RFE and RELIEF, when used for classifying pattern proteomic samples of mixed quality. The data are generated by spiking human sera to artificially create differentiable sample groups, and by handling samples at different storage temperature. We consider two type of classifiers: support vector machines (SVM) and k-nearest neighbour (kNN). Results of leave-one-out cross validation (LOOCV) experiments indicate that RELIEF selects more stable feature subsets than RFE over the runs, where the selected features are mainly spiked ones. However, RFE outperforms RELIEF in terms of (average LOOCV) accuracy, both when combined with SVM and kNN. Perfect LOOCV accuracy is obtained by RFE combined with 1NN. Almost all the samples that are wrongly classified by the algorithms have high storage temperature. The results of experiments on this data indicate that when samples of mixed quality are analyzed computationally, feature selection of only relevant (spiked) features does not necessarily correspond to highest accuracy of classification.","PeriodicalId":330810,"journal":{"name":"2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126493826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
H. Ressom, R. Varghese, E. Orvisky, S. K. Drake, G. Hortin, M. Abdel-Hamid, C. Loffredo, R. Goldman
{"title":"Analysis of MALDI-TOF Serum Profiles for Biomarker Selection and Sample Classification","authors":"H. Ressom, R. Varghese, E. Orvisky, S. K. Drake, G. Hortin, M. Abdel-Hamid, C. Loffredo, R. Goldman","doi":"10.1109/CIBCB.2005.1594943","DOIUrl":"https://doi.org/10.1109/CIBCB.2005.1594943","url":null,"abstract":"Mass spectrometric profiles of peptides and proteins obtained by current technologies are characterized by complex spectra, high dimensionality, and substantial noise. These characteristics generate challenges in discovery of proteins and protein-profiles that distinguish disease states, e.g. cancer patients from healthy individuals. A challenging aspect of biomarker discovery in serum is the interference of abundant proteins with identification of disease-related proteins and peptides. We present data processing methods and computational intelligence that combines support vector machines (SVM) with particle swarm optimization (PSO) for biomarker selection from MALDI-TOF spectra of enriched serum. SVM classifiers were built for various combinations of m/z windows guided by the PSO algorithm. The method identified mass points that achieved high classification accuracy in distinguishing cancer patients from non-cancer controls. Based on their frequency of occurrence in multiple runs, six m/z windows were selected as candidate biomarkers. These biomarkers yielded 100% sensitivity and 91% specificity in distinguishing liver cancer patients from healthy individuals in an independent dataset.","PeriodicalId":330810,"journal":{"name":"2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121821582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}