{"title":"Supervised learning of maternal cigarette-smoking signatures from placental gene expression data: A case study","authors":"Chengpeng Bi, C. Vyhlidal, J. Leeder","doi":"10.1109/CIBCB.2010.5510587","DOIUrl":"https://doi.org/10.1109/CIBCB.2010.5510587","url":null,"abstract":"This paper aims to conduct supervised learning of the cigarette-smoking signatures from the placental gene expression data sets under the neural network framework and build classifiers to identify the cigarette-smoking moms during pregnancy. First, a unified model for gene selection is proposed to single out a set of informative gene sets (up-or down-regulated genes). The selected signature gene sets are subject to refinement, and then so refined informative gene sets are fed into three supervised statistical learning algorithms, linear discriminant function (LDF), probabilistic neural network (PNN) and support vector machine (SVM) for training and testing. It shows that SVM is the best classifier in predicting the cigarette-smoking moms compared to other methods tested.","PeriodicalId":340637,"journal":{"name":"2010 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122713283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rao M. Kotamarti, Michael Hahsler, Douglas W. Raiford, M. Dunham
{"title":"Sequence transformation to a complex signature form for consistent phylogenetic tree using Extensible Markov Model","authors":"Rao M. Kotamarti, Michael Hahsler, Douglas W. Raiford, M. Dunham","doi":"10.1109/CIBCB.2010.5510472","DOIUrl":"https://doi.org/10.1109/CIBCB.2010.5510472","url":null,"abstract":"Phylogenetic tree analysis using molecular sequences continues to expand beyond the 16S rRNA marker. By addressing the multi-copy issue known as the intra-heterogeneity, this paper restores the focus in using the 16S rRNA marker. Through use of a novel learning and model building algorithm, the multiple gene copies are integrated into a compact complex signature using the Extensible Markov Model (EMM). The method clusters related sequence segments while preserving their inherent order to create an EMM signature for a mi-crobial organism. A library of EMM signatures is generated from which samples are drawn for phylogenetic analysis. By matching the components of two signatures, referred to as quasi-alignment, the differences are highlighted and scored. Scoring quasi-alignments is done using adapted Karlin-Altschul statistics to compute a novel distance metric. The metric satisfies conditions of identity, symmetry, triangular inequality and the four point rule required for a valid evolution distance metric. The resulting distance matrix is input to PHYologeny Inference Package (PHYLIP) to generate phylogenies using neighbor joining algorithms. Through control of clustering in signature creation, the diversity of similar organisms and their placement in the phylogeny is explained. The experiments include analysis of genus Burkholderia, a random microbial sample spanning several phyla and a diverse sample that includes RNA of Eukaryotic origin. The NCBI sequence data for 16S rRNA is used for validation.","PeriodicalId":340637,"journal":{"name":"2010 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127422405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Detecting retroviruses using reading frame information and side effect machines","authors":"W. Ashlock, S. Datta","doi":"10.1109/CIBCB.2010.5510699","DOIUrl":"https://doi.org/10.1109/CIBCB.2010.5510699","url":null,"abstract":"This paper addresses the problem of distinguishing retroviruses from non-coding DNA sequences. Retroviruses have a distinctive reading frame structure that includes multiple reading frames that often overlap. This paper uses reading frame information generated from Fourier spectral analysis as input for Side Effect Machines (SEMs) that are evolved to create clusterings which separate the two types of sequences. The output from these SEMs is then used to train Support Vector Machines (SVMs) to perform the classification. The best classifier out of 100 replicates achieves 100% accuracy using complete retroviral genomes and the average classifier achieves 85% accuracy. Using endogenous retroviral data that includes many mutations, the best classifier achieves 86% accuracy; the average achieves an accuracy of 71%. The method also was able to distinguish lentiviruses from other types of retroviruses with a best accuracy of 100% (average 93%). In order to better understand the evolved SEMs, classifiers trained on SEMs evolved using endogenous retroviral data were used to classify the complete unmutated retroviral genomes and vice versa. It was found that, regardless of which type of data was used to create the classifiers, their performance on the test data sets was similar. This suggests that SEMs are able to extract the distinctive retroviral reading frame structure from the Fourier spectra, but that in some of the endogenous retroviruses in our data set there were too many mutations for this structure to be discernable from the data using this method.","PeriodicalId":340637,"journal":{"name":"2010 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology","volume":"28 12","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114028142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gene M. Ko, A. Reddy, Sunil Kumar, S. A. Bailey, R. Garg
{"title":"Classification of HIV-1 protease crystal structures using Random Forest, linear discriminant analysis and logistic regression","authors":"Gene M. Ko, A. Reddy, Sunil Kumar, S. A. Bailey, R. Garg","doi":"10.1109/CIBCB.2010.5510465","DOIUrl":"https://doi.org/10.1109/CIBCB.2010.5510465","url":null,"abstract":"The present study develops a classification model to correlate the binding pockets of 70 HIV-1 protease crystal structures in terms of their structural descriptors to their complexed HIV-1 protease inhibitors. The Random Forest classification model is used to reduce the chemical descriptor space from 456 to the 12 most relevant descriptors based on the Gini importance measure. The selected 12 descriptors are then used to develop classification models using linear discriminant analysis (LDA) and logistic regression (LR). The top eight descriptors were found to produce the best LDA model with an overall error of 30% and a leave-one-out cross validation error of 44.29%, while the top five descriptors were found to produce the best LR model with an overall error of 28.57% and a leave-one-out cross validation error of 41.43%. Hierarchical clustering was performed on the top five and eight descriptors to verify whether the descriptor selection of Random Forest can group together the binding pockets based on their complexed ligands. The selected descriptors would play a crucial role in understanding the HIV-1 protease binding pocket structure in terms of its chemical descriptors.","PeriodicalId":340637,"journal":{"name":"2010 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology","volume":"22 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114110581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Expanded study of efn2 thermodynamic model performance on RnaPredict, an evolutionary algorithm for RNA folding","authors":"K. Wiese, A. Hendriks","doi":"10.1109/CIBCB.2010.5510321","DOIUrl":"https://doi.org/10.1109/CIBCB.2010.5510321","url":null,"abstract":"The shape that organic molecules such as biopolymers form within organic systems largely determines the function said molecules perform. RNA is a biopolymer that plays a central part in several stages of protein synthesis, and also has structural, functional, and regulatory roles in the cell. In an ab initio case most common structure prediction techniques employ minimization of the free energy of a given RNA molecule via a thermodynamic model. RnaPredict is an evolutionary algorithm for RNA folding. This paper compares the performance of an advanced thermodynamic model, efn2, against the stacking-energy thermodynamic models INN and INN-HB on a test set containing 24 sequences from 4 rRNA subtypes. The prediction accuracy of efn2 is demonstrated on a majority of test sequences. A comparison is also made with the mfold prediction algorithm which demonstrated RnaPredict's comparable performance.","PeriodicalId":340637,"journal":{"name":"2010 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121114528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"New approaches to clustering microarray time-series data using multiple expression profile alignment","authors":"N. Subhani, L. Rueda, A. Ngom, C. J. Burden","doi":"10.1109/CIBCB.2010.5510385","DOIUrl":"https://doi.org/10.1109/CIBCB.2010.5510385","url":null,"abstract":"An important process in functional genomic studies is clustering microarray time-series data, where genes with similar expression profiles are expected to be functionally related. Clustering microarray time-series data via pairwise alignment of piecewise linear profiles has been recently introduced. In this paper, we propose a clustering approach based on a multiple profile alignment of natural cubic spline and piecewise linear representations of gene expression profiles. We combine these multiple alignment approaches with k-means. We ran our methods on a well-known data set of pre-clustered Saccharomyces cerevisiae gene expression profiles and a data set of 3315 Pseudomonas aeruginosa expression profiles. We assessed the validity of the resulting clusters and applied a c-nearest neighbor classifier for evaluating the performance of our approaches, obtaining accuracies of 89.51% and 86.12% respectively, on Saccharomyces cerevisiae data, and 90.90% and 93.71% accuracies for cubic spline and piecewise linear respectively on Pseudomonas aeruginosa data.","PeriodicalId":340637,"journal":{"name":"2010 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122914323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Predicting chemical activities from structures by attributed molecular graph classification","authors":"Qian Xu, Derek Hao Hu, H. Xue, Qiang Yang","doi":"10.1109/CIBCB.2010.5510690","DOIUrl":"https://doi.org/10.1109/CIBCB.2010.5510690","url":null,"abstract":"Designing Quantitative Structure-Activity Relationship (QSAR) models has been a recurrent research interest for biologists and computer scientists. An example is to predict the toxicity of chemical compounds using their structural properties as features represented by graphs. A popular method to classify these graphs is to exploit classifiers such as support vector machines (SVMs) and graph kernels to incorporate the sequential, structural and chemical information. Previous works have focused on designing specific graph kernels for this task, amongst which graph alignment kernels are one of the most popular approach. Graph alignment kernels align the nodes of one graph to the nodes of the second graph so that the total overall similarity is maximized with respect to all possible alignments. However, taking both vertex and edge similarities into account makes the problem NP-Hard. In this paper, we present a novel general graph-matching based method for QSAR. We view the problem of calculating optimal assignments of two attributed graphs from a different perspective. Instead of first designing an atom kernel function and a bond kernel function, we first provide a training set of pairs of graphs with their corresponding matchings. We then try to learn the compatibility function over atoms and use only the atom kernel function to compute graph matchings. Our algorithm has the advantage of being more general and yet efficient than previous approaches for the QSAR problem. We evaluate our method on a set of chemical structure-activity prediction benchmark datasets, and show that our algorithm can achieve better or comparable accuracies over the optimal assignment kernel method.","PeriodicalId":340637,"journal":{"name":"2010 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128173322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Modular clustering of protein-protein interaction networks","authors":"Nassim Sohaee, C. Forst","doi":"10.1109/CIBCB.2010.5510590","DOIUrl":"https://doi.org/10.1109/CIBCB.2010.5510590","url":null,"abstract":"Identifying the modular structures in proteinprotein interaction networks is crucial to the understanding of the organization and function of biological systems. In this paper we introduce the concept of critical module in a network and propose an efficient algorithm to find all critical modules in a given network. Finally we tested the proposed algorithm on Yeast protein interaction data set.","PeriodicalId":340637,"journal":{"name":"2010 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123015184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Computation intelligence method to find generic non-coding RNA search models","authors":"Jennifer A. Smith","doi":"10.1109/CIBCB.2010.5510341","DOIUrl":"https://doi.org/10.1109/CIBCB.2010.5510341","url":null,"abstract":"Fairly effective methods exist for finding new non-coding RNA genes using search models based on known families of ncRNA genes (for example covariance models). However, these models only find new members of the existing families and are not useful in finding potential members of novel ncRNA families. Other problems with family-specific search include large processing requirements, ambiguity in defining which sequences form a family and lack of sufficient numbers of known sequences to properly estimate model parameters. An ncRNA search model is proposed which includes a collection of non-overlapping RNA hairpin structure covariance models. The hairpin models are chosen from a hairpin-model list compiled from many families in the Rfam non-coding RNA families database. The specific hairpin models included and the overall score threshold for the search model is determined through the use of a genetic algorithm.","PeriodicalId":340637,"journal":{"name":"2010 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124954170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Simulation of oscillatory dynamics of blood testosterone levels using the crossover method","authors":"A. Sabnis, R. Harrison","doi":"10.1109/CIBCB.2010.5510490","DOIUrl":"https://doi.org/10.1109/CIBCB.2010.5510490","url":null,"abstract":"Blood testosterone levels oscillate periodically in humans. The in vivo dynamics of this biochemical system cannot be simulated in silico using a continuous deterministic solution of a previously reported mathematical model. The use of the stochastic simulation algorithm (SSA), however, has been reported to generate sustained oscillations that are qualitatively and quantitatively consistent with the experimental observations. Although the SSA is capable of accurately simulating a biochemical network, it is extremely inefficient from a computational standpoint. In this work, we have attempted to simulate the above mentioned model using a deterministic-stochastic crossover method, for three separate sets of parameters. Each time, not only did the results show the existence of sustained oscillations but also that the computational time was at least four times lower than the corresponding SSA solution. The crossover method can hence be proposed as a viable alternative to the SSA for simulating biochemical systems that are commonly encountered in systems biology applications.","PeriodicalId":340637,"journal":{"name":"2010 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology","volume":"107 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130469007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}