{"title":"Natively disordered proteins: functions and predictions.","authors":"Pedro Romero, Zoran Obradovic, A Keith Dunker","doi":"10.2165/00822942-200403020-00005","DOIUrl":"https://doi.org/10.2165/00822942-200403020-00005","url":null,"abstract":"<p><p>Proteins can exist in at least three forms: the ordered form (solid-like), the partially folded form (collapsed, molten globule-like or liquid-like) and the extended form (extended, random coil-like or gas-like). The protein trinity hypothesis has two components: (i) a given native protein can be in any one of the three forms, depending on the sequence and the environment; and (ii) function can arise from any one of the three forms or from transitions between them. In this study, bioinformatics and data mining were used to investigate intrinsic disorder in proteins and develop neural network-based predictors of natural disordered regions (PONDR) that can discriminate between ordered and disordered residues with up to 84% accuracy. Predictions of intrinsic disorder indicate that the three kingdoms follow the disorder ranking eubacteria < archaebacteria << eukaryotes, with approximately half of eukaryotic proteins predicted to contain substantial regions of intrinsic disorder. Many of the known disordered regions are involved in signalling, regulation or control. Involvement of highly flexible or disordered regions in signalling is logical: a flexible sensor more readily undergoes conformational change in response to environmental perturbations than does a rigid one. Thus, the increased disorder in the eukaryotes is likely the direct result of an increased need for signalling and regulation in nucleated organisms. PONDR can also be used to detect molecular recognition elements that are disordered in the unbound state and become structured when bound to a biologically meaningful partner. Application of disorder predictions to cell-signalling, cancer-associated and control protein databases supports the widespread occurrence of protein disorder in these processes.</p>","PeriodicalId":87049,"journal":{"name":"Applied bioinformatics","volume":"3 2-3","pages":"105-13"},"PeriodicalIF":0.0,"publicationDate":"2004-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.2165/00822942-200403020-00005","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"24941798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
John K Vries, Rajan Munshi, Dror Tobi, Judith Klein-Seetharaman, Panayiotis V Benos, Ivet Bahar
{"title":"A sequence alignment-independent method for protein classification.","authors":"John K Vries, Rajan Munshi, Dror Tobi, Judith Klein-Seetharaman, Panayiotis V Benos, Ivet Bahar","doi":"10.2165/00822942-200403020-00008","DOIUrl":"https://doi.org/10.2165/00822942-200403020-00008","url":null,"abstract":"<p><p>Annotation of the rapidly accumulating body of sequence data relies heavily on the detection of remote homologues and functional motifs in protein families. The most popular methods rely on sequence alignment. These include programs that use a scoring matrix to compare the probability of a potential alignment with random chance and programs that use curated multiple alignments to train profile hidden Markov models (HMMs). Related approaches depend on bootstrapping multiple alignments from a single sequence. However, alignment-based programs have limitations. They make the assumption that contiguity is conserved between homologous segments, which may not be true in genetic recombination or horizontal transfer. Alignments also become ambiguous when sequence similarity drops below 40%. This has kindled interest in classification methods that do not rely on alignment. An approach to classification without alignment based on the distribution of contiguous sequences of four amino acids (4-grams) was developed. Interest in 4-grams stemmed from the observation that almost all theoretically possible 4-grams (20(4)) occur in natural sequences and the majority of 4-grams are uniformly distributed. This implies that the probability of finding identical 4-grams by random chance in unrelated sequences is low. A Bayesian probabilistic model was developed to test this hypothesis. For each protein family in Pfam-A and PIR-PSD, a feature vector called a probe was constructed from the set of 4-grams that best characterised the family. In rigorous jackknife tests, unknown sequences from Pfam-A and PIR-PSD were compared with the probes for each family. A classification result was deemed a true positive if the probe match with the highest probability was in first place in a rank-ordered list. This was achieved in 70% of cases. Analysis of false positives suggested that the precision might approach 85% if selected families were clustered into subsets. Case studies indicated that the 4-grams in common between an unknown and the best matching probe correlated with functional motifs from PRINTS. The results showed that remote homologues and functional motifs could be identified from an analysis of 4-gram patterns.</p>","PeriodicalId":87049,"journal":{"name":"Applied bioinformatics","volume":"3 2-3","pages":"137-48"},"PeriodicalIF":0.0,"publicationDate":"2004-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.2165/00822942-200403020-00008","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"24941801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Gene structure prediction using an orthologous gene of known exon-intron structure.","authors":"Stephanie Seneff, Chao Wang, Christopher B Burge","doi":"10.2165/00822942-200403020-00002","DOIUrl":"https://doi.org/10.2165/00822942-200403020-00002","url":null,"abstract":"<p><p>Given the availability of complete genome sequences from related organisms, sequence conservation can provide important clues for predicting gene structure. In particular, one should be able to leverage information about known genes in one species to help determine the structures of related genes in another. Such an approach is appealing in that high-quality gene prediction can be achieved for newly sequenced species, such as mouse and puffer fish, using the extensive knowledge that has been accumulated about human genes. This article reports a novel approach to predicting the exon-intron structures of mouse genes by incorporating constraints from orthologous human genes using techniques that have previously been exploited in speech and natural language processing applications. The approach uses a context-free grammar to parse a training corpus of annotated human genes. A statistical training procedure produces a weighted recursive transition network (RTN) intended to capture the general features of a mammalian gene. This RTN is expanded into a finite state transducer (FST) and composed with an FST capturing the specific features of the human orthologue. This model includes a trigram language model on the amino acid sequence as well as exon length constraints. A final stage uses the free software package ClustalW to align the top n candidates in the search space. For a set of 98 orthologous human-mouse pairs, we achieved 96% sensitivity and 97% specificity at the exon level on the mouse genes, given only knowledge gleaned from the annotated human genome.</p>","PeriodicalId":87049,"journal":{"name":"Applied bioinformatics","volume":"3 2-3","pages":"81-90"},"PeriodicalIF":0.0,"publicationDate":"2004-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.2165/00822942-200403020-00002","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"24943060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jean-Pierre Mainguy, Grant Macdonnell, Stefan Bund, David L Wild
{"title":"KMD: an open-source port of the ArrayExpress microarray database.","authors":"Jean-Pierre Mainguy, Grant Macdonnell, Stefan Bund, David L Wild","doi":"10.2165/00822942-200403040-00008","DOIUrl":"https://doi.org/10.2165/00822942-200403040-00008","url":null,"abstract":"<p><strong>Unlabelled: </strong>The Keck Microarray Database (KMD) is a port of the ArrayExpress database from Oracle to the MySQL environment. The requirements for a locally available, open-source microarray database solution based on ArrayExpress are analysed in this article. The differences between the Oracle and MySQL environments are identified and the method to port to MySQL is described, providing a unified relational database management system (RDBMS) platform for both MIAMExpress and ArrayExpress.</p><p><strong>Availability: </strong>The software and documentation are available from the Keck Graduate Institute of Applied Life Sciences website at http://public.kgi.edu/~jmainguy/applied-bioinformatics.htm.</p>","PeriodicalId":87049,"journal":{"name":"Applied bioinformatics","volume":"3 4","pages":"257-60"},"PeriodicalIF":0.0,"publicationDate":"2004-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.2165/00822942-200403040-00008","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25118537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kei-Hoi Cheung, Remko de Knikker, Youjun Guo, Guoneng Zhong, Janet Hager, Kevin Y Yip, Albert K H Kwan, Peter Li, David W Cheung
{"title":"Biosphere: the interoperation of web services in microarray cluster analysis.","authors":"Kei-Hoi Cheung, Remko de Knikker, Youjun Guo, Guoneng Zhong, Janet Hager, Kevin Y Yip, Albert K H Kwan, Peter Li, David W Cheung","doi":"10.2165/00822942-200403040-00007","DOIUrl":"https://doi.org/10.2165/00822942-200403040-00007","url":null,"abstract":"<p><strong>Unlabelled: </strong>The growing use of DNA microarrays in biomedical research has led to the proliferation of analysis tools. These software programs address different aspects of analysis (e.g. normalisation and clustering within and across individual arrays) as well as extended analysis methods (e.g. clustering, annotation and mining of multiple datasets). Therefore, microarray data analysis typically requires the interoperability of multiple software programs involving different analysis types and methods. Such interoperation is often hampered by the heterogeneity inherent in the software tools (which may function by implementing different interfaces and using different programming languages). To address this problem, we employed the simple object access protocol (SOAP)-based web service approach that provides a uniform programmatic interface to these heterogeneous software components. To demonstrate this approach in the microarray context, we created a web server application, Biosphere, which interoperates a number of web services that are geographically widely distributed. These web services include a clustering web service, which is a suite of different clustering algorithms for analysing microarray data; XEMBL, developed at the European Bioinformatics Institute (EBI) for retrieving EMBL Nucleotide Sequence Database sequence data; and three gene annotation web services: GetGO, GetHAPI and GetUMLS. GetGO allows retrieval of Gene Ontology (GO) annotation, and the other two web services retrieve annotation from the biomedical literature that is indexed based on the Medical Subject Headings (MeSH) terms. With these web services, Biosphere allows the users to do the following: (i) cluster gene expression data using seven different algorithms; (ii) visualise the clustering results that are grouped statistically in colour; and (iii) retrieve sequence, annotation and citation data for the genes of interest.</p><p><strong>Availability: </strong>Biosphere and its web services described in Web Service Description Language (WSDL) can be accessed at http://rook.cecid.hku.hk:8280/BiosphereServer.</p>","PeriodicalId":87049,"journal":{"name":"Applied bioinformatics","volume":"3 4","pages":"253-6"},"PeriodicalIF":0.0,"publicationDate":"2004-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.2165/00822942-200403040-00007","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25118642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Joachim A Hering, Peter R Innocent, Parvez I Haris
{"title":"Beyond average protein secondary structure content prediction using FTIR spectroscopy.","authors":"Joachim A Hering, Peter R Innocent, Parvez I Haris","doi":"10.2165/00822942-200403010-00003","DOIUrl":"https://doi.org/10.2165/00822942-200403010-00003","url":null,"abstract":"<p><p>This paper demonstrates that secondary structure information beyond purely protein secondary structure content can be predicted from FTIR (Fourier transform infrared spectroscopy) spectra of proteins with a high degree of accuracy. Both neural networks and adaptive neuro-fuzzy inference systems (ANFISs) were employed to predict helix/sheet segment information. The best results were achieved using ANFISs with fuzzy subtractive clustering based on normalised, compressed amide I data with an average SEP (standard error of prediction, root mean of squared errors) of 1.51. Predictions for average helix/sheet length based merely on the amide I band maximum position in combination with the full-width at half-height resulted in a comparable average SEP of 1.62. This suggests the importance of information on the position and width of the amide I band maximum for the prediction of helix/sheet segment information. Finally, the most promising pattern recognition approaches found in this study were applied to a protein with an as yet unknown x-ray structure: native a1-antichymotrypsin (a1-ACT).</p>","PeriodicalId":87049,"journal":{"name":"Applied bioinformatics","volume":"3 1","pages":"9-20"},"PeriodicalIF":0.0,"publicationDate":"2004-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.2165/00822942-200403010-00003","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25739563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rose Hoberman, Judith Klein-Seetharaman, Roni Rosenfeld
{"title":"Inferring property selection pressure from positional residue conservation.","authors":"Rose Hoberman, Judith Klein-Seetharaman, Roni Rosenfeld","doi":"10.2165/00822942-200403020-00011","DOIUrl":"https://doi.org/10.2165/00822942-200403020-00011","url":null,"abstract":"<p><p>In this study, we attempt to understand and explain positional selection pressure in terms of underlying physical and chemical properties. We propose a set of constraining assumptions about how these pressures behave, then describe a procedure for analysing and explaining the distribution of residues at a particular position in a multiple sequence alignment. In contrast to previous approaches, our model takes into account both amino acid frequencies and a large number of physical-chemical properties. By analysing each property separately, it is possible to identify positions where distinct conservation patterns are present. In addition, the model can easily incorporate sequence weights that adjust for bias in the sample sequences. Finally, a test of statistical significance is provided for our conservation measure. The applicability of this method is demonstrated on two HIV-1 proteins: Nef and Env. The tools, data and results presented in this article are available at http://flan.blm.cs.cmu.edu.</p>","PeriodicalId":87049,"journal":{"name":"Applied bioinformatics","volume":"3 2-3","pages":"167-79"},"PeriodicalIF":0.0,"publicationDate":"2004-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.2165/00822942-200403020-00011","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"24941804","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Five hierarchical levels of sequence-structure correlation in proteins.","authors":"Christopher Bystroff, Yu Shao, Xin Yuan","doi":"10.2165/00822942-200403020-00004","DOIUrl":"https://doi.org/10.2165/00822942-200403020-00004","url":null,"abstract":"<p><p>This article reviews recent work towards modelling protein folding pathways using a bioinformatics approach. Statistical models have been developed for sequence-structure correlations in proteins at five levels of structural complexity: (i) short motifs; (ii) extended motifs; (iii) nonlocal pairs of motifs; (iv) 3-dimensional arrangements of multiple motifs; and (v) global structural homology. We review statistical models, including sequence profiles, hidden Markov models (HMMs) and interaction potentials, for the first four levels of structural detail. The I-sites (folding Initiation sites) Library models short local structure motifs. Each succeeding level has a statistical model, as follows: HMMSTR (HMM for STRucture) is an HMM for extended motifs; HMMSTR-CM (Contact Maps) is a model for pairwise interactions between motifs; and SCALI-HMM (HMMs for Structural Core ALIgnments) is a set of HMMs for the spatial arrangements of motifs. The parallels between the statistical models and theoretical models for folding pathways are discussed in this article; however, global sequence models are not discussed because they have been extensively reviewed elsewhere. The data used and algorithms presented in this article are available at http://www.bioinfo.rpi.edu/~bystrc/ (click on \"servers\" or \"downloads\") or by request to bystrc@rpi.edu .</p>","PeriodicalId":87049,"journal":{"name":"Applied bioinformatics","volume":"3 2-3","pages":"97-104"},"PeriodicalIF":0.0,"publicationDate":"2004-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.2165/00822942-200403020-00004","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"24941797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tejasvini Prasad, Tamilselvi Subramanian, Sridhar Hariharaputran, H S Chaitra, Nagasuma Chandra
{"title":"Extracting hydrogen-bond signature patterns from protein structure data.","authors":"Tejasvini Prasad, Tamilselvi Subramanian, Sridhar Hariharaputran, H S Chaitra, Nagasuma Chandra","doi":"10.2165/00822942-200403020-00007","DOIUrl":"https://doi.org/10.2165/00822942-200403020-00007","url":null,"abstract":"<p><p>Classification of protein sequences and structures into families is a fundamental task in biology, and it is often used as a basis for designing experiments for gaining further knowledge. Some relationships between proteins are detected by the similarities in their sequences, and many more by the similarities in their structures. Despite this, there are a number of examples of functionally similar molecules without any recognisable sequence or structure similarities, and there are also a number of protein molecules that share common structural scaffolds but exhibit different functions. Newer methods of comparing molecules are required in order to detect similarities and dissimilarities in protein molecules. In this article, it is proposed that the precise 3-dimensional disposition of key residues in a protein molecule is what matters for its function, or what conveys the \"meaning\" for a biological system, but not what means it uses to achieve this. The concept of comparing two molecules through their intramolecular interaction networks is explored, since these networks dictate the disposition of amino acids in a protein structure. First, signature patterns, or fingerprints, of interaction networks in pre-classified protein structural families are computed using an approach to find structural equivalences and consensus hydrogen bonds. Five examples from different structural classes are illustrated. These patterns are then used to search the entire Protein Data Bank, an approach through which new, unexpected similarities have been found. The potential for finding relationships through this approach is highlighted. The use of hydrogen-bond fingerprints as a new metric for measuring similarities in protein structures is also described.</p>","PeriodicalId":87049,"journal":{"name":"Applied bioinformatics","volume":"3 2-3","pages":"125-35"},"PeriodicalIF":0.0,"publicationDate":"2004-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.2165/00822942-200403020-00007","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"24941800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Reusing microarrays within closely related species: experimental validation through phylogenetic inference.","authors":"Deepika Jagan, Gautam B Singh","doi":"10.2165/00822942-200403020-00003","DOIUrl":"https://doi.org/10.2165/00822942-200403020-00003","url":null,"abstract":"<p><p>Microarrays are generally designed for a specific set of organisms, and this poses a limitation for researchers wanting to conduct investigations on gene expression in organisms that are, in some sense, not \"popular\" enough. In this article, we demonstrate that microarrays may in fact be reusable for aggregate expression analysis for species that are evolutionarily related. Our validation approach is based on this assumption and draws a phylogenetic conclusion that is deemed to be true only if the assumption of reusability is valid. This article demonstrates that microarrays developed using the human transcriptome are reusable for aggregate expression analysis of primates in general.</p>","PeriodicalId":87049,"journal":{"name":"Applied bioinformatics","volume":"3 2-3","pages":"91-6"},"PeriodicalIF":0.0,"publicationDate":"2004-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.2165/00822942-200403020-00003","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"24943061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}