Gong-Xin Yu, George Ostrouchov, Al Geist, Nagiza F Samatova
{"title":"An SVM-based algorithm for identification of photosynthesis-specific genome features.","authors":"Gong-Xin Yu, George Ostrouchov, Al Geist, Nagiza F Samatova","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>This paper presents a novel algorithm for identification and functional characterization of \"key\" genome features responsible for a particular biochemical process of interest. The central idea is that individual genome features are identified as \"key\" features if the discrimination accuracy between two classes of genomes with respect to a given biochemical process is sufficiently affected by the inclusion or exclusion of these features. In this paper, genome features are defined by high-resolution gene functions. The discrimination procedure utilizes the Support Vector Machine classification technique. The application to the oxygenic photosynthetic process resulted in 126 highly confident candidate genome features. While many of these features are well-known components in the oxygenic photosynthetic process, others are completely unknown, even including some hypothetical proteins. It is obvious that our algorithm is capable of discovering features related to a targeted biochemical process.</p>","PeriodicalId":87204,"journal":{"name":"Proceedings. IEEE Computer Society Bioinformatics Conference","volume":"2 ","pages":"235-43"},"PeriodicalIF":0.0,"publicationDate":"2003-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25833759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"cWINNOWER algorithm for finding fuzzy DNA motifs.","authors":"Shoudan Liang","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>The cWINNOWER algorithm detects fuzzy motifs in DNA sequences rich in protein-binding signals. A signal is defined as any short nucleotide pattern having up to d mutations differing from a motif of length l. The algorithm finds such motifs if multiple mutated copies of the motif (i.e., the signals) are present in the DNA sequence in sufficient abundance. The cWINNOWER algorithm substantially improves the sensitivity of the winnower method of Pevzner and Sze by imposing a consensus constraint, enabling it to detect much weaker signals. We studied the minimum number of detectable motifs qc as a function of sequence length N for random sequences. We found that q(c) increases linearly with N for a fast version of the algorithm based on counting three-member sub-cliques. Imposing consensus constraints reduces q(c) by a factor of three in this case, which makes the algorithm dramatically more sensitive. Our most sensitive algorithm, which counts four-member sub-cliques, needs a minimum of only 13 signals to detect motifs in a sequence of length N = 12,000 for (l,d) = (15,4).</p>","PeriodicalId":87204,"journal":{"name":"Proceedings. IEEE Computer Society Bioinformatics Conference","volume":"2 ","pages":"260-5"},"PeriodicalIF":0.0,"publicationDate":"2003-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25834888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Keynote address: the role of algorithmic research in computational genomics.","authors":"Richard M Karp","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>In the early 1990s, after more than three decades of studying algorithms within the frame work of theoretical computer science, I shifted my focus to alogrithmic problems arising in genomics. There is a fundamental difference between the views of algorithms in the two fields: in theoretical computer science the input-output behavior of an algorithm is rigorously specified in advance, whereas in computational biology an algorithm is merely a vehicle for discovering Nature's ground truth. In order to be effective in computational genomics I have had to radically change my approach to research. On the occasion of this keynote address I will share some of the lessons I have learned, in the hope of making the way easier for computer scientists and mathematicians entering this field. These lessons will be encapsulated in a list of aphorisms, accompanied by illustrative examples.</p>","PeriodicalId":87204,"journal":{"name":"Proceedings. IEEE Computer Society Bioinformatics Conference","volume":"2 ","pages":"10-1"},"PeriodicalIF":0.0,"publicationDate":"2003-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"26133912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Can we identify cellular pathways implicated in cancer using gene expression data?","authors":"Nigam Shah, Jorge Lepre, Yuhai Tu, Gustavo Stolovitzky","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>The cancer state of a cell is characterized by alterations of important cellular processes such as cell proliferation, apoptosis, DNA-damage repair, etc. The expression of genes associated with cancer related pathways, therefore, may exhibit differences between the normal and the cancerous states. We explore various means to find these differences. We analyze 6 different pathways (p53, Ras, Brca, DNA damage repair, NFkappab and beta-catenin) and 4 different types of cancer: colon, pancreas, prostate and kidney. Our results are found to be mostly consistent with existing knowledge of the involvement of these pathways in different cancers. Our analysis constitutes proof of principle that it may be possible to predict the involvement of a particular pathway in cancer or other diseases by using gene expression data. Such method would be particularly useful for the types of diseases where biology is poorly understood.</p>","PeriodicalId":87204,"journal":{"name":"Proceedings. IEEE Computer Society Bioinformatics Conference","volume":"2 ","pages":"94-103"},"PeriodicalIF":0.0,"publicationDate":"2003-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25834346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fourier harmonic approach for visualizing temporal patterns of gene expression data.","authors":"Li Zhang, Aidong Zhang, Murali Ramanathan","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>DNA microarray technology provides a broad snapshot of the state of the cell by measuring the expression levels of thousands of genes simultaneously. Visualization techniques can enable the exploration and detection of patterns and relationships in a complex dataset by presenting the data in a graphical format in which the key characteristics become more apparent. The purpose of this study is to present an interactive visualization technique conveying the temporal patterns of gene expression data in a form intuitive for non-specialized end-users. The first Fourier harmonic projection (FFHP) was introduced to translate the multi-dimensional time series data into a two dimensional scatter plot. The spatial relationship of the points reflect the structure of the original dataset and relationships among clusters become two dimensional. The proposed method was tested using two published, array-derived gene expression datasets. Our results demonstrate the effectiveness of the approach.</p>","PeriodicalId":87204,"journal":{"name":"Proceedings. IEEE Computer Society Bioinformatics Conference","volume":"2 ","pages":"137-47"},"PeriodicalIF":0.0,"publicationDate":"2003-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25834351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Group testing with DNA chips: generating designs and decoding experiments.","authors":"Alexander Schliep, David C Torney, Sven Rahmann","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>DNA microarrays are a valuable tool for massively parallel DNA-DNA hybridization experiments. Currently, most applications rely on the existence of sequence-specific oligonucleotide probes. In large families of closely related target sequences, such as different virus subtypes, the high degree of similarity often makes it impossible to find a unique probe for every target. Fortunately, this is unnecessary. We propose a microarray design methodology based on a group testing approach. While probes might bind to multiple targets simultaneously, a properly chosen probe set can still unambiguously distinguish the presence of one target set from the presence of a different target set. Our method is the first one that explicitly takes cross-hybridization and experimental errors into account while accommodating several targets. The approach consists of three steps: (1) Pre-selection of probe candidates, (2) Generation of a suitable group testing design, and (3) Decoding of hybridization results to infer presence or absence of individual targets. Our results show that this approach is very promising, even for challenging data sets and experimental error rates of up to 5%. On a data set of 28S rDNA sequences we were able to identify 660 sequences, a substantial improvement over a prior approach using unique probes which only identified 408 sequences.</p>","PeriodicalId":87204,"journal":{"name":"Proceedings. IEEE Computer Society Bioinformatics Conference","volume":"2 ","pages":"84-91"},"PeriodicalIF":0.0,"publicationDate":"2003-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25834436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Haplotype motifs: an algorithmic approach to locating evolutionarily conserved patterns in haploid sequences.","authors":"Russell Schwartz","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>The promise of plentiful data on common human genetic variations has given hope that we will be able to uncover genetic factors behind common diseases that have proven difficult to locate by prior methods. Much recent interest in this problem has focused on using haplotypes (contiguous regions of correlated genetic variations), instead of the isolated variations, in order to reduce the size of the statistical analysis problem. In order to most effectively use such variation data, we will need a better understanding of haplotype structure, including both the general principles underlying haplotype structure in the human population and the specific structures found in particular genetic regions or sub-populations. This paper presents a probabilistic model for analyzing haplotype structure in a population using conserved motifs found in statistically significant sub-populations. It describes the model and computational methods for deriving the predicted motif set and haplotype structure for a population. It further presents results on simulated data, in order to validate the method, and on two real datasets from the literature, in order to illustrate its practical application.</p>","PeriodicalId":87204,"journal":{"name":"Proceedings. IEEE Computer Society Bioinformatics Conference","volume":"2 ","pages":"306-14"},"PeriodicalIF":0.0,"publicationDate":"2003-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25834893","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CTSS: a robust and efficient method for protein structure alignment based on local geometrical and biological features.","authors":"Tolga Can, Yuan-Fang Wang","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>We present a new method for conducting protein structure similarity searches, which improves on the accuracy, robustness, and efficiency of some existing techniques. Our method is grounded in the theory of differential geometry on 3D space curve matching. We generate shape signatures for proteins that are invariant, localized, robust, compact, and biologically meaningful. To improve matching accuracy, we smooth the noisy raw atomic coordinate data with spline fitting. To improve matching efficiency, we adopt a hierarchical coarse-to-fine strategy. We use an efficient hashing-based technique to screen out unlikely candidates and perform detailed pairwise alignments only for a small number of candidates that survive the screening process. Contrary to other hashing based techniques, our technique employs domain specific information (not just geometric information) in constructing the hash key, and hence, is more tuned to the domain of biology. Furthermore, the invariancy, localization, and compactness of the shape signatures allow us to utilize a well-known local sequence alignment algorithm for aligning two protein structures. One measure of the efficacy of the proposed technique is that we were able to discover new, meaningful motifs that were not reported by other structure alignment methods.</p>","PeriodicalId":87204,"journal":{"name":"Proceedings. IEEE Computer Society Bioinformatics Conference","volume":"2 ","pages":"169-79"},"PeriodicalIF":0.0,"publicationDate":"2003-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25833752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Statistical inference for well-ordered structures in nucleotide sequences.","authors":"Shu-Yun Le, Jih-H Chen, Jacob V Maize","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Distinct, local structures are frequently correlated with functional RNA elements involved in post-transcriptional regulation of gene expression. Discovery of microRNAs (miRNAs) suggests that there are a large class of small non-coding RNAs in eukaryotic genomes. These miRNAs have the potential to form distinct fold-back stem-loop structures. The prediction of those well-ordered folding sequences (WFS) in genomic sequences is very helpful for our understanding of RNA-based gene regulation and the determination of local RNA elements with structure-dependent functions. In this study, we describe a novel method for discovering the local WFS in a nucleotide sequence by Monte Carlo simulation and RNA folding. In the approach the quality of a local WFS is assessed by the energy difference (E(diff)) between the optimal structure folded in the local segment and its corresponding optimal, restrained structure where all the previous base pairings formed in the optimal structure are prohibited. Distinct WFS can be discovered by scanning successive segments along a sequence for evaluating the difference between E(diff) of the natural sequence and those computed from randomly shuffled sequences. Our results indicate that the statistically significant WFS detected in the genomic sequences of Caenorhabditis elegans (C.elegans) F49E12, T07C5, T07D1, T10H9, Y56A3A and Y71G12B are coincident with known fold-back stem-loops found in miRNA precursors. The potential and implications of our method in searching for miRNAs in genomes is discussed.</p>","PeriodicalId":87204,"journal":{"name":"Proceedings. IEEE Computer Society Bioinformatics Conference","volume":"2 ","pages":"190-6"},"PeriodicalIF":0.0,"publicationDate":"2003-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25833754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Minghua Deng, Kui Zhang, Shipra Mehta, Ting Chen, Fengzhu Sun
{"title":"Prediction of protein function using protein-protein interaction data.","authors":"Minghua Deng, Kui Zhang, Shipra Mehta, Ting Chen, Fengzhu Sun","doi":"10.1109/csb.2002.1039342","DOIUrl":"https://doi.org/10.1109/csb.2002.1039342","url":null,"abstract":"Assigning functions to novel proteins is one of the most important problems in the postgenomic era. Several approaches have been applied to this problem, including the analysis of gene expression patterns, phylogenetic profiles, protein fusions, and protein-protein interactions. In this paper, we develop a novel approach that employs the theory of Markov random fields to infer a protein's functions using protein-protein interaction data and the functional annotations of protein's interaction partners. For each function of interest and protein, we predict the probability that the protein has such function using Bayesian approaches. Unlike other available approaches for protein annotation in which a protein has or does not have a function of interest, we give a probability for having the function. This probability indicates how confident we are about the prediction. We employ our method to predict protein functions based on \"biochemical function,\" \"subcellular location,\" and \"cellular role\" for yeast proteins defined in the Yeast Proteome Database (YPD, www.incyte.com), using the protein-protein interaction data from the Munich Information Center for Protein Sequences (MIPS, mips.gsf.de). We show that our approach outperforms other available methods for function prediction based on protein interaction data. The supplementary data is available at www-hto.usc.edu/~msms/ProteinFunction.","PeriodicalId":87204,"journal":{"name":"Proceedings. IEEE Computer Society Bioinformatics Conference","volume":"10 6 1","pages":"947-60"},"PeriodicalIF":0.0,"publicationDate":"2003-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2002.1039342","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62214316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}