{"title":"Accelerating approximate subsequence search on large protein sequence databases","authors":"Jiong Yang, Wei Wang, Yi Xia, Philip S. Yu","doi":"10.1109/CSB.2002.1039343","DOIUrl":"https://doi.org/10.1109/CSB.2002.1039343","url":null,"abstract":"In this paper, we study the problem on how to build a persistent index structure for protein sequences to support approximate match. The suffix tree has been proposed as a solution to index sequence database and has been deployed on organizing DNA sequences (Hunt et al. (2001)). Unfortunately, it suffers from the problem of \"memory bottleneck\" that prevents it from being applied efficiently to a large database. The performance even degrades further for protein database due to a larger fanout at each node. Here, we employ an indexing structure, called BASS-tree, to support approximate match in sublinear time on a large protein database. We call this indexing method the sequence approximate match index method. The search of approximate matches can be properly directed to the portion in the database with a high potential of matching quickly. It is demonstrated in our experiments that the potential performance improvement is in an order of magnitude over alternative methods such as the BLAST algorithm and the suffix tree.","PeriodicalId":87204,"journal":{"name":"Proceedings. IEEE Computer Society Bioinformatics Conference","volume":"1 1","pages":"207-216"},"PeriodicalIF":0.0,"publicationDate":"2002-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/CSB.2002.1039343","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62214354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
V. Ravichandran, J. Vondrášek, G. Gilliland, T. Bhat, A. Wlodawer
{"title":"HIV protease structural database","authors":"V. Ravichandran, J. Vondrášek, G. Gilliland, T. Bhat, A. Wlodawer","doi":"10.1109/CSB.2002.1039363","DOIUrl":"https://doi.org/10.1109/CSB.2002.1039363","url":null,"abstract":"HIV Protease Database (HIVdb) is a repository for those structures of HIV protease that have never been released or deposited to the Protein Data Bank (PDB). Together with the official PDB data, HIVdb provided a unique source of information in a statistical sense. The database contains 207 structures; 148 taken from PDB, and 59 that are unique entries in HIVdb. Query tools in terms of the creation of ensembles for statistical analysis were designed. We present a new form, location, tools and data form of the HIV Protease Database. The new tools utilize a standard PDB user interface, but provide extra capabilities connected exclusively with this one protein and its ligands. We also present a design strategy for a specific subset or sub-database of the PDB with the aim of pointing out the statistical dimension of the problem related to a single protein structure. We are currently annotating the ligands in order to include their chemical properties. This approach emphasises large scale databases and scalability.","PeriodicalId":87204,"journal":{"name":"Proceedings. IEEE Computer Society Bioinformatics Conference","volume":"1 1","pages":"340-"},"PeriodicalIF":0.0,"publicationDate":"2002-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/CSB.2002.1039363","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62214519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards automatic clustering of protein sequences","authors":"Jiong Yang, Wei Wang","doi":"10.1109/CSB.2002.1039340","DOIUrl":"https://doi.org/10.1109/CSB.2002.1039340","url":null,"abstract":"Analyzing protein sequence data becomes increasingly important recently. Most previous work on this area has mainly focused on building classification models. In this paper we investigate in the problem of automatic clustering of unlabeled protein sequences. As a widely recognized technique in statistics and computer science, clustering has been proven very useful in detecting unknown object categories and revealing hidden correlations among objects. One difficulty, that prevents clustering from being performed directly on protein sequence is the lack of an effective similarity measure that can be computed efficiently. Therefore, we propose a novel model for protein sequence cluster by exploring significant statistical properties possessed by the sequences. The concept of imprecise probabilities are introduced to the original probabilistic suffix tree to monitor the convergence of the empirical measurement and to guide the clustering process. It is demonstrated that the proposed method can successfully discover meaningful families without the necessity of learning models of different families from pre-labeled \"training data\".","PeriodicalId":87204,"journal":{"name":"Proceedings. IEEE Computer Society Bioinformatics Conference","volume":"1 1","pages":"175-186"},"PeriodicalIF":0.0,"publicationDate":"2002-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/CSB.2002.1039340","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62214278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Complexity and application of pedigree analysis programme GTree","authors":"D. Ogino, S. Mori, M. Nose, Hideki Sawada","doi":"10.1109/CSB.2002.1039358","DOIUrl":"https://doi.org/10.1109/CSB.2002.1039358","url":null,"abstract":"A novel recombinant congenic mouse strain, McRA1lpr/lpr, which was established by the intercrosses between MRL/Mp/-lpr/lpr and C3H/HeJ-lpr/lpr strains throughout more than F50 generations by means of selection based on swelling of ankle joints, manifested severe arthritis, followed by ankylosis, pathologically resembling rheumatoid arthritis in humans. To clarify the genetic mechanisms on the development of arthritis in this strain, we newly prepared \"GTree\" for analyzing the pedigree of pathological phenotypes of arthritis, splenomegaly and lymphadenopathy based on the collected data from over 700 McRA1-lpr/lpr mice collected and arranged by Shiro MORI. The data themselves are now dealt with by a PostgreSQL Linux server administered by Hideki SAWADA. We explain the algorithm of the program and its complexity, and show the pathological peculiarity of spleens and axillary lymph nodes which appear only in the group of RA mice.","PeriodicalId":87204,"journal":{"name":"Proceedings. IEEE Computer Society Bioinformatics Conference","volume":"1 1","pages":"333-335"},"PeriodicalIF":0.0,"publicationDate":"2002-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/CSB.2002.1039358","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62214338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Guo, Xingjing Li, A. Kamal, O. Brazhnik, P. Mendes
{"title":"A reference database for Medicago truncatula genes, proteins, and metabolites","authors":"D. Guo, Xingjing Li, A. Kamal, O. Brazhnik, P. Mendes","doi":"10.1109/CSB.2002.1039366","DOIUrl":"https://doi.org/10.1109/CSB.2002.1039366","url":null,"abstract":"Summary form only given. As a model plant for legumes as well as a rich source of natural products (such as flavonoids, isoflavonoids and triterpenes), Medicago truncatula (Mt) is one of the subjects of current major US genomics initiatives. Nevertheless, data sources of gene, protein, and metabolite in relation to Mt are very limited in publicly available biological databases. Information about genes, proteins, and metabolites is usually distributed among multiple databases. Retrieval and organization of this information can be a laborious task. We present a relational database, B-Net, that is intended to gather information from multiple sources representing genes, proteins, metabolites, and biochemical reactions of Mt. This database represents known facts about the biochemistry of Mt, classified according to the Gene Ontology. We anticipate this new resource to be particularly useful as a reference data set but also a qualitative proteome and metabolome database.","PeriodicalId":87204,"journal":{"name":"Proceedings. IEEE Computer Society Bioinformatics Conference","volume":"102 1","pages":"343-"},"PeriodicalIF":0.0,"publicationDate":"2002-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/CSB.2002.1039366","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62214606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Adjeroh, Yong Zhang, A. Mukherjee, M. Powell, T. Bell
{"title":"DNA sequence compression using the Burrows-Wheeler Transform","authors":"D. Adjeroh, Yong Zhang, A. Mukherjee, M. Powell, T. Bell","doi":"10.1109/CSB.2002.1039352","DOIUrl":"https://doi.org/10.1109/CSB.2002.1039352","url":null,"abstract":"We investigate off-line dictionary oriented approaches to DNA sequence compression, based on the Burrows-Wheeler Transform (BWT). The preponderance of short repeating patterns is an important phenomenon in biological sequences. Here, we propose off-line methods to compress DNA sequences that exploit the different repetition structures inherent in such sequences. Repetition analysis is performed based on the relationship between the BWT and important pattern matching data structures, such as the suffix tree and suffix array. We discuss how the proposed approach can be incorporated in the BWT compression pipeline.","PeriodicalId":87204,"journal":{"name":"Proceedings. IEEE Computer Society Bioinformatics Conference","volume":"1 1","pages":"303-313"},"PeriodicalIF":0.0,"publicationDate":"2002-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/CSB.2002.1039352","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62214709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Visualization techniques for genomic data","authors":"A. Loraine, G. Helt","doi":"10.1109/CSB.2002.1039354","DOIUrl":"https://doi.org/10.1109/CSB.2002.1039354","url":null,"abstract":"In order to take full advantage of the newly available public human genome sequence data and associated annotations, biologists require visualization tools that can accommodate the high frequency of alternative splicing in human genes and other complexities. We describe techniques for presenting human genomic sequence data and annotations in an interactive, graphical format, with the aim of providing developers with a guide to what features are most likely to meet biologists' needs. These techniques include: one-dimensional semantic zooming to show sequence data alongside gene structures; moveable, adjustable tiers; visual encoding of the translation frame to show how alternative transcript structure affects encoded proteins; and display of protein domains in the context of genomic sequence to show how alternative splicing impacts protein structure and function.","PeriodicalId":87204,"journal":{"name":"Proceedings. IEEE Computer Society Bioinformatics Conference","volume":"1 1","pages":"321-326"},"PeriodicalIF":0.0,"publicationDate":"2002-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/CSB.2002.1039354","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62214730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Rapid large-scale oligonucleotide selection for microarrays","authors":"S. Rahmann","doi":"10.1109/CSB.2002.1039329","DOIUrl":"https://doi.org/10.1109/CSB.2002.1039329","url":null,"abstract":"We present the first algorithm that selects oligonucleotide probes (e.g. 25-mers) for microarray experiments on a large scale. For example, oligos for human genes can be found within 50 hours. This becomes possible by using the longest common substring as a specificity measure for candidate oligos. We present an algorithm based on a suffix array with additional information that is efficient both in terms of memory usage and running time to rank all candidate oligos according to their specificity. We also introduce the concept of master sequences to describe the sequences from which oligos are to be selected. Constraints such as oligo length, melting temperature, and self-complementarity are incorporated in the master sequence at a preprocessing stage and thus kept separate from the main selection problem. As a result, custom oligos can now be designed for any sequenced genome, just as the technology for on-site chip synthesis is becoming increasingly mature.","PeriodicalId":87204,"journal":{"name":"Proceedings. IEEE Computer Society Bioinformatics Conference","volume":"1 1","pages":"54-63"},"PeriodicalIF":0.0,"publicationDate":"2002-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/CSB.2002.1039329","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62214192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An efficient branch-and-bound algorithm for the assignment of protein backbone NMR peaks","authors":"Guohui Lin, Dong Xu, Zhi-Zhong Chen, Tao Jiang, Jianjun Wen, Ying Xu","doi":"10.1109/CSB.2002.1039339","DOIUrl":"https://doi.org/10.1109/CSB.2002.1039339","url":null,"abstract":"NMR resonance assignment is one of the key steps in solving an NMR protein structure. The assignment process links resonance peaks to individual residues of the target protein sequence, providing the prerequisite for establishing intra- and inter-residue spatial relationships between atoms. The assignment process is tedious and time-consuming, which could take many weeks. Though there exist a number of computer programs to assist the assignment process, many NMR labs are still doing the assignments manually to ensure quality. This paper presents a new computational method based on our recent work towards automating the assignment process, particularly the process of backbone resonance peak assignment. We formulate the assignment problem as a constrained weighted bipartite matching problem. While the problem, in the most general situation, is NP-hard, we present an efficient solution based on a branch-and-bound algorithm with effective bounding techniques and a greedy filtering algorithm for reducing the search space. Our experimental results on 70 instances of (pseudo) real NMR data derived from 14 proteins demonstrate that the new solution runs much faster than a recently introduced (exhaustive) two-layer algorithm and recovers more correct peak assignments than the two-layer algorithm.","PeriodicalId":87204,"journal":{"name":"Proceedings. IEEE Computer Society Bioinformatics Conference","volume":"1 1","pages":"165-174"},"PeriodicalIF":0.0,"publicationDate":"2002-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/CSB.2002.1039339","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62214271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A bi-recursive neural network architecture for the prediction of protein coarse contact maps","authors":"A. Vullo, P. Frasconi","doi":"10.1109/CSB.2002.1039341","DOIUrl":"https://doi.org/10.1109/CSB.2002.1039341","url":null,"abstract":"Prediction of contact maps may be seen as a strategic step towards the solution of fundamental open problems in structural genomics. In this paper we focus on coarse grained maps that describe the spatial neighborhood relation between secondary structure elements (helices, strands, and coils) of a protein. We introduce a new machine learning approach for scoring candidate contact maps. The method combines a specialized noncausal recursive connectionist architecture and a heuristic graph search algorithm. The network is trained using candidate graphs generated during search. We show how the process of selecting and generating training examples is important for tuning the precision of the predictor.","PeriodicalId":87204,"journal":{"name":"Proceedings. IEEE Computer Society Bioinformatics Conference","volume":"1 1","pages":"187-196"},"PeriodicalIF":0.0,"publicationDate":"2002-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/CSB.2002.1039341","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62214307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}