Chang-Heng Chang, L. Hsieh, T. Chen, Hong-Da Chen, L. Luo, Hoong-Chien Lee
{"title":"Shannon information in complete genomes.","authors":"Chang-Heng Chang, L. Hsieh, T. Chen, Hong-Da Chen, L. Luo, Hoong-Chien Lee","doi":"10.1109/CSB.2004.153","DOIUrl":"https://doi.org/10.1109/CSB.2004.153","url":null,"abstract":"Shannon information in the genomes of all completely sequenced prokaryotes and eukaryotes are measured in word lengths of two to ten letters. It is found that in a scale-dependent way, the Shannon information in complete genomes are much greater than that in matching random sequences - thousands of times greater in the case of short words. Furthermore, with the exception of the 14 chromosomes of Plasmodium falciparum, the Shannon information in all available complete genomes belong to a universality class given by an extremely simple formula. The data are consistent with a model for genome growth composed of two main ingredients: random segmental duplications that increase the Shannon information in a scale-independent way, and random point mutations that preferentially reduces the larger-scale Shannon information. The inference drawn from the present study is that the large-scale and coarse-grained growth of genomes was selectively neutral and this suggests an independent corroboration of Kimura's neutral theory of evolution.","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":"1 1","pages":"20-30"},"PeriodicalIF":0.0,"publicationDate":"2004-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62215018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Inverse Protein Folding in 2D HP Mode (Extended Abstract)","authors":"Arvind Gupta, Ján Manuch, L. Stacho","doi":"10.1109/CSB.2004.1332444","DOIUrl":"https://doi.org/10.1109/CSB.2004.1332444","url":null,"abstract":"The inverse protein folding problem is that of designing an amino acid sequence which has a particular native protein fold. This problem arises in drug design where a particular structure is necessary to ensure proper protein-protein interactions. In this paper we show that in the 2D HP model of Dill it is possible to solve this problem for a broad class of structures. These structures can be used to closely approximate any given structure. One of the most important properties of a good protein is its stability -- the aptitude not to fold simultanously into other structures. We show that for a number of basic structures, our sequences have a unique fold.","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":"1 1","pages":"311-8"},"PeriodicalIF":0.0,"publicationDate":"2004-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/CSB.2004.1332444","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62215002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Shannon information in complete genomes.","authors":"Chang-Heng Chang, Li-Ching Hsieh, Ta-Yuan Chen, Hong-Da Chen, Liaofu Luo, Hoong-Chien Lee","doi":"10.1109/csb.2004.1332413","DOIUrl":"https://doi.org/10.1109/csb.2004.1332413","url":null,"abstract":"<p><p>Shannon information in the genomes of all completely sequenced prokaryotes and eukaryotes are measured in word lengths of two to ten letters. It is found that in a scale-dependent way, the Shannon information in complete genomes are much greater than that in matching random sequences - thousands of times greater in the case of short words. Furthermore, with the exception of the 14 chromosomes of Plasmodium falciparum, the Shannon information in all available complete genomes belong to a universality class given by an extremely simple formula. The data are consistent with a model for genome growth composed of two main ingredients: random segmental duplications that increase the Shannon information in a scale-independent way, and random point mutations that preferentially reduces the larger-scale Shannon information. The inference drawn from the present study is that the large-scale and coarse-grained growth of genomes was selectively neutral and this suggests an independent corroboration of Kimura's neutral theory of evolution.</p>","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"20-30"},"PeriodicalIF":0.0,"publicationDate":"2004-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2004.1332413","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25829769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Algorithms for association study design using a generalized model of haplotype conservation.","authors":"Russell Schwartz","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>There is considerable interest in computational methods to assist in the use of genetic polymorphism data for locating disease-related genes. Haplotypes, contiguous sets of correlated variants, may provide a means of reducing the difficulty of the data analysis problems involved. The field to date has been dominated by methods based on the \"haplotype block\" hypothesis, which assumes discrete population-wide boundaries between conserved genetic segments, but there is strong reason to believe that haplotype blocks do not fully capture true haplotype conservation patterns. In this paper, we address the computational challenges of using a more flexible, block-free representation of haplotype structure called the \"haplotype motif\" model for downstream analysis problems. We develop algorithms for htSNP selection and missing data inference using this more generalized model of sequence conservation. Application to a dataset from the literature demonstrates the practical value of these block-free methods.</p>","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"90-7"},"PeriodicalIF":0.0,"publicationDate":"2004-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25829776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SPIDER: software for protein identification from sequence tags with de novo sequencing error.","authors":"Yonghua Han, Bin Ma, Kaizhong Zhang","doi":"10.1109/csb.2004.1332434","DOIUrl":"https://doi.org/10.1109/csb.2004.1332434","url":null,"abstract":"<p><p>For the identification of novel proteins using MS/MS, de novo sequencing software computes one or several possible amino acid sequences (called sequence tags) for each MS/MS spectrum. Those tags are then used to match, accounting amino acid mutations, the sequences in a protein database. If the de novo sequencing gives correct tags, the homologs of the proteins can be identified by this approach and software such as MS-BLAST is available for the matching. However, de novo sequencing very often gives only partially correct tags. The most common error is that a segment of amino acids is replaced by another segment with approximately the same masses. We developed a new efficient algorithm to match sequence tags with errors to database sequences for the purpose of protein and peptide identification. A software package, SPIDER, was developed and made available on Internet for free public use. This paper describes the algorithms and features of the SPIDER software.</p>","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"206-15"},"PeriodicalIF":0.0,"publicationDate":"2004-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2004.1332434","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25829589","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MinPD: distance-based phylogenetic analysis and recombination detection of serially-sampled HIV quasispecies.","authors":"Patricia Buendia, Giri Narasimhan","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>A new computational method to study within-host viral evolution is explored to better understand the evolution and pathogenesis of viruses. Traditional phylogenetic tree methods are better suited to study relationships between contemporaneous species, which appear as leaves of a phylogenetic tree. However, viral sequences are often sampled serially from a single host. Consequently, data may be available at the leaves as well as the internal nodes of a phylogenetic tree. Recombination may further complicate the analysis. Such relationships are not easily expressed by traditional phylogenetic methods. We propose a new algorithm, called MinPD, based on minimum pairwise distances. Our algorithm uses multiple distance matrices and correlation rules to output a MinPD tree or network. We test our algorithm using extensive simmulations and apply it to a set of HIV sequence data isolated from one patient over a period of ten years. The proposed visualization of the phylogenetic treenetwork further enhances the benefits of our methods.</p>","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"110-9"},"PeriodicalIF":0.0,"publicationDate":"2004-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3195421/pdf/nihms326150.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25829708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ying Liu, Brian J Ciliax, Karin Borges, Venu Dasigi, Ashwin Ram, Shamkant B Navathe, Ray Dingledine
{"title":"Comparison of two schemes for automatic keyword extraction from MEDLINE for functional gene clustering.","authors":"Ying Liu, Brian J Ciliax, Karin Borges, Venu Dasigi, Ashwin Ram, Shamkant B Navathe, Ray Dingledine","doi":"10.1109/csb.2004.1332452","DOIUrl":"https://doi.org/10.1109/csb.2004.1332452","url":null,"abstract":"<p><p>One of the key challenges of microarray studies is to derive biological insights from the unprecedented quatities of data on gene-expression patterns. Clustering genes by functional keyword association can provide direct information about the nature of the functional links among genes within the derived clusters. However, the quality of the keyword lists extracted from biomedical literature for each gene significantly affects the clustering results. We extracted keywords from MEDLINE that describes the most prominent functions of the genes, and used the resulting weights of the keywords as feature vectors for gene clustering. By analyzing the resulting cluster quality, we compared two keyword weighting schemes: normalized z-score and term frequency-inverse document frequency (TFIDF). The best combination of background comparison set, stop list and stemming algorithm was selected based on precision and recall metrics. In a test set of four known gene groups, a hierarchical algorithm correctly assigned 25 of 26 genes to the appropriate clusters based on keywords extracted by the TDFIDF weighting scheme, but only 23 og 26 with the z-score method. To evaluate the effectiveness of the weighting schemes for keyword extraction for gene clusters from microarray profiles, 44 yeast genes that are differentially expressed during the cell cycle were used as a second test set. Using established measures of cluster quality, the results produced from TFIDF-weighted keywords had higher purity, lower entropy, and higher mutual information than those produced from normalized z-score weighted keywords. The optimized algorithms should be useful for sorting genes from microarray lists into functionally discrete clusters.</p>","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"394-404"},"PeriodicalIF":0.0,"publicationDate":"2004-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2004.1332452","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25830003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Calculation, visualization, and manipulation of MASTs (Maximum Agreement Subtrees).","authors":"Shiming Dong, Eileen Kraemer","doi":"10.1109/csb.2004.1332453","DOIUrl":"https://doi.org/10.1109/csb.2004.1332453","url":null,"abstract":"<p><strong>Unlabelled: </strong>Phylogenetic trees are used to represent the evolutionary history of a set of species. Comparison of multiple phylogenetic trees can help researchers find the common classification of a tree group, compare tree construction inferences or obtain distances between trees. We present TreeAnalyzer, a freely available package for phylogenetic tree comparison. A MAST (Maximum Agreement Subtree) algorithm is implemented to compare the trees. Additional features of this software include tree comparison, visualization, manipulation, labeling, and printing.</p><p><strong>Availability: </strong>http://www.cs.uga.edu/~eileen/TreeAnalyzer.</p>","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"405-14"},"PeriodicalIF":0.0,"publicationDate":"2004-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2004.1332453","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25830004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Minimum entropy clustering and applications to gene expression analysis.","authors":"Haifeng Li, Keshu Zhang, Tao Jiang","doi":"10.1109/csb.2004.1332427","DOIUrl":"https://doi.org/10.1109/csb.2004.1332427","url":null,"abstract":"<p><p>Clustering is a common methodology for analyzing the gene expression data. In this paper, we present a new clustering algorithm from an information-theoretic point of view. First, we propose the minimum entropy (measured on a posteriori probabilities) criterion, which is the conditional entropy of clusters given the observations. Fano's inequality indicates that it could be a good criterion for clustering. We generalize the criterion by replacing Shannon's entropy with Havrda-Charvat's structural alpha-entropy. Interestingly, the minimum entropy criterion based on structural alpha-entropy is equal to the probability error of the nearest neighbor method when alpha = 2. This is another evidence that the proposed criterion is good for clustering. With a non-parametric approach for estimating a posteriori probabilities, an efficient iterative algorithm is then established to minimize the entropy. The experimental results show that the clustering algorithm performs significantly better than k-means/medians, hierarchical clustering, SOM, and EM in terms of adjusted Rand index. Particularly, our algorithm performs very well even when the correct number of clusters is unknown. In addition, most clustering algorithms produce poor partitions in presence of outliers while our method can correctly reveal the structure of data and effectively identify outliers simultaneously.</p>","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"142-51"},"PeriodicalIF":0.0,"publicationDate":"2004-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2004.1332427","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25829711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Recurrence time statistics: versatile tools for genomic DNA sequence analysis.","authors":"Yinhe Cao, Wen-Wen Tung, J B Gao","doi":"10.1109/csb.2004.1332415","DOIUrl":"https://doi.org/10.1109/csb.2004.1332415","url":null,"abstract":"<p><p>With the completion of the human and a few model organisms' genomes, and the genomes of many other organisms waiting to be sequenced, it has become increasingly important to develop faster computational tools which are capable of easily identifying the structures and extracting features from DNA sequences. One of the more important structures in a DNA sequence is repeat-related. Often they have to be masked before protein coding regions along a DNA sequence are to be identified or redundant expressed sequence tags (ESTs) are to be sequenced. Here we report a novel recurrence time based method for sequence analysis. The method can conveniently study all kinds of periodicity and exhaustively find all repeat-related features from a genomic DNA sequence. An efficient codon index is also derived from the recurrence time statistics, which has the salient features of being largely species-independent and working well on very short sequences. Efficient codon indices are key elements of successful gene finding algorithms, and are particularly useful for determining whether a suspected EST belongs to a coding or non-coding region. We illustrate the power of the method by studying the genomes of E. coli, the yeast S. cervisivae, the nematode worm C. elegans, and the human, Homo sapiens. Computationally, our method is very efficient. It allows us to carry out analysis of genomes on the whole genomic scale by a PC.</p>","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"40-51"},"PeriodicalIF":0.0,"publicationDate":"2004-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2004.1332415","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25829771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}