{"title":"All Hits All The Time: Parameter Free Calculation of Seed Sensitivity","authors":"Denise Y. F. Mak, Gary Benson","doi":"10.1142/9781860947995_0035","DOIUrl":null,"url":null,"abstract":"Standard search techniques for DNA repeats start by identifying seeds , that is, small matching words, that may inhabit larger repeats. Recent innovations in seed structure have led to the development of spacedseeds [8] andindel seeds [9] which are more sensitive than contiguous seeds (also known as k-mers, k-tuples, l-words, etc.). Evaluating seed s nsitivityrequires 1) specifying a homology model which describes types of alignments that can occur between two copies of a repeat, and 2) assigning probabilities to those alignments. Optimal seed selection is a resource intensive activity because essentially all alternative seeds must be tested [7]. Current methods require that the model and probability parameters be specified in advance. When the parameters change, the entire calculation has to be rerun. In this paper, we show how to eliminatethe need for prior parameter specification. The ideas presented follow from a simple observation: given a homology model, the alignments hit by a particular seed remain the same regardless of the probability parameters. Only the weights assigned to those alignments change. Therefore, if we know all the hits, we can easily (and quickly) find optimal seeds. We describe a highly efficient preprocessing step, which is computed just oncefor each seed. In this calculation, strings which represent possible alignments are unweightedby any probability parameters. Then we show several increasingly efficient methods to find the optimal seed when given specific probability parameters. Indeed, we show how to determine exactly which seeds can never be optimal under any set of probability parameters. This leads to the startling observation that out of thousands of seeds, only a handful have any chance of being optimal. We then show how to find optimal seeds and the boundaries within probability space where they are optimal. We expect this method to greatly facilitate the study of seed space sensitivity, construction of multiple seed sets, and the use of alternative definitions of optimality.","PeriodicalId":74513,"journal":{"name":"Proceedings of the ... Asia-Pacific bioinformatics conference","volume":"31 1","pages":"327-340"},"PeriodicalIF":0.0000,"publicationDate":"2007-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ... Asia-Pacific bioinformatics conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1142/9781860947995_0035","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 15
Abstract
Standard search techniques for DNA repeats start by identifying seeds , that is, small matching words, that may inhabit larger repeats. Recent innovations in seed structure have led to the development of spacedseeds [8] andindel seeds [9] which are more sensitive than contiguous seeds (also known as k-mers, k-tuples, l-words, etc.). Evaluating seed s nsitivityrequires 1) specifying a homology model which describes types of alignments that can occur between two copies of a repeat, and 2) assigning probabilities to those alignments. Optimal seed selection is a resource intensive activity because essentially all alternative seeds must be tested [7]. Current methods require that the model and probability parameters be specified in advance. When the parameters change, the entire calculation has to be rerun. In this paper, we show how to eliminatethe need for prior parameter specification. The ideas presented follow from a simple observation: given a homology model, the alignments hit by a particular seed remain the same regardless of the probability parameters. Only the weights assigned to those alignments change. Therefore, if we know all the hits, we can easily (and quickly) find optimal seeds. We describe a highly efficient preprocessing step, which is computed just oncefor each seed. In this calculation, strings which represent possible alignments are unweightedby any probability parameters. Then we show several increasingly efficient methods to find the optimal seed when given specific probability parameters. Indeed, we show how to determine exactly which seeds can never be optimal under any set of probability parameters. This leads to the startling observation that out of thousands of seeds, only a handful have any chance of being optimal. We then show how to find optimal seeds and the boundaries within probability space where they are optimal. We expect this method to greatly facilitate the study of seed space sensitivity, construction of multiple seed sets, and the use of alternative definitions of optimality.