All Hits All The Time: Parameter Free Calculation of Seed Sensitivity

Proceedings of the ... Asia-Pacific bioinformatics conference Pub Date : 2007-01-01 DOI:10.1142/9781860947995_0035

Denise Y. F. Mak, Gary Benson

{"title":"All Hits All The Time: Parameter Free Calculation of Seed Sensitivity","authors":"Denise Y. F. Mak, Gary Benson","doi":"10.1142/9781860947995_0035","DOIUrl":null,"url":null,"abstract":"Standard search techniques for DNA repeats start by identifying seeds , that is, small matching words, that may inhabit larger repeats. Recent innovations in seed structure have led to the development of spacedseeds [8] andindel seeds [9] which are more sensitive than contiguous seeds (also known as k-mers, k-tuples, l-words, etc.). Evaluating seed s nsitivityrequires 1) specifying a homology model which describes types of alignments that can occur between two copies of a repeat, and 2) assigning probabilities to those alignments. Optimal seed selection is a resource intensive activity because essentially all alternative seeds must be tested [7]. Current methods require that the model and probability parameters be specified in advance. When the parameters change, the entire calculation has to be rerun. In this paper, we show how to eliminatethe need for prior parameter specification. The ideas presented follow from a simple observation: given a homology model, the alignments hit by a particular seed remain the same regardless of the probability parameters. Only the weights assigned to those alignments change. Therefore, if we know all the hits, we can easily (and quickly) find optimal seeds. We describe a highly efficient preprocessing step, which is computed just oncefor each seed. In this calculation, strings which represent possible alignments are unweightedby any probability parameters. Then we show several increasingly efficient methods to find the optimal seed when given specific probability parameters. Indeed, we show how to determine exactly which seeds can never be optimal under any set of probability parameters. This leads to the startling observation that out of thousands of seeds, only a handful have any chance of being optimal. We then show how to find optimal seeds and the boundaries within probability space where they are optimal. We expect this method to greatly facilitate the study of seed space sensitivity, construction of multiple seed sets, and the use of alternative definitions of optimality.","PeriodicalId":74513,"journal":{"name":"Proceedings of the ... Asia-Pacific bioinformatics conference","volume":"31 1","pages":"327-340"},"PeriodicalIF":0.0000,"publicationDate":"2007-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ... Asia-Pacific bioinformatics conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1142/9781860947995_0035","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 15

Abstract

Standard search techniques for DNA repeats start by identifying seeds , that is, small matching words, that may inhabit larger repeats. Recent innovations in seed structure have led to the development of spacedseeds [8] andindel seeds [9] which are more sensitive than contiguous seeds (also known as k-mers, k-tuples, l-words, etc.). Evaluating seed s nsitivityrequires 1) specifying a homology model which describes types of alignments that can occur between two copies of a repeat, and 2) assigning probabilities to those alignments. Optimal seed selection is a resource intensive activity because essentially all alternative seeds must be tested [7]. Current methods require that the model and probability parameters be specified in advance. When the parameters change, the entire calculation has to be rerun. In this paper, we show how to eliminatethe need for prior parameter specification. The ideas presented follow from a simple observation: given a homology model, the alignments hit by a particular seed remain the same regardless of the probability parameters. Only the weights assigned to those alignments change. Therefore, if we know all the hits, we can easily (and quickly) find optimal seeds. We describe a highly efficient preprocessing step, which is computed just oncefor each seed. In this calculation, strings which represent possible alignments are unweightedby any probability parameters. Then we show several increasingly efficient methods to find the optimal seed when given specific probability parameters. Indeed, we show how to determine exactly which seeds can never be optimal under any set of probability parameters. This leads to the startling observation that out of thousands of seeds, only a handful have any chance of being optimal. We then show how to find optimal seeds and the boundaries within probability space where they are optimal. We expect this method to greatly facilitate the study of seed space sensitivity, construction of multiple seed sets, and the use of alternative definitions of optimality.

查看原文本刊更多论文

所有命中所有时间:种子敏感性的参数自由计算

DNA重复序列的标准搜索技术从识别种子开始，即可能包含较大重复序列的小匹配词。最近在种子结构上的创新导致了间隔种子[8]和indel种子[9]的发展，它们比连续种子(也称为k-mers, k-tuples, l-words等)更敏感。评估种子的敏感性需要1)指定一个同源性模型，该模型描述了在重复的两个副本之间可能发生的配对类型，以及2)为这些配对分配概率。最佳种子选择是一项资源密集型活动，因为基本上所有备选种子都必须经过测试。目前的方法需要事先确定模型和概率参数。当参数改变时，整个计算必须重新运行。在本文中，我们展示了如何消除对预先参数规范的需要。提出的想法来自一个简单的观察:给定一个同源模型，被特定种子击中的排列保持不变，而不管概率参数。只有分配给这些对齐的权重才会改变。因此，如果我们知道所有的结果，我们就可以很容易(而且很快)找到最佳种子。我们描述了一个高效的预处理步骤，每个种子只计算一次。在此计算中，表示可能对齐的字符串不受任何概率参数的加权。在给定特定的概率参数时，我们给出了几种越来越有效的寻找最优种子的方法。实际上，我们展示了如何准确地确定在任何一组概率参数下哪些种子永远不会是最优的。这导致了一个惊人的观察结果:在成千上万的种子中，只有少数有机会成为最优的。然后，我们展示了如何在概率空间中找到最优种子和最优种子的边界。我们期望这种方法能够极大地促进种子空间敏感性的研究、多种子集的构造以及最优性的替代定义的使用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the ... Asia-Pacific bioinformatics conference

自引率

0.00%

发文量