Efficient Index for Weighted Sequences

Annual Symposium on Combinatorial Pattern Matching Pub Date : 2016-02-02 DOI:10.4230/LIPIcs.CPM.2016.4

Carl Barton, T. Kociumaka, S. Pissis, J. Radoszewski

{"title":"Efficient Index for Weighted Sequences","authors":"Carl Barton, T. Kociumaka, S. Pissis, J. Radoszewski","doi":"10.4230/LIPIcs.CPM.2016.4","DOIUrl":null,"url":null,"abstract":"The problem of finding factors of a text string which are identical or similar to a given pattern string is a central problem in computer science. A generalised version of this problem consists in implementing an index over the text to support efficient on-line pattern queries. We study this problem in the case where the text is weighted: for every position of the text and every letter of the alphabet a probability of occurrence of this letter at this position is given. Sequences of this type, also called position weight matrices, are commonly used to represent imprecise or uncertain data. A weighted sequence may represent many different strings, each with probability of occurrence equal to the product of probabilities of its letters at subsequent positions. Given a probability threshold $1/z$, we say that a pattern string $P$ matches a weighted text at position $i$ if the product of probabilities of the letters of $P$ at positions $i,\\ldots,i+|P|-1$ in the text is at least $1/z$. In this article, we present an $O(nz)$-time construction of an $O(nz)$-sized index that can answer pattern matching queries in a weighted text in optimal time improving upon the state of the art by a factor of $z \\log z$. Other applications of this data structure include an $O(nz)$-time construction of the weighted prefix table and an $O(nz)$-time computation of all covers of a weighted sequence, which improve upon the state of the art by the same factor.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annual Symposium on Combinatorial Pattern Matching","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4230/LIPIcs.CPM.2016.4","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 21

Abstract

The problem of finding factors of a text string which are identical or similar to a given pattern string is a central problem in computer science. A generalised version of this problem consists in implementing an index over the text to support efficient on-line pattern queries. We study this problem in the case where the text is weighted: for every position of the text and every letter of the alphabet a probability of occurrence of this letter at this position is given. Sequences of this type, also called position weight matrices, are commonly used to represent imprecise or uncertain data. A weighted sequence may represent many different strings, each with probability of occurrence equal to the product of probabilities of its letters at subsequent positions. Given a probability threshold $1/z$, we say that a pattern string $P$ matches a weighted text at position $i$ if the product of probabilities of the letters of $P$ at positions $i,\ldots,i+|P|-1$ in the text is at least $1/z$. In this article, we present an $O(nz)$-time construction of an $O(nz)$-sized index that can answer pattern matching queries in a weighted text in optimal time improving upon the state of the art by a factor of $z \log z$. Other applications of this data structure include an $O(nz)$-time construction of the weighted prefix table and an $O(nz)$-time computation of all covers of a weighted sequence, which improve upon the state of the art by the same factor.

查看原文本刊更多论文

加权序列的高效索引

寻找文本字符串中与给定模式字符串相同或相似的因素的问题是计算机科学中的一个核心问题。这个问题的一般版本包括在文本上实现索引，以支持有效的在线模式查询。我们在文本加权的情况下研究这个问题:对于文本的每个位置和字母表中的每个字母，给出了这个字母在这个位置出现的概率。这种类型的序列，也称为位置权重矩阵，通常用于表示不精确或不确定的数据。加权序列可以表示许多不同的字符串，每个字符串的出现概率等于其字母在后续位置的概率之积。给定一个概率阈值$1/z$，我们说模式字符串$P$匹配位置$i$的加权文本，如果$P$的字母在位置$i，\ldots,i+|P|-1$的概率积至少为$1/z$。在本文中，我们提出了一个$O(nz)$时间构造一个$O(nz)$大小的索引，该索引可以在最佳时间内回答加权文本中的模式匹配查询，并将目前的技术水平提高了$z \log z$。该数据结构的其他应用包括加权前缀表的$O(nz)$时间构造和加权序列的所有覆盖的$O(nz)$时间计算，它们在相同的因素上改进了当前的技术状态。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Annual Symposium on Combinatorial Pattern Matching

自引率

0.00%

发文量