冗余和稀有子串的线性全局检测器

Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096) Pub Date : 1999-03-29 DOI:10.1109/DCC.1999.755666

A. Apostolico, M. Bock, S. Lonardi

{"title":"冗余和稀有子串的线性全局检测器","authors":"A. Apostolico, M. Bock, S. Lonardi","doi":"10.1109/DCC.1999.755666","DOIUrl":null,"url":null,"abstract":"The identification of strings that are, by some measure, redundant or rare in the context of larger sequences is an implicit goal of any data compression method. In the straightforward approach to searching for unusual substrings, the words (up to a certain length) are enumerated more or less exhaustively and individually checked in terms of observed and expected frequencies, variances, and scores of discrepancy and significance thereof. As is well known, clever methods are available to compute and organize the counts of occurrences of all substrings of a given string. The corresponding tables take up the tree-like structure of a special kind of digital search index or trie. We show here that under several accepted measures of deviation from expected frequency, the candidate over- or under-represented words are restricted to the O(n) words that end at internal nodes of a compact suffix tree, as opposed to the /spl Theta/(n/sup 2/) possible substrings. This surprising fact is a consequence of properties in the form that if a word that ends in the middle of an arc is, say, over-represented, then its extension to the nearest node of the tree is even more so. Based on this, we design global linear detectors of favoured and unfavored words for our probabilistic framework, and display the results of some preliminary that apply our constructions to the analysis of genomic sequences.","PeriodicalId":103598,"journal":{"name":"Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1999-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Linear global detectors of redundant and rare substrings\",\"authors\":\"A. Apostolico, M. Bock, S. Lonardi\",\"doi\":\"10.1109/DCC.1999.755666\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The identification of strings that are, by some measure, redundant or rare in the context of larger sequences is an implicit goal of any data compression method. In the straightforward approach to searching for unusual substrings, the words (up to a certain length) are enumerated more or less exhaustively and individually checked in terms of observed and expected frequencies, variances, and scores of discrepancy and significance thereof. As is well known, clever methods are available to compute and organize the counts of occurrences of all substrings of a given string. The corresponding tables take up the tree-like structure of a special kind of digital search index or trie. We show here that under several accepted measures of deviation from expected frequency, the candidate over- or under-represented words are restricted to the O(n) words that end at internal nodes of a compact suffix tree, as opposed to the /spl Theta/(n/sup 2/) possible substrings. This surprising fact is a consequence of properties in the form that if a word that ends in the middle of an arc is, say, over-represented, then its extension to the nearest node of the tree is even more so. Based on this, we design global linear detectors of favoured and unfavored words for our probabilistic framework, and display the results of some preliminary that apply our constructions to the analysis of genomic sequences.\",\"PeriodicalId\":103598,\"journal\":{\"name\":\"Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)\",\"volume\":\"35 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1999-03-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DCC.1999.755666\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.1999.755666","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

从某种程度上说，识别在较大序列上下文中冗余或罕见的字符串是任何数据压缩方法的隐含目标。在搜索不寻常子字符串的直接方法中，对单词(不超过一定长度)进行或多或少的详尽枚举，并根据观察到的和期望的频率、方差、差异分数及其重要性进行单独检查。众所周知，可以使用一些聪明的方法来计算和组织给定字符串的所有子字符串的出现次数。相应的表采用一种特殊类型的数字搜索索引或树状结构。我们在这里表明，在几个可接受的偏离预期频率的度量下，候选过度或未充分表示的单词被限制为O(n)个以紧凑后缀树的内部节点结尾的单词，而不是/spl Theta/(n/sup 2/)可能的子字符串。这个令人惊讶的事实是一个属性的结果，如果一个单词在圆弧中间结束，那么它延伸到树的最近节点的情况就更严重了。在此基础上，我们为我们的概率框架设计了偏好词和不偏好词的全局线性检测器，并展示了一些将我们的结构应用于基因组序列分析的初步结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Linear global detectors of redundant and rare substrings

The identification of strings that are, by some measure, redundant or rare in the context of larger sequences is an implicit goal of any data compression method. In the straightforward approach to searching for unusual substrings, the words (up to a certain length) are enumerated more or less exhaustively and individually checked in terms of observed and expected frequencies, variances, and scores of discrepancy and significance thereof. As is well known, clever methods are available to compute and organize the counts of occurrences of all substrings of a given string. The corresponding tables take up the tree-like structure of a special kind of digital search index or trie. We show here that under several accepted measures of deviation from expected frequency, the candidate over- or under-represented words are restricted to the O(n) words that end at internal nodes of a compact suffix tree, as opposed to the /spl Theta/(n/sup 2/) possible substrings. This surprising fact is a consequence of properties in the form that if a word that ends in the middle of an arc is, say, over-represented, then its extension to the nearest node of the tree is even more so. Based on this, we design global linear detectors of favoured and unfavored words for our probabilistic framework, and display the results of some preliminary that apply our constructions to the analysis of genomic sequences.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096)

自引率

0.00%

发文量