弹性简并文本的字典匹配及其在VCF文件在线搜索中的应用

S. Pissis, Ahmad Retha
{"title":"弹性简并文本的字典匹配及其在VCF文件在线搜索中的应用","authors":"S. Pissis, Ahmad Retha","doi":"10.4230/LIPIcs.SEA.2018.16","DOIUrl":null,"url":null,"abstract":"An elastic-degenerate string is a sequence of n sets of strings of total length N . It has been introduced to represent multiple sequence alignments of closely-related sequences in a compact form. For a standard pattern of length m, pattern matching in an elastic-degenerate text can be solved on-line in time O(nm2 +N) with pre-processing time and space O(m) (Grossi et al., CPM 2017). A fast bit-vector algorithm requiring time O(N · dmw e) with pre-processing time and space O(m·dmw e), where w is the size of the computer word, was also presented. In this paper we consider the same problem for a set of patterns of total length M . A straightforward generalization of the existing bit-vector algorithm would require time O(N · dMw e) with pre-processing time and space O(M · dMw e), which is prohibitive in practice. We present a new on-line O(N · d M w e)-time algorithm with pre-processing time and space O(M). We present experimental results using both synthetic and real data demonstrating the performance of the algorithm. We further demonstrate a real application of our algorithm in a pipeline for discovery and verification of minimal absent words (MAWs) in the human genome showing that a significant number of previously discovered MAWs are in fact false-positives when a population’s variants are considered. 2012 ACM Subject Classification Theory of computation → Pattern matching","PeriodicalId":9448,"journal":{"name":"Bulletin of the Society of Sea Water Science, Japan","volume":"25 1","pages":"16:1-16:14"},"PeriodicalIF":0.0000,"publicationDate":"2018-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":"{\"title\":\"Dictionary Matching in Elastic-Degenerate Texts with Applications in Searching VCF Files On-line\",\"authors\":\"S. Pissis, Ahmad Retha\",\"doi\":\"10.4230/LIPIcs.SEA.2018.16\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"An elastic-degenerate string is a sequence of n sets of strings of total length N . It has been introduced to represent multiple sequence alignments of closely-related sequences in a compact form. For a standard pattern of length m, pattern matching in an elastic-degenerate text can be solved on-line in time O(nm2 +N) with pre-processing time and space O(m) (Grossi et al., CPM 2017). A fast bit-vector algorithm requiring time O(N · dmw e) with pre-processing time and space O(m·dmw e), where w is the size of the computer word, was also presented. In this paper we consider the same problem for a set of patterns of total length M . A straightforward generalization of the existing bit-vector algorithm would require time O(N · dMw e) with pre-processing time and space O(M · dMw e), which is prohibitive in practice. We present a new on-line O(N · d M w e)-time algorithm with pre-processing time and space O(M). We present experimental results using both synthetic and real data demonstrating the performance of the algorithm. We further demonstrate a real application of our algorithm in a pipeline for discovery and verification of minimal absent words (MAWs) in the human genome showing that a significant number of previously discovered MAWs are in fact false-positives when a population’s variants are considered. 2012 ACM Subject Classification Theory of computation → Pattern matching\",\"PeriodicalId\":9448,\"journal\":{\"name\":\"Bulletin of the Society of Sea Water Science, Japan\",\"volume\":\"25 1\",\"pages\":\"16:1-16:14\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"13\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Bulletin of the Society of Sea Water Science, Japan\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.4230/LIPIcs.SEA.2018.16\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bulletin of the Society of Sea Water Science, Japan","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4230/LIPIcs.SEA.2018.16","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 13

摘要

弹性简并弦是由总长度为n的n组弦组成的序列。它已经被引入到表示紧密相关序列的多个序列排列在一个紧凑的形式。对于长度为m的标准模式,在预处理时间和空间为O(m)的情况下,弹性退化文本中的模式匹配可以在线解决,时间为O(nm2 +N) (Grossi et al., CPM 2017)。提出了一种快速的位矢量算法,预处理时间为O(N·dmw e),空间为O(m·dmw e),其中w为计算机字的大小。在本文中,我们考虑了总长度为M的一组模式的相同问题。对现有的位矢量算法进行直接推广需要时间O(N·dMw e),预处理时间和空间O(M·dMw e),这在实践中是难以实现的。提出了一种时间为O(N·d·M)、预处理时间为O(M)、空间为O(M)的在线算法。我们给出了合成数据和真实数据的实验结果,证明了该算法的性能。我们进一步展示了我们的算法在发现和验证人类基因组中最小缺失词(MAWs)的管道中的实际应用,表明当考虑种群的变体时,大量先前发现的MAWs实际上是假阳性。2012 ACM学科分类计算理论→模式匹配
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Dictionary Matching in Elastic-Degenerate Texts with Applications in Searching VCF Files On-line
An elastic-degenerate string is a sequence of n sets of strings of total length N . It has been introduced to represent multiple sequence alignments of closely-related sequences in a compact form. For a standard pattern of length m, pattern matching in an elastic-degenerate text can be solved on-line in time O(nm2 +N) with pre-processing time and space O(m) (Grossi et al., CPM 2017). A fast bit-vector algorithm requiring time O(N · dmw e) with pre-processing time and space O(m·dmw e), where w is the size of the computer word, was also presented. In this paper we consider the same problem for a set of patterns of total length M . A straightforward generalization of the existing bit-vector algorithm would require time O(N · dMw e) with pre-processing time and space O(M · dMw e), which is prohibitive in practice. We present a new on-line O(N · d M w e)-time algorithm with pre-processing time and space O(M). We present experimental results using both synthetic and real data demonstrating the performance of the algorithm. We further demonstrate a real application of our algorithm in a pipeline for discovery and verification of minimal absent words (MAWs) in the human genome showing that a significant number of previously discovered MAWs are in fact false-positives when a population’s variants are considered. 2012 ACM Subject Classification Theory of computation → Pattern matching
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信