{"title":"弹性简并文本的字典匹配及其在VCF文件在线搜索中的应用","authors":"S. Pissis, Ahmad Retha","doi":"10.4230/LIPIcs.SEA.2018.16","DOIUrl":null,"url":null,"abstract":"An elastic-degenerate string is a sequence of n sets of strings of total length N . It has been introduced to represent multiple sequence alignments of closely-related sequences in a compact form. For a standard pattern of length m, pattern matching in an elastic-degenerate text can be solved on-line in time O(nm2 +N) with pre-processing time and space O(m) (Grossi et al., CPM 2017). A fast bit-vector algorithm requiring time O(N · dmw e) with pre-processing time and space O(m·dmw e), where w is the size of the computer word, was also presented. In this paper we consider the same problem for a set of patterns of total length M . A straightforward generalization of the existing bit-vector algorithm would require time O(N · dMw e) with pre-processing time and space O(M · dMw e), which is prohibitive in practice. We present a new on-line O(N · d M w e)-time algorithm with pre-processing time and space O(M). We present experimental results using both synthetic and real data demonstrating the performance of the algorithm. We further demonstrate a real application of our algorithm in a pipeline for discovery and verification of minimal absent words (MAWs) in the human genome showing that a significant number of previously discovered MAWs are in fact false-positives when a population’s variants are considered. 2012 ACM Subject Classification Theory of computation → Pattern matching","PeriodicalId":9448,"journal":{"name":"Bulletin of the Society of Sea Water Science, Japan","volume":"25 1","pages":"16:1-16:14"},"PeriodicalIF":0.0000,"publicationDate":"2018-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":"{\"title\":\"Dictionary Matching in Elastic-Degenerate Texts with Applications in Searching VCF Files On-line\",\"authors\":\"S. Pissis, Ahmad Retha\",\"doi\":\"10.4230/LIPIcs.SEA.2018.16\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"An elastic-degenerate string is a sequence of n sets of strings of total length N . It has been introduced to represent multiple sequence alignments of closely-related sequences in a compact form. For a standard pattern of length m, pattern matching in an elastic-degenerate text can be solved on-line in time O(nm2 +N) with pre-processing time and space O(m) (Grossi et al., CPM 2017). A fast bit-vector algorithm requiring time O(N · dmw e) with pre-processing time and space O(m·dmw e), where w is the size of the computer word, was also presented. In this paper we consider the same problem for a set of patterns of total length M . A straightforward generalization of the existing bit-vector algorithm would require time O(N · dMw e) with pre-processing time and space O(M · dMw e), which is prohibitive in practice. We present a new on-line O(N · d M w e)-time algorithm with pre-processing time and space O(M). We present experimental results using both synthetic and real data demonstrating the performance of the algorithm. We further demonstrate a real application of our algorithm in a pipeline for discovery and verification of minimal absent words (MAWs) in the human genome showing that a significant number of previously discovered MAWs are in fact false-positives when a population’s variants are considered. 2012 ACM Subject Classification Theory of computation → Pattern matching\",\"PeriodicalId\":9448,\"journal\":{\"name\":\"Bulletin of the Society of Sea Water Science, Japan\",\"volume\":\"25 1\",\"pages\":\"16:1-16:14\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"13\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Bulletin of the Society of Sea Water Science, Japan\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.4230/LIPIcs.SEA.2018.16\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bulletin of the Society of Sea Water Science, Japan","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4230/LIPIcs.SEA.2018.16","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Dictionary Matching in Elastic-Degenerate Texts with Applications in Searching VCF Files On-line
An elastic-degenerate string is a sequence of n sets of strings of total length N . It has been introduced to represent multiple sequence alignments of closely-related sequences in a compact form. For a standard pattern of length m, pattern matching in an elastic-degenerate text can be solved on-line in time O(nm2 +N) with pre-processing time and space O(m) (Grossi et al., CPM 2017). A fast bit-vector algorithm requiring time O(N · dmw e) with pre-processing time and space O(m·dmw e), where w is the size of the computer word, was also presented. In this paper we consider the same problem for a set of patterns of total length M . A straightforward generalization of the existing bit-vector algorithm would require time O(N · dMw e) with pre-processing time and space O(M · dMw e), which is prohibitive in practice. We present a new on-line O(N · d M w e)-time algorithm with pre-processing time and space O(M). We present experimental results using both synthetic and real data demonstrating the performance of the algorithm. We further demonstrate a real application of our algorithm in a pipeline for discovery and verification of minimal absent words (MAWs) in the human genome showing that a significant number of previously discovered MAWs are in fact false-positives when a population’s variants are considered. 2012 ACM Subject Classification Theory of computation → Pattern matching