{"title":"Practical Indexing of Repetitive Collections Using Relative Lempel-Ziv","authors":"G. Navarro, Victor Sepulveda","doi":"10.1109/DCC.2019.00028","DOIUrl":null,"url":null,"abstract":"We introduce a simple and implementable compressed index for highly repetitive sequence collections based on Relative Lempel-Ziv (RLZ). On a collection of total size n compressed into z phrases from a reference string R[1..r] over alphabet [1..σ] and with hth order empirical entropy H_h(R), our index uses rH_h(R)+o(r logσ)+O(r+z log n) bits, and finds the occ occurrences of a pattern P[1..m] in time O((m+occ) log n). This is competitive with the only existing index based on RLZ, yet it is much simpler and easier to implement. On a 1GB collection of 80 yeast genomes, a variant of our index achieves the least space among competing structures (slightly over 0.1 bits per base) while outperforming or matching them in time (1–10 microseconds per occurrence found). Our largest variant (below 0.3 bits per base) offers the best search time (1–3 microseconds per occurrence) among all structures using space below 1 bit per base.","PeriodicalId":167723,"journal":{"name":"2019 Data Compression Conference (DCC)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 Data Compression Conference (DCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.2019.00028","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
Abstract
We introduce a simple and implementable compressed index for highly repetitive sequence collections based on Relative Lempel-Ziv (RLZ). On a collection of total size n compressed into z phrases from a reference string R[1..r] over alphabet [1..σ] and with hth order empirical entropy H_h(R), our index uses rH_h(R)+o(r logσ)+O(r+z log n) bits, and finds the occ occurrences of a pattern P[1..m] in time O((m+occ) log n). This is competitive with the only existing index based on RLZ, yet it is much simpler and easier to implement. On a 1GB collection of 80 yeast genomes, a variant of our index achieves the least space among competing structures (slightly over 0.1 bits per base) while outperforming or matching them in time (1–10 microseconds per occurrence found). Our largest variant (below 0.3 bits per base) offers the best search time (1–3 microseconds per occurrence) among all structures using space below 1 bit per base.