构建可以有效查询的可伸缩索引

C. Boucher
{"title":"构建可以有效查询的可伸缩索引","authors":"C. Boucher","doi":"10.1109/IPDPSW55747.2022.00034","DOIUrl":null,"url":null,"abstract":"Recently, Gagie et al. proposed a version of the FM-index, called the r-index, that can store thousands of human genomes on a commodity computer. We later showed how to build the r-index efficiently via a technique called prefix-free parsing (PFP) and demonstrated its effectiveness for exact pattern matching. Exact pattern matching can be leveraged to support approximate pattern matching but the r-index itself cannot support efficiently popular and important queries such as finding maximal exact matches (MEMs). To address this shortcoming, Bannai et al. introduced the concept of thresholds, and showed that storing them together with the r-index enables efficient MEM finding --- but they did not say how to find those thresholds. We present another novel algorithm that applies PFP to build the r-index and find the thresholds simultaneously and in linear time and space with respect to the size of the prefix-free parse. Our implementation can rapidly find MEMs between reads and large sequence collections of highly repetitive sequences. Compared to existing methods, ours used 2 to 11 times less memory and was 2 to 32 times faster for index construction. Moreover, our method was less than one thousandth the size of competing indexes for large collections of human chromosomes.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"362 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Building scalable indexes that can be efficiently queried\",\"authors\":\"C. Boucher\",\"doi\":\"10.1109/IPDPSW55747.2022.00034\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recently, Gagie et al. proposed a version of the FM-index, called the r-index, that can store thousands of human genomes on a commodity computer. We later showed how to build the r-index efficiently via a technique called prefix-free parsing (PFP) and demonstrated its effectiveness for exact pattern matching. Exact pattern matching can be leveraged to support approximate pattern matching but the r-index itself cannot support efficiently popular and important queries such as finding maximal exact matches (MEMs). To address this shortcoming, Bannai et al. introduced the concept of thresholds, and showed that storing them together with the r-index enables efficient MEM finding --- but they did not say how to find those thresholds. We present another novel algorithm that applies PFP to build the r-index and find the thresholds simultaneously and in linear time and space with respect to the size of the prefix-free parse. Our implementation can rapidly find MEMs between reads and large sequence collections of highly repetitive sequences. Compared to existing methods, ours used 2 to 11 times less memory and was 2 to 32 times faster for index construction. Moreover, our method was less than one thousandth the size of competing indexes for large collections of human chromosomes.\",\"PeriodicalId\":286968,\"journal\":{\"name\":\"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)\",\"volume\":\"362 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPSW55747.2022.00034\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW55747.2022.00034","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

最近,Gagie等人提出了一种称为r-index的FM-index版本,它可以在一台普通计算机上存储数千个人类基因组。我们随后展示了如何通过一种称为无前缀解析(PFP)的技术高效地构建r-index,并演示了它在精确模式匹配方面的有效性。可以利用精确模式匹配来支持近似模式匹配,但是r-index本身不能有效地支持流行和重要的查询,例如查找最大精确匹配(MEMs)。为了解决这个缺点,Bannai等人引入了阈值的概念,并表明将它们与r-index存储在一起可以有效地找到MEM,但他们没有说明如何找到这些阈值。我们提出了另一种新颖的算法,该算法应用PFP来构建r-index并同时在线性时间和空间中找到与无前缀解析大小相关的阈值。我们的实现可以在读取和高重复序列的大序列集合之间快速找到MEMs。与现有方法相比,我们的方法使用的内存减少了2到11倍,索引构建速度提高了2到32倍。此外,我们的方法小于人类染色体大型集合的竞争索引的千分之一。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Building scalable indexes that can be efficiently queried
Recently, Gagie et al. proposed a version of the FM-index, called the r-index, that can store thousands of human genomes on a commodity computer. We later showed how to build the r-index efficiently via a technique called prefix-free parsing (PFP) and demonstrated its effectiveness for exact pattern matching. Exact pattern matching can be leveraged to support approximate pattern matching but the r-index itself cannot support efficiently popular and important queries such as finding maximal exact matches (MEMs). To address this shortcoming, Bannai et al. introduced the concept of thresholds, and showed that storing them together with the r-index enables efficient MEM finding --- but they did not say how to find those thresholds. We present another novel algorithm that applies PFP to build the r-index and find the thresholds simultaneously and in linear time and space with respect to the size of the prefix-free parse. Our implementation can rapidly find MEMs between reads and large sequence collections of highly repetitive sequences. Compared to existing methods, ours used 2 to 11 times less memory and was 2 to 32 times faster for index construction. Moreover, our method was less than one thousandth the size of competing indexes for large collections of human chromosomes.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信