PLA-index: A k-mer Index Exploiting Rank Curve Linearity.

Q3 Computer Science

Leibniz International Proceedings in Informatics Pub Date : 2024-01-01 Epub Date: 2024-08-26 DOI:10.4230/LIPIcs.WABI.2024.13

Hasin Abrar, Paul Medvedev

{"title":"PLA-index: A k-mer Index Exploiting Rank Curve Linearity.","authors":"Hasin Abrar, Paul Medvedev","doi":"10.4230/LIPIcs.WABI.2024.13","DOIUrl":null,"url":null,"abstract":"Given a sorted list of k-mers S, the rank curve of S is the function mapping a k-mer from the k-mer universe to the location in S where it either first appears or would be inserted. An exciting recent development is the observation that, for certain datasets, the rank curve is predictable and can be exploited to create small search indices. In this paper, we develop a novel search index that first estimates a k-mer's rank using a piece-wise linear approximation of the rank curve and then does a local search to determine the precise location of the k-mer in the list. We combine ideas from previous approaches and supplement them with an innovative data representation strategy that substantially reduces space usage. Our PLA-index uses an order of magnitude less space than Sapling and uses less than half the space of the PGM-index, for roughly the same query time. For example, using only 9 MiB of memory, it can narrow down the position of k-mer in the suffix array of the human genome to within 255 positions. Furthermore, we demonstrate the potential of our approach to impact a variety of downstream applications. First, the PLA-index halves the time of binary search on the suffix array of the human genome. Second, the PLA-index reduces the space of a direct-access lookup table by 76 percent, without increasing the run time. Third, we plug the PLA-index into a state-of-the-art read aligner Strobealign and replace a 2 GiB component with a PLA-index of size 1.5 MiB, without significantly effecting runtime. The software and reproducibility information is freely available at https://github.com/medvedevgroup/pla-index.","PeriodicalId":30209,"journal":{"name":"Leibniz International Proceedings in Informatics","volume":"312 ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12037174/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Leibniz International Proceedings in Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4230/LIPIcs.WABI.2024.13","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/8/26 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"Computer Science","Score":null,"Total":0}

引用次数: 0

Abstract

Given a sorted list of k-mers S, the rank curve of S is the function mapping a k-mer from the k-mer universe to the location in S where it either first appears or would be inserted. An exciting recent development is the observation that, for certain datasets, the rank curve is predictable and can be exploited to create small search indices. In this paper, we develop a novel search index that first estimates a k-mer's rank using a piece-wise linear approximation of the rank curve and then does a local search to determine the precise location of the k-mer in the list. We combine ideas from previous approaches and supplement them with an innovative data representation strategy that substantially reduces space usage. Our PLA-index uses an order of magnitude less space than Sapling and uses less than half the space of the PGM-index, for roughly the same query time. For example, using only 9 MiB of memory, it can narrow down the position of k-mer in the suffix array of the human genome to within 255 positions. Furthermore, we demonstrate the potential of our approach to impact a variety of downstream applications. First, the PLA-index halves the time of binary search on the suffix array of the human genome. Second, the PLA-index reduces the space of a direct-access lookup table by 76 percent, without increasing the run time. Third, we plug the PLA-index into a state-of-the-art read aligner Strobealign and replace a 2 GiB component with a PLA-index of size 1.5 MiB, without significantly effecting runtime. The software and reproducibility information is freely available at https://github.com/medvedevgroup/pla-index.

查看原文本刊更多论文

pla指数：利用等级曲线线性的k-mer指数。

给定k-mer的排序表S， S的秩曲线是将k-mer从k-mer域映射到S中它第一次出现或将被插入的位置的函数。最近一个令人兴奋的发展是，对于某些数据集，排名曲线是可预测的，可以用来创建小型搜索索引。在本文中，我们开发了一种新的搜索索引，它首先使用秩曲线的分段线性逼近来估计k-mer的秩，然后进行局部搜索以确定k-mer在列表中的精确位置。我们结合了以前方法的思想，并用创新的数据表示策略进行补充，从而大大减少了空间使用。在大致相同的查询时间内，我们的pla索引使用的空间比Sapling少一个数量级，并且使用的空间不到pgm索引的一半。例如，仅使用9mb的内存，它就可以将k-mer在人类基因组后缀阵列中的位置缩小到255个位置以内。此外，我们还展示了我们的方法影响各种下游应用程序的潜力。首先，PLA-index将人类基因组后缀数组的二分查找时间减半。其次，pla索引在不增加运行时间的情况下，将直接访问查找表的空间减少了76%。第三，我们将pla索引插入到最先进的读取校准器Strobealign中，并用大小为1.5 MiB的pla索引替换2 GiB组件，而不会显着影响运行时间。该软件和可复制性信息可在https://github.com/medvedevgroup/pla-index免费获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊