Prefix-free parsing for building large tunnelled Wheeler graphs

Workshop on Algorithms in Bioinformatics Pub Date : 2022-06-30 DOI:10.4230/LIPIcs.WABI.2022.18

Adrián Goga, Andrej Baláz

{"title":"Prefix-free parsing for building large tunnelled Wheeler graphs","authors":"Adrián Goga, Andrej Baláz","doi":"10.4230/LIPIcs.WABI.2022.18","DOIUrl":null,"url":null,"abstract":"We propose a new technique for creating a space-efficient index for large repetitive text collections, such as pangenomic databases containing sequences of many individuals from the same species. We combine two recent techniques from this area: Wheeler graphs (Gagie et al., 2017) and prefix-free parsing (PFP, Boucher et al., 2019). Wheeler graphs (WGs) are a general framework encompassing several indexes based on the Burrows-Wheeler transform (BWT), such as the FM-index. Wheeler graphs admit a succinct representation which can be further compacted by employing the idea of tunnelling, which exploits redundancies in the form of parallel, equally-labelled paths called blocks that can be merged into a single path. The problem of finding the optimal set of blocks for tunnelling, i.e. the one that minimizes the size of the resulting WG, is known to be NP-complete and remains the most computationally challenging part of the tunnelling process. To find an adequate set of blocks in less time, we propose a new method based on the prefix-free parsing (PFP). The idea of PFP is to divide the input text into phrases of roughly equal sizes that overlap by a fixed number of characters. The original text is represented by a sequence of phrase ranks (the parse) and a list of all used phrases (the dictionary). In repetitive texts, the PFP of the text is generally much shorter than the original. To speed up the block selection for tunnelling, we apply the PFP to obtain the parse and the dictionary of the text, tunnel the WG of the parse using existing heuristics and subsequently use this tunnelled parse to construct a compact WG of the original text. Compared with constructing a WG from the original text without PFP, our method is much faster and uses less memory on collections of pangenomic sequences. Therefore, our method enables the use of WGs as a pangenomic reference for real-world datasets.","PeriodicalId":329847,"journal":{"name":"Workshop on Algorithms in Bioinformatics","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Workshop on Algorithms in Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4230/LIPIcs.WABI.2022.18","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

We propose a new technique for creating a space-efficient index for large repetitive text collections, such as pangenomic databases containing sequences of many individuals from the same species. We combine two recent techniques from this area: Wheeler graphs (Gagie et al., 2017) and prefix-free parsing (PFP, Boucher et al., 2019). Wheeler graphs (WGs) are a general framework encompassing several indexes based on the Burrows-Wheeler transform (BWT), such as the FM-index. Wheeler graphs admit a succinct representation which can be further compacted by employing the idea of tunnelling, which exploits redundancies in the form of parallel, equally-labelled paths called blocks that can be merged into a single path. The problem of finding the optimal set of blocks for tunnelling, i.e. the one that minimizes the size of the resulting WG, is known to be NP-complete and remains the most computationally challenging part of the tunnelling process. To find an adequate set of blocks in less time, we propose a new method based on the prefix-free parsing (PFP). The idea of PFP is to divide the input text into phrases of roughly equal sizes that overlap by a fixed number of characters. The original text is represented by a sequence of phrase ranks (the parse) and a list of all used phrases (the dictionary). In repetitive texts, the PFP of the text is generally much shorter than the original. To speed up the block selection for tunnelling, we apply the PFP to obtain the parse and the dictionary of the text, tunnel the WG of the parse using existing heuristics and subsequently use this tunnelled parse to construct a compact WG of the original text. Compared with constructing a WG from the original text without PFP, our method is much faster and uses less memory on collections of pangenomic sequences. Therefore, our method enables the use of WGs as a pangenomic reference for real-world datasets.

查看原文本刊更多论文

用于构建大型隧道惠勒图的无前缀解析

我们提出了一种新技术，用于为大型重复文本集合创建空间高效索引，例如包含来自同一物种的许多个体序列的泛基因组数据库。我们结合了该领域的两种最新技术:Wheeler图(gieet al.， 2017)和无前缀解析(PFP, Boucher et al.， 2019)。惠勒图(WGs)是一个基于Burrows-Wheeler变换(BWT)的包含多个索引的通用框架，如fm指数。惠勒图承认一个简洁的表示，它可以通过使用隧道的思想进一步压缩，隧道利用冗余的形式，以平行的，相等标记的路径称为块，可以合并成一条路径。寻找隧道掘进的最优块集的问题，即最小化所产生的WG大小的问题，已知是np完全的，并且仍然是隧道掘进过程中最具计算挑战性的部分。为了在更短的时间内找到合适的块集，我们提出了一种基于无前缀解析(PFP)的新方法。PFP的思想是将输入文本分成大小大致相等的短语，这些短语由固定数量的字符重叠。原始文本由短语序列(解析)和所有使用过的短语列表(字典)表示。在重复的文本中，文本的PFP通常比原文短得多。为了加快隧道化的块选择，我们使用PFP获得文本的解析和字典，使用现有的启发式方法隧道化解析的WG，然后使用该隧道化解析构造原始文本的紧凑WG。与不使用PFP从原始文本构建WG相比，我们的方法在全基因组序列集合上速度更快，占用的内存更少。因此，我们的方法可以使用WGs作为真实世界数据集的泛基因组参考。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Workshop on Algorithms in Bioinformatics

自引率

0.00%

发文量