Computing Maximal Unique Matches with the r-index

Sara Giuliani, Giuseppe Romana, Massimiliano Rossi
{"title":"Computing Maximal Unique Matches with the r-index","authors":"Sara Giuliani, Giuseppe Romana, Massimiliano Rossi","doi":"10.48550/arXiv.2205.01576","DOIUrl":null,"url":null,"abstract":"In recent years, pangenomes received increasing attention from the scientific community for their ability to incorporate population variation information and alleviate reference genome bias. Maximal Exact Matches ( MEMs ) and Maximal Unique Matches ( MUMs ) have proven themselves to be useful in multiple bioinformatic contexts, for example short-read alignment and multiple-genome alignment. However, standard techniques using suffix trees and FM-indexes do not scale to a pangenomic level. Recently, Gagie et al. [JACM 20] introduced the r -index that is a Burrows-Wheeler Transform ( BWT )-based index able to handle hundreds of human genomes. Later, Rossi et al. [JCB 22] enabled the computation of MEMs using the r -index, and Boucher et al. [DCC 21] showed how to compute them in a streaming fashion. In this paper, we show how to augment Boucher et al.’s approach to enable the computation of MUMs on the r -index, while preserving the space and time bounds. We add additional O ( r ) samples of the longest common prefix ( LCP ) array, where r is the number of equal-letter runs of the BWT , that permits the computation of the second longest match of the pattern suffix with respect to the input text, which in turn allows the computation of candidate MUMs . We implemented a proof-of-concept of our approach, that we call mum-phinder , and tested on real-world datasets. We compared our approach with competing methods that are able to compute MUMs . We observe that our method is up to 8 times smaller, while up to 19 times slower when the dataset is not highly repetitive, while on highly repetitive data, our method is up to 6.5 times slower and uses up to 25 times less memory.","PeriodicalId":9448,"journal":{"name":"Bulletin of the Society of Sea Water Science, Japan","volume":"35 1","pages":"22:1-22:16"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bulletin of the Society of Sea Water Science, Japan","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2205.01576","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

In recent years, pangenomes received increasing attention from the scientific community for their ability to incorporate population variation information and alleviate reference genome bias. Maximal Exact Matches ( MEMs ) and Maximal Unique Matches ( MUMs ) have proven themselves to be useful in multiple bioinformatic contexts, for example short-read alignment and multiple-genome alignment. However, standard techniques using suffix trees and FM-indexes do not scale to a pangenomic level. Recently, Gagie et al. [JACM 20] introduced the r -index that is a Burrows-Wheeler Transform ( BWT )-based index able to handle hundreds of human genomes. Later, Rossi et al. [JCB 22] enabled the computation of MEMs using the r -index, and Boucher et al. [DCC 21] showed how to compute them in a streaming fashion. In this paper, we show how to augment Boucher et al.’s approach to enable the computation of MUMs on the r -index, while preserving the space and time bounds. We add additional O ( r ) samples of the longest common prefix ( LCP ) array, where r is the number of equal-letter runs of the BWT , that permits the computation of the second longest match of the pattern suffix with respect to the input text, which in turn allows the computation of candidate MUMs . We implemented a proof-of-concept of our approach, that we call mum-phinder , and tested on real-world datasets. We compared our approach with competing methods that are able to compute MUMs . We observe that our method is up to 8 times smaller, while up to 19 times slower when the dataset is not highly repetitive, while on highly repetitive data, our method is up to 6.5 times slower and uses up to 25 times less memory.
用r索引计算最大唯一匹配
近年来,泛基因组因其整合种群变异信息和减轻参考基因组偏差的能力而受到科学界的越来越多的关注。最大精确匹配(MEMs)和最大唯一匹配(mum)已被证明在多种生物信息学背景下非常有用,例如短读比对和多基因组比对。然而,使用后缀树和fm索引的标准技术不能扩展到全基因组水平。最近,Gagie等人[JACM 20]引入了r -索引,这是一种基于Burrows-Wheeler变换(BWT)的索引,能够处理数百个人类基因组。后来,Rossi等人[JCB 22]使用r -指数实现了MEMs的计算,而Boucher等人[DCC 21]展示了如何以流方式计算它们。在本文中,我们展示了如何增强Boucher等人的方法来实现r -索引上的mum计算,同时保留空间和时间界限。我们添加了最长公共前缀(LCP)数组的额外O (r)个样本,其中r是BWT的等字母运行次数,它允许计算模式后缀相对于输入文本的第二长的匹配,这反过来允许计算候选的MUMs。我们实现了我们的方法的概念验证,我们称之为mum-phinder,并在现实世界的数据集上进行了测试。我们将我们的方法与能够计算mom的竞争方法进行了比较。我们观察到,当数据集不是高度重复的时候,我们的方法要小8倍,而慢19倍,而在高度重复的数据上,我们的方法要慢6.5倍,使用的内存要少25倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信