可转位方正方图的线性时间构造

V. Mäkinen, Bastien Cazaux, Massimo Equi, T. Norri, Alexandru I. Tomescu
{"title":"可转位方正方图的线性时间构造","authors":"V. Mäkinen, Bastien Cazaux, Massimo Equi, T. Norri, Alexandru I. Tomescu","doi":"10.4230/LIPIcs.WABI.2020.7","DOIUrl":null,"url":null,"abstract":"We introduce a compact pangenome representation based on an optimal segmentation concept that aims to reconstruct founder sequences from a multiple sequence alignment (MSA). Such founder sequences have the feature that each row of the MSA is a recombination of the founders. Several linear time dynamic programming algorithms have been previously devised to optimize segmentations that induce founder blocks that then can be concatenated into a set of founder sequences. All possible concatenation orders can be expressed as a founder block graph. We observe a key property of such graphs: if the node labels (founder segments) do not repeat in the paths of the graph, such graphs can be indexed for efficient string matching. We call such graphs segment repeat-free founder block graphs. \nWe give a linear time algorithm to construct a segment repeat-free founder block graph given an MSA. The algorithm combines techniques from the founder segmentation algorithms (Cazaux et al. SPIRE 2019) and fully-functional bidirectional Burrows-Wheeler index (Belazzougui and Cunial, CPM 2019). We derive a succinct index structure to support queries of arbitrary length in the paths of the graph. \nExperiments on an MSA of SAR-CoV-2 strains are reported. An MSA of size $410\\times 29811$ is compacted in one minute into a segment repeat-free founder block graph of 3900 nodes and 4440 edges. The maximum length and total length of node labels is 12 and 34968, respectively. The index on the graph takes only $3\\%$ of the size of the MSA.","PeriodicalId":329847,"journal":{"name":"Workshop on Algorithms in Bioinformatics","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":"{\"title\":\"Linear Time Construction of Indexable Founder Block Graphs\",\"authors\":\"V. Mäkinen, Bastien Cazaux, Massimo Equi, T. Norri, Alexandru I. Tomescu\",\"doi\":\"10.4230/LIPIcs.WABI.2020.7\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We introduce a compact pangenome representation based on an optimal segmentation concept that aims to reconstruct founder sequences from a multiple sequence alignment (MSA). Such founder sequences have the feature that each row of the MSA is a recombination of the founders. Several linear time dynamic programming algorithms have been previously devised to optimize segmentations that induce founder blocks that then can be concatenated into a set of founder sequences. All possible concatenation orders can be expressed as a founder block graph. We observe a key property of such graphs: if the node labels (founder segments) do not repeat in the paths of the graph, such graphs can be indexed for efficient string matching. We call such graphs segment repeat-free founder block graphs. \\nWe give a linear time algorithm to construct a segment repeat-free founder block graph given an MSA. The algorithm combines techniques from the founder segmentation algorithms (Cazaux et al. SPIRE 2019) and fully-functional bidirectional Burrows-Wheeler index (Belazzougui and Cunial, CPM 2019). We derive a succinct index structure to support queries of arbitrary length in the paths of the graph. \\nExperiments on an MSA of SAR-CoV-2 strains are reported. An MSA of size $410\\\\times 29811$ is compacted in one minute into a segment repeat-free founder block graph of 3900 nodes and 4440 edges. The maximum length and total length of node labels is 12 and 34968, respectively. The index on the graph takes only $3\\\\%$ of the size of the MSA.\",\"PeriodicalId\":329847,\"journal\":{\"name\":\"Workshop on Algorithms in Bioinformatics\",\"volume\":\"25 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-05-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"19\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Workshop on Algorithms in Bioinformatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.4230/LIPIcs.WABI.2020.7\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Workshop on Algorithms in Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4230/LIPIcs.WABI.2020.7","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 19

摘要

我们介绍了一种基于最佳分割概念的紧凑泛基因组表示,旨在从多序列比对(MSA)中重建方正序列。这样的建立者序列的特点是,每一行的MSA是一个重新组合的建立者。以前已经设计了几种线性时间动态规划算法来优化分割,这些分割可以诱导方正块,然后可以连接到一组方正序列中。所有可能的连接顺序都可以表示为一个方正块图。我们观察到这样的图的一个关键属性:如果节点标签(创始人段)在图的路径中不重复,这样的图可以被索引以进行有效的字符串匹配。我们称这种图为分段无重复方正块图。给出了一种线性时间算法来构造给定MSA的段无重复方正块图。该算法结合了创始人分割算法(Cazaux等)的技术。SPIRE 2019)和全功能双向Burrows-Wheeler指数(Belazzougui和Cunial, CPM 2019)。我们推导了一个简洁的索引结构来支持图路径中任意长度的查询。报道了sars - cov -2株MSA的实验。大小为$410\乘以29811$的MSA在一分钟内被压缩成一个包含3900个节点和4440条边的段无重复的创建方块图。节点标签的最大长度为12,总长度为34968。图上的索引只占用MSA大小的3%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Linear Time Construction of Indexable Founder Block Graphs
We introduce a compact pangenome representation based on an optimal segmentation concept that aims to reconstruct founder sequences from a multiple sequence alignment (MSA). Such founder sequences have the feature that each row of the MSA is a recombination of the founders. Several linear time dynamic programming algorithms have been previously devised to optimize segmentations that induce founder blocks that then can be concatenated into a set of founder sequences. All possible concatenation orders can be expressed as a founder block graph. We observe a key property of such graphs: if the node labels (founder segments) do not repeat in the paths of the graph, such graphs can be indexed for efficient string matching. We call such graphs segment repeat-free founder block graphs. We give a linear time algorithm to construct a segment repeat-free founder block graph given an MSA. The algorithm combines techniques from the founder segmentation algorithms (Cazaux et al. SPIRE 2019) and fully-functional bidirectional Burrows-Wheeler index (Belazzougui and Cunial, CPM 2019). We derive a succinct index structure to support queries of arbitrary length in the paths of the graph. Experiments on an MSA of SAR-CoV-2 strains are reported. An MSA of size $410\times 29811$ is compacted in one minute into a segment repeat-free founder block graph of 3900 nodes and 4440 edges. The maximum length and total length of node labels is 12 and 34968, respectively. The index on the graph takes only $3\%$ of the size of the MSA.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信