Workshop on Algorithms in Bioinformatics最新文献

Suffix sorting via matching statistics 通过匹配统计进行后缀排序

Workshop on Algorithms in Bioinformatics Pub Date : 2022-07-03 DOI: 10.48550/arXiv.2207.00972

Zsuzsanna Lipt'ak, Francesco Masillo, S. Puglisi

引用次数: 1

Prefix-free parsing for building large tunnelled Wheeler graphs 用于构建大型隧道惠勒图的无前缀解析

Workshop on Algorithms in Bioinformatics Pub Date : 2022-06-30 DOI: 10.4230/LIPIcs.WABI.2022.18

Adrián Goga, Andrej Baláz

{"title":"Prefix-free parsing for building large tunnelled Wheeler graphs","authors":"Adrián Goga, Andrej Baláz","doi":"10.4230/LIPIcs.WABI.2022.18","DOIUrl":"https://doi.org/10.4230/LIPIcs.WABI.2022.18","url":null,"abstract":"We propose a new technique for creating a space-efficient index for large repetitive text collections, such as pangenomic databases containing sequences of many individuals from the same species. We combine two recent techniques from this area: Wheeler graphs (Gagie et al., 2017) and prefix-free parsing (PFP, Boucher et al., 2019). Wheeler graphs (WGs) are a general framework encompassing several indexes based on the Burrows-Wheeler transform (BWT), such as the FM-index. Wheeler graphs admit a succinct representation which can be further compacted by employing the idea of tunnelling, which exploits redundancies in the form of parallel, equally-labelled paths called blocks that can be merged into a single path. The problem of finding the optimal set of blocks for tunnelling, i.e. the one that minimizes the size of the resulting WG, is known to be NP-complete and remains the most computationally challenging part of the tunnelling process. To find an adequate set of blocks in less time, we propose a new method based on the prefix-free parsing (PFP). The idea of PFP is to divide the input text into phrases of roughly equal sizes that overlap by a fixed number of characters. The original text is represented by a sequence of phrase ranks (the parse) and a list of all used phrases (the dictionary). In repetitive texts, the PFP of the text is generally much shorter than the original. To speed up the block selection for tunnelling, we apply the PFP to obtain the parse and the dictionary of the text, tunnel the WG of the parse using existing heuristics and subsequently use this tunnelled parse to construct a compact WG of the original text. Compared with constructing a WG from the original text without PFP, our method is much faster and uses less memory on collections of pangenomic sequences. Therefore, our method enables the use of WGs as a pangenomic reference for real-world datasets.","PeriodicalId":329847,"journal":{"name":"Workshop on Algorithms in Bioinformatics","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132725143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Phyolin: Identifying a Linear Perfect Phylogeny in Single-Cell DNA Sequencing Data of Tumors 植藻碱:在肿瘤单细胞DNA测序数据中发现一个线性的完美系统发育

Workshop on Algorithms in Bioinformatics Pub Date : 2020-08-01 DOI: 10.4230/LIPIcs.WABI.2020.5

Leah L. Weber, M. El-Kebir

{"title":"Phyolin: Identifying a Linear Perfect Phylogeny in Single-Cell DNA Sequencing Data of Tumors","authors":"Leah L. Weber, M. El-Kebir","doi":"10.4230/LIPIcs.WABI.2020.5","DOIUrl":"https://doi.org/10.4230/LIPIcs.WABI.2020.5","url":null,"abstract":"Cancer arises from an evolutionary process where somatic mutations occur and eventually give rise to clonal expansions. Modeling this evolutionary process as a phylogeny is useful for treatment decision-making as well as understanding evolutionary patterns across patients and cancer types. However, cancer phylogeny inference from single-cell DNA sequencing data of tumors is challenging due to limitations with sequencing technology and the complexity of the resulting problem. Therefore, as a first step some value might be obtained from correctly classifying the evolutionary process as either linear or branched. The biological implications of these two high-level patterns are different and understanding what cancer types and which patients have each of these trajectories could provide useful insight for both clinicians and researchers. Here, we introduce the Linear Perfect Phylogeny Flipping Problem as a means of testing a null model that the tree topology is linear and show that it is NP-hard. We develop Phyolin and, through both in silico experiments and real data application, show that it is an accurate, easy to use and a reasonably fast method for classifying an evolutionary trajectory as linear or branched. 2012 ACM Subject Classification Applied computing → Molecular evolution","PeriodicalId":329847,"journal":{"name":"Workshop on Algorithms in Bioinformatics","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129315548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Near-Linear Time Edit Distance for Indel Channels 近线性时间编辑距离Indel通道

Workshop on Algorithms in Bioinformatics Pub Date : 2020-07-06 DOI: 10.4230/LIPIcs.WABI.2020.17

Arun Ganesh, Aaron Sy

{"title":"Near-Linear Time Edit Distance for Indel Channels","authors":"Arun Ganesh, Aaron Sy","doi":"10.4230/LIPIcs.WABI.2020.17","DOIUrl":"https://doi.org/10.4230/LIPIcs.WABI.2020.17","url":null,"abstract":"We consider the following model for sampling pairs of strings: $s_1$ is a uniformly random bitstring of length $n$, and $s_2$ is the bitstring arrived at by applying substitutions, insertions, and deletions to each bit of $s_1$ with some probability. We show that the edit distance between $s_1$ and $s_2$ can be computed in $O(n ln n)$ time with high probability, as long as each bit of $s_1$ has a mutation applied to it with probability at most a small constant. The algorithm is simple and only uses the textbook dynamic programming algorithm as a primitive, first computing an approximate alignment between the two strings, and then running the dynamic programming algorithm restricted to entries close to the approximate alignment. The analysis of our algorithm provides theoretical justification for alignment heuristics used in practice such as BLAST, FASTA, and MAFFT, which also start by computing approximate alignments quickly and then find the best alignment near the approximate alignment. Our main technical contribution is a partitioning of alignments such that the number of the subsets in the partition is not too large and every alignment in one subset is worse than an alignment considered by our algorithm with high probability. Similar techniques may be of interest in the average-case analysis of other problems commonly solved via dynamic programming.","PeriodicalId":329847,"journal":{"name":"Workshop on Algorithms in Bioinformatics","volume":"181 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132923870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Linear Time Construction of Indexable Founder Block Graphs 可转位方正方图的线性时间构造

Workshop on Algorithms in Bioinformatics Pub Date : 2020-05-19 DOI: 10.4230/LIPIcs.WABI.2020.7

V. Mäkinen, Bastien Cazaux, Massimo Equi, T. Norri, Alexandru I. Tomescu

{"title":"Linear Time Construction of Indexable Founder Block Graphs","authors":"V. Mäkinen, Bastien Cazaux, Massimo Equi, T. Norri, Alexandru I. Tomescu","doi":"10.4230/LIPIcs.WABI.2020.7","DOIUrl":"https://doi.org/10.4230/LIPIcs.WABI.2020.7","url":null,"abstract":"We introduce a compact pangenome representation based on an optimal segmentation concept that aims to reconstruct founder sequences from a multiple sequence alignment (MSA). Such founder sequences have the feature that each row of the MSA is a recombination of the founders. Several linear time dynamic programming algorithms have been previously devised to optimize segmentations that induce founder blocks that then can be concatenated into a set of founder sequences. All possible concatenation orders can be expressed as a founder block graph. We observe a key property of such graphs: if the node labels (founder segments) do not repeat in the paths of the graph, such graphs can be indexed for efficient string matching. We call such graphs segment repeat-free founder block graphs. \u0000We give a linear time algorithm to construct a segment repeat-free founder block graph given an MSA. The algorithm combines techniques from the founder segmentation algorithms (Cazaux et al. SPIRE 2019) and fully-functional bidirectional Burrows-Wheeler index (Belazzougui and Cunial, CPM 2019). We derive a succinct index structure to support queries of arbitrary length in the paths of the graph. \u0000Experiments on an MSA of SAR-CoV-2 strains are reported. An MSA of size $410times 29811$ is compacted in one minute into a segment repeat-free founder block graph of 3900 nodes and 4440 edges. The maximum length and total length of node labels is 12 and 34968, respectively. The index on the graph takes only $3%$ of the size of the MSA.","PeriodicalId":329847,"journal":{"name":"Workshop on Algorithms in Bioinformatics","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127200125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Weighted Minimum-Length Rearrangement Scenarios 加权最小长度重排场景

Workshop on Algorithms in Bioinformatics Pub Date : 2019-09-08 DOI: 10.4230/LIPIcs.WABI.2019.13

Pijus Simonaitis, A. Chateau, K. M. Swenson

{"title":"Weighted Minimum-Length Rearrangement Scenarios","authors":"Pijus Simonaitis, A. Chateau, K. M. Swenson","doi":"10.4230/LIPIcs.WABI.2019.13","DOIUrl":"https://doi.org/10.4230/LIPIcs.WABI.2019.13","url":null,"abstract":"We present the first known model of genome rearrangement with an arbitrary real-valued weight function on the rearrangements. It is based on the dominant model for the mathematical and algorithmic study of genome rearrangement, Double Cut and Join (DCJ). Our objective function is the sum or product of the weights of the DCJs in an evolutionary scenario, and the function can be minimized or maximized. If the likelihood of observing an independent DCJ was estimated based on biological conditions, for example, then this objective function could be the likelihood of observing the independent DCJs together in a scenario. We present an O(n 4)-time dynamic programming algorithm solving the Minimum Cost Parsimonious Scenario (MCPS) problem for co-tailed genomes with n genes (or syntenic blocks). Combining this with our previous work on MCPS yields a polynomial-time algorithm for general genomes. The key theoretical contribution is a novel link between the parsimonious DCJ (or 2-break) scenarios and quadrangulations of a regular polygon. To demonstrate that our algorithm is fast enough to treat biological data, we run it on syntenic blocks constructed for Human paired with Chimpanzee, Gibbon, Mouse, and Chicken. We argue that the Human and Gibbon pair is a particularly interesting model for the study of weighted genome rearrangements.","PeriodicalId":329847,"journal":{"name":"Workshop on Algorithms in Bioinformatics","volume":"253 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132412136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Read Mapping on Genome Variation Graphs 阅读基因组变异图制图

Workshop on Algorithms in Bioinformatics Pub Date : 2019-09-06 DOI: 10.4230/LIPIcs.WABI.2019.7

N. Vaddadi, Rajgopal Srinivasan, N. Sivadasan

引用次数: 4

Bounded-Length Smith-Waterman Alignment 限长史密斯-沃特曼对齐

Workshop on Algorithms in Bioinformatics Pub Date : 2019-09-06 DOI: 10.4230/LIPIcs.WABI.2019.16

A. Tiskin

{"title":"Bounded-Length Smith-Waterman Alignment","authors":"A. Tiskin","doi":"10.4230/LIPIcs.WABI.2019.16","DOIUrl":"https://doi.org/10.4230/LIPIcs.WABI.2019.16","url":null,"abstract":"Given a fixed alignment scoring scheme, the bounded length (respectively, bounded total length) Smith–Waterman alignment problem on a pair of strings of lengths m, n, asks for the maximum alignment score across all substring pairs, such that the first substring’s length (respectively, the sum of the two substrings’ lengths) is above the given threshold w. The latter problem was introduced by Arslan and Eğecioğlu under the name “local alignment with length threshold”. They proposed a dynamic programming algorithm solving the problem in time O(mn2), and also an approximation algorithm running in time O(rmn), where r is a parameter controlling the accuracy of approximation. We show that both these problems can be solved exactly in time O(mn), assuming a rational scoring scheme; furthermore, this solution can be used to obtain an exact algorithm for the normalised bounded total length Smith–Waterman alignment problem, running in time O(mn log n). Our algorithms rely on the techniques of fast window-substring alignment and implicit unit-Monge matrix searching, developed previously by the author and others. 2012 ACM Subject Classification Theory of computation → Pattern matching; Theory of computation → Divide and conquer; Theory of computation → Dynamic programming; Applied computing → Molecular sequence analysis; Applied computing → Bioinformatics","PeriodicalId":329847,"journal":{"name":"Workshop on Algorithms in Bioinformatics","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125067791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

TRACTION: Fast Non-Parametric Improvement of Estimated Gene Trees 牵引:快速非参数改进估计的基因树

Workshop on Algorithms in Bioinformatics Pub Date : 2019-09-01 DOI: 10.4230/LIPIcs.WABI.2019.4

Sarah A. Christensen, Erin K. Molloy, P. Vachaspati, T. Warnow

{"title":"TRACTION: Fast Non-Parametric Improvement of Estimated Gene Trees","authors":"Sarah A. Christensen, Erin K. Molloy, P. Vachaspati, T. Warnow","doi":"10.4230/LIPIcs.WABI.2019.4","DOIUrl":"https://doi.org/10.4230/LIPIcs.WABI.2019.4","url":null,"abstract":"Gene tree correction aims to improve the accuracy of a gene tree by using computational techniques along with a reference tree (and in some cases available sequence data). It is an active area of research when dealing with gene tree heterogeneity due to duplication and loss (GDL). Here, we study the problem of gene tree correction where gene tree heterogeneity is instead due to incomplete lineage sorting (ILS, a common problem in eukaryotic phylogenetics) and horizontal gene transfer (HGT, a common problem in bacterial phylogenetics). We introduce TRACTION, a simple polynomial time method that provably finds an optimal solution to the RF-Optimal Tree Refinement and Completion Problem, which seeks a refinement and completion of an input tree t with respect to a given binary tree T so as to minimize the Robinson-Foulds (RF) distance. We present the results of an extensive simulation study evaluating TRACTION within gene tree correction pipelines on 68,000 estimated gene trees, using estimated species trees as reference trees. We explore accuracy under conditions with varying levels of gene tree heterogeneity due to ILS and HGT. We show that TRACTION matches or improves the accuracy of well-established methods from the GDL literature under conditions with HGT and ILS, and ties for best under the ILS-only conditions. Furthermore, TRACTION ties for fastest on these datasets. TRACTION is available at https://github.com/pranjalv123/TRACTION-RF and the study datasets are available at https://doi.org/10.13012/B2IDB-1747658_V1.","PeriodicalId":329847,"journal":{"name":"Workshop on Algorithms in Bioinformatics","volume":"184 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132868609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Quantified Uncertainty of Flexible Protein-Protein Docking Algorithms 柔性蛋白-蛋白对接算法的量化不确定性

Workshop on Algorithms in Bioinformatics Pub Date : 2019-06-24 DOI: 10.4230/LIPIcs.WABI.2019.3

Nathan L. Clement

{"title":"Quantified Uncertainty of Flexible Protein-Protein Docking Algorithms","authors":"Nathan L. Clement","doi":"10.4230/LIPIcs.WABI.2019.3","DOIUrl":"https://doi.org/10.4230/LIPIcs.WABI.2019.3","url":null,"abstract":"The strength or weakness of an algorithm is ultimately governed by the confidence of its result. When the domain of the problem is large (e.g. traversal of a high-dimensional space), a perfect solution cannot be obtained, so approximations must be made. These approximations often lead to a reported quantity of interest (QOI) which varies between runs, decreasing the confidence of any single run. When the algorithm further computes this final QOI based on uncertain or noisy data, the variability (or lack of confidence) of the final QOI increases. Unbounded, these two sources of uncertainty (algorithmic approximations and uncertainty in input data) can result in a reported statistic that has low correlation with ground truth. \u0000In biological applications, this is especially applicable, as the search space is generally approximated at least to some degree (e.g. a high percentage of protein structures are invalid or energetically unfavorable) and the explicit conversion from continuous to discrete space for protein representation implies some uncertainty in the input data. This research applies uncertainty quantification techniques to the difficult protein-protein docking problem, first showing the variability that exists in existing software, and then providing a method for computing probabilistic certificates in the form of Chernoff-like bounds. Finally, this paper leverages these probabilistic certificates to accurately bound the uncertainty in docking from two docking algorithms, providing a QOI that is both robust and statistically meaningful.","PeriodicalId":329847,"journal":{"name":"Workshop on Algorithms in Bioinformatics","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128588539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0