Algorithms for Molecular Biology最新文献

筛选
英文 中文
Non-parametric correction of estimated gene trees using TRACTION. 利用牵引力对估计基因树进行非参数校正。
IF 1 4区 生物学
Algorithms for Molecular Biology Pub Date : 2020-01-04 eCollection Date: 2020-01-01 DOI: 10.1186/s13015-019-0161-8
Sarah Christensen, Erin K Molloy, Pranjal Vachaspati, Ananya Yammanuru, Tandy Warnow
{"title":"Non-parametric correction of estimated gene trees using TRACTION.","authors":"Sarah Christensen,&nbsp;Erin K Molloy,&nbsp;Pranjal Vachaspati,&nbsp;Ananya Yammanuru,&nbsp;Tandy Warnow","doi":"10.1186/s13015-019-0161-8","DOIUrl":"https://doi.org/10.1186/s13015-019-0161-8","url":null,"abstract":"<p><strong>Motivation: </strong>Estimated gene trees are often inaccurate, due to insufficient phylogenetic signal in the single gene alignment, among other causes. Gene tree correction aims to improve the accuracy of an estimated gene tree by using computational techniques along with auxiliary information, such as a reference species tree or sequencing data. However, gene trees and species trees can differ as a result of gene duplication and loss (GDL), incomplete lineage sorting (ILS), and other biological processes. Thus gene tree correction methods need to take estimation error as well as gene tree heterogeneity into account. Many prior gene tree correction methods have been developed for the case where GDL is present.</p><p><strong>Results: </strong>Here, we study the problem of gene tree correction where gene tree heterogeneity is instead due to ILS and/or HGT. We introduce TRACTION, a simple polynomial time method that provably finds an optimal solution to the RF-optimal tree refinement and completion (RF-OTRC) Problem, which seeks a refinement and completion of a singly-labeled gene tree with respect to a given singly-labeled species tree so as to minimize the Robinson-Foulds (RF) distance. Our extensive simulation study on 68,000 estimated gene trees shows that TRACTION matches or improves on the accuracy of well-established methods from the GDL literature when HGT and ILS are both present, and ties for best under the ILS-only conditions. Furthermore, TRACTION ties for fastest on these datasets. We also show that a naive generalization of the RF-OTRC problem to multi-labeled trees is possible, but can produce misleading results where gene tree heterogeneity is due to GDL.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"15 ","pages":"1"},"PeriodicalIF":1.0,"publicationDate":"2020-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13015-019-0161-8","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37519970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Kohdista: an efficient method to index and query possible Rmap alignments. Kohdista:一个有效的方法来索引和查询可能的Rmap对齐。
IF 1 4区 生物学
Algorithms for Molecular Biology Pub Date : 2019-12-12 eCollection Date: 2019-01-01 DOI: 10.1186/s13015-019-0160-9
Martin D Muggli, Simon J Puglisi, Christina Boucher
{"title":"Kohdista: an efficient method to index and query possible Rmap alignments.","authors":"Martin D Muggli,&nbsp;Simon J Puglisi,&nbsp;Christina Boucher","doi":"10.1186/s13015-019-0160-9","DOIUrl":"https://doi.org/10.1186/s13015-019-0160-9","url":null,"abstract":"<p><strong>Background: </strong>Genome-wide optical maps are ordered high-resolution restriction maps that give the position of occurrence of restriction cut sites corresponding to one or more restriction enzymes. These genome-wide optical maps are assembled using an overlap-layout-consensus approach using raw optical map data, which are referred to as Rmaps. Due to the high error-rate of Rmap data, finding the overlap between Rmaps remains challenging.</p><p><strong>Results: </strong>We present Kohdista, which is an index-based algorithm for finding pairwise alignments between single molecule maps (<i>Rmaps</i>). The novelty of our approach is the formulation of the alignment problem as automaton path matching, and the application of modern index-based data structures. In particular, we combine the use of the Generalized Compressed Suffix Array (GCSA) index with the wavelet tree in order to build Kohdista. We validate Kohdista on simulated <i>E. coli</i> data, showing the approach successfully finds alignments between Rmaps simulated from overlapping genomic regions.</p><p><strong>Conclusion: </strong>we demonstrate Kohdista is the only method that is capable of finding a significant number of high quality pairwise Rmap alignments for large eukaryote organisms in reasonable time.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":" ","pages":"25"},"PeriodicalIF":1.0,"publicationDate":"2019-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13015-019-0160-9","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37483243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
NANUQ: a method for inferring species networks from gene trees under the coalescent model. NANUQ:一种在聚结模型下从基因树推断物种网络的方法。
IF 1 4区 生物学
Algorithms for Molecular Biology Pub Date : 2019-12-06 eCollection Date: 2019-01-01 DOI: 10.1186/s13015-019-0159-2
Elizabeth S Allman, Hector Baños, John A Rhodes
{"title":"NANUQ: a method for inferring species networks from gene trees under the coalescent model.","authors":"Elizabeth S Allman,&nbsp;Hector Baños,&nbsp;John A Rhodes","doi":"10.1186/s13015-019-0159-2","DOIUrl":"https://doi.org/10.1186/s13015-019-0159-2","url":null,"abstract":"<p><p>Species networks generalize the notion of species trees to allow for hybridization or other lateral gene transfer. Under the network multispecies coalescent model, individual gene trees arising from a network can have any topology, but arise with frequencies dependent on the network structure and numerical parameters. We propose a new algorithm for statistical inference of a level-1 species network under this model, from data consisting of gene tree topologies, and provide the theoretical justification for it. The algorithm is based on an analysis of quartets displayed on gene trees, combining several statistical hypothesis tests with combinatorial ideas such as a quartet-based intertaxon distance appropriate to networks, the NeighborNet algorithm for circular split systems, and the Circular Network algorithm for constructing a splits graph.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":" ","pages":"24"},"PeriodicalIF":1.0,"publicationDate":"2019-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13015-019-0159-2","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37449224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 38
TMRS: an algorithm for computing the time to the most recent substitution event from a multiple alignment column. TMRS:一种算法,用于计算从多个对齐列到最近的替换事件的时间。
IF 1 4区 生物学
Algorithms for Molecular Biology Pub Date : 2019-11-18 eCollection Date: 2019-01-01 DOI: 10.1186/s13015-019-0158-3
Hisanori Kiryu, Yuto Ichikawa, Yasuhiro Kojima
{"title":"TMRS: an algorithm for computing the time to the most recent substitution event from a multiple alignment column.","authors":"Hisanori Kiryu,&nbsp;Yuto Ichikawa,&nbsp;Yasuhiro Kojima","doi":"10.1186/s13015-019-0158-3","DOIUrl":"https://doi.org/10.1186/s13015-019-0158-3","url":null,"abstract":"<p><strong>Background: </strong>As the number of sequenced genomes grows, researchers have access to an increasingly rich source for discovering detailed evolutionary information. However, the computational technologies for inferring biologically important evolutionary events are not sufficiently developed.</p><p><strong>Results: </strong>We present algorithms to estimate the evolutionary time ( <math><msub><mi>t</mi> <mtext>MRS</mtext></msub> </math> ) to the most recent substitution event from a multiple alignment column by using a probabilistic model of sequence evolution. As the confidence in estimated <math><msub><mi>t</mi> <mtext>MRS</mtext></msub> </math> values varies depending on gap fractions and nucleotide patterns of alignment columns, we also compute the standard deviation <math><mi>σ</mi></math> of <math><msub><mi>t</mi> <mtext>MRS</mtext></msub> </math> by using a dynamic programming algorithm. We identified a number of human genomic sites at which the last substitutions occurred between two speciation events in the human lineage with confidence. A large fraction of such sites have substitutions that occurred between the concestor nodes of Hominoidea and Euarchontoglires. We investigated the correlation between tissue-specific transcribed enhancers and the distribution of the sites with specific substitution time intervals, and found that brain-specific transcribed enhancers are threefold enriched in the density of substitutions in the human lineage relative to expectations.</p><p><strong>Conclusions: </strong>We have presented algorithms to estimate the evolutionary time ( <math><msub><mi>t</mi> <mtext>MRS</mtext></msub> </math> ) to the most recent substitution event from a multiple alignment column by using a probabilistic model of sequence evolution. Our algorithms will be useful for Evo-Devo studies, as they facilitate screening potential genomic sites that have played an important role in the acquisition of unique biological features by target species.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":" ","pages":"23"},"PeriodicalIF":1.0,"publicationDate":"2019-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13015-019-0158-3","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37453129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Adjacency-constrained hierarchical clustering of a band similarity matrix with application to genomics. 带相似性矩阵的邻接约束层次聚类及其在基因组学中的应用。
IF 1 4区 生物学
Algorithms for Molecular Biology Pub Date : 2019-11-15 eCollection Date: 2019-01-01 DOI: 10.1186/s13015-019-0157-4
Christophe Ambroise, Alia Dehman, Pierre Neuvial, Guillem Rigaill, Nathalie Vialaneix
{"title":"Adjacency-constrained hierarchical clustering of a band similarity matrix with application to genomics.","authors":"Christophe Ambroise,&nbsp;Alia Dehman,&nbsp;Pierre Neuvial,&nbsp;Guillem Rigaill,&nbsp;Nathalie Vialaneix","doi":"10.1186/s13015-019-0157-4","DOIUrl":"https://doi.org/10.1186/s13015-019-0157-4","url":null,"abstract":"<p><strong>Background: </strong>Genomic data analyses such as Genome-Wide Association Studies (GWAS) or Hi-C studies are often faced with the problem of partitioning chromosomes into successive regions based on a similarity matrix of high-resolution, locus-level measurements. An intuitive way of doing this is to perform a modified Hierarchical Agglomerative Clustering (HAC), where only adjacent clusters (according to the ordering of positions within a chromosome) are allowed to be merged. But a major practical drawback of this method is its quadratic time and space complexity in the number of loci, which is typically of the order of <math><msup><mn>10</mn> <mn>4</mn></msup> </math> to <math><msup><mn>10</mn> <mn>5</mn></msup> </math> for each chromosome.</p><p><strong>Results: </strong>By assuming that the similarity between physically distant objects is negligible, we are able to propose an implementation of adjacency-constrained HAC with quasi-linear complexity. This is achieved by pre-calculating specific sums of similarities, and storing candidate fusions in a min-heap. Our illustrations on GWAS and Hi-C datasets demonstrate the relevance of this assumption, and show that this method highlights biologically meaningful signals. Thanks to its small time and memory footprint, the method can be run on a standard laptop in minutes or even seconds.</p><p><strong>Availability and implementation: </strong>Software and sample data are available as an R package, <b>adjclust</b>, that can be downloaded from the Comprehensive R Archive Network (CRAN).</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"14 ","pages":"22"},"PeriodicalIF":1.0,"publicationDate":"2019-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13015-019-0157-4","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49684571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
A cubic algorithm for the generalized rank median of three genomes. 三个基因组广义秩中值的三次算法。
IF 1 4区 生物学
Algorithms for Molecular Biology Pub Date : 2019-07-26 eCollection Date: 2019-01-01 DOI: 10.1186/s13015-019-0150-y
Leonid Chindelevitch, Sean La, Joao Meidanis
{"title":"A cubic algorithm for the generalized rank median of three genomes.","authors":"Leonid Chindelevitch,&nbsp;Sean La,&nbsp;Joao Meidanis","doi":"10.1186/s13015-019-0150-y","DOIUrl":"https://doi.org/10.1186/s13015-019-0150-y","url":null,"abstract":"<p><strong>Background: </strong>The area of genome rearrangements has given rise to a number of interesting biological, mathematical and algorithmic problems. Among these, one of the most intractable ones has been that of finding the median of three genomes, a special case of the ancestral reconstruction problem. In this work we re-examine our recently proposed way of measuring genome rearrangement distance, namely, the rank distance between the matrix representations of the corresponding genomes, and show that the median of three genomes can be computed exactly in polynomial time <math><mrow><mi>O</mi> <mo>(</mo> <msup><mi>n</mi> <mi>ω</mi></msup> <mo>)</mo></mrow> </math> , where <math><mrow><mi>ω</mi> <mo>≤</mo> <mn>3</mn></mrow> </math> , with respect to this distance, when the median is allowed to be an arbitrary orthogonal matrix.</p><p><strong>Results: </strong>We define the five fundamental subspaces depending on three input genomes, and use their properties to show that a particular action on each of these subspaces produces a median. In the process we introduce the notion of <i>M</i>-stable subspaces. We also show that the median found by our algorithm is always orthogonal, symmetric, and conserves any adjacencies or telomeres present in at least 2 out of 3 input genomes.</p><p><strong>Conclusions: </strong>We test our method on both simulated and real data. We find that the majority of the realistic inputs result in genomic outputs, and for those that do not, our two heuristics perform well in terms of reconstructing a genomic matrix attaining a score close to the lower bound, while running in a reasonable amount of time. We conclude that the rank distance is not only theoretically intriguing, but also practically useful for median-finding, and potentially ancestral genome reconstruction.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":" ","pages":"16"},"PeriodicalIF":1.0,"publicationDate":"2019-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13015-019-0150-y","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37453128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Linear time minimum segmentation enables scalable founder reconstruction. 线性时间最小分割使可扩展的创始人重建。
IF 1 4区 生物学
Algorithms for Molecular Biology Pub Date : 2019-05-17 eCollection Date: 2019-01-01 DOI: 10.1186/s13015-019-0147-6
Tuukka Norri, Bastien Cazaux, Dmitry Kosolobov, Veli Mäkinen
{"title":"Linear time minimum segmentation enables scalable founder reconstruction.","authors":"Tuukka Norri,&nbsp;Bastien Cazaux,&nbsp;Dmitry Kosolobov,&nbsp;Veli Mäkinen","doi":"10.1186/s13015-019-0147-6","DOIUrl":"https://doi.org/10.1186/s13015-019-0147-6","url":null,"abstract":"<p><strong>Background: </strong> We study a preprocessing routine relevant in pan-genomic analyses: consider a set of aligned haplotype sequences of complete human chromosomes. Due to the enormous size of such data, one would like to represent this input set with a few <i>founder</i> sequences that retain as well as possible the contiguities of the original sequences. Such a smaller set gives a scalable way to exploit pan-genomic information in further analyses (e.g. read alignment and variant calling). Optimizing the founder set is an NP-hard problem, but there is a segmentation formulation that can be solved in polynomial time, defined as follows. Given a threshold <i>L</i> and a set <math><mrow><mi>R</mi> <mo>=</mo> <mo>{</mo> <msub><mi>R</mi> <mn>1</mn></msub> <mo>,</mo> <mo>…</mo> <mo>,</mo> <msub><mi>R</mi> <mi>m</mi></msub> <mo>}</mo></mrow> </math> of <i>m</i> strings (haplotype sequences), each having length <i>n</i>, the minimum segmentation problem for founder reconstruction is to partition [1, <i>n</i>] into set <i>P</i> of disjoint segments such that each segment <math><mrow><mo>[</mo> <mi>a</mi> <mo>,</mo> <mi>b</mi> <mo>]</mo> <mo>∈</mo> <mi>P</mi></mrow> </math> has length at least <i>L</i> and the number <math><mrow><mi>d</mi> <mrow><mo>(</mo> <mi>a</mi> <mo>,</mo> <mi>b</mi> <mo>)</mo></mrow> <mo>=</mo> <mo>|</mo> <mo>{</mo> <msub><mi>R</mi> <mi>i</mi></msub> <mrow><mo>[</mo> <mi>a</mi> <mo>,</mo> <mi>b</mi> <mo>]</mo></mrow> <mo>:</mo> <mn>1</mn> <mo>≤</mo> <mi>i</mi> <mo>≤</mo> <mi>m</mi> <mo>}</mo> <mo>|</mo></mrow> </math> of distinct substrings at segment [<i>a</i>, <i>b</i>] is minimized over <math><mrow><mo>[</mo> <mi>a</mi> <mo>,</mo> <mi>b</mi> <mo>]</mo> <mo>∈</mo> <mi>P</mi></mrow> </math> . The distinct substrings in the segments represent founder blocks that can be concatenated to form <math><mrow><mo>max</mo> <mo>{</mo> <mi>d</mi> <mo>(</mo> <mi>a</mi> <mo>,</mo> <mi>b</mi> <mo>)</mo> <mo>:</mo> <mo>[</mo> <mi>a</mi> <mo>,</mo> <mi>b</mi> <mo>]</mo> <mo>∈</mo> <mi>P</mi> <mo>}</mo></mrow> </math> founder sequences representing the original <math><mi>R</mi></math> such that crossovers happen only at segment boundaries.</p><p><strong>Results: </strong> We give an <i>O</i>(<i>mn</i>) time (i.e. linear time in the input size) algorithm to solve the minimum segmentation problem for founder reconstruction, improving over an earlier <math><mrow><mi>O</mi> <mo>(</mo> <mi>m</mi> <msup><mi>n</mi> <mn>2</mn></msup> <mo>)</mo></mrow> </math> .</p><p><strong>Conclusions: </strong> Our improvement enables to apply the formulation on an input of thousands of complete human chromosomes. We implemented the new algorithm and give experimental evidence on its practicality. The implementation is available in https://github.com/tsnorri/founder-sequences.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":" ","pages":"12"},"PeriodicalIF":1.0,"publicationDate":"2019-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13015-019-0147-6","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37276725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Repairing Boolean logical models from time-series data using Answer Set Programming. 使用答案集编程从时间序列数据修复布尔逻辑模型。
IF 1 4区 生物学
Algorithms for Molecular Biology Pub Date : 2019-03-25 eCollection Date: 2019-01-01 DOI: 10.1186/s13015-019-0145-8
Alexandre Lemos, Inês Lynce, Pedro T Monteiro
{"title":"Repairing Boolean logical models from time-series data using Answer Set Programming.","authors":"Alexandre Lemos,&nbsp;Inês Lynce,&nbsp;Pedro T Monteiro","doi":"10.1186/s13015-019-0145-8","DOIUrl":"https://doi.org/10.1186/s13015-019-0145-8","url":null,"abstract":"<p><strong>Background: </strong>Boolean models of biological signalling-regulatory networks are increasingly used to formally describe and understand complex biological processes. These models may become inconsistent as new data become available and need to be repaired. In the past, the focus has been shed on the inference of (classes of) models given an interaction network and time-series data sets. However, repair of existing models against new data is still in its infancy, where the process is still manually performed and therefore slow and prone to errors.</p><p><strong>Results: </strong>In this work, we propose a method with an associated tool to suggest repairs over inconsistent Boolean models, based on a set of atomic repair operations. Answer Set Programming is used to encode the minimal repair problem as a combinatorial optimization problem. In particular, given an inconsistent model, the tool provides the minimal repairs that render the model capable of generating dynamics coherent with a (set of) time-series data set(s), considering either a synchronous or an asynchronous updating scheme.</p><p><strong>Conclusions: </strong>The method was validated using known biological models from different species, as well as synthetic models obtained from randomly generated networks. We discuss the method's limitations regarding each of the updating schemes and the considered minimization algorithm.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":" ","pages":"9"},"PeriodicalIF":1.0,"publicationDate":"2019-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13015-019-0145-8","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37134889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Connectivity problems on heterogeneous graphs. 异构图上的连通性问题。
IF 1 4区 生物学
Algorithms for Molecular Biology Pub Date : 2019-03-08 eCollection Date: 2019-01-01 DOI: 10.1186/s13015-019-0141-z
Jimmy Wu, Alex Khodaverdian, Benjamin Weitz, Nir Yosef
{"title":"Connectivity problems on heterogeneous graphs.","authors":"Jimmy Wu,&nbsp;Alex Khodaverdian,&nbsp;Benjamin Weitz,&nbsp;Nir Yosef","doi":"10.1186/s13015-019-0141-z","DOIUrl":"https://doi.org/10.1186/s13015-019-0141-z","url":null,"abstract":"<p><strong>Background: </strong>Network connectivity problems are abundant in computational biology research, where graphs are used to represent a range of phenomena: from physical interactions between molecules to more abstract relationships such as gene co-expression. One common challenge in studying biological networks is the need to extract meaningful, small subgraphs out of large databases of potential interactions. A useful abstraction for this task turned out to be the Steiner Network problems: given a reference \"database\" graph, find a parsimonious subgraph that satisfies a given set of connectivity demands. While this formulation proved useful in a number of instances, the next challenge is to account for the fact that the reference graph may not be static. This can happen for instance, when studying protein measurements in single cells or at different time points, whereby different subsets of conditions can have different protein milieu.</p><p><strong>Results and discussion: </strong>We introduce the <i>condition</i> Steiner Network problem in which we concomitantly consider a set of distinct biological conditions. Each condition is associated with a set of connectivity demands, as well as a set of edges that are assumed to be present in that condition. The goal of this problem is to find a minimal subgraph that satisfies all the demands through paths that are present in the respective condition. We show that introducing multiple conditions as an additional factor makes this problem much harder to approximate. Specifically, we prove that for <i>C</i> conditions, this new problem is NP-hard to approximate to a factor of <math><mrow><mi>C</mi> <mo>-</mo> <mi>ϵ</mi></mrow> </math> , for every <math><mrow><mi>C</mi> <mo>≥</mo> <mn>2</mn></mrow> </math> and <math><mrow><mi>ϵ</mi> <mo>></mo> <mn>0</mn></mrow> </math> , and that this bound is tight. Moving beyond the worst case, we explore a special set of instances where the reference graph grows <i>monotonically</i> between conditions, and show that this problem admits substantially improved approximation algorithms. We also developed an integer linear programming solver for the general problem and demonstrate its ability to reach optimality with instances from the human protein interaction network.</p><p><strong>Conclusion: </strong>Our results demonstrate that in contrast to most connectivity problems studied in computational biology, accounting for multiplicity of biological conditions adds considerable complexity, which we propose to address with a new solver. Importantly, our results extend to several network connectivity problems that are commonly used in computational biology, such as Prize-Collecting Steiner Tree, and provide insight into the theoretical guarantees for their applications in a multiple condition setting.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":" ","pages":"5"},"PeriodicalIF":1.0,"publicationDate":"2019-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13015-019-0141-z","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37078885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
External memory BWT and LCP computation for sequence collections with applications. 应用程序序列集合的外部内存BWT和LCP计算。
IF 1 4区 生物学
Algorithms for Molecular Biology Pub Date : 2019-03-08 eCollection Date: 2019-01-01 DOI: 10.1186/s13015-019-0140-0
Lavinia Egidi, Felipe A Louza, Giovanni Manzini, Guilherme P Telles
{"title":"External memory BWT and LCP computation for sequence collections with applications.","authors":"Lavinia Egidi,&nbsp;Felipe A Louza,&nbsp;Giovanni Manzini,&nbsp;Guilherme P Telles","doi":"10.1186/s13015-019-0140-0","DOIUrl":"https://doi.org/10.1186/s13015-019-0140-0","url":null,"abstract":"<p><strong>Background: </strong>Sequencing technologies produce larger and larger collections of biosequences that have to be stored in compressed indices supporting fast search operations. Many compressed indices are based on the Burrows-Wheeler Transform (BWT) and the longest common prefix (LCP) array. Because of the sheer size of the input it is important to build these data structures in external memory and time using in the best possible way the available RAM.</p><p><strong>Results: </strong>We propose a space-efficient algorithm to compute the BWT and LCP array for a collection of sequences in the external or semi-external memory setting. Our algorithm splits the input collection into subcollections sufficiently small that it can compute their BWT in RAM using an optimal linear time algorithm. Next, it merges the partial BWTs in external or semi-external memory and in the process it also computes the LCP values. Our algorithm can be modified to output two additional arrays that, combined with the BWT and LCP array, provide simple, scan-based, external memory algorithms for three well known problems in bioinformatics: the computation of maximal repeats, the all pairs suffix-prefix overlaps, and the construction of succinct de Bruijn graphs.</p><p><strong>Conclusions: </strong>We prove that our algorithm performs <math><mrow><mi>O</mi> <mo>(</mo> <mi>n</mi> <mspace></mspace> <mi>maxlcp</mi> <mo>)</mo></mrow> </math> sequential I/Os, where <i>n</i> is the total length of the collection and <math><mi>maxlcp</mi></math> is the maximum LCP value. The experimental results show that our algorithm is only slightly slower than the state of the art for short sequences but it is up to 40 times faster for longer sequences or when the available RAM is at least equal to the size of the input.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":" ","pages":"6"},"PeriodicalIF":1.0,"publicationDate":"2019-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13015-019-0140-0","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37080399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信