Algorithms for Molecular Biology最新文献_第10页

Adjacency-constrained hierarchical clustering of a band similarity matrix with application to genomics. 带相似性矩阵的邻接约束层次聚类及其在基因组学中的应用。

IF 1 4区生物学

Algorithms for Molecular Biology Pub Date : 2019-11-15 eCollection Date: 2019-01-01 DOI: 10.1186/s13015-019-0157-4

Christophe Ambroise, Alia Dehman, Pierre Neuvial, Guillem Rigaill, Nathalie Vialaneix

{"title":"Adjacency-constrained hierarchical clustering of a band similarity matrix with application to genomics.","authors":"Christophe Ambroise, Alia Dehman, Pierre Neuvial, Guillem Rigaill, Nathalie Vialaneix","doi":"10.1186/s13015-019-0157-4","DOIUrl":"https://doi.org/10.1186/s13015-019-0157-4","url":null,"abstract":"Background: Genomic data analyses such as Genome-Wide Association Studies (GWAS) or Hi-C studies are often faced with the problem of partitioning chromosomes into successive regions based on a similarity matrix of high-resolution, locus-level measurements. An intuitive way of doing this is to perform a modified Hierarchical Agglomerative Clustering (HAC), where only adjacent clusters (according to the ordering of positions within a chromosome) are allowed to be merged. But a major practical drawback of this method is its quadratic time and space complexity in the number of loci, which is typically of the order of <math><msup><mn>10</mn> <mn>4</mn></msup> </math> to <math><msup><mn>10</mn> <mn>5</mn></msup> </math> for each chromosome.Results: By assuming that the similarity between physically distant objects is negligible, we are able to propose an implementation of adjacency-constrained HAC with quasi-linear complexity. This is achieved by pre-calculating specific sums of similarities, and storing candidate fusions in a min-heap. Our illustrations on GWAS and Hi-C datasets demonstrate the relevance of this assumption, and show that this method highlights biologically meaningful signals. Thanks to its small time and memory footprint, the method can be run on a standard laptop in minutes or even seconds.Availability and implementation: Software and sample data are available as an R package, adjclust, that can be downloaded from the Comprehensive R Archive Network (CRAN).","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"14 ","pages":"22"},"PeriodicalIF":1.0,"publicationDate":"2019-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13015-019-0157-4","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49684571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

A cubic algorithm for the generalized rank median of three genomes. 三个基因组广义秩中值的三次算法。

IF 1 4区生物学

Algorithms for Molecular Biology Pub Date : 2019-07-26 eCollection Date: 2019-01-01 DOI: 10.1186/s13015-019-0150-y

Leonid Chindelevitch, Sean La, Joao Meidanis

{"title":"A cubic algorithm for the generalized rank median of three genomes.","authors":"Leonid Chindelevitch, Sean La, Joao Meidanis","doi":"10.1186/s13015-019-0150-y","DOIUrl":"https://doi.org/10.1186/s13015-019-0150-y","url":null,"abstract":"Background: The area of genome rearrangements has given rise to a number of interesting biological, mathematical and algorithmic problems. Among these, one of the most intractable ones has been that of finding the median of three genomes, a special case of the ancestral reconstruction problem. In this work we re-examine our recently proposed way of measuring genome rearrangement distance, namely, the rank distance between the matrix representations of the corresponding genomes, and show that the median of three genomes can be computed exactly in polynomial time <math><mrow><mi>O</mi> <mo>(</mo> <msup><mi>n</mi> <mi>ω</mi></msup> <mo>)</mo></mrow> </math> , where <math><mrow><mi>ω</mi> <mo>≤</mo> <mn>3</mn></mrow> </math> , with respect to this distance, when the median is allowed to be an arbitrary orthogonal matrix.Results: We define the five fundamental subspaces depending on three input genomes, and use their properties to show that a particular action on each of these subspaces produces a median. In the process we introduce the notion of M-stable subspaces. We also show that the median found by our algorithm is always orthogonal, symmetric, and conserves any adjacencies or telomeres present in at least 2 out of 3 input genomes.Conclusions: We test our method on both simulated and real data. We find that the majority of the realistic inputs result in genomic outputs, and for those that do not, our two heuristics perform well in terms of reconstructing a genomic matrix attaining a score close to the lower bound, while running in a reasonable amount of time. We conclude that the rank distance is not only theoretically intriguing, but also practically useful for median-finding, and potentially ancestral genome reconstruction.","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":" ","pages":"16"},"PeriodicalIF":1.0,"publicationDate":"2019-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13015-019-0150-y","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37453128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Linear time minimum segmentation enables scalable founder reconstruction. 线性时间最小分割使可扩展的创始人重建。

IF 1 4区生物学

Algorithms for Molecular Biology Pub Date : 2019-05-17 eCollection Date: 2019-01-01 DOI: 10.1186/s13015-019-0147-6

Tuukka Norri, Bastien Cazaux, Dmitry Kosolobov, Veli Mäkinen

{"title":"Linear time minimum segmentation enables scalable founder reconstruction.","authors":"Tuukka Norri, Bastien Cazaux, Dmitry Kosolobov, Veli Mäkinen","doi":"10.1186/s13015-019-0147-6","DOIUrl":"https://doi.org/10.1186/s13015-019-0147-6","url":null,"abstract":"Background: We study a preprocessing routine relevant in pan-genomic analyses: consider a set of aligned haplotype sequences of complete human chromosomes. Due to the enormous size of such data, one would like to represent this input set with a few founder sequences that retain as well as possible the contiguities of the original sequences. Such a smaller set gives a scalable way to exploit pan-genomic information in further analyses (e.g. read alignment and variant calling). Optimizing the founder set is an NP-hard problem, but there is a segmentation formulation that can be solved in polynomial time, defined as follows. Given a threshold L and a set <math><mrow><mi>R</mi> <mo>=</mo> <mo>{</mo> <msub><mi>R</mi> <mn>1</mn></msub> <mo>,</mo> <mo>…</mo> <mo>,</mo> <msub><mi>R</mi> <mi>m</mi></msub> <mo>}</mo></mrow> </math> of m strings (haplotype sequences), each having length n, the minimum segmentation problem for founder reconstruction is to partition [1, n] into set P of disjoint segments such that each segment <math><mrow><mo>[</mo> <mi>a</mi> <mo>,</mo> <mi>b</mi> <mo>]</mo> <mo>∈</mo> <mi>P</mi></mrow> </math> has length at least L and the number <math><mrow><mi>d</mi> <mrow><mo>(</mo> <mi>a</mi> <mo>,</mo> <mi>b</mi> <mo>)</mo></mrow> <mo>=</mo> <mo>|</mo> <mo>{</mo> <msub><mi>R</mi> <mi>i</mi></msub> <mrow><mo>[</mo> <mi>a</mi> <mo>,</mo> <mi>b</mi> <mo>]</mo></mrow> <mo>:</mo> <mn>1</mn> <mo>≤</mo> <mi>i</mi> <mo>≤</mo> <mi>m</mi> <mo>}</mo> <mo>|</mo></mrow> </math> of distinct substrings at segment [a, b] is minimized over <math><mrow><mo>[</mo> <mi>a</mi> <mo>,</mo> <mi>b</mi> <mo>]</mo> <mo>∈</mo> <mi>P</mi></mrow> </math> . The distinct substrings in the segments represent founder blocks that can be concatenated to form <math><mrow><mo>max</mo> <mo>{</mo> <mi>d</mi> <mo>(</mo> <mi>a</mi> <mo>,</mo> <mi>b</mi> <mo>)</mo> <mo>:</mo> <mo>[</mo> <mi>a</mi> <mo>,</mo> <mi>b</mi> <mo>]</mo> <mo>∈</mo> <mi>P</mi> <mo>}</mo></mrow> </math> founder sequences representing the original <math><mi>R</mi></math> such that crossovers happen only at segment boundaries.Results: We give an O(mn) time (i.e. linear time in the input size) algorithm to solve the minimum segmentation problem for founder reconstruction, improving over an earlier <math><mrow><mi>O</mi> <mo>(</mo> <mi>m</mi> <msup><mi>n</mi> <mn>2</mn></msup> <mo>)</mo></mrow> </math> .Conclusions: Our improvement enables to apply the formulation on an input of thousands of complete human chromosomes. We implemented the new algorithm and give experimental evidence on its practicality. The implementation is available in https://github.com/tsnorri/founder-sequences.","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":" ","pages":"12"},"PeriodicalIF":1.0,"publicationDate":"2019-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13015-019-0147-6","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37276725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Repairing Boolean logical models from time-series data using Answer Set Programming. 使用答案集编程从时间序列数据修复布尔逻辑模型。

IF 1 4区生物学

Algorithms for Molecular Biology Pub Date : 2019-03-25 eCollection Date: 2019-01-01 DOI: 10.1186/s13015-019-0145-8

Alexandre Lemos, Inês Lynce, Pedro T Monteiro

{"title":"Repairing Boolean logical models from time-series data using Answer Set Programming.","authors":"Alexandre Lemos, Inês Lynce, Pedro T Monteiro","doi":"10.1186/s13015-019-0145-8","DOIUrl":"https://doi.org/10.1186/s13015-019-0145-8","url":null,"abstract":"Background: Boolean models of biological signalling-regulatory networks are increasingly used to formally describe and understand complex biological processes. These models may become inconsistent as new data become available and need to be repaired. In the past, the focus has been shed on the inference of (classes of) models given an interaction network and time-series data sets. However, repair of existing models against new data is still in its infancy, where the process is still manually performed and therefore slow and prone to errors.Results: In this work, we propose a method with an associated tool to suggest repairs over inconsistent Boolean models, based on a set of atomic repair operations. Answer Set Programming is used to encode the minimal repair problem as a combinatorial optimization problem. In particular, given an inconsistent model, the tool provides the minimal repairs that render the model capable of generating dynamics coherent with a (set of) time-series data set(s), considering either a synchronous or an asynchronous updating scheme.Conclusions: The method was validated using known biological models from different species, as well as synthetic models obtained from randomly generated networks. We discuss the method's limitations regarding each of the updating schemes and the considered minimization algorithm.","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":" ","pages":"9"},"PeriodicalIF":1.0,"publicationDate":"2019-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13015-019-0145-8","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37134889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Connectivity problems on heterogeneous graphs. 异构图上的连通性问题。

IF 1 4区生物学

Algorithms for Molecular Biology Pub Date : 2019-03-08 eCollection Date: 2019-01-01 DOI: 10.1186/s13015-019-0141-z

Jimmy Wu, Alex Khodaverdian, Benjamin Weitz, Nir Yosef

{"title":"Connectivity problems on heterogeneous graphs.","authors":"Jimmy Wu, Alex Khodaverdian, Benjamin Weitz, Nir Yosef","doi":"10.1186/s13015-019-0141-z","DOIUrl":"https://doi.org/10.1186/s13015-019-0141-z","url":null,"abstract":"Background: Network connectivity problems are abundant in computational biology research, where graphs are used to represent a range of phenomena: from physical interactions between molecules to more abstract relationships such as gene co-expression. One common challenge in studying biological networks is the need to extract meaningful, small subgraphs out of large databases of potential interactions. A useful abstraction for this task turned out to be the Steiner Network problems: given a reference \"database\" graph, find a parsimonious subgraph that satisfies a given set of connectivity demands. While this formulation proved useful in a number of instances, the next challenge is to account for the fact that the reference graph may not be static. This can happen for instance, when studying protein measurements in single cells or at different time points, whereby different subsets of conditions can have different protein milieu.Results and discussion: We introduce the condition Steiner Network problem in which we concomitantly consider a set of distinct biological conditions. Each condition is associated with a set of connectivity demands, as well as a set of edges that are assumed to be present in that condition. The goal of this problem is to find a minimal subgraph that satisfies all the demands through paths that are present in the respective condition. We show that introducing multiple conditions as an additional factor makes this problem much harder to approximate. Specifically, we prove that for C conditions, this new problem is NP-hard to approximate to a factor of <math><mrow><mi>C</mi> <mo>-</mo> <mi>ϵ</mi></mrow> </math> , for every <math><mrow><mi>C</mi> <mo>≥</mo> <mn>2</mn></mrow> </math> and <math><mrow><mi>ϵ</mi> <mo>></mo> <mn>0</mn></mrow> </math> , and that this bound is tight. Moving beyond the worst case, we explore a special set of instances where the reference graph grows monotonically between conditions, and show that this problem admits substantially improved approximation algorithms. We also developed an integer linear programming solver for the general problem and demonstrate its ability to reach optimality with instances from the human protein interaction network.Conclusion: Our results demonstrate that in contrast to most connectivity problems studied in computational biology, accounting for multiplicity of biological conditions adds considerable complexity, which we propose to address with a new solver. Importantly, our results extend to several network connectivity problems that are commonly used in computational biology, such as Prize-Collecting Steiner Tree, and provide insight into the theoretical guarantees for their applications in a multiple condition setting.","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":" ","pages":"5"},"PeriodicalIF":1.0,"publicationDate":"2019-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13015-019-0141-z","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37078885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

External memory BWT and LCP computation for sequence collections with applications. 应用程序序列集合的外部内存BWT和LCP计算。

IF 1 4区生物学

Algorithms for Molecular Biology Pub Date : 2019-03-08 eCollection Date: 2019-01-01 DOI: 10.1186/s13015-019-0140-0

Lavinia Egidi, Felipe A Louza, Giovanni Manzini, Guilherme P Telles

{"title":"External memory BWT and LCP computation for sequence collections with applications.","authors":"Lavinia Egidi, Felipe A Louza, Giovanni Manzini, Guilherme P Telles","doi":"10.1186/s13015-019-0140-0","DOIUrl":"https://doi.org/10.1186/s13015-019-0140-0","url":null,"abstract":"Background: Sequencing technologies produce larger and larger collections of biosequences that have to be stored in compressed indices supporting fast search operations. Many compressed indices are based on the Burrows-Wheeler Transform (BWT) and the longest common prefix (LCP) array. Because of the sheer size of the input it is important to build these data structures in external memory and time using in the best possible way the available RAM.Results: We propose a space-efficient algorithm to compute the BWT and LCP array for a collection of sequences in the external or semi-external memory setting. Our algorithm splits the input collection into subcollections sufficiently small that it can compute their BWT in RAM using an optimal linear time algorithm. Next, it merges the partial BWTs in external or semi-external memory and in the process it also computes the LCP values. Our algorithm can be modified to output two additional arrays that, combined with the BWT and LCP array, provide simple, scan-based, external memory algorithms for three well known problems in bioinformatics: the computation of maximal repeats, the all pairs suffix-prefix overlaps, and the construction of succinct de Bruijn graphs.Conclusions: We prove that our algorithm performs <math><mrow><mi>O</mi> <mo>(</mo> <mi>n</mi> <mspace></mspace> <mi>maxlcp</mi> <mo>)</mo></mrow> </math> sequential I/Os, where n is the total length of the collection and <math><mi>maxlcp</mi></math> is the maximum LCP value. The experimental results show that our algorithm is only slightly slower than the state of the art for short sequences but it is up to 40 times faster for longer sequences or when the available RAM is at least equal to the size of the input.","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":" ","pages":"6"},"PeriodicalIF":1.0,"publicationDate":"2019-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13015-019-0140-0","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37080399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

Semi-nonparametric modeling of topological domain formation from epigenetic data. 基于表观遗传数据的拓扑域形成半非参数建模。

IF 1 4区生物学

Algorithms for Molecular Biology Pub Date : 2019-03-05 eCollection Date: 2019-01-01 DOI: 10.1186/s13015-019-0142-y

Emre Sefer, Carl Kingsford

{"title":"Semi-nonparametric modeling of topological domain formation from epigenetic data.","authors":"Emre Sefer, Carl Kingsford","doi":"10.1186/s13015-019-0142-y","DOIUrl":"https://doi.org/10.1186/s13015-019-0142-y","url":null,"abstract":"Background: Hi-C experiments capturing the 3D genome architecture have led to the discovery of topologically-associated domains (TADs) that form an important part of the 3D genome organization and appear to play a role in gene regulation and other functions. Several histone modifications have been independently associated with TAD formation, but their combinatorial effects on domain formation remain poorly understood at a global scale.Results: We propose a convex semi-nonparametric approach called nTDP based on Bernstein polynomials to explore the joint effects of histone markers on TAD formation as well as predict TADs solely from the histone data. We find a small subset of modifications to be predictive of TADs across species. By inferring TADs using our trained model, we are able to predict TADs across different species and cell types, without the use of Hi-C data, suggesting their effect is conserved. This work provides the first comprehensive joint model of the effect of histone markers on domain formation.Conclusions: Our approach, nTDP, can form the basis of a unified, explanatory model of the relationship between epigenetic marks and topological domain structures. It can be used to predict domain boundaries for cell types, species, and conditions for which no Hi-C data is available. The model may also be of use for improving Hi-C-based domain finders.","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":" ","pages":"4"},"PeriodicalIF":1.0,"publicationDate":"2019-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13015-019-0142-y","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37052618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

SNPs detection by eBWT positional clustering. 基于eBWT位置聚类的snp检测。

IF 1 4区生物学

Algorithms for Molecular Biology Pub Date : 2019-02-06 eCollection Date: 2019-01-01 DOI: 10.1186/s13015-019-0137-8

Nicola Prezza, Nadia Pisanti, Marinella Sciortino, Giovanna Rosone

{"title":"SNPs detection by eBWT positional clustering.","authors":"Nicola Prezza, Nadia Pisanti, Marinella Sciortino, Giovanna Rosone","doi":"10.1186/s13015-019-0137-8","DOIUrl":"https://doi.org/10.1186/s13015-019-0137-8","url":null,"abstract":"Background: Sequencing technologies keep on turning cheaper and faster, thus putting a growing pressure for data structures designed to efficiently store raw data, and possibly perform analysis therein. In this view, there is a growing interest in alignment-free and reference-free variants calling methods that only make use of (suitably indexed) raw reads data.Results: We develop the positional clustering theory that (i) describes how the extended Burrows-Wheeler Transform (eBWT) of a collection of reads tends to cluster together bases that cover the same genome position (ii) predicts the size of such clusters, and (iii) exhibits an elegant and precise LCP array based procedure to locate such clusters in the eBWT. Based on this theory, we designed and implemented an alignment-free and reference-free SNPs calling method, and we devised a consequent SNPs calling pipeline. Experiments on both synthetic and real data show that SNPs can be detected with a simple scan of the eBWT and LCP arrays as, in accordance with our theoretical framework, they are within clusters in the eBWT of the reads. Finally, our tool intrinsically performs a reference-free evaluation of its accuracy by returning the coverage of each SNP.Conclusions: Based on the results of the experiments on synthetic and real data, we conclude that the positional clustering framework can be effectively used for the problem of identifying SNPs, and it appears to be a promising approach for calling other type of variants directly on raw sequencing data.Availability: The software ebwt2snp is freely available for academic use at: https://github.com/nicolaprezza/ebwt2snp.","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":" ","pages":"3"},"PeriodicalIF":1.0,"publicationDate":"2019-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13015-019-0137-8","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37028905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Constrained incremental tree building: new absolute fast converging phylogeny estimation methods with improved scalability and accuracy. 约束增量树构建:新的绝对快速收敛系统发育估计方法，提高了可扩展性和准确性。

IF 1 4区生物学

Algorithms for Molecular Biology Pub Date : 2019-02-06 eCollection Date: 2019-01-01 DOI: 10.1186/s13015-019-0136-9

Qiuyi Zhang, Satish Rao, Tandy Warnow

{"title":"Constrained incremental tree building: new absolute fast converging phylogeny estimation methods with improved scalability and accuracy.","authors":"Qiuyi Zhang, Satish Rao, Tandy Warnow","doi":"10.1186/s13015-019-0136-9","DOIUrl":"https://doi.org/10.1186/s13015-019-0136-9","url":null,"abstract":"Background: Absolute fast converging (AFC) phylogeny estimation methods are ones that have been proven to recover the true tree with high probability given sequences whose lengths are polynomial in the number of number of leaves in the tree (once the shortest and longest branch weights are fixed). While there has been a large literature on AFC methods, the best in terms of empirical performance was <math><mrow><mi>D</mi> <mi>C</mi> <msub><mi>M</mi> <mrow><mi>NJ</mi></mrow> </msub> <mo>,</mo></mrow> </math> published in SODA 2001. The main empirical advantage of <math> <msub><mrow><mi>DCM</mi></mrow> <mrow><mi>NJ</mi></mrow> </msub> </math> over other AFC methods is its use of neighbor joining (NJ) to construct trees on smaller taxon subsets, which are then combined into a tree on the full set of species using a supertree method; in contrast, the other AFC methods in essence depend on quartet trees that are computed independently of each other, which reduces accuracy compared to neighbor joining. However, <math> <msub><mrow><mi>DCM</mi></mrow> <mrow><mi>NJ</mi></mrow> </msub> </math> is unlikely to scale to large datasets due to its reliance on supertree methods, as no current supertree methods are able to scale to large datasets with high accuracy.Results: In this study we present a new approach to large-scale phylogeny estimation that shares some of the features of <math> <msub><mrow><mi>DCM</mi></mrow> <mrow><mi>NJ</mi></mrow> </msub> </math> but bypasses the use of supertree methods. We prove that this new approach is AFC and uses polynomial time and space. Furthermore, we describe variations on this basic approach that can be used with leaf-disjoint constraint trees (computed using methods such as maximum likelihood) to produce other methods that are likely to provide even better accuracy. Thus, we present a new generalizable technique for large-scale tree estimation that is designed to improve scalability for phylogeny estimation methods to ultra-large datasets, and that can be used in a variety of settings (including tree estimation from unaligned sequences, and species tree estimation from gene trees).","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":" ","pages":"2"},"PeriodicalIF":1.0,"publicationDate":"2019-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13015-019-0136-9","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37204080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Automated partial atomic charge assignment for drug-like molecules: a fast knapsack approach. 类药物分子的部分原子电荷自动分配:快速背包方法。

IF 1 4区生物学

Algorithms for Molecular Biology Pub Date : 2019-02-05 eCollection Date: 2019-01-01 DOI: 10.1186/s13015-019-0138-7

Martin S Engler, Bertrand Caron, Lourens Veen, Daan P Geerke, Alan E Mark, Gunnar W Klau

{"title":"Automated partial atomic charge assignment for drug-like molecules: a fast knapsack approach.","authors":"Martin S Engler, Bertrand Caron, Lourens Veen, Daan P Geerke, Alan E Mark, Gunnar W Klau","doi":"10.1186/s13015-019-0138-7","DOIUrl":"https://doi.org/10.1186/s13015-019-0138-7","url":null,"abstract":"A key factor in computational drug design is the consistency and reliability with which intermolecular interactions between a wide variety of molecules can be described. Here we present a procedure to efficiently, reliably and automatically assign partial atomic charges to atoms based on known distributions. We formally introduce the molecular charge assignment problem, where the task is to select a charge from a set of candidate charges for every atom of a given query molecule. Charges are accompanied by a score that depends on their observed frequency in similar neighbourhoods (chemical environments) in a database of previously parameterised molecules. The aim is to assign the charges such that the total charge equals a known target charge within a margin of error while maximizing the sum of the charge scores. We show that the problem is a variant of the well-studied multiple-choice knapsack problem and thus weakly <math><mi>NP</mi></math> -complete. We propose solutions based on Integer Linear Programming and a pseudo-polynomial time Dynamic Programming algorithm. We demonstrate that the results obtained for novel molecules not included in the database are comparable to the ones obtained performing explicit charge calculations while decreasing the time to determine partial charges for a molecule from hours or even days to below a second. Our software is openly available.","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":" ","pages":"1"},"PeriodicalIF":1.0,"publicationDate":"2019-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13015-019-0138-7","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37030172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8