Algorithms for Molecular Biology最新文献_第7页

Bayesian optimization with evolutionary and structure-based regularization for directed protein evolution. 基于进化和结构正则化的定向蛋白质进化贝叶斯优化。

IF 1 4区生物学

Algorithms for Molecular Biology Pub Date : 2021-07-01 DOI: 10.1186/s13015-021-00195-4

Trevor S Frisby, Christopher James Langmead

{"title":"Bayesian optimization with evolutionary and structure-based regularization for directed protein evolution.","authors":"Trevor S Frisby, Christopher James Langmead","doi":"10.1186/s13015-021-00195-4","DOIUrl":"https://doi.org/10.1186/s13015-021-00195-4","url":null,"abstract":"Background: Directed evolution (DE) is a technique for protein engineering that involves iterative rounds of mutagenesis and screening to search for sequences that optimize a given property, such as binding affinity to a specified target. Unfortunately, the underlying optimization problem is under-determined, and so mutations introduced to improve the specified property may come at the expense of unmeasured, but nevertheless important properties (ex. solubility, thermostability, etc). We address this issue by formulating DE as a regularized Bayesian optimization problem where the regularization term reflects evolutionary or structure-based constraints.Results: We applied our approach to DE to three representative proteins, GB1, BRCA1, and SARS-CoV-2 Spike, and evaluated both evolutionary and structure-based regularization terms. The results of these experiments demonstrate that: (i) structure-based regularization usually leads to better designs (and never hurts), compared to the unregularized setting; (ii) evolutionary-based regularization tends to be least effective; and (iii) regularization leads to better designs because it effectively focuses the search in certain areas of sequence space, making better use of the experimental budget. Additionally, like previous work in Machine learning assisted DE, we find that our approach significantly reduces the experimental burden of DE, relative to model-free methods.Conclusion: Introducing regularization into a Bayesian ML-assisted DE framework alters the exploratory patterns of the underlying optimization routine, and can shift variant selections towards those with a range of targeted and desirable properties. In particular, we find that structure-based regularization often improves variant selection compared to unregularized approaches, and never hurts.","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"16 1","pages":"13"},"PeriodicalIF":1.0,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13015-021-00195-4","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39141819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Using the longest run subsequence problem within homology-based scaffolding. 在基于同构的脚手架中使用最长运行子序列问题。

IF 1 4区生物学

Algorithms for Molecular Biology Pub Date : 2021-06-28 DOI: 10.1186/s13015-021-00191-8

Sven Schrinner, Manish Goel, Michael Wulfert, Philipp Spohr, Korbinian Schneeberger, Gunnar W Klau

{"title":"Using the longest run subsequence problem within homology-based scaffolding.","authors":"Sven Schrinner, Manish Goel, Michael Wulfert, Philipp Spohr, Korbinian Schneeberger, Gunnar W Klau","doi":"10.1186/s13015-021-00191-8","DOIUrl":"https://doi.org/10.1186/s13015-021-00191-8","url":null,"abstract":"Genome assembly is one of the most important problems in computational genomics. Here, we suggest addressing an issue that arises in homology-based scaffolding, that is, when linking and ordering contigs to obtain larger pseudo-chromosomes by means of a second incomplete assembly of a related species. The idea is to use alignments of binned regions in one contig to find the most homologous contig in the other assembly. We show that ordering the contigs of the other assembly can be expressed by a new string problem, the longest run subsequence problem (LRS). We show that LRS is NP-hard and present reduction rules and two algorithmic approaches that, together, are able to solve large instances of LRS to provable optimality. All data used in the experiments as well as our source code are freely available. We demonstrate its usefulness within an existing larger scaffolding approach by solving realistic instances resulting from partial Arabidopsis thaliana assemblies in short computation time.","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"16 1","pages":"11"},"PeriodicalIF":1.0,"publicationDate":"2021-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8240273/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39137713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

The Bourque distances for mutation trees of cancers. 癌症突变树的布尔克距离。

IF 1 4区生物学

Algorithms for Molecular Biology Pub Date : 2021-06-10 DOI: 10.1186/s13015-021-00188-3

Katharina Jahn, Niko Beerenwinkel, Louxin Zhang

引用次数: 1

The energy-spectrum of bicompatible sequences. 双相容序列的能谱。

IF 1 4区生物学

Algorithms for Molecular Biology Pub Date : 2021-06-01 DOI: 10.1186/s13015-021-00187-4

Fenix W Huang, Christopher L Barrett, Christian M Reidys

{"title":"The energy-spectrum of bicompatible sequences.","authors":"Fenix W Huang, Christopher L Barrett, Christian M Reidys","doi":"10.1186/s13015-021-00187-4","DOIUrl":"https://doi.org/10.1186/s13015-021-00187-4","url":null,"abstract":"Background: Genotype-phenotype maps provide a meaningful filtration of sequence space and RNA secondary structures are particular such phenotypes. Compatible sequences, which satisfy the base-pairing constraints of a given RNA structure, play an important role in the context of neutral evolution. Sequences that are simultaneously compatible with two given structures (bicompatible sequences), are beacons in phenotypic transitions, induced by erroneously replicating populations of RNA sequences. RNA riboswitches, which are capable of expressing two distinct secondary structures without changing the underlying sequence, are one example of bicompatible sequences in living organisms.Results: We present a full loop energy model Boltzmann sampler of bicompatible sequences for pairs of structures. The sequence sampler employs a dynamic programming routine whose time complexity is polynomial when assuming the maximum number of exposed vertices, [Formula: see text], is a constant. The parameter [Formula: see text] depends on the two structures and can be very large. We introduce a novel topological framework encapsulating the relations between loops that sheds light on the understanding of [Formula: see text]. Based on this framework, we give an algorithm to sample sequences with minimum [Formula: see text] on a particular topologically classified case as well as giving hints to the solution in the other cases. As a result, we utilize our sequence sampler to study some established riboswitches.Conclusion: Our analysis of riboswitch sequences shows that a pair of structures needs to satisfy key properties in order to facilitate phenotypic transitions and that pairs of random structures are unlikely to do so. Our analysis observes a distinct signature of riboswitch sequences, suggesting a new criterion for identifying native sequences and sequences subjected to evolutionary pressure. Our free software is available at: https://github.com/FenixHuang667/Bifold .","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"16 1","pages":"7"},"PeriodicalIF":1.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13015-021-00187-4","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39051451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Fast and efficient Rmap assembly using the Bi-labelled de Bruijn graph. 快速和有效的Rmap装配使用双标记德布鲁因图。

IF 1 4区生物学

Algorithms for Molecular Biology Pub Date : 2021-05-25 DOI: 10.1186/s13015-021-00182-9

Kingshuk Mukherjee, Massimiliano Rossi, Leena Salmela, Christina Boucher

{"title":"Fast and efficient Rmap assembly using the Bi-labelled de Bruijn graph.","authors":"Kingshuk Mukherjee, Massimiliano Rossi, Leena Salmela, Christina Boucher","doi":"10.1186/s13015-021-00182-9","DOIUrl":"https://doi.org/10.1186/s13015-021-00182-9","url":null,"abstract":"Genome wide optical maps are high resolution restriction maps that give a unique numeric representation to a genome. They are produced by assembling hundreds of thousands of single molecule optical maps, which are called Rmaps. Unfortunately, there are very few choices for assembling Rmap data. There exists only one publicly-available non-proprietary method for assembly and one proprietary software that is available via an executable. Furthermore, the publicly-available method, by Valouev et al. (Proc Natl Acad Sci USA 103(43):15770-15775, 2006), follows the overlap-layout-consensus (OLC) paradigm, and therefore, is unable to scale for relatively large genomes. The algorithm behind the proprietary method, Bionano Genomics' Solve, is largely unknown. In this paper, we extend the definition of bi-labels in the paired de Bruijn graph to the context of optical mapping data, and present the first de Bruijn graph based method for Rmap assembly. We implement our approach, which we refer to as RMAPPER, and compare its performance against the assembler of Valouev et al. (Proc Natl Acad Sci USA 103(43):15770-15775, 2006) and Solve by Bionano Genomics on data from three genomes: E. coli, human, and climbing perch fish (Anabas Testudineus). Our method was able to successfully run on all three genomes. The method of Valouev et al. (Proc Natl Acad Sci USA 103(43):15770-15775, 2006) only successfully ran on E. coli. Moreover, on the human genome RMAPPER was at least 130 times faster than Bionano Solve, used five times less memory and produced the highest genome fraction with zero mis-assemblies. Our software, RMAPPER is written in C++ and is publicly available under GNU General Public License at https://github.com/kingufl/Rmapper .","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"16 1","pages":"6"},"PeriodicalIF":1.0,"publicationDate":"2021-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13015-021-00182-9","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39017832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Exact transcript quantification over splice graphs. 精确转录定量剪接图。

IF 1 4区生物学

Algorithms for Molecular Biology Pub Date : 2021-05-10 DOI: 10.1186/s13015-021-00184-7

Cong Ma, Hongyu Zheng, Carl Kingsford

{"title":"Exact transcript quantification over splice graphs.","authors":"Cong Ma, Hongyu Zheng, Carl Kingsford","doi":"10.1186/s13015-021-00184-7","DOIUrl":"https://doi.org/10.1186/s13015-021-00184-7","url":null,"abstract":"Background: The probability of sequencing a set of RNA-seq reads can be directly modeled using the abundances of splice junctions in splice graphs instead of the abundances of a list of transcripts. We call this model graph quantification, which was first proposed by Bernard et al. (Bioinformatics 30:2447-55, 2014). The model can be viewed as a generalization of transcript expression quantification where every full path in the splice graph is a possible transcript. However, the previous graph quantification model assumes the length of single-end reads or paired-end fragments is fixed.Results: We provide an improvement of this model to handle variable-length reads or fragments and incorporate bias correction. We prove that our model is equivalent to running a transcript quantifier with exactly the set of all compatible transcripts. The key to our method is constructing an extension of the splice graph based on Aho-Corasick automata. The proof of equivalence is based on a novel reparameterization of the read generation model of a state-of-art transcript quantification method.Conclusion: We propose a new approach for graph quantification, which is useful for modeling scenarios where reference transcriptome is incomplete or not available and can be further used in transcriptome assembly or alternative splicing analysis.","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"16 1","pages":"5"},"PeriodicalIF":1.0,"publicationDate":"2021-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13015-021-00184-7","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"38968170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Tree diet: reducing the treewidth to unlock FPT algorithms in RNA bioinformatics 树的饮食:减少树的宽度解锁RNA生物信息学中的FPT算法

IF 1 4区生物学

Algorithms for Molecular Biology Pub Date : 2021-05-04 DOI: 10.1186/s13015-022-00213-z

Bertrand Marchand, Y. Ponty, L. Bulteau

引用次数: 1

Improving metagenomic binning results with overlapped bins using assembly graphs. 使用装配图改进重叠箱的宏基因组分类结果。

IF 1 4区生物学

Algorithms for Molecular Biology Pub Date : 2021-05-04 DOI: 10.1186/s13015-021-00185-6

Vijini G Mallawaarachchi, Anuradha S Wickramarachchi, Yu Lin

{"title":"Improving metagenomic binning results with overlapped bins using assembly graphs.","authors":"Vijini G Mallawaarachchi, Anuradha S Wickramarachchi, Yu Lin","doi":"10.1186/s13015-021-00185-6","DOIUrl":"https://doi.org/10.1186/s13015-021-00185-6","url":null,"abstract":"Background: Metagenomic sequencing allows us to study the structure, diversity and ecology in microbial communities without the necessity of obtaining pure cultures. In many metagenomics studies, the reads obtained from metagenomics sequencing are first assembled into longer contigs and these contigs are then binned into clusters of contigs where contigs in a cluster are expected to come from the same species. As different species may share common sequences in their genomes, one assembled contig may belong to multiple species. However, existing tools for binning contigs only support non-overlapped binning, i.e., each contig is assigned to at most one bin (species).Results: In this paper, we introduce GraphBin2 which refines the binning results obtained from existing tools and, more importantly, is able to assign contigs to multiple bins. GraphBin2 uses the connectivity and coverage information from assembly graphs to adjust existing binning results on contigs and to infer contigs shared by multiple species. Experimental results on both simulated and real datasets demonstrate that GraphBin2 not only improves binning results of existing tools but also supports to assign contigs to multiple bins.Conclusion: GraphBin2 incorporates the coverage information into the assembly graph to refine the binning results obtained from existing binning tools. GraphBin2 also enables the detection of contigs that may belong to multiple species. We show that GraphBin2 outperforms its predecessor GraphBin on both simulated and real datasets. GraphBin2 is freely available at https://github.com/Vini2/GraphBin2 .","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"16 1","pages":"3"},"PeriodicalIF":1.0,"publicationDate":"2021-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13015-021-00185-6","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"38869054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Fast lightweight accurate xenograft sorting. 快速、轻量、准确的异种移植物分选。

IF 1 4区生物学

Algorithms for Molecular Biology Pub Date : 2021-04-02 DOI: 10.1186/s13015-021-00181-w

Jens Zentgraf, Sven Rahmann

{"title":"Fast lightweight accurate xenograft sorting.","authors":"Jens Zentgraf, Sven Rahmann","doi":"10.1186/s13015-021-00181-w","DOIUrl":"https://doi.org/10.1186/s13015-021-00181-w","url":null,"abstract":"Motivation: With an increasing number of patient-derived xenograft (PDX) models being created and subsequently sequenced to study tumor heterogeneity and to guide therapy decisions, there is a similarly increasing need for methods to separate reads originating from the graft (human) tumor and reads originating from the host species' (mouse) surrounding tissue. Two kinds of methods are in use: On the one hand, alignment-based tools require that reads are mapped and aligned (by an external mapper/aligner) to the host and graft genomes separately first; the tool itself then processes the resulting alignments and quality metrics (typically BAM files) to assign each read or read pair. On the other hand, alignment-free tools work directly on the raw read data (typically FASTQ files). Recent studies compare different approaches and tools, with varying results.Results: We show that alignment-free methods for xenograft sorting are superior concerning CPU time usage and equivalent in accuracy. We improve upon the state of the art sorting by presenting a fast lightweight approach based on three-way bucketed quotiented Cuckoo hashing. Our hash table requires memory comparable to an FM index typically used for read alignment and less than other alignment-free approaches. It allows extremely fast lookups and uses less CPU time than other alignment-free methods and alignment-based methods at similar accuracy. Several engineering steps (e.g., shortcuts for unsuccessful lookups, software prefetching) improve the performance even further.Availability: Our software xengsort is available under the MIT license at http://gitlab.com/genomeinformatics/xengsort . It is written in numba-compiled Python and comes with sample Snakemake workflows for hash table construction and dataset processing.","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"16 1","pages":"2"},"PeriodicalIF":1.0,"publicationDate":"2021-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13015-021-00181-w","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25554318","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Quantifying steric hindrance and topological obstruction to protein structure superposition. 定量的位阻和拓扑阻对蛋白质结构叠加的影响。

IF 1 4区生物学

Algorithms for Molecular Biology Pub Date : 2021-02-27 DOI: 10.1186/s13015-020-00180-3

Peter Røgen

{"title":"Quantifying steric hindrance and topological obstruction to protein structure superposition.","authors":"Peter Røgen","doi":"10.1186/s13015-020-00180-3","DOIUrl":"https://doi.org/10.1186/s13015-020-00180-3","url":null,"abstract":"Background: In computational structural biology, structure comparison is fundamental for our understanding of proteins. Structure comparison is, e.g., algorithmically the starting point for computational studies of structural evolution and it guides our efforts to predict protein structures from their amino acid sequences. Most methods for structural alignment of protein structures optimize the distances between aligned and superimposed residue pairs, i.e., the distances traveled by the aligned and superimposed residues during linear interpolation. Considering such a linear interpolation, these methods do not differentiate if there is room for the interpolation, if it causes steric clashes, or more severely, if it changes the topology of the compared protein backbone curves.Results: To distinguish such cases, we analyze the linear interpolation between two aligned and superimposed backbones. We quantify the amount of steric clashes and find all self-intersections in a linear backbone interpolation. To determine if the self-intersections alter the protein's backbone curve significantly or not, we present a path-finding algorithm that checks if there exists a self-avoiding path in a neighborhood of the linear interpolation. A new path is constructed by altering the linear interpolation using a novel interpretation of Reidemeister moves from knot theory working on three-dimensional curves rather than on knot diagrams. Either the algorithm finds a self-avoiding path or it returns a smallest set of essential self-intersections. Each of these indicates a significant difference between the folds of the aligned protein structures. As expected, we find at least one essential self-intersection separating most unknotted structures from a knotted structure, and we find even larger motions in proteins connected by obstruction free linear interpolations. We also find examples of homologous proteins that are differently threaded, and we find many distinct folds connected by longer but simple deformations. TM-align is one of the most restrictive alignment programs. With standard parameters, it only aligns residues superimposed within 5 Ångström distance. We find 42165 topological obstructions between aligned parts in 142068 TM-alignments. Thus, this restrictive alignment procedure still allows topological dissimilarity of the aligned parts.Conclusions: Based on the data we conclude that our program ProteinAlignmentObstruction provides significant additional information to alignment scores based solely on distances between aligned and superimposed residue pairs.","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"16 1","pages":"1"},"PeriodicalIF":1.0,"publicationDate":"2021-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13015-020-00180-3","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25411733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3