Algorithms for Molecular Biology最新文献_第2页

Mem-based pangenome indexing for k-mer queries. 针对 k-mer 查询的基于 Mem 的泛基因组索引。

IF 1.5 4区生物学

Algorithms for Molecular Biology Pub Date : 2025-03-01 DOI: 10.1186/s13015-025-00272-y

Stephen Hwang, Nathaniel K Brown, Omar Y Ahmed, Katharine M Jenike, Sam Kovaka, Michael C Schatz, Ben Langmead

{"title":"Mem-based pangenome indexing for k-mer queries.","authors":"Stephen Hwang, Nathaniel K Brown, Omar Y Ahmed, Katharine M Jenike, Sam Kovaka, Michael C Schatz, Ben Langmead","doi":"10.1186/s13015-025-00272-y","DOIUrl":"10.1186/s13015-025-00272-y","url":null,"abstract":"Pangenomes are growing in number and size, thanks to the prevalence of high-quality long-read assemblies. However, current methods for studying sequence composition and conservation within pangenomes have limitations. Methods based on graph pangenomes require a computationally expensive multiple-alignment step, which can leave out some variation. Indexes based on k-mers and de Bruijn graphs are limited to answering questions at a specific substring length k. We present Maximal Exact Match Ordered (MEMO), a pangenome indexing method based on maximal exact matches (MEMs) between sequences. A single MEMO index can handle arbitrary-length queries over pangenomic windows. MEMO enables both queries that test k-mer presence/absence (membership queries) and that count the number of genomes containing k-mers in a window (conservation queries). MEMO's index for a pangenome of 89 human autosomal haplotypes fits in 2.04 GB, 8.8 <math><mo>×</mo></math> smaller than a comparable KMC3 index and 11.4 <math><mo>×</mo></math> smaller than a PanKmer index. MEMO indexes can be made smaller by sacrificing some counting resolution, with our decile-resolution HPRC index reaching 0.67 GB. MEMO can conduct a conservation query for 31-mers over the human leukocyte antigen locus in 13.89 s, 2.5 <math><mo>×</mo></math> faster than other approaches. MEMO's small index size, lack of k-mer length dependence, and efficient queries make it a flexible tool for studying and visualizing substring conservation in pangenomes.","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"20 1","pages":"3"},"PeriodicalIF":1.5,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11871630/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143538063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Finding high posterior density phylogenies by systematically extending a directed acyclic graph. 通过系统地扩展有向无环图来寻找高后验密度系统发育。

IF 1.5 4区生物学

Algorithms for Molecular Biology Pub Date : 2025-02-28 DOI: 10.1186/s13015-025-00273-x

Chris Jennings-Shaffer, David H Rich, Matthew Macaulay, Michael D Karcher, Tanvi Ganapathy, Shosuke Kiami, Anna Kooperberg, Cheng Zhang, Marc A Suchard, Frederick A Matsen

{"title":"Finding high posterior density phylogenies by systematically extending a directed acyclic graph.","authors":"Chris Jennings-Shaffer, David H Rich, Matthew Macaulay, Michael D Karcher, Tanvi Ganapathy, Shosuke Kiami, Anna Kooperberg, Cheng Zhang, Marc A Suchard, Frederick A Matsen","doi":"10.1186/s13015-025-00273-x","DOIUrl":"10.1186/s13015-025-00273-x","url":null,"abstract":"Bayesian phylogenetics typically estimates a posterior distribution, or aspects thereof, using Markov chain Monte Carlo methods. These methods integrate over tree space by applying local rearrangements to move a tree through its space as a random walk. Previous work explored the possibility of replacing this random walk with a systematic search, but was quickly overwhelmed by the large number of probable trees in the posterior distribution. In this paper we develop methods to sidestep this problem using a recently introduced structure called the subsplit directed acyclic graph (sDAG). This structure can represent many trees at once, and local rearrangements of trees translate to methods of enlarging the sDAG. Here we propose two methods of introducing, ranking, and selecting local rearrangements on sDAGs to produce a collection of trees with high posterior density. One of these methods successfully recovers the set of high posterior density trees across a range of data sets. However, we find that a simpler strategy of aggregating trees into an sDAG in fact is computationally faster and returns a higher fraction of probable trees.","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"20 1","pages":"2"},"PeriodicalIF":1.5,"publicationDate":"2025-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11869616/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143532146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fractional hitting sets for efficient multiset sketching. 分数命中集用于高效的多集素描。

IF 1.5 4区生物学

Algorithms for Molecular Biology Pub Date : 2025-02-08 DOI: 10.1186/s13015-024-00268-0

Timothé Rouzé, Igor Martayan, Camille Marchet, Antoine Limasset

{"title":"Fractional hitting sets for efficient multiset sketching.","authors":"Timothé Rouzé, Igor Martayan, Camille Marchet, Antoine Limasset","doi":"10.1186/s13015-024-00268-0","DOIUrl":"10.1186/s13015-024-00268-0","url":null,"abstract":"The exponential increase in publicly available sequencing data and genomic resources necessitates the development of highly efficient methods for data processing and analysis. Locality-sensitive hashing techniques have successfully transformed large datasets into smaller, more manageable sketches while maintaining comparability using metrics such as Jaccard and containment indices. However, fixed-size sketches encounter difficulties when applied to divergent datasets. Scalable sketching methods, such as sourmash, provide valuable solutions but still lack resource-efficient, tailored indexing. Our objective is to create lighter sketches with comparable results while enhancing efficiency. We introduce the concept of Fractional Hitting Sets, a generalization of Universal Hitting Sets, which cover a specified fraction of the k-mer space. In theory and practice, we demonstrate the feasibility of achieving such coverage with simple but highly efficient schemes. By encoding the covered k-mers as super-k-mers, we provide a space-efficient exact representation that also enables optimized comparisons. Our novel tool, supersampler, implements this scheme, and experimental results with real bacterial collections closely match our theoretical findings. In comparison to sourmash, supersampler achieves similar outcomes while utilizing an order of magnitude less space and memory and operating several times faster. This highlights the potential of our approach in addressing the challenges presented by the ever-expanding landscape of genomic data. supersampler is an open-source software and can be accessed at https://github.com/TimRouze/supersampler . The data required to reproduce the results presented in this manuscript is available at https://github.com/TimRouze/supersampler/experiments .","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"20 1","pages":"1"},"PeriodicalIF":1.5,"publicationDate":"2025-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11807336/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143374779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

On the parameterized complexity of the median and closest problems under some permutation metrics. 若干置换度量下中值和最近邻问题的参数化复杂度。

IF 1.5 4区生物学

Algorithms for Molecular Biology Pub Date : 2024-12-24 DOI: 10.1186/s13015-024-00269-z

Luís Cunha, Ignasi Sau, Uéverton Souza

{"title":"On the parameterized complexity of the median and closest problems under some permutation metrics.","authors":"Luís Cunha, Ignasi Sau, Uéverton Souza","doi":"10.1186/s13015-024-00269-z","DOIUrl":"10.1186/s13015-024-00269-z","url":null,"abstract":"Genome rearrangements are events where large blocks of DNA exchange places during evolution. The analysis of these events is a promising tool for understanding evolutionary genomics, providing data for phylogenetic reconstruction based on genome rearrangement measures. Many pairwise rearrangement distances have been proposed, based on finding the minimum number of rearrangement events to transform one genome into the other, using some predefined operation. When more than two genomes are considered, we have the more challenging problem of rearrangement-based phylogeny reconstruction. Given a set of genomes and a distance notion, there are at least two natural ways to define the \"target\" genome. On the one hand, finding a genome that minimizes the sum of the distances from this to any other, called the median genome. On the other hand, finding a genome that minimizes the maximum distance to any other, called the closest genome. Considering genomes as permutations of distinct integers, some distance metrics have been extensively studied. We investigate the median and closest problems on permutations over the following metrics: breakpoint distance, swap distance, block-interchange distance, short-block-move distance, and transposition distance. In biological applications some values are usually very small, such as the solution value d or the number k of input permutations. For each of these metrics and parameters d or k, we analyze the closest and the median problems from the viewpoint of parameterized complexity. We obtain the following results: NP-hardness for finding the median/closest permutation regarding some metrics of distance, even for only <math><mrow><mi>k</mi> <mo>=</mo> <mn>3</mn></mrow> </math> permutations; Polynomial kernels for the problems of finding the median permutation of all studied metrics, considering the target distance d as parameter; NP-hardness result for finding the closest permutation by short-block-moves; FPT algorithms and infeasibility of polynomial kernels for finding the closest permutation for some metrics when parameterized by the target distance d.","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"19 1","pages":"24"},"PeriodicalIF":1.5,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11669244/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142885647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

TINNiK: inference of the tree of blobs of a species network under the coalescent model. TINNiK：聚合模型下的物种网络 Blob 树推断。

IF 1.5 4区生物学

Algorithms for Molecular Biology Pub Date : 2024-11-05 DOI: 10.1186/s13015-024-00266-2

Elizabeth S Allman, Hector Baños, Jonathan D Mitchell, John A Rhodes

引用次数: 0

New generalized metric based on branch length distance to compare B cell lineage trees. 基于分支长度距离的新通用指标，用于比较 B 细胞系树。

IF 1.5 4区生物学

Algorithms for Molecular Biology Pub Date : 2024-10-05 DOI: 10.1186/s13015-024-00267-1

Mahsa Farnia, Nadia Tahiri

{"title":"New generalized metric based on branch length distance to compare B cell lineage trees.","authors":"Mahsa Farnia, Nadia Tahiri","doi":"10.1186/s13015-024-00267-1","DOIUrl":"10.1186/s13015-024-00267-1","url":null,"abstract":"The B cell lineage tree encapsulates the successive phases of B cell differentiation and maturation, transitioning from hematopoietic stem cells to mature, antibody-secreting cells within the immune system. Mathematically, this lineage can be conceptualized as an evolutionary tree, where each node represents a distinct stage in B cell development, and the edges reflect the differentiation pathways. To compare these lineage trees, a rigorous mathematical metric is essential. Analyzing B cell lineage trees mathematically and quantifying changes in lineage attributes over time necessitates a comparison methodology capable of accurately assessing and measuring these changes. Addressing the intricacies of multiple B cell lineage tree comparisons, this study introduces a novel metric that enhances the precision of comparative analysis. This metric is formulated on principles of metric theory and evolutionary biology, quantifying the dissimilarities between lineage trees by measuring branch length distance and weight. By providing a framework for systematically classifying lineage trees, this metric facilitates the development of predictive models that are crucial for the creation of targeted immunotherapy and vaccines. To validate the effectiveness of this new metric, synthetic datasets that mimic the complexity and variability of real B cell lineage structures are employed. We demonstrated the ability of the new metric method to accurately capture the evolutionary nuances of B cell lineages.","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"19 1","pages":"22"},"PeriodicalIF":1.5,"publicationDate":"2024-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11453055/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142378550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Metric multidimensional scaling for large single-cell datasets using neural networks. 利用神经网络对大型单细胞数据集进行度量多维缩放。

IF 1 4区生物学

Algorithms for Molecular Biology Pub Date : 2024-06-11 DOI: 10.1186/s13015-024-00265-3

Stefan Canzar, Van Hoan Do, Slobodan Jelić, Sören Laue, Domagoj Matijević, Tomislav Prusina

引用次数: 0

Compression algorithm for colored de Bruijn graphs. 彩色德布鲁因图的压缩算法。

IF 1 4区生物学

Algorithms for Molecular Biology Pub Date : 2024-05-26 DOI: 10.1186/s13015-024-00254-6

Amatur Rahman, Yoann Dufresne, Paul Medvedev

{"title":"Compression algorithm for colored de Bruijn graphs.","authors":"Amatur Rahman, Yoann Dufresne, Paul Medvedev","doi":"10.1186/s13015-024-00254-6","DOIUrl":"10.1186/s13015-024-00254-6","url":null,"abstract":"A colored de Bruijn graph (also called a set of k-mer sets), is a set of k-mers with every k-mer assigned a set of colors. Colored de Bruijn graphs are used in a variety of applications, including variant calling, genome assembly, and database search. However, their size has posed a scalability challenge to algorithm developers and users. There have been numerous indexing data structures proposed that allow to store the graph compactly while supporting fast query operations. However, disk compression algorithms, which do not need to support queries on the compressed data and can thus be more space-efficient, have received little attention. The dearth of specialized compression tools has been a detriment to tool developers, tool users, and reproducibility efforts. In this paper, we develop a new tool that compresses colored de Bruijn graphs to disk, building on previous ideas for compression of k-mer sets and indexing colored de Bruijn graphs. We test our tool, called ESS-color, on various datasets, including both sequencing data and whole genomes. ESS-color achieves better compression than all evaluated tools and all datasets, with no other tool able to consistently achieve less than 44% space overhead. The software is available at http://github.com/medvedevgroup/ESSColor .","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"19 1","pages":"20"},"PeriodicalIF":1.0,"publicationDate":"2024-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11129398/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141155161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ESKEMAP: exact sketch-based read mapping ESKEMAP：基于草图的精确读取映射

IF 1 4区生物学

Algorithms for Molecular Biology Pub Date : 2024-05-04 DOI: 10.1186/s13015-024-00261-7

Tizian Schulz, Paul Medvedev

{"title":"ESKEMAP: exact sketch-based read mapping","authors":"Tizian Schulz, Paul Medvedev","doi":"10.1186/s13015-024-00261-7","DOIUrl":"https://doi.org/10.1186/s13015-024-00261-7","url":null,"abstract":"Given a sequencing read, the broad goal of read mapping is to find the location(s) in the reference genome that have a “similar sequence”. Traditionally, “similar sequence” was defined as having a high alignment score and read mappers were viewed as heuristic solutions to this well-defined problem. For sketch-based mappers, however, there has not been a problem formulation to capture what problem an exact sketch-based mapping algorithm should solve. Moreover, there is no sketch-based method that can find all possible mapping positions for a read above a certain score threshold. In this paper, we formulate the problem of read mapping at the level of sequence sketches. We give an exact dynamic programming algorithm that finds all hits above a given similarity threshold. It runs in $$mathcal {O} (|t| + |p| + ell ^2)$$ time and $$mathcal {O} (ell log ell )$$ space, where |t| is the number of $$k$$ -mers inside the sketch of the reference, |p| is the number of $$k$$ -mers inside the read’s sketch and $$ell$$ is the number of times that $$k$$ -mers from the pattern sketch occur in the sketch of the text. We evaluate our algorithm’s performance in mapping long reads to the T2T assembly of human chromosome Y, where ampliconic regions make it desirable to find all good mapping positions. For an equivalent level of precision as minimap2, the recall of our algorithm is 0.88, compared to only 0.76 of minimap2.","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"18 1","pages":""},"PeriodicalIF":1.0,"publicationDate":"2024-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140833439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

NestedBD: Bayesian inference of phylogenetic trees from single-cell copy number profiles under a birth-death model NestedBD：在出生-死亡模型下从单细胞拷贝数剖面对系统发生树进行贝叶斯推断

IF 1 4区生物学

Algorithms for Molecular Biology Pub Date : 2024-04-29 DOI: 10.1186/s13015-024-00264-4

Yushu Liu, Mohammadamin Edrisi, Zhi Yan, Huw A Ogilvie, Luay Nakhleh

{"title":"NestedBD: Bayesian inference of phylogenetic trees from single-cell copy number profiles under a birth-death model","authors":"Yushu Liu, Mohammadamin Edrisi, Zhi Yan, Huw A Ogilvie, Luay Nakhleh","doi":"10.1186/s13015-024-00264-4","DOIUrl":"https://doi.org/10.1186/s13015-024-00264-4","url":null,"abstract":"Copy number aberrations (CNAs) are ubiquitous in many types of cancer. Inferring CNAs from cancer genomic data could help shed light on the initiation, progression, and potential treatment of cancer. While such data have traditionally been available via “bulk sequencing,” the more recently introduced techniques for single-cell DNA sequencing (scDNAseq) provide the type of data that makes CNA inference possible at the single-cell resolution. We introduce a new birth-death evolutionary model of CNAs and a Bayesian method, NestedBD, for the inference of evolutionary trees (topologies and branch lengths with relative mutation rates) from single-cell data. We evaluated NestedBD’s performance using simulated data sets, benchmarking its accuracy against traditional phylogenetic tools as well as state-of-the-art methods. The results show that NestedBD infers more accurate topologies and branch lengths, and that the birth-death model can improve the accuracy of copy number estimation. And when applied to biological data sets, NestedBD infers plausible evolutionary histories of two colorectal cancer samples. NestedBD is available at https://github.com/Androstane/NestedBD .","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"48 1","pages":""},"PeriodicalIF":1.0,"publicationDate":"2024-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140833543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0