Algorithms for Molecular Biology最新文献

Swiftly identifying strongly unique k-mers. 快速识别强烈独特的k-mers。

IF 1.5 4区生物学

Algorithms for Molecular Biology Pub Date : 2025-07-13 DOI: 10.1186/s13015-025-00286-6

Jens Zentgraf, Sven Rahmann

{"title":"Swiftly identifying strongly unique k-mers.","authors":"Jens Zentgraf, Sven Rahmann","doi":"10.1186/s13015-025-00286-6","DOIUrl":"10.1186/s13015-025-00286-6","url":null,"abstract":"Motivation: Short DNA sequences of length k that appear in a single location (e.g., at a single genomic position, in a single species from a larger set of species, etc.) are called unique k-mers. They are useful for placing sequenced DNA fragments at the correct location without computing alignments and without ambiguity. However, they are not necessarily robust: A single basepair change may turn a unique k-mer into a different one that may in fact be present at one or more different locations, which may give confusing or contradictory information when attempting to place a read by its k-mer content. A more robust concept are strongly unique k-mers, i.e., unique k-mers for which no Hamming-distance-1 neighbor with conflicting information exists in all of the considered sequences. Given a set of k-mers, it is therefore of interest to have an efficient method that can distinguish k-mers with a Hamming-distance-1 neighbor in the collection from those that do not.Results: We present engineered algorithms to identify and mark within a set K of (canonical) k-mers all elements that have a Hamming-distance-1 neighbor in the same set. One algorithm is based on recursively running a 4-way comparison on sub-intervals of the sorted set. The other algorithm is based on bucketing and running a pairwise bit-parallel Hamming distance test on small buckets of the sorted set. Both methods consider canonical k-mers (i.e., taking reverse complements into account) and allow for efficient parallelization. The methods have been implemented and applied in practice to sets consisting of several billions of k-mers. An optimized combined approach running with 16 threads on a 16-core workstation yields wall times below 20 seconds on the 2.5 billion distinct 31-mers of the human telomere-to-telomere reference genome.Availability: An implementation can be found at https://gitlab.com/rahmannlab/strong-k-mers .","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"20 1","pages":"13"},"PeriodicalIF":1.5,"publicationDate":"2025-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12257829/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144627683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Anchorage accurately assembles anchor-flanked synthetic long reads. 锚固准确地组装锚侧合成长读取。

IF 1.5 4区生物学

Algorithms for Molecular Biology Pub Date : 2025-07-06 DOI: 10.1186/s13015-025-00288-4

Xiaofei Carl Zang, Xiang Li, Kyle Metcalfe, Tuval Ben-Yehezkel, Ryan Kelley, Mingfu Shao

{"title":"Anchorage accurately assembles anchor-flanked synthetic long reads.","authors":"Xiaofei Carl Zang, Xiang Li, Kyle Metcalfe, Tuval Ben-Yehezkel, Ryan Kelley, Mingfu Shao","doi":"10.1186/s13015-025-00288-4","DOIUrl":"10.1186/s13015-025-00288-4","url":null,"abstract":"Modern sequencing technologies allow for the addition of short-sequence tags, known as anchors, to both ends of a captured molecule. Anchors are useful in assembling the full-length sequence of a captured molecule as they can be used to accurately determine the endpoints. One representative of such anchor-enabled technology is LoopSeq Solo, a synthetic long read (SLR) sequencing protocol. LoopSeq Solo also achieves ultra-high sequencing depth and high purity of short reads covering the entire captured molecule. Despite the availability of many assembly methods, constructing full-length sequence from these anchor-enabled, ultra-high coverage sequencing data remains challenging due to the complexity of the underlying assembly graphs and the lack of specific algorithms leveraging anchors. We present Anchorage, a novel assembler that performs anchor-guided assembly for ultra-high-depth sequencing data. Anchorage starts with a kmer-based approach for precise estimation of molecule lengths. It then formulates the assembly problem as finding an optimal path that connects the two nodes determined by anchors in the underlying compact de Bruijn graph. The optimality is defined as maximizing the weight of the smallest node while matching the estimated sequence length. Anchorage uses a modified dynamic programming algorithm to efficiently find the optimal path. Through both simulations and real data, we show that Anchorage outperforms existing assembly methods, particularly in the presence of sequencing artifacts. Anchorage fills the gap in assembling anchor-enabled data. We anticipate its broad use as anchor-enabled sequencing technologies become prevalent. Anchorage is freely available at https://github.com/Shao-Group/anchorage ; the scripts and documents that can reproduce all experiments in this manuscript are available at https://github.com/Shao-Group/anchorage-test .","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"20 1","pages":"12"},"PeriodicalIF":1.5,"publicationDate":"2025-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12232771/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144576879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Faster computation of left-bounded shortest unique substrings. 更快的计算左有界最短唯一子串。

IF 1.5 4区生物学

Algorithms for Molecular Biology Pub Date : 2025-06-20 DOI: 10.1186/s13015-025-00287-5

Larissa L M Aguiar, Felipe A Louza

引用次数: 0

Reconstructing rearrangement phylogenies of natural genomes. 重建自然基因组重排系统发育。

IF 1.5 4区生物学

Algorithms for Molecular Biology Pub Date : 2025-06-07 DOI: 10.1186/s13015-025-00279-5

Leonard Bohnenkämper, Jens Stoye, Daniel Doerr

{"title":"Reconstructing rearrangement phylogenies of natural genomes.","authors":"Leonard Bohnenkämper, Jens Stoye, Daniel Doerr","doi":"10.1186/s13015-025-00279-5","DOIUrl":"10.1186/s13015-025-00279-5","url":null,"abstract":"Background: We study the classical problem of inferring ancestral genomes from a set of extant genomes under a given phylogeny, known as the Small Parsimony Problem (SPP). Genomes are represented as sequences of oriented markers, organized in one or more linear or circular chromosomes. Any marker may appear in several copies, without restriction on orientation or genomic location, known as the natural genomes model. Evolutionary events along the branches of the phylogeny encompass large scale rearrangements, including segmental inversions, translocations, gain and loss (DCJ-indel model). Even under simpler rearrangement models, such as the classical breakpoint model without duplicates, the SPP is computationally intractable. Nevertheless, the SPP for natural genomes under the DCJ-indel model has been studied recently, with limited success.Methods: Building on prior work, we present a highly optimized ILP that is able to solve the SPP for sufficiently small phylogenies and gene families. A notable improvement w.r.t. the previous result is an optimized way of handling both circular and linear chromosomes. This is especially relevant to the SPP, since the chromosomal structure of ancestral genomes is unknown and the solution space for this chromosomal structure is typically large.Results: We benchmark our method on simulated and real data. On simulated phylogenies we observe a considerable performance improvement on problems that include linear chromosomes. And even when the ground truth contains only one circular chromosome per genome, our method outperforms its predecessor due to its optimized handling of the solution space. The practical advantage becomes also visible in an analysis of seven Anopheles taxa.","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"20 1","pages":"10"},"PeriodicalIF":1.5,"publicationDate":"2025-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12144824/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144250682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Sama: a contig assembler with correctness guarantee. Sama：具有正确性保证的配置汇编程序。

IF 1.5 4区生物学

Algorithms for Molecular Biology Pub Date : 2025-06-03 DOI: 10.1186/s13015-025-00280-y

Leena Salmela

引用次数: 0

Estimating similarity and distance using FracMinHash. 使用FracMinHash估计相似度和距离。

IF 1.5 4区生物学

Algorithms for Molecular Biology Pub Date : 2025-05-15 DOI: 10.1186/s13015-025-00276-8

Mahmudur Rahman Hera, David Koslicki

{"title":"Estimating similarity and distance using FracMinHash.","authors":"Mahmudur Rahman Hera, David Koslicki","doi":"10.1186/s13015-025-00276-8","DOIUrl":"10.1186/s13015-025-00276-8","url":null,"abstract":"Motivation: The increasing number and volume of genomic and metagenomic data necessitates scalable and robust computational models for precise analysis. Sketching techniques utilizing <math><mi>k</mi></math> -mers from a biological sample have proven to be useful for large-scale analyses. In recent years, FracMinHash has emerged as a popular sketching technique and has been used in several useful applications. Recent studies on FracMinHash proved unbiased estimators for the containment and Jaccard indices. However, theoretical investigations for other metrics are still lacking.Theoretical contributions: In this paper, we present a theoretical framework for estimating similarity/distance metrics by using FracMinHash sketches, when the metric is expressible in a certain form. We establish conditions under which such an estimation is sound and recommend a minimum scale factor s for accurate results. Experimental evidence supports our theoretical findings.Practical contributions: We also present frac-kmc, a fast and efficient FracMinHash sketch generator program. frac-kmc is the fastest known FracMinHash sketch generator, delivering accurate and precise results for cosine similarity estimation on real data. frac-kmc is also the first parallel tool for this task, allowing for speeding up sketch generation using multiple CPU cores - an option lacking in existing serialized tools. We show that by computing FracMinHash sketches using frac-kmc, we can estimate pairwise similarity speedily and accurately on real data. frac-kmc is freely available here: https://github.com/KoslickiLab/frac-kmc/.","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"20 1","pages":"8"},"PeriodicalIF":1.5,"publicationDate":"2025-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12082993/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144081838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

AlfaPang: alignment free algorithm for pangenome graph construction. AlfaPang：用于泛基因组图构建的无对齐算法。

IF 1.5 4区生物学

Algorithms for Molecular Biology Pub Date : 2025-05-15 DOI: 10.1186/s13015-025-00277-7

Adam Cicherski, Anna Lisiecka, Norbert Dojer

引用次数: 0

M C D A G : indexing maximal common subsequences for k strings. M C D A G：索引k个字符串的最大公共子序列。

IF 1.5 4区生物学

Algorithms for Molecular Biology Pub Date : 2025-04-19 DOI: 10.1186/s13015-025-00271-z

Giovanni Buzzega, Alessio Conte, Roberto Grossi, Giulia Punzi

{"title":"<ArticleTitle xmlns:ns0=\"http://www.w3.org/1998/Math/MathML\"><ns0:math><ns0:mrow><ns0:mi>M</ns0:mi> <ns0:mstyle><ns0:mi>C</ns0:mi> <ns0:mi>D</ns0:mi> <ns0:mi>A</ns0:mi> <ns0:mi>G</ns0:mi></ns0:mstyle> </ns0:mrow> </ns0:math> : indexing maximal common subsequences for k strings.","authors":"Giovanni Buzzega, Alessio Conte, Roberto Grossi, Giulia Punzi","doi":"10.1186/s13015-025-00271-z","DOIUrl":"https://doi.org/10.1186/s13015-025-00271-z","url":null,"abstract":"Analyzing and comparing sequences of symbols is among the most fundamental problems in computer science, possibly even more so in bioinformatics. Maximal Common Subsequences (MCSs), i.e., inclusion-maximal sequences of non-contiguous symbols common to two or more strings, have only recently received attention in this area, despite being a basic notion and a natural generalization of more common tools like Longest Common Substrings/Subsequences. In this paper we simplify and engineer recent advancements in MCSs into a practical tool called <math><mrow><mi>M</mi> <mstyle><mi>C</mi> <mi>D</mi> <mi>A</mi> <mi>G</mi></mstyle> </mrow> </math> , the first publicly available tool that can index MCSs of real genomic data, and show that its definition can be generalized to multiple strings. We demonstrate that our tool can index pairs of sequences exceeding 10,000 base pairs within minutes, utilizing only 4-7% more than the minimum required nodes. For three or more sequences, we observe experimentally that the minimum index may exhibit a significant increase in the number of nodes.","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"20 1","pages":"6"},"PeriodicalIF":1.5,"publicationDate":"2025-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12008955/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144042825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Unbiased anchors for reliable genome-wide synteny detection. 无偏锚可靠的全基因组同步检测。

IF 1.5 4区生物学

Algorithms for Molecular Biology Pub Date : 2025-04-05 DOI: 10.1186/s13015-025-00275-9

Karl K Käther, Andreas Remmel, Steffen Lemke, Peter F Stadler

引用次数: 0

The open-closed mod-minimizer algorithm. 开闭模最小化算法。

IF 1.5 4区生物学

Algorithms for Molecular Biology Pub Date : 2025-03-17 DOI: 10.1186/s13015-025-00270-0

Ragnar Groot Koerkamp, Daniel Liu, Giulio Ermanno Pibiri

{"title":"The open-closed mod-minimizer algorithm.","authors":"Ragnar Groot Koerkamp, Daniel Liu, Giulio Ermanno Pibiri","doi":"10.1186/s13015-025-00270-0","DOIUrl":"10.1186/s13015-025-00270-0","url":null,"abstract":"Sampling algorithms that deterministically select a subset of <math><mi>k</mi></math> -mers are an important building block in bioinformatics applications. For example, they are used to index large textual collections, like DNA, and to compare sequences quickly. In such applications, a sampling algorithm is required to select one <math><mi>k</mi></math> -mer out of every window of w consecutive <math><mi>k</mi></math> -mers. The folklore and most used scheme is the random minimizer that selects the smallest <math><mi>k</mi></math> -mer in the window according to some random order. This scheme is remarkably simple and versatile, and has a density (expected fraction of selected <math><mi>k</mi></math> -mers) of <math><mrow><mn>2</mn> <mo>/</mo> <mo>(</mo> <mi>w</mi> <mo>+</mo> <mn>1</mn> <mo>)</mo></mrow> </math> . In practice, lower density leads to faster methods and smaller indexes, and it turns out that the random minimizer is not the best one can do. Indeed, some schemes are known to approach optimal density 1/w when <math><mrow><mi>k</mi> <mo>→</mo> <mi>∞</mi></mrow> </math> , like the recently introduced mod-minimizer (Groot Koerkamp and Pibiri, WABI 2024). In this work, we study methods that achieve low density when <math><mrow><mi>k</mi> <mo>≤</mo> <mi>w</mi></mrow> </math> . In this small-k regime, a practical method with provably better density than the random minimizer is the miniception (Zheng et al., Bioinformatics 2021). This method can be elegantly described as sampling the smallest closed sycnmer (Edgar, PeerJ 2021) in the window according to some random order. We show that extending the miniception to prefer sampling open syncmers yields much better density. This new method-the open-closed minimizer-offers improved density for small <math><mrow><mi>k</mi> <mo>≤</mo> <mi>w</mi></mrow> </math> while being as fast to compute as the random minimizer. Compared to methods based on decycling sets, that achieve very low density in the small-k regime, our method has comparable density while being computationally simpler and intuitive. Furthermore, we extend the mod-minimizer to improve density of any scheme that works well for small k to also work well when <math><mrow><mi>k</mi> <mo>></mo> <mi>w</mi></mrow> </math> is large. We hence obtain the open-closed mod-minimizer, a practical method that improves over the mod-minimizer for all k.","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"20 1","pages":"4"},"PeriodicalIF":1.5,"publicationDate":"2025-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11912762/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143651867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0