Anna Lindeberg, Guillaume E Scholz, Nicolas Wieseke, Marc Hellmuth
{"title":"Orthology and near-cographs in the context of phylogenetic networks.","authors":"Anna Lindeberg, Guillaume E Scholz, Nicolas Wieseke, Marc Hellmuth","doi":"10.1186/s13015-025-00285-7","DOIUrl":"10.1186/s13015-025-00285-7","url":null,"abstract":"<p><p>Orthologous genes, which arise through speciation, play a key role in comparative genomics and functional inference. In particular, graph-based methods allow for the inference of orthology estimates without prior knowledge of the underlying gene or species trees. This results in orthology graphs, where each vertex represents a gene, and an edge exists between two vertices if the corresponding genes are estimated to be orthologs. Orthology graphs inferred under a tree-like evolutionary model must be cographs. However, real-world data often deviate from this property, either due to noise in the data, errors in inference methods or, simply, because evolution follows a network-like rather than a tree-like process. The latter, in particular, raises the question of whether and how orthology graphs can be derived from or, equivalently, are explained by phylogenetic networks. In this work, we study the constraints imposed on orthology graphs when the underlying evolutionary history follows a phylogenetic network instead of a tree. We show that any orthology graph can be represented by a sufficiently complex level-k network. However, such networks lack biologically meaningful constraints. In contrast, level-1 networks provide a simpler explanation, and we establish characterizations for level-1 explainable orthology graphs, i.e., those derived from level-1 evolutionary histories. To this end, we employ modular decomposition, a classical technique for studying graph structures. Specifically, an arbitrary graph is level-1 explainable if and only if each primitive subgraph is a near-cograph (a graph in which the removal of a single vertex results in a cograph). Additionally, we present a linear-time algorithm to recognize level-1 explainable orthology graphs and to construct a level-1 network that explains them, if such a network exists. Finally, we demonstrate the close relationship of level-1 explainable orthology graphs to the substitution operation, weakly chordal and perfect graphs, as well as graphs with twin-width at most 2.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"20 1","pages":"19"},"PeriodicalIF":1.7,"publicationDate":"2025-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12490074/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145214285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Finding maximum common contractions between phylogenetic networks.","authors":"Bertrand Marchand, Nadia Tahiri, Shohreh Golpaigani Fard, Olivier Tremblay-Savard, Manuel Lafond","doi":"10.1186/s13015-025-00283-9","DOIUrl":"10.1186/s13015-025-00283-9","url":null,"abstract":"<p><p>In this paper, we lay the groundwork on the comparison of phylogenetic networks based on edge contractions and expansions as edit operations, as originally proposed by Robinson and Foulds to compare trees. We prove that these operations connect the space of all phylogenetic networks on the same set of leaves, even if we forbid contractions that create cycles. This allows to define an operational distance on this space, as the minimum number of contractions and expansions required to transform one network into another. We highlight the difference between this distance and the computation of the maximum common contraction between two networks. Given its ability to outline a common structure between them, which can provide valuable biological insights, we study the algorithmic aspects of the latter. We first prove that computing a maximum common contraction between two networks is NP-hard, even when the maximum degree, the size of the common contraction, or the number of leaves is bounded. We also provide lower bounds to the problem based on the Exponential-Time Hypothesis. Nonetheless, we do provide a polynomial-time algorithm for weakly galled trees, a generalization of galled trees.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"20 1","pages":"18"},"PeriodicalIF":1.7,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12490124/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145208207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exclusive functional signatures for gene annotation with vast OpenOrd layout.","authors":"Gleb Buzanov, Vsevolod Makeev","doi":"10.1186/s13015-025-00290-w","DOIUrl":"10.1186/s13015-025-00290-w","url":null,"abstract":"<p><p>A biological study can produce a limited number of marker genes, not large enough to be used in gene set enrichment analysis. Here we suggest VOL-Gene, a graph-based algorithm that partitions all genes into non-overlapping classes of functionally related genes, thus assigning a single function to each gene. To this end, many functional signatures are combined into a single weighted graph, which is partitioned into cliques. For a poorly annotated marker gene, our approach fetches a number of genes that belong to the same class, some of which can be well annotated and are likely to take part in the same biological process.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"20 1","pages":"17"},"PeriodicalIF":1.7,"publicationDate":"2025-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12482410/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145193964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alitzel López Sánchez, José Antonio Ramírez-Rafael, Alejandro Flores-Lamas, Maribel Hernández-Rosales, Manuel Lafond
{"title":"The path-label reconciliation (PLR) dissimilarity measure for gene trees.","authors":"Alitzel López Sánchez, José Antonio Ramírez-Rafael, Alejandro Flores-Lamas, Maribel Hernández-Rosales, Manuel Lafond","doi":"10.1186/s13015-025-00284-8","DOIUrl":"10.1186/s13015-025-00284-8","url":null,"abstract":"<p><strong>Background: </strong>In this study, we investigate the problem of comparing gene trees reconciled with the same species tree using a novel semi-metric, called the Path-Label Reconciliation (PLR) dissimilarity measure. This approach not only quantifies differences in the topology of reconciled gene trees, but also considers discrepancies in predicted ancestral gene-species maps and speciation/duplication events, offering a refinement of existing metrics such as Robinson-Foulds (RF) and their labeled extensions LRF and ELRF. A tunable parameter <math><mi>α</mi></math> also allows users to adjust the balance between its species map and event labeling components.</p><p><strong>Our contributions: </strong>We show that PLR can be computed in linear time and that it is a semi-metric. We also discuss the diameters of reconciled gene tree measures, which are important in practice for normalization, and provide initial bounds on PLR, LRF, and ELRF. To validate PLR, we simulate reconciliations and perform comparisons with LRF and ELRF. The results show that PLR provides a more evenly distributed range of distances, making it less susceptible to overestimating differences in the presence of small topological changes, while at the same time being computationally efficient. We also apply our measure to evaluate the set of possible rootings of gene trees against a gold standard, and demonstrate that our measure is better at distinguishing one best gene tree among multiple candidates. Furthermore, our findings suggest that the theoretical diameter is rarely reached in practice. The PLR measure advances phylogenetic reconciliation by combining theoretical rigor with practical applicability. Future research will refine its mathematical properties, explore its performance on different types of trees, and integrate it with existing bioinformatics tools for large-scale evolutionary analyses. The implementation of the PLR distance is available in the open-source PyPI package parle: https://pypi.org/project/parle/ .</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"20 1","pages":"16"},"PeriodicalIF":1.7,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12366074/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144884255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lore Depuydt, Luca Renders, Simon Van de Vyver, Lennart Veys, Travis Gagie, Jan Fostier
{"title":"b-move: faster lossless approximate pattern matching in a run-length compressed index.","authors":"Lore Depuydt, Luca Renders, Simon Van de Vyver, Lennart Veys, Travis Gagie, Jan Fostier","doi":"10.1186/s13015-025-00281-x","DOIUrl":"10.1186/s13015-025-00281-x","url":null,"abstract":"<p><strong>Background: </strong>Due to the increasing availability of high-quality genome sequences, pan-genomes are gradually replacing single consensus reference genomes in many bioinformatics pipelines to better capture genetic diversity. Traditional bioinformatics tools using the FM-index face memory limitations with such large genome collections. Recent advancements in run-length compressed indices like Gagie et al.'s r-index and Nishimoto and Tabei's move structure, alleviate memory constraints but focus primarily on backward search for MEM-finding. Arakawa et al.'s br-index initiates complete approximate pattern matching using bidirectional search in run-length compressed space, but with significant computational overhead due to complex memory access patterns.</p><p><strong>Results: </strong>We introduce b-move, a novel bidirectional extension of the move structure, enabling fast, cache-efficient, lossless approximate pattern matching in run-length compressed space. It achieves bidirectional character extensions up to 7 times faster than the br-index, closing the performance gap with FM-index-based alternatives. For locating occurrences, b-move performs <math><mi>ϕ</mi></math> and <math><msup><mi>ϕ</mi> <mrow><mo>-</mo> <mn>1</mn></mrow> </msup> </math> operations up to 7 times faster than the br-index. At the same time, it maintains the favorable memory characteristics of the br-index, for example, all available complete E. coli genomes on NCBI's RefSeq collection can be compiled into a b-move index that fits into the RAM of a typical laptop.</p><p><strong>Conclusions: </strong>b-move proves practical and scalable for pan-genome indexing and querying. We provide a C++ implementation of b-move, supporting efficient lossless approximate pattern matching including locate functionality, available at https://github.com/biointec/b-move under the AGPL-3.0 license.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"20 1","pages":"15"},"PeriodicalIF":1.7,"publicationDate":"2025-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12345024/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144838492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Elizabeth S Allman, Hector Baños, John A Rhodes, Kristina Wicke
{"title":"NANUQ<sup>+</sup>: A divide-and-conquer approach to network estimation.","authors":"Elizabeth S Allman, Hector Baños, John A Rhodes, Kristina Wicke","doi":"10.1186/s13015-025-00274-w","DOIUrl":"10.1186/s13015-025-00274-w","url":null,"abstract":"<p><p>Inference of a species network from genomic data remains a difficult problem, with recent progress mostly limited to the level-1 case. However, inference of the Tree of Blobs of a network, showing only the network's cut edges, can be performed for any network by TINNiK, suggesting a divide-and-conquer approach to network inference where the tree's multifurcations are individually resolved to give more detailed structure. Here we develop a method, <math><msup><mtext>NANUQ</mtext> <mo>+</mo></msup> </math> , to quickly perform such a level-1 resolution. Viewed as part of the NANUQ pipeline for fast level-1 inference, this gives tools for both understanding when the level-1 assumption is likely to be met and for exploring all highly-supported resolutions to cycles.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"20 1","pages":"14"},"PeriodicalIF":1.7,"publicationDate":"2025-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12297685/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144719080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Swiftly identifying strongly unique k-mers.","authors":"Jens Zentgraf, Sven Rahmann","doi":"10.1186/s13015-025-00286-6","DOIUrl":"10.1186/s13015-025-00286-6","url":null,"abstract":"<p><strong>Motivation: </strong>Short DNA sequences of length k that appear in a single location (e.g., at a single genomic position, in a single species from a larger set of species, etc.) are called unique k-mers. They are useful for placing sequenced DNA fragments at the correct location without computing alignments and without ambiguity. However, they are not necessarily robust: A single basepair change may turn a unique k-mer into a different one that may in fact be present at one or more different locations, which may give confusing or contradictory information when attempting to place a read by its k-mer content. A more robust concept are strongly unique k-mers, i.e., unique k-mers for which no Hamming-distance-1 neighbor with conflicting information exists in all of the considered sequences. Given a set of k-mers, it is therefore of interest to have an efficient method that can distinguish k-mers with a Hamming-distance-1 neighbor in the collection from those that do not.</p><p><strong>Results: </strong>We present engineered algorithms to identify and mark within a set K of (canonical) k-mers all elements that have a Hamming-distance-1 neighbor in the same set. One algorithm is based on recursively running a 4-way comparison on sub-intervals of the sorted set. The other algorithm is based on bucketing and running a pairwise bit-parallel Hamming distance test on small buckets of the sorted set. Both methods consider canonical k-mers (i.e., taking reverse complements into account) and allow for efficient parallelization. The methods have been implemented and applied in practice to sets consisting of several billions of k-mers. An optimized combined approach running with 16 threads on a 16-core workstation yields wall times below 20 seconds on the 2.5 billion distinct 31-mers of the human telomere-to-telomere reference genome.</p><p><strong>Availability: </strong>An implementation can be found at https://gitlab.com/rahmannlab/strong-k-mers .</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"20 1","pages":"13"},"PeriodicalIF":1.5,"publicationDate":"2025-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12257829/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144627683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaofei Carl Zang, Xiang Li, Kyle Metcalfe, Tuval Ben-Yehezkel, Ryan Kelley, Mingfu Shao
{"title":"Anchorage accurately assembles anchor-flanked synthetic long reads.","authors":"Xiaofei Carl Zang, Xiang Li, Kyle Metcalfe, Tuval Ben-Yehezkel, Ryan Kelley, Mingfu Shao","doi":"10.1186/s13015-025-00288-4","DOIUrl":"10.1186/s13015-025-00288-4","url":null,"abstract":"<p><p>Modern sequencing technologies allow for the addition of short-sequence tags, known as anchors, to both ends of a captured molecule. Anchors are useful in assembling the full-length sequence of a captured molecule as they can be used to accurately determine the endpoints. One representative of such anchor-enabled technology is LoopSeq Solo, a synthetic long read (SLR) sequencing protocol. LoopSeq Solo also achieves ultra-high sequencing depth and high purity of short reads covering the entire captured molecule. Despite the availability of many assembly methods, constructing full-length sequence from these anchor-enabled, ultra-high coverage sequencing data remains challenging due to the complexity of the underlying assembly graphs and the lack of specific algorithms leveraging anchors. We present Anchorage, a novel assembler that performs anchor-guided assembly for ultra-high-depth sequencing data. Anchorage starts with a kmer-based approach for precise estimation of molecule lengths. It then formulates the assembly problem as finding an optimal path that connects the two nodes determined by anchors in the underlying compact de Bruijn graph. The optimality is defined as maximizing the weight of the smallest node while matching the estimated sequence length. Anchorage uses a modified dynamic programming algorithm to efficiently find the optimal path. Through both simulations and real data, we show that Anchorage outperforms existing assembly methods, particularly in the presence of sequencing artifacts. Anchorage fills the gap in assembling anchor-enabled data. We anticipate its broad use as anchor-enabled sequencing technologies become prevalent. Anchorage is freely available at https://github.com/Shao-Group/anchorage ; the scripts and documents that can reproduce all experiments in this manuscript are available at https://github.com/Shao-Group/anchorage-test .</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"20 1","pages":"12"},"PeriodicalIF":1.5,"publicationDate":"2025-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12232771/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144576879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Faster computation of left-bounded shortest unique substrings.","authors":"Larissa L M Aguiar, Felipe A Louza","doi":"10.1186/s13015-025-00287-5","DOIUrl":"10.1186/s13015-025-00287-5","url":null,"abstract":"<p><p>Finding shortest unique substrings (SUS) is a fundamental problem in string processing with applications in bioinformatics. In this paper, we present an algorithm for solving a variant of the SUS problem, the left-bounded shortest unique substrings (LSUS). This variant is particularly important in applications such as PCR primer design. Our algorithm runs in O(n) time using 2n memory words plus n bytes for an input string of length n. Experimental results with real and artificial datasets show that our algorithm is the fastest alternative in practice, being two times faster (on the average) than related works, while using a similar peak memory footprint.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"20 1","pages":"11"},"PeriodicalIF":1.5,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12181909/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144337195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Reconstructing rearrangement phylogenies of natural genomes.","authors":"Leonard Bohnenkämper, Jens Stoye, Daniel Doerr","doi":"10.1186/s13015-025-00279-5","DOIUrl":"10.1186/s13015-025-00279-5","url":null,"abstract":"<p><strong>Background: </strong>We study the classical problem of inferring ancestral genomes from a set of extant genomes under a given phylogeny, known as the Small Parsimony Problem (SPP). Genomes are represented as sequences of oriented markers, organized in one or more linear or circular chromosomes. Any marker may appear in several copies, without restriction on orientation or genomic location, known as the natural genomes model. Evolutionary events along the branches of the phylogeny encompass large scale rearrangements, including segmental inversions, translocations, gain and loss (DCJ-indel model). Even under simpler rearrangement models, such as the classical breakpoint model without duplicates, the SPP is computationally intractable. Nevertheless, the SPP for natural genomes under the DCJ-indel model has been studied recently, with limited success.</p><p><strong>Methods: </strong>Building on prior work, we present a highly optimized ILP that is able to solve the SPP for sufficiently small phylogenies and gene families. A notable improvement w.r.t. the previous result is an optimized way of handling both circular and linear chromosomes. This is especially relevant to the SPP, since the chromosomal structure of ancestral genomes is unknown and the solution space for this chromosomal structure is typically large.</p><p><strong>Results: </strong>We benchmark our method on simulated and real data. On simulated phylogenies we observe a considerable performance improvement on problems that include linear chromosomes. And even when the ground truth contains only one circular chromosome per genome, our method outperforms its predecessor due to its optimized handling of the solution space. The practical advantage becomes also visible in an analysis of seven Anopheles taxa.</p>","PeriodicalId":50823,"journal":{"name":"Algorithms for Molecular Biology","volume":"20 1","pages":"10"},"PeriodicalIF":1.5,"publicationDate":"2025-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12144824/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144250682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}