{"title":"Estimating Ancestral States of Complex Characters: a Case Study on the Evolution of Feathers.","authors":"Pierre Cockx, Michael J Benton, Joseph N Keating","doi":"10.1093/sysbio/syaf063","DOIUrl":"https://doi.org/10.1093/sysbio/syaf063","url":null,"abstract":"<p><p>Feathers are a key novelty underpinning the evolutionary success of birds, yet the origin of feathers remains poorly understood. Debates about feather evolution hinge upon whether filamentous integument has evolved once or multiple times independently on the lineage leading to modern birds. These contradictory results stem from methodological differences in statistical ancestral state estimates. Here we conduct a comprehensive comparison of ancestral state estimation methodologies applied to stem-group birds, testing the role of outgroup inclusion, tree time scaling method, model choice and character coding strategy. Models are compared based on their Akaike Information Criteria (AIC), mutual information, as well as the uncertainty of marginal ancestral state estimates. Our results demonstrate that ancestral state estimates of stem-bird integument are strongly influenced by tree time scaling method, outgroup selection and model choice, while character coding strategy seems to have less effect on the ancestral estimates produced. We identify the best fitting and most generalizable models using AIC scores and leave-one-out cross-validation (LOOCV) respectively. Our analyses broadly support the independent origin of filamentous integument in dinosaurs and pterosaurs and support a younger evolutionary origin of feathers than has been suggested previously. In terms of model selection, we observe little correlation between AIC/AICc and LOOCV error, suggesting that, for our dataset, model fit does not reliably predict generalizability. However, both approaches favor models that infer a similar pattern of feather evolution. More globally, our study highlights that special care must be taken in selecting the outgroup, tree and model when conducting ASE analyses.</p>","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145091583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Estimating waiting distances between genealogy changes under a Multi-Species Extension of the Sequentially Markov Coalescent.","authors":"Patrick F McKenzie, Deren A R Eaton","doi":"10.1093/sysbio/syaf059","DOIUrl":"https://doi.org/10.1093/sysbio/syaf059","url":null,"abstract":"<p><p>Genomes are composed of a mosaic of segments inherited from different ancestors, each separated by past recombination events. Consequently, genealogical relationships among multiple genomes vary spatially across different genomic regions. Genealogical variation among unlinked (uncorrelated) genomic regions is well described for either a single population (coalescent) or multiple structured populations (multispecies coalescent). However, the expected similarity among genealogies at linked regions of a genome is less well characterized. Recently, an analytical solution was derived for the distribution of the waiting distance for a change in the genealogical tree spatially across a genome for a single population with constant effective population size. Here we describe a generalization of this result, in terms of the distribution of waiting distances between changes in genealogical trees and topologies for multiple structured populations with branch-specific effective population sizes (i.e., under the multispecies coalescent). We implemented our model in the Python package ipcoal and validated its accuracy against stochastic coalescent simulations. Using a novel likelihood framework we show that tree and topology-change waiting distances in an ARG can be used to fit species tree model parameters, demonstrating an application of our model for developing new methods for phylogenetic inference. The Multi-Species Sequentially Markov Coalescent (MS-SMC) model presented here represents a major advance for linking local ancestry inference to hierarchical demographic models.</p>","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145024236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sean A S Anderson, Sachin Kaushik, Daniel R Matute
{"title":"The comparative analysis of lineage-pair traits.","authors":"Sean A S Anderson, Sachin Kaushik, Daniel R Matute","doi":"10.1093/sysbio/syaf061","DOIUrl":"10.1093/sysbio/syaf061","url":null,"abstract":"<p><p>For many questions in ecology and evolution, the most relevant data to consider are attributes of lineage pairs. Comparative tests for causal relationships among traits like 'diet niche overlap', 'divergence time', and 'strength of reproductive isolation (RI)' - measured for pairwise combinations of related species or populations - have led to several groundbreaking insights, but the correct statistical approach for these analyses has never been clear. Lineage-pair traits are non-independent, but unlike the expected covariance among species' traits, which is captured by a phylogenetic covariance matrix arising from a given model, the expected covariance among lineage-pair traits has not been explicitly formulated. Analyses of pairwise-defined data have thus employed untested workarounds for non-independence rather than direct models of lineage-pair covariance, with consequences that are unexplored. Here, we consider how evolutionary relatedness among taxa translates into non-independence among taxonomic pairs. We develop models by which phylogenetic signal in an underlying character generates covariance among pairs in a lineage-pair trait. We incorporate the resulting lineage-pair covariance matrices into modified versions of phylogenetic generalized least squares and a new phylogenetic beta regression for bounded response variables. Both outperform previous approaches in simulation tests. We find that a common heuristic method, node averaging, imparts a greater cost to model performance than does the non-independence it was designed to correct. We re-analyze two empirical datasets to find dramatic improvements in model fit and, in the case of avian hybridization data, an even stronger relationship between pair age and RI than is revealed from uncorrected analysis. We finally present a new tool, the R package phylopairs, that allows empiricists to test relationships among pairwise-defined variables in a way that is statistically robust and more straightforward to implement.</p>","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145001406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tianjian Qin,Koen J van Benthem,Luis Valente,Rampal S Etienne
{"title":"Parameter Estimation from Phylogenetic Trees Using Neural Networks and Ensemble Learning.","authors":"Tianjian Qin,Koen J van Benthem,Luis Valente,Rampal S Etienne","doi":"10.1093/sysbio/syaf060","DOIUrl":"https://doi.org/10.1093/sysbio/syaf060","url":null,"abstract":"Species diversification is characterized by speciation and extinction, the rates of which can, under some assumptions, be estimated from time-calibrated phylogenies. However, maximum likelihood estimation methods (MLE) for inferring rates are limited to simpler models and can show bias, particularly in small phylogenies. Likelihood-free methods to estimate parameters of diversification models using deep learning have started to emerge, but how robust neural network methods are at handling the intricate nature of phylogenetic data remains an open question. Here we present a new ensemble neural network approach to estimate diversification parameters from phylogenetic trees that leverages different classes of neural networks (dense neural network, graph neural network, and long short-term memory recurrent network) and simultaneously learns from graph representations of phylogenies, their branching times and their summary statistics. Our best-performing ensemble neural network (which adjusts the graph neural network result using a recurrent neural network) delivers estimates faster than MLE and shows less sensitivity to tree size for constant-rate and diversity-dependent speciation scenarios. It performs well compared to an existing convolutional network approach. However, like MLE, our approach still fails to recover parameters precisely under a protracted birth-death process. Our analysis suggests that the primary limitation to accurate parameter estimation is the amount of information contained within a phylogeny, as indicated by its size and the strength of effects shaping it. In cases where MLE is unavailable, our neural network method provides a promising alternative for estimating phylogenetic tree parameters. If detectable phylogenetic signals are present, our approach delivers results that are comparable to MLE but without inherent biases.","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"13 1","pages":""},"PeriodicalIF":6.5,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144960285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tianjian Qin, Koen J van Benthem, Luis Valente, Rampal S Etienne
{"title":"Parameter Estimation from Phylogenetic Trees Using Neural Networks and Ensemble Learning.","authors":"Tianjian Qin, Koen J van Benthem, Luis Valente, Rampal S Etienne","doi":"10.1093/sysbio/syaf060","DOIUrl":"10.1093/sysbio/syaf060","url":null,"abstract":"<p><p>Species diversification is characterized by speciation and extinction, the rates of which can, under some assumptions, be estimated from time-calibrated phylogenies. However, maximum likelihood estimation methods (MLE) for inferring rates are limited to simpler models and can show bias, particularly in small phylogenies. Likelihood-free methods to estimate parameters of diversification models using deep learning have started to emerge, but how robust neural network methods are at handling the intricate nature of phylogenetic data remains an open question. Here we present a new ensemble neural network approach to estimate diversification parameters from phylogenetic trees that leverages different classes of neural networks (dense neural network, graph neural network, and long short-term memory recurrent network) and simultaneously learns from graph representations of phylogenies, their branching times and their summary statistics. Our best-performing ensemble neural network (which adjusts the graph neural network result using a recurrent neural network) delivers estimates faster than MLE and shows less sensitivity to tree size for constant-rate and diversity-dependent speciation scenarios. It performs well compared to an existing convolutional network approach. However, like MLE, our approach still fails to recover parameters precisely under a protracted birth-death process. Our analysis suggests that the primary limitation to accurate parameter estimation is the amount of information contained within a phylogeny, as indicated by its size and the strength of effects shaping it. In cases where MLE is unavailable, our neural network method provides a promising alternative for estimating phylogenetic tree parameters. If detectable phylogenetic signals are present, our approach delivers results that are comparable to MLE but without inherent biases.</p>","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144970014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Paul Aubier, Valentin Rineau, Jorge Cubo, Stéphane Jouve
{"title":"New perspectives in phylogenetic support assessment: using the new Relative Contradiction Index to investigate the phylogenetic controversies in Crocodylia","authors":"Paul Aubier, Valentin Rineau, Jorge Cubo, Stéphane Jouve","doi":"10.1093/sysbio/syaf058","DOIUrl":"https://doi.org/10.1093/sysbio/syaf058","url":null,"abstract":"Numerous tools have been developed since the advent of phylogenetic methods to assess tree robustness. Identifying the degree of contradiction in a phylogenetic matrix, as well as the specific contribution of each taxon and character, is essential for estimating its reliability. In parsimony-based phylogenetic inferences, classically used by paleontologists, a phylogeny results from the interaction of all the characters used in the analysis. Consequently, the support initially provided by the characters in the matrix may differ from that after after optimization in the final tree, severing the link between the phylogenetic content of the matrix and that of the final tree. Thus, all methods aimed at measuring support only do so indirectly and the impact of individual characters or taxa can only be assessed after the analysis. Three-taxon analysis (3ta) is a phylogenetic method that can circumvent these issues by precisely measuring the support of targeted characters and/or taxa directly from the phylogenetic matrix. In 3ta, characters are coded as trees and decomposed into three-taxon statements (3ts). The analysis searches for the largest set of non-contradicting 3ts to compute the optimal phylogeny. Because the analysis is a compatibility procedure, not an optimization procedure, character supports on the tree are independent from one another. This enables direct assessment of support from the matrix, providing meaningful insights into the topology of the optimal trees. Moreover, the decomposition of characters into 3ts allows for precise quantification of the impact of the characters/taxa in the results. In this study, focusing on Crocodylia (a subject of ongoing debate over recent decades), we use 3ta to measure the support of specific characters and/or taxa in the recently published matrix of Rio and Mannion (2021). This conflict revolves around two competing hypotheses – Longirostres and Brevirostres – supporting a different placement of the Gavialoidea clade. We also introduce here the Relative Contradiction Index (RCI) to evaluate node support, a metric that reflects the degree of contradiction in a matrix between competing cladistic hypotheses, ranging from 0.5 (maximum contradiction) to 1 (no contradiction). We show that although the Longirostres hypothesis is the best-supported, it is strongly challenged by the Brevirostres hypothesis (RCI = 0.62). Furthermore, we find that Tomistominae provides 61% of the supporting evidence for the Longirostres hypothesis, such that, when removed, the matrix supports the Brevirostres hypothesis. Individual tomistomines’ contributions vary only from 2% to 7% of the total support to the Longirostres hypothesis. Finally, we show that characters correlated to longirostry only provide a fraction (22%) of the total support to the Longirostres hypothesis. Thus, our method can quantify the impact of specific characters or taxa on a phylogenetic result. This should prove very useful to phylogeneticists, especi","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"3 1","pages":""},"PeriodicalIF":6.5,"publicationDate":"2025-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144906112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rong Zhang, Gregory W Stull, Jian-Jun Jin, Yin-Huan Wang, Ying Guo, Zhi-Yun Yang, Hong-Tao Li, Kai-Lun An, Joseph L M Charboneau, Ryan A Folk, Domingos Cardoso, Luciano P de Queiroz, Anne Bruneau, Pamela S Soltis, Douglas E Soltis, Stephen A Smith, De-Zhu Li, Ting-Shuang Yi
{"title":"Phylogenetic Resolution and Conflict in the Species-Rich Flowering Plant Family Leguminosae.","authors":"Rong Zhang, Gregory W Stull, Jian-Jun Jin, Yin-Huan Wang, Ying Guo, Zhi-Yun Yang, Hong-Tao Li, Kai-Lun An, Joseph L M Charboneau, Ryan A Folk, Domingos Cardoso, Luciano P de Queiroz, Anne Bruneau, Pamela S Soltis, Douglas E Soltis, Stephen A Smith, De-Zhu Li, Ting-Shuang Yi","doi":"10.1093/sysbio/syaf057","DOIUrl":"10.1093/sysbio/syaf057","url":null,"abstract":"<p><p>The Tree of Life is central to evolutionary biology, yet resolving deep, recalcitrant phylogenetic relationships remains challenging due to complex processes such as incomplete lineage sorting (ILS), hybridization, and polyploidization. Although previous phylogenetic studies have advanced our understanding of Leguminosae (Fabaceae), a species-rich and ecologically diverse family, many deep relationships at the tribal and higher levels remain unresolved. Incorporating newly generated genome skimming data for 231 species with previously issued plastid genomic, mitochondrial genomic and transcriptomic data, we reconstructed a phylogeny of the family using whole plastomes, 39 mitochondrial genes, and 1559 low-copy nuclear genes, achieving dense taxonomic sampling across almost all recognized tribes and major unplaced lineages. Our results supported the monophyly of the six subfamilies and 49 recognized tribes, identified ten clades worthy of recognition as new tribes in subfamily Papilionoideae, and clarified many contentious relationships. However, nuclear-nuclear and cytonuclear conflicts persist at multiple nodes among trees inferred from different datasets and analytical methods. We proposed the most probable resolution for 22 contentious nodes by applying nuclear gene-tree quartet analysis with corroboration from support of nuclear Maximum Likelihood (ML) and ASTRAL trees. Our results indicate ILS significantly contributes to observed phylogenetic conflicts, while gene flow represents an additional and previously underappreciated factor that mainly contributes to cytonuclear conflicts, particularly along the branches of the Angylocalyceae + Dipterygeae + Amburaneae (ADA) clade and Wisterieae. These processes likely underlie recalcitrant phylogenetic relationships, such as those within the 50-kb inversion clade of Papilionoideae. Our study uses multiple data partitions and analytical methods to resolve contentious phylogenetic relationships in Leguminosae, resulting in a robust phylogenomic framework to guide further investigations in this economically important and exceptionally diverse family.</p>","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144875351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Social environment and the evolution of delayed reproduction in birds.","authors":"Liam U Taylor, Josef C Uyeda, Richard O Prum","doi":"10.1093/sysbio/syaf056","DOIUrl":"10.1093/sysbio/syaf056","url":null,"abstract":"<p><p>One puzzling feature of avian life histories is that individuals in many different lineages delay reproduction for several years after they finish growing. Intraspecific field studies suggest that various complex social environments-such as cooperative breeding groups, nesting colonies, and display leks-result in delayed reproduction because they require forms of sociosexual development that extend beyond physical maturation. Here, we formally propose this hypothesis and use a full suite of phylogenetic comparative methods to test it, analyzing the evolution of age at first reproduction (AFR) in females and males across 963 species of birds. Phylogenetic regressions support increased AFR in colonial females and males, cooperatively breeding males, and lekking males. Continuous Ornstein-Uhlenbeck models support distinct evolutionary regimes with increased AFR for all of cooperative, colonial, and lekking lineages. Discrete hidden state Markov models suggest a net increase in delayed reproduction for social lineages, even when accounting for hidden state heterogeneity and the potential reverse influence of AFR on sociality. Our results support the hypothesis that the evolution of sociality reshapes the dynamics of life history evolution in birds. Comparative analyses of even the most broadly generalizable characters, such as AFR, must reckon with unique, heterogeneous, historical events in the evolution of individual lineages.</p>","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144837778","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sebastian Prillo, Akshay Ravoor, Nir Yosef, Yun S Song
{"title":"ConvexML: Fast and accurate branch length estimation under irreversible mutation models, illustrated through applications to CRISPR/Cas9-based lineage tracing","authors":"Sebastian Prillo, Akshay Ravoor, Nir Yosef, Yun S Song","doi":"10.1093/sysbio/syaf054","DOIUrl":"https://doi.org/10.1093/sysbio/syaf054","url":null,"abstract":"Branch length estimation is a fundamental problem in Statistical Phylogenetics and a core component of tree reconstruction algorithms. Traditionally, general time-reversible mutation models are employed, and many software tools exist for this scenario. With the advent of CRISPR/Cas9 lineage tracing technologies, there has been significant interest in the study of branch length estimation under irreversible mutation models. Under the CRISPR/Cas9 mutation model, irreversible mutations – in the form of DNA insertions or deletions – are accrued during the experiment, which are then read out at the single-cell level to reconstruct the cell lineage tree. However, most of the analyses of CRISPR/Cas9 lineage tracing data have so far been limited to the reconstruction of single-cell tree topologies, which depict lineage relationships between cells, but not the amount of time that has passed between ancestral cell states and the present. Time-resolved trees, known as chronograms, would allow one to study the evolutionary dynamics of cell populations at an unprecedented level of resolution. Indeed, time-resolved trees would reveal the timing of events on the tree, the relative fitness of subclones, and the dynamics underlying phenotypic changes in the cell population – among other important applications. In this work, we introduce the first scalable and accurate method to refine any given single-cell tree topology into a single-cell chronogram by estimating its branch lengths. To do this, we perform regularized maximum likelihood estimation under a general irreversible mutation model, paired with a conservative version of maximum parsimony that reconstructs only the ancestral states that we are confident about. To deal with the particularities of CRISPR/Cas9 lineage tracing data – such as double-resection events which affect runs of consecutive sites – we avoid making our model more complex and instead opt for using a simple but effective data encoding scheme. Similarly, we avoid explicitly modeling the missing data mechanisms – such as heritable missing data – by instead assuming that they are missing completely at random. We stabilize estimates in low-information regimes by using a simple penalized version of maximum likelihood estimation (MLE) using a minimum branch length constraint and pseudocounts. All this leads to a convex MLE problem that can be readily solved in seconds with off-the-shelf convex optimization solvers. We benchmark our method using both simulations and real lineage tracing data, and show that it performs well on several tasks, matching or outperforming competing methods such as TiDeTree and LAML in terms of accuracy, while being 10 ∼ 100 × faster. Notably, our statistical model is simpler and more general, as we do not explicitly model the intricacies of CRISPR/Cas9 lineage tracing data. In this sense, our contribution is twofold: (1) a fast and robust method for branch length estimation under a general irreversible mutation model, ","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"1 1","pages":""},"PeriodicalIF":6.5,"publicationDate":"2025-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144825256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Selecting a Window Size for Phylogenomic Analyses of Whole Genome Alignments using AIC","authors":"Jeremias Ivan, Paul Frandsen, Robert Lanfear","doi":"10.1093/sysbio/syaf053","DOIUrl":"https://doi.org/10.1093/sysbio/syaf053","url":null,"abstract":"Gene tree discordance along a set of aligned genomes presents a challenge for phylogenomic methods to identify the non-recombining regions and reconstruct the phylogenetic tree for each region. To address this problem, many studies used the non-overlapping window approach, often with an arbitrary selection of fixed window sizes that potentially include intra-window recombination events. In this study, we propose an information theoretic approach to select a window size that best reflects the underlying histories of the alignment. First, we simulated chromosome alignments that reflected the key characteristics of an empirical dataset and found that the AIC is a good predictor of window size accuracy in correctly recovering the tree topologies of the alignment. To address the issue of missing data in empirical datasets, we designed a stepwise non-overlapping window approach that compares the AIC of two window sizes at a time, retaining only genomic regions that can be analysed using both window sizes. We then applied this method to the genomes of Heliconius butterflies and great apes. We found that the best window sizes for the butterflies’ chromosomes ranged from &lt;125bp to 250bp, which are much shorter than those used in a previous study even though this difference in window size did not significantly change the most common topologies across the genome. On the other hand, the best window sizes for great apes’ chromosomes ranged from 500bp to 1kb with the proportion of the major topology (grouping human and chimpanzee) falling between 60% and 87%, consistent with previous findings. Additionally, we observed a notable impact of gene tree estimation error and concatenation when using small and large windows, respectively. For instance, the proportion of the major topology for great apes was 50% when using 250bp windows, but reached almost 100% for 64kb windows. In conclusion, our study highlights the challenges associated with selecting a fixed window size in non-overlapping window analyses and proposes the AIC as a less arbitrary way to select the optimal window size when running non-overlapping method on whole genome alignments.","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"69 1","pages":""},"PeriodicalIF":6.5,"publicationDate":"2025-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144825108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}