{"title":"利用AIC选择全基因组比对系统发育分析的窗口大小","authors":"Jeremias Ivan, Paul Frandsen, Robert Lanfear","doi":"10.1093/sysbio/syaf053","DOIUrl":null,"url":null,"abstract":"Gene tree discordance along a set of aligned genomes presents a challenge for phylogenomic methods to identify the non-recombining regions and reconstruct the phylogenetic tree for each region. To address this problem, many studies used the non-overlapping window approach, often with an arbitrary selection of fixed window sizes that potentially include intra-window recombination events. In this study, we propose an information theoretic approach to select a window size that best reflects the underlying histories of the alignment. First, we simulated chromosome alignments that reflected the key characteristics of an empirical dataset and found that the AIC is a good predictor of window size accuracy in correctly recovering the tree topologies of the alignment. To address the issue of missing data in empirical datasets, we designed a stepwise non-overlapping window approach that compares the AIC of two window sizes at a time, retaining only genomic regions that can be analysed using both window sizes. We then applied this method to the genomes of Heliconius butterflies and great apes. We found that the best window sizes for the butterflies’ chromosomes ranged from <125bp to 250bp, which are much shorter than those used in a previous study even though this difference in window size did not significantly change the most common topologies across the genome. On the other hand, the best window sizes for great apes’ chromosomes ranged from 500bp to 1kb with the proportion of the major topology (grouping human and chimpanzee) falling between 60% and 87%, consistent with previous findings. Additionally, we observed a notable impact of gene tree estimation error and concatenation when using small and large windows, respectively. For instance, the proportion of the major topology for great apes was 50% when using 250bp windows, but reached almost 100% for 64kb windows. In conclusion, our study highlights the challenges associated with selecting a fixed window size in non-overlapping window analyses and proposes the AIC as a less arbitrary way to select the optimal window size when running non-overlapping method on whole genome alignments.","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"69 1","pages":""},"PeriodicalIF":5.7000,"publicationDate":"2025-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Selecting a Window Size for Phylogenomic Analyses of Whole Genome Alignments using AIC\",\"authors\":\"Jeremias Ivan, Paul Frandsen, Robert Lanfear\",\"doi\":\"10.1093/sysbio/syaf053\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Gene tree discordance along a set of aligned genomes presents a challenge for phylogenomic methods to identify the non-recombining regions and reconstruct the phylogenetic tree for each region. To address this problem, many studies used the non-overlapping window approach, often with an arbitrary selection of fixed window sizes that potentially include intra-window recombination events. In this study, we propose an information theoretic approach to select a window size that best reflects the underlying histories of the alignment. First, we simulated chromosome alignments that reflected the key characteristics of an empirical dataset and found that the AIC is a good predictor of window size accuracy in correctly recovering the tree topologies of the alignment. To address the issue of missing data in empirical datasets, we designed a stepwise non-overlapping window approach that compares the AIC of two window sizes at a time, retaining only genomic regions that can be analysed using both window sizes. We then applied this method to the genomes of Heliconius butterflies and great apes. We found that the best window sizes for the butterflies’ chromosomes ranged from <125bp to 250bp, which are much shorter than those used in a previous study even though this difference in window size did not significantly change the most common topologies across the genome. On the other hand, the best window sizes for great apes’ chromosomes ranged from 500bp to 1kb with the proportion of the major topology (grouping human and chimpanzee) falling between 60% and 87%, consistent with previous findings. Additionally, we observed a notable impact of gene tree estimation error and concatenation when using small and large windows, respectively. For instance, the proportion of the major topology for great apes was 50% when using 250bp windows, but reached almost 100% for 64kb windows. In conclusion, our study highlights the challenges associated with selecting a fixed window size in non-overlapping window analyses and proposes the AIC as a less arbitrary way to select the optimal window size when running non-overlapping method on whole genome alignments.\",\"PeriodicalId\":22120,\"journal\":{\"name\":\"Systematic Biology\",\"volume\":\"69 1\",\"pages\":\"\"},\"PeriodicalIF\":5.7000,\"publicationDate\":\"2025-08-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Systematic Biology\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1093/sysbio/syaf053\",\"RegionNum\":1,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"EVOLUTIONARY BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Systematic Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/sysbio/syaf053","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EVOLUTIONARY BIOLOGY","Score":null,"Total":0}
Selecting a Window Size for Phylogenomic Analyses of Whole Genome Alignments using AIC
Gene tree discordance along a set of aligned genomes presents a challenge for phylogenomic methods to identify the non-recombining regions and reconstruct the phylogenetic tree for each region. To address this problem, many studies used the non-overlapping window approach, often with an arbitrary selection of fixed window sizes that potentially include intra-window recombination events. In this study, we propose an information theoretic approach to select a window size that best reflects the underlying histories of the alignment. First, we simulated chromosome alignments that reflected the key characteristics of an empirical dataset and found that the AIC is a good predictor of window size accuracy in correctly recovering the tree topologies of the alignment. To address the issue of missing data in empirical datasets, we designed a stepwise non-overlapping window approach that compares the AIC of two window sizes at a time, retaining only genomic regions that can be analysed using both window sizes. We then applied this method to the genomes of Heliconius butterflies and great apes. We found that the best window sizes for the butterflies’ chromosomes ranged from <125bp to 250bp, which are much shorter than those used in a previous study even though this difference in window size did not significantly change the most common topologies across the genome. On the other hand, the best window sizes for great apes’ chromosomes ranged from 500bp to 1kb with the proportion of the major topology (grouping human and chimpanzee) falling between 60% and 87%, consistent with previous findings. Additionally, we observed a notable impact of gene tree estimation error and concatenation when using small and large windows, respectively. For instance, the proportion of the major topology for great apes was 50% when using 250bp windows, but reached almost 100% for 64kb windows. In conclusion, our study highlights the challenges associated with selecting a fixed window size in non-overlapping window analyses and proposes the AIC as a less arbitrary way to select the optimal window size when running non-overlapping method on whole genome alignments.
期刊介绍:
Systematic Biology is the bimonthly journal of the Society of Systematic Biologists. Papers for the journal are original contributions to the theory, principles, and methods of systematics as well as phylogeny, evolution, morphology, biogeography, paleontology, genetics, and the classification of all living things. A Points of View section offers a forum for discussion, while book reviews and announcements of general interest are also featured.