{"title":"Estimating Genome-wide Phylogenies Using Probabilistic Topic Modeling","authors":"Marzieh Khodaei, Scott V Edwards, Peter Beerli","doi":"10.1093/sysbio/syaf015","DOIUrl":null,"url":null,"abstract":"Methods for rapidly inferring the evolutionary history of species or populations with, genome-wide data are progressing, but computational constraints still limit our abilities in, this area. We developed an alignment-free method to infer genome-wide phylogenies and, implemented it in the Python package TopicContml. The method uses probabilistic, topic modeling (specifically, Latent Dirichlet Allocation or LDA) to extract ‘topic’, frequencies from k-mers, which are derived from multilocus DNA sequences. These, extracted frequencies then serve as an input for the program Contml in the PHYLIP, package, which is used to generate a species tree. We evaluated the performance of, TopicContml on simulated datasets with gaps and three biological datasets: (1) 14 DNA, sequence loci from two Australian bird species distributed across nine populations, (2), 5162 loci from 80 mammal species, and (3) raw, unaligned, non-orthologous PacBio, sequences from 12 bird species. We also assessed the uncertainty of the estimated, relationships among clades using a bootstrap procedure. Our empirical results and, simulated data suggest that our method is efficient and statistically robust.","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"50 1","pages":""},"PeriodicalIF":6.1000,"publicationDate":"2025-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Systematic Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/sysbio/syaf015","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EVOLUTIONARY BIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Methods for rapidly inferring the evolutionary history of species or populations with, genome-wide data are progressing, but computational constraints still limit our abilities in, this area. We developed an alignment-free method to infer genome-wide phylogenies and, implemented it in the Python package TopicContml. The method uses probabilistic, topic modeling (specifically, Latent Dirichlet Allocation or LDA) to extract ‘topic’, frequencies from k-mers, which are derived from multilocus DNA sequences. These, extracted frequencies then serve as an input for the program Contml in the PHYLIP, package, which is used to generate a species tree. We evaluated the performance of, TopicContml on simulated datasets with gaps and three biological datasets: (1) 14 DNA, sequence loci from two Australian bird species distributed across nine populations, (2), 5162 loci from 80 mammal species, and (3) raw, unaligned, non-orthologous PacBio, sequences from 12 bird species. We also assessed the uncertainty of the estimated, relationships among clades using a bootstrap procedure. Our empirical results and, simulated data suggest that our method is efficient and statistically robust.
期刊介绍:
Systematic Biology is the bimonthly journal of the Society of Systematic Biologists. Papers for the journal are original contributions to the theory, principles, and methods of systematics as well as phylogeny, evolution, morphology, biogeography, paleontology, genetics, and the classification of all living things. A Points of View section offers a forum for discussion, while book reviews and announcements of general interest are also featured.