Estimating Genome-wide Phylogenies Using Probabilistic Topic Modeling

IF 6.1 1区 生物学 Q1 EVOLUTIONARY BIOLOGY
Marzieh Khodaei, Scott V Edwards, Peter Beerli
{"title":"Estimating Genome-wide Phylogenies Using Probabilistic Topic Modeling","authors":"Marzieh Khodaei, Scott V Edwards, Peter Beerli","doi":"10.1093/sysbio/syaf015","DOIUrl":null,"url":null,"abstract":"Methods for rapidly inferring the evolutionary history of species or populations with, genome-wide data are progressing, but computational constraints still limit our abilities in, this area. We developed an alignment-free method to infer genome-wide phylogenies and, implemented it in the Python package TopicContml. The method uses probabilistic, topic modeling (specifically, Latent Dirichlet Allocation or LDA) to extract ‘topic’, frequencies from k-mers, which are derived from multilocus DNA sequences. These, extracted frequencies then serve as an input for the program Contml in the PHYLIP, package, which is used to generate a species tree. We evaluated the performance of, TopicContml on simulated datasets with gaps and three biological datasets: (1) 14 DNA, sequence loci from two Australian bird species distributed across nine populations, (2), 5162 loci from 80 mammal species, and (3) raw, unaligned, non-orthologous PacBio, sequences from 12 bird species. We also assessed the uncertainty of the estimated, relationships among clades using a bootstrap procedure. Our empirical results and, simulated data suggest that our method is efficient and statistically robust.","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"50 1","pages":""},"PeriodicalIF":6.1000,"publicationDate":"2025-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Systematic Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/sysbio/syaf015","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EVOLUTIONARY BIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Methods for rapidly inferring the evolutionary history of species or populations with, genome-wide data are progressing, but computational constraints still limit our abilities in, this area. We developed an alignment-free method to infer genome-wide phylogenies and, implemented it in the Python package TopicContml. The method uses probabilistic, topic modeling (specifically, Latent Dirichlet Allocation or LDA) to extract ‘topic’, frequencies from k-mers, which are derived from multilocus DNA sequences. These, extracted frequencies then serve as an input for the program Contml in the PHYLIP, package, which is used to generate a species tree. We evaluated the performance of, TopicContml on simulated datasets with gaps and three biological datasets: (1) 14 DNA, sequence loci from two Australian bird species distributed across nine populations, (2), 5162 loci from 80 mammal species, and (3) raw, unaligned, non-orthologous PacBio, sequences from 12 bird species. We also assessed the uncertainty of the estimated, relationships among clades using a bootstrap procedure. Our empirical results and, simulated data suggest that our method is efficient and statistically robust.
利用概率主题模型估计全基因组系统发育
利用全基因组数据快速推断物种或种群进化史的方法正在取得进展,但计算限制仍然限制了我们在这一领域的能力。我们开发了一种无需比对的方法来推断全基因组的系统发育,并在Python包TopicContml中实现了它。该方法使用概率主题建模(特别是潜狄利克雷分配或LDA)从k-mers中提取“主题”频率,k-mers来自多位点DNA序列。这些被提取的频率然后作为PHYLIP包中的程序Contml的输入,用于生成物种树。我们评估了TopicContml在具有缺口的模拟数据集和3个生物数据集上的性能:(1)分布在9个种群中的2种澳大利亚鸟类的14个DNA序列位点,(2)来自80种哺乳动物的5162个位点,以及(3)来自12种鸟类的原始、未对齐、非同源PacBio序列。我们还评估了估计的不确定性,使用自举程序的分支之间的关系。我们的实证结果和模拟数据表明,我们的方法是有效的和统计稳健性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Systematic Biology
Systematic Biology 生物-进化生物学
CiteScore
13.00
自引率
7.70%
发文量
70
审稿时长
6-12 weeks
期刊介绍: Systematic Biology is the bimonthly journal of the Society of Systematic Biologists. Papers for the journal are original contributions to the theory, principles, and methods of systematics as well as phylogeny, evolution, morphology, biogeography, paleontology, genetics, and the classification of all living things. A Points of View section offers a forum for discussion, while book reviews and announcements of general interest are also featured.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信