Joseph Rusinko, Yu Cai, Allison Crysler, Katherine Thompson, Julien Boutte, Mark Fishbein, Shannon C K Straub
{"title":"PickMe: Sample selection for species tree reconstruction using coalescent weighted quartets","authors":"Joseph Rusinko, Yu Cai, Allison Crysler, Katherine Thompson, Julien Boutte, Mark Fishbein, Shannon C K Straub","doi":"10.1093/sysbio/syaf017","DOIUrl":null,"url":null,"abstract":"After collecting large data sets for phylogenomics studies, researchers must decide which, genes or samples to include when reconstructing a species tree. Incomplete or unreliable, data sets make the empiricist’s decision more difficult. Researchers rely on ad hoc, strategies to maximize sampling while ensuring sufficient data for accurate inferences. An, algorithm called PickMe formalizes the sample selection process, assuming that the, samples evolved under the Tree Multispecies Coalescent model. We propose a Bayesian, framework for selecting samples for species tree analysis. Given a collection of gene trees, we compute a posterior probability for each quartet, describing the likelihood that the, species tree displays this topology. From this, we assign individual samples reliability, scores computed as the average of a scaled version of the posterior probabilities. PickMe, uses these weights to recommend which samples to include in a species tree analysis., Analysis of simulated data showed that including the samples suggested by Pickme, produced species trees closer to the true species trees than both unfiltered data sets and, data sets with ad hoc gene occupancy cut-offs applied. To further illustrate the efficacy of, this tool, we apply PickMe to gene trees generated from target capture data from, milkweeds. PickMe indicates more samples could have reliably been included in a previous, milkweed phylogenomic analysis than the authors analyzed without access to a formal, methodology for sample selection. Using simulated and empirical data, we also compare, PickMe to existing sample selection methods. Inclusion of PickMe will enhance, phylogenomics data analysis pipelines by providing a formal structure for sample selection.","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":"29 1","pages":""},"PeriodicalIF":6.1000,"publicationDate":"2025-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Systematic Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/sysbio/syaf017","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EVOLUTIONARY BIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
After collecting large data sets for phylogenomics studies, researchers must decide which, genes or samples to include when reconstructing a species tree. Incomplete or unreliable, data sets make the empiricist’s decision more difficult. Researchers rely on ad hoc, strategies to maximize sampling while ensuring sufficient data for accurate inferences. An, algorithm called PickMe formalizes the sample selection process, assuming that the, samples evolved under the Tree Multispecies Coalescent model. We propose a Bayesian, framework for selecting samples for species tree analysis. Given a collection of gene trees, we compute a posterior probability for each quartet, describing the likelihood that the, species tree displays this topology. From this, we assign individual samples reliability, scores computed as the average of a scaled version of the posterior probabilities. PickMe, uses these weights to recommend which samples to include in a species tree analysis., Analysis of simulated data showed that including the samples suggested by Pickme, produced species trees closer to the true species trees than both unfiltered data sets and, data sets with ad hoc gene occupancy cut-offs applied. To further illustrate the efficacy of, this tool, we apply PickMe to gene trees generated from target capture data from, milkweeds. PickMe indicates more samples could have reliably been included in a previous, milkweed phylogenomic analysis than the authors analyzed without access to a formal, methodology for sample selection. Using simulated and empirical data, we also compare, PickMe to existing sample selection methods. Inclusion of PickMe will enhance, phylogenomics data analysis pipelines by providing a formal structure for sample selection.
期刊介绍:
Systematic Biology is the bimonthly journal of the Society of Systematic Biologists. Papers for the journal are original contributions to the theory, principles, and methods of systematics as well as phylogeny, evolution, morphology, biogeography, paleontology, genetics, and the classification of all living things. A Points of View section offers a forum for discussion, while book reviews and announcements of general interest are also featured.