Reference Genome Choice and Filtering Thresholds Jointly Influence Phylogenomic Analyses.

IF 5.7 1区生物学 Q1 EVOLUTIONARY BIOLOGY

Systematic Biology Pub Date : 2024-05-27 DOI:10.1093/sysbio/syad065

Jessica A Rick, Chad D Brock, Alexander L Lewanski, Jimena Golcher-Benavides, Catherine E Wagner

{"title":"Reference Genome Choice and Filtering Thresholds Jointly Influence Phylogenomic Analyses.","authors":"Jessica A Rick, Chad D Brock, Alexander L Lewanski, Jimena Golcher-Benavides, Catherine E Wagner","doi":"10.1093/sysbio/syad065","DOIUrl":null,"url":null,"abstract":"<p><p>Molecular phylogenies are a cornerstone of modern comparative biology and are commonly employed to investigate a range of biological phenomena, such as diversification rates, patterns in trait evolution, biogeography, and community assembly. Recent work has demonstrated that significant biases may be introduced into downstream phylogenetic analyses from processing genomic data; however, it remains unclear whether there are interactions among bioinformatic parameters or biases introduced through the choice of reference genome for sequence alignment and variant calling. We address these knowledge gaps by employing a combination of simulated and empirical data sets to investigate the extent to which the choice of reference genome in upstream bioinformatic processing of genomic data influences phylogenetic inference, as well as the way that reference genome choice interacts with bioinformatic filtering choices and phylogenetic inference method. We demonstrate that more stringent minor allele filters bias inferred trees away from the true species tree topology, and that these biased trees tend to be more imbalanced and have a higher center of gravity than the true trees. We find the greatest topological accuracy when filtering sites for minor allele count (MAC) >3-4 in our 51-taxa data sets, while tree center of gravity was closest to the true value when filtering for sites with MAC >1-2. In contrast, filtering for missing data increased accuracy in the inferred topologies; however, this effect was small in comparison to the effect of minor allele filters and may be undesirable due to a subsequent mutation spectrum distortion. The bias introduced by these filters differs based on the reference genome used in short read alignment, providing further support that choosing a reference genome for alignment is an important bioinformatic decision with implications for downstream analyses. These results demonstrate that attributes of the study system and dataset (and their interaction) add important nuance for how best to assemble and filter short-read genomic data for phylogenetic inference.</p>","PeriodicalId":22120,"journal":{"name":"Systematic Biology","volume":" ","pages":"76-101"},"PeriodicalIF":5.7000,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Systematic Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/sysbio/syad065","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EVOLUTIONARY BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Molecular phylogenies are a cornerstone of modern comparative biology and are commonly employed to investigate a range of biological phenomena, such as diversification rates, patterns in trait evolution, biogeography, and community assembly. Recent work has demonstrated that significant biases may be introduced into downstream phylogenetic analyses from processing genomic data; however, it remains unclear whether there are interactions among bioinformatic parameters or biases introduced through the choice of reference genome for sequence alignment and variant calling. We address these knowledge gaps by employing a combination of simulated and empirical data sets to investigate the extent to which the choice of reference genome in upstream bioinformatic processing of genomic data influences phylogenetic inference, as well as the way that reference genome choice interacts with bioinformatic filtering choices and phylogenetic inference method. We demonstrate that more stringent minor allele filters bias inferred trees away from the true species tree topology, and that these biased trees tend to be more imbalanced and have a higher center of gravity than the true trees. We find the greatest topological accuracy when filtering sites for minor allele count (MAC) >3-4 in our 51-taxa data sets, while tree center of gravity was closest to the true value when filtering for sites with MAC >1-2. In contrast, filtering for missing data increased accuracy in the inferred topologies; however, this effect was small in comparison to the effect of minor allele filters and may be undesirable due to a subsequent mutation spectrum distortion. The bias introduced by these filters differs based on the reference genome used in short read alignment, providing further support that choosing a reference genome for alignment is an important bioinformatic decision with implications for downstream analyses. These results demonstrate that attributes of the study system and dataset (and their interaction) add important nuance for how best to assemble and filter short-read genomic data for phylogenetic inference.

查看原文本刊更多论文

参考基因组选择和过滤阈值共同影响系统发育分析。

分子系统发育是现代比较生物学的基石，通常用于研究一系列生物现象，如多样化率、特征进化模式、生物地理学和群落聚集。最近的工作表明，处理基因组数据可能会在下游系统发育分析中引入重大偏差；然而，目前尚不清楚生物信息学参数之间是否存在相互作用，或者通过选择参考基因组进行序列比对和变体调用引入的偏差。我们通过使用模拟和经验数据集的组合来解决这些知识差距，以调查在基因组数据的上游生物信息学处理中参考基因组的选择在多大程度上影响系统发育推断，以及参考基因组选择与生物信息学过滤选择和系统发育推断方法相互作用的方式。我们证明，更严格的次要等位基因过滤了偏离真实物种树拓扑的偏差推断树，并且这些偏差树往往比真实树更不平衡，重心更高。在我们的51个分类群数据集中，当筛选次要等位基因计数>3-4的位点时，我们发现拓扑准确性最高，而当筛选次要等位基因计数>1-2的位点时树的重心最接近真实值。相反，对缺失数据的过滤提高了推断拓扑的准确性；然而，与次要等位基因过滤器的效果相比，这种效果很小，并且由于随后的突变谱畸变，可能是不希望的。这些过滤器引入的偏差因短读比对中使用的参考基因组而异，这进一步支持了选择用于比对的参考基因组是一个重要的生物信息学决策，对下游分析有影响。这些结果表明，研究系统和数据集的属性（及其相互作用）为如何最好地收集和过滤短读基因组数据以进行系统发育推断增加了重要的细微差别。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Systematic Biology 生物-进化生物学

CiteScore

13.00

自引率

7.70%

发文量

审稿时长

6-12 weeks

期刊介绍： Systematic Biology is the bimonthly journal of the Society of Systematic Biologists. Papers for the journal are original contributions to the theory, principles, and methods of systematics as well as phylogeny, evolution, morphology, biogeography, paleontology, genetics, and the classification of all living things. A Points of View section offers a forum for discussion, while book reviews and announcements of general interest are also featured.