Evaluation of ancient DNA imputation: a simulation study

Mariana Escobar-Rodríguez, K. Veeramah
{"title":"Evaluation of ancient DNA imputation: a simulation study","authors":"Mariana Escobar-Rodríguez, K. Veeramah","doi":"10.47248/hpgg2404010002","DOIUrl":null,"url":null,"abstract":"Ancient genomic data is becoming increasingly available thanks to recent advances in high-throughput sequencing technologies. Yet, post-mortem degradation of endogenous ancient DNA often results in low depth of coverage and subsequently high levels of genotype missingness and uncertainty. Genotype imputation is a potential strategy for increasing the information available in ancient DNA samples and thus improving the power of downstream population genetic analyses. However, the performance of genotype imputation on ancient genomes under different conditions has not yet been fully explored, with all previous work primarily using an empirical approach of downsampling high coverage paleogenomes. While these studies have provided invaluable insights into best practices for imputation, they rely on a fairly limited number of existing high coverage samples with significant temporal and geographical biases. \nAs an alternative, we used a coalescent simulation approach to generate genomes with characteristics of ancient DNA in order to more systematically evaluate the performance of two popular imputation software, BEAGLE and GLIMPSE, under variable divergence times between the target sample and reference haplotypes, as well as different depths of coverage and reference sample size. Our results suggest that for genomes with coverage <=0.1x imputation performance is poor regardless of the strategy employed. Beyond 0.1x coverage imputation is generally improved as the size of the reference panel increases, and imputation accuracy decreases with increasing divergence between target and reference populations. It may thus be preferable to compile a smaller set of less diverged reference samples than a larger more highly diverged dataset. In addition, the imputation accuracy may plateau beyond some level of divergence between the reference and target populations. While accuracy at common variants is similar regardless of divergence time, rarer variants are better imputed on less diverged target samples. Furthermore, both imputation software, but particularly GLIMPSE, overestimate high genotype probability calls, especially at low coverages. Our results provide insight into optimal strategies for ancient genotype imputation under a wide set of scenarios, complementing previous empirical studies based on imputing downsampled high-coverage ancient genomes.","PeriodicalId":393324,"journal":{"name":"Human Population Genetics and Genomics","volume":"8 19","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Human Population Genetics and Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.47248/hpgg2404010002","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Ancient genomic data is becoming increasingly available thanks to recent advances in high-throughput sequencing technologies. Yet, post-mortem degradation of endogenous ancient DNA often results in low depth of coverage and subsequently high levels of genotype missingness and uncertainty. Genotype imputation is a potential strategy for increasing the information available in ancient DNA samples and thus improving the power of downstream population genetic analyses. However, the performance of genotype imputation on ancient genomes under different conditions has not yet been fully explored, with all previous work primarily using an empirical approach of downsampling high coverage paleogenomes. While these studies have provided invaluable insights into best practices for imputation, they rely on a fairly limited number of existing high coverage samples with significant temporal and geographical biases. As an alternative, we used a coalescent simulation approach to generate genomes with characteristics of ancient DNA in order to more systematically evaluate the performance of two popular imputation software, BEAGLE and GLIMPSE, under variable divergence times between the target sample and reference haplotypes, as well as different depths of coverage and reference sample size. Our results suggest that for genomes with coverage <=0.1x imputation performance is poor regardless of the strategy employed. Beyond 0.1x coverage imputation is generally improved as the size of the reference panel increases, and imputation accuracy decreases with increasing divergence between target and reference populations. It may thus be preferable to compile a smaller set of less diverged reference samples than a larger more highly diverged dataset. In addition, the imputation accuracy may plateau beyond some level of divergence between the reference and target populations. While accuracy at common variants is similar regardless of divergence time, rarer variants are better imputed on less diverged target samples. Furthermore, both imputation software, but particularly GLIMPSE, overestimate high genotype probability calls, especially at low coverages. Our results provide insight into optimal strategies for ancient genotype imputation under a wide set of scenarios, complementing previous empirical studies based on imputing downsampled high-coverage ancient genomes.
古 DNA 估算评估:模拟研究
由于近年来高通量测序技术的进步,古代基因组数据的可用性越来越高。然而,内源性古 DNA 死后降解往往会导致覆盖深度低,进而造成高水平的基因型缺失和不确定性。基因型估算是一种潜在的策略,可增加古 DNA 样本中的可用信息,从而提高下游群体遗传分析的能力。然而,基因型估算在不同条件下对古基因组的表现尚未得到充分探讨,以往的工作主要采用对高覆盖率古基因组进行下采样的经验方法。虽然这些研究为估算的最佳实践提供了宝贵的见解,但它们所依赖的现有高覆盖样本数量相当有限,而且存在明显的时间和地理偏差。作为一种替代方法,我们使用聚合模拟方法生成具有古DNA特征的基因组,以便在目标样本和参考单倍型之间的分歧时间不同、覆盖深度和参考样本大小不同的情况下,更系统地评估两种流行的估算软件 BEAGLE 和 GLIMPSE 的性能。我们的研究结果表明,对于覆盖率<=0.1x的基因组,无论采用哪种策略,估算效果都很差。超过 0.1x 覆盖率的基因组,随着参照组规模的增加,估算效果一般会有所改善,而估算的准确性则会随着目标种群和参照种群之间差异的增加而降低。因此,编制一组较小的差异较小的参考样本可能比编制一个较大的差异较大的数据集更可取。此外,当参照群体和目标群体之间的差异达到一定程度后,估算准确率可能会趋于稳定。虽然无论发散时间长短,常见变异的准确率都差不多,但在发散程度较低的目标样本中,稀有变异的归因效果更好。此外,两种估算软件,尤其是 GLIMPSE,都会高估高基因型概率调用,特别是在低覆盖率的情况下。我们的研究结果提供了在多种情况下古代基因型归因的最佳策略,补充了之前基于高覆盖率古代基因组归因的经验研究。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信