Estimating allele frequencies, ancestry proportions and genotype likelihoods in the presence of mapping bias.

IF 2.2 3区 生物学 Q3 GENETICS & HEREDITY
Torsten Günther, Amy Goldberg, Joshua G Schraiber
{"title":"Estimating allele frequencies, ancestry proportions and genotype likelihoods in the presence of mapping bias.","authors":"Torsten Günther, Amy Goldberg, Joshua G Schraiber","doi":"10.1093/g3journal/jkaf172","DOIUrl":null,"url":null,"abstract":"<p><p>Population genomic analyses rely on an accurate and unbiased characterization of the genetic composition of the studied population. For short-read, high-throughput sequencing data, mapping sequencing reads to a linear reference genome can bias population genetic inference due to mismatches in reads carrying non-reference alleles. In this study, we investigate the impact of mapping bias on allele frequency estimates from pseudohaploid data and genotype likelihoods, two approaches commonly used in ultra-low to medium coverage sequencing. To mitigate mapping bias, we propose an empirical adjustment to genotype likelihoods. Using data from the 1000 Genomes Project, we find that our new method improves allele frequency estimation. To test a downstream application, we simulate ancient DNA data with realistic post-mortem damage to compare widely used methods for estimating ancestry proportions under different scenarios, including reference genome selection, population divergence, and sequencing depth. Our findings reveal that mapping bias can lead to differences in estimated admixture proportion of up to 4% depending on the reference population. However, the choice of method has a much stronger impact, with some methods showing differences of 10%. qpAdm appears to perform best at estimating simulated ancestry proportions, but it is sensitive to mapping bias and its applicability may vary across species due to its requirement for additional populations beyond the sources and target population. Our adjusted genotype likelihood approach largely mitigates the effect of mapping bias on genome-wide ancestry estimates from genotype likelihood-based tools. However, it cannot account for the bias introduced by the method itself or the noise in individual site allele frequency estimates due to low sequencing depth. Overall, our study provides valuable insights for obtaining more precise estimates of allele frequencies and ancestry proportions in empirical studies.</p>","PeriodicalId":12468,"journal":{"name":"G3: Genes|Genomes|Genetics","volume":" ","pages":""},"PeriodicalIF":2.2000,"publicationDate":"2025-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"G3: Genes|Genomes|Genetics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/g3journal/jkaf172","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}
引用次数: 0

Abstract

Population genomic analyses rely on an accurate and unbiased characterization of the genetic composition of the studied population. For short-read, high-throughput sequencing data, mapping sequencing reads to a linear reference genome can bias population genetic inference due to mismatches in reads carrying non-reference alleles. In this study, we investigate the impact of mapping bias on allele frequency estimates from pseudohaploid data and genotype likelihoods, two approaches commonly used in ultra-low to medium coverage sequencing. To mitigate mapping bias, we propose an empirical adjustment to genotype likelihoods. Using data from the 1000 Genomes Project, we find that our new method improves allele frequency estimation. To test a downstream application, we simulate ancient DNA data with realistic post-mortem damage to compare widely used methods for estimating ancestry proportions under different scenarios, including reference genome selection, population divergence, and sequencing depth. Our findings reveal that mapping bias can lead to differences in estimated admixture proportion of up to 4% depending on the reference population. However, the choice of method has a much stronger impact, with some methods showing differences of 10%. qpAdm appears to perform best at estimating simulated ancestry proportions, but it is sensitive to mapping bias and its applicability may vary across species due to its requirement for additional populations beyond the sources and target population. Our adjusted genotype likelihood approach largely mitigates the effect of mapping bias on genome-wide ancestry estimates from genotype likelihood-based tools. However, it cannot account for the bias introduced by the method itself or the noise in individual site allele frequency estimates due to low sequencing depth. Overall, our study provides valuable insights for obtaining more precise estimates of allele frequencies and ancestry proportions in empirical studies.

估计等位基因频率,祖先比例和基因型的可能性存在定位偏差。
群体基因组分析依赖于对所研究群体的遗传组成的准确和公正的描述。对于短读、高通量的测序数据,将测序读段定位到线性参考基因组可能会由于携带非参考等位基因的读段不匹配而导致群体遗传推断偏差。在这项研究中,我们研究了定位偏差对假单倍体数据和基因型可能性的等位基因频率估计的影响,这两种方法通常用于超低至中等覆盖测序。为了减轻作图偏差,我们建议对基因型可能性进行经验调整。使用千人基因组计划的数据,我们发现我们的新方法改进了等位基因频率的估计。为了测试下游应用,我们模拟了具有真实死后损伤的古代DNA数据,比较了在不同情况下广泛使用的估计祖先比例的方法,包括参考基因组选择、种群差异和测序深度。我们的研究结果表明,根据参考人群的不同,绘制偏差可能导致估计的混合比例差异高达4%。然而,方法的选择影响更大,有些方法的差异达到10%。qpAdm在估计模拟祖先比例方面表现最好,但它对作图偏差很敏感,并且由于它需要在源和目标种群之外的额外种群,其适用性可能因物种而异。我们调整的基因型似然方法在很大程度上减轻了基于基因型似然的工具对全基因组祖先估计的定位偏差的影响。然而,它不能解释方法本身引入的偏差或由于测序深度低而导致的单个位点等位基因频率估计中的噪声。总的来说,我们的研究为在实证研究中获得更精确的等位基因频率和祖先比例估计提供了有价值的见解。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
G3: Genes|Genomes|Genetics
G3: Genes|Genomes|Genetics GENETICS & HEREDITY-
CiteScore
5.10
自引率
3.80%
发文量
305
审稿时长
3-8 weeks
期刊介绍: G3: Genes, Genomes, Genetics provides a forum for the publication of high‐quality foundational research, particularly research that generates useful genetic and genomic information such as genome maps, single gene studies, genome‐wide association and QTL studies, as well as genome reports, mutant screens, and advances in methods and technology. The Editorial Board of G3 believes that rapid dissemination of these data is the necessary foundation for analysis that leads to mechanistic insights. G3, published by the Genetics Society of America, meets the critical and growing need of the genetics community for rapid review and publication of important results in all areas of genetics. G3 offers the opportunity to publish the puzzling finding or to present unpublished results that may not have been submitted for review and publication due to a perceived lack of a potential high-impact finding. G3 has earned the DOAJ Seal, which is a mark of certification for open access journals, awarded by DOAJ to journals that achieve a high level of openness, adhere to Best Practice and high publishing standards.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信