在古代人类基因组学中,一种概率方法可视化缺失数据对PCA的影响。

IF 3.5 2区 生物学 Q2 BIOTECHNOLOGY & APPLIED MICROBIOLOGY
Susanne Zabel, Samira Breitling, Cosimo Posth, Kay Nieselt
{"title":"在古代人类基因组学中,一种概率方法可视化缺失数据对PCA的影响。","authors":"Susanne Zabel, Samira Breitling, Cosimo Posth, Kay Nieselt","doi":"10.1186/s12864-025-11728-1","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Principal Component Analysis (PCA) is widely used in population genetics to visualize genetic relationships and population structures. In ancient genomics, genotype information may in parts remain unresolved due to the low abundance and degraded quality of ancient DNA. While methods like SmartPCA allow the projection of ancient samples despite missing data, they do not quantify projection uncertainty. The reliability of PCA projections for often very sparse ancient genotype samples is not well understood. Ignoring this uncertainty may lead to overconfident conclusions about the observed genetic relationships and population structure.</p><p><strong>Results: </strong>This study systematically investigates the impact of missing loci on PCA projections using both simulated and real ancient human genotype data. Through extensive simulations with high-coverage ancient samples, we demonstrate that increasing levels of missing data can lead to less accurate SmartPCA projections, highlighting the importance of considering uncertainty when interpreting PCA results from ancient samples. To address this, we developed a probabilistic framework to quantify the uncertainty in PCA projections due to missing data. By applying our methodology to modern and ancient West Eurasian genotype samples from the Allen Ancient DNA Resource database, we could show a high concordance between our predicted projection and empirically derived distributions. Applying this framework to real-world data, we demonstrate its utility in predicting and visualizing embedding uncertainties for ancient samples of varying SNP coverages.</p><p><strong>Conclusion: </strong>Our results emphasize the importance of accounting for projection uncertainty in ancient population studies. We therefore make our probabilistic model available through TrustPCA, a user-friendly web tool that provides researchers with uncertainty estimates alongside PCA projections, facilitating data exploration in ancient human genomic studies and enhancing transparency in data quality reporting.</p>","PeriodicalId":9030,"journal":{"name":"BMC Genomics","volume":"26 1","pages":"537"},"PeriodicalIF":3.5000,"publicationDate":"2025-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A probabilistic approach to visualize the effect of missing data on PCA in ancient human genomics.\",\"authors\":\"Susanne Zabel, Samira Breitling, Cosimo Posth, Kay Nieselt\",\"doi\":\"10.1186/s12864-025-11728-1\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Principal Component Analysis (PCA) is widely used in population genetics to visualize genetic relationships and population structures. In ancient genomics, genotype information may in parts remain unresolved due to the low abundance and degraded quality of ancient DNA. While methods like SmartPCA allow the projection of ancient samples despite missing data, they do not quantify projection uncertainty. The reliability of PCA projections for often very sparse ancient genotype samples is not well understood. Ignoring this uncertainty may lead to overconfident conclusions about the observed genetic relationships and population structure.</p><p><strong>Results: </strong>This study systematically investigates the impact of missing loci on PCA projections using both simulated and real ancient human genotype data. Through extensive simulations with high-coverage ancient samples, we demonstrate that increasing levels of missing data can lead to less accurate SmartPCA projections, highlighting the importance of considering uncertainty when interpreting PCA results from ancient samples. To address this, we developed a probabilistic framework to quantify the uncertainty in PCA projections due to missing data. By applying our methodology to modern and ancient West Eurasian genotype samples from the Allen Ancient DNA Resource database, we could show a high concordance between our predicted projection and empirically derived distributions. Applying this framework to real-world data, we demonstrate its utility in predicting and visualizing embedding uncertainties for ancient samples of varying SNP coverages.</p><p><strong>Conclusion: </strong>Our results emphasize the importance of accounting for projection uncertainty in ancient population studies. We therefore make our probabilistic model available through TrustPCA, a user-friendly web tool that provides researchers with uncertainty estimates alongside PCA projections, facilitating data exploration in ancient human genomic studies and enhancing transparency in data quality reporting.</p>\",\"PeriodicalId\":9030,\"journal\":{\"name\":\"BMC Genomics\",\"volume\":\"26 1\",\"pages\":\"537\"},\"PeriodicalIF\":3.5000,\"publicationDate\":\"2025-05-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMC Genomics\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1186/s12864-025-11728-1\",\"RegionNum\":2,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"BIOTECHNOLOGY & APPLIED MICROBIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Genomics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12864-025-11728-1","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOTECHNOLOGY & APPLIED MICROBIOLOGY","Score":null,"Total":0}
引用次数: 0

摘要

背景:主成分分析(PCA)被广泛应用于群体遗传学,以可视化遗传关系和群体结构。在古代基因组学中,由于古代DNA的低丰度和质量下降,基因型信息可能在某些方面仍未得到解决。虽然像SmartPCA这样的方法可以在缺少数据的情况下对古代样本进行投影,但它们不能量化投影的不确定性。PCA预测对于通常非常稀疏的古代基因型样本的可靠性还不是很清楚。忽视这种不确定性可能会导致对观察到的遗传关系和种群结构的结论过于自信。结果:本研究使用模拟和真实的古人类基因型数据系统地调查了缺失位点对PCA预测的影响。通过对高覆盖率古代样本的广泛模拟,我们证明了数据缺失水平的增加会导致SmartPCA预测的准确性降低,这突出了在解释古代样本的PCA结果时考虑不确定性的重要性。为了解决这个问题,我们开发了一个概率框架来量化由于缺失数据而导致的PCA预测中的不确定性。通过将我们的方法应用于来自Allen古DNA资源数据库的现代和古代西欧亚基因型样本,我们可以显示我们的预测预测与经验推导的分布之间的高度一致性。将该框架应用于现实世界的数据,我们证明了它在预测和可视化不同SNP覆盖的古代样本的嵌入不确定性方面的实用性。结论:我们的研究结果强调了在古代人口研究中考虑预测不确定性的重要性。因此,我们通过TrustPCA提供了我们的概率模型,这是一个用户友好的网络工具,为研究人员提供了不确定性估计和PCA预测,促进了古代人类基因组研究的数据探索,提高了数据质量报告的透明度。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
A probabilistic approach to visualize the effect of missing data on PCA in ancient human genomics.

Background: Principal Component Analysis (PCA) is widely used in population genetics to visualize genetic relationships and population structures. In ancient genomics, genotype information may in parts remain unresolved due to the low abundance and degraded quality of ancient DNA. While methods like SmartPCA allow the projection of ancient samples despite missing data, they do not quantify projection uncertainty. The reliability of PCA projections for often very sparse ancient genotype samples is not well understood. Ignoring this uncertainty may lead to overconfident conclusions about the observed genetic relationships and population structure.

Results: This study systematically investigates the impact of missing loci on PCA projections using both simulated and real ancient human genotype data. Through extensive simulations with high-coverage ancient samples, we demonstrate that increasing levels of missing data can lead to less accurate SmartPCA projections, highlighting the importance of considering uncertainty when interpreting PCA results from ancient samples. To address this, we developed a probabilistic framework to quantify the uncertainty in PCA projections due to missing data. By applying our methodology to modern and ancient West Eurasian genotype samples from the Allen Ancient DNA Resource database, we could show a high concordance between our predicted projection and empirically derived distributions. Applying this framework to real-world data, we demonstrate its utility in predicting and visualizing embedding uncertainties for ancient samples of varying SNP coverages.

Conclusion: Our results emphasize the importance of accounting for projection uncertainty in ancient population studies. We therefore make our probabilistic model available through TrustPCA, a user-friendly web tool that provides researchers with uncertainty estimates alongside PCA projections, facilitating data exploration in ancient human genomic studies and enhancing transparency in data quality reporting.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
BMC Genomics
BMC Genomics 生物-生物工程与应用微生物
CiteScore
7.40
自引率
4.50%
发文量
769
审稿时长
6.4 months
期刊介绍: BMC Genomics is an open access, peer-reviewed journal that considers articles on all aspects of genome-scale analysis, functional genomics, and proteomics. BMC Genomics is part of the BMC series which publishes subject-specific journals focused on the needs of individual research communities across all areas of biology and medicine. We offer an efficient, fair and friendly peer review service, and are committed to publishing all sound science, provided that there is some advance in knowledge presented by the work.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信