Susanne Zabel, Samira Breitling, Cosimo Posth, Kay Nieselt
{"title":"A probabilistic approach to visualize the effect of missing data on PCA in ancient human genomics.","authors":"Susanne Zabel, Samira Breitling, Cosimo Posth, Kay Nieselt","doi":"10.1186/s12864-025-11728-1","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Principal Component Analysis (PCA) is widely used in population genetics to visualize genetic relationships and population structures. In ancient genomics, genotype information may in parts remain unresolved due to the low abundance and degraded quality of ancient DNA. While methods like SmartPCA allow the projection of ancient samples despite missing data, they do not quantify projection uncertainty. The reliability of PCA projections for often very sparse ancient genotype samples is not well understood. Ignoring this uncertainty may lead to overconfident conclusions about the observed genetic relationships and population structure.</p><p><strong>Results: </strong>This study systematically investigates the impact of missing loci on PCA projections using both simulated and real ancient human genotype data. Through extensive simulations with high-coverage ancient samples, we demonstrate that increasing levels of missing data can lead to less accurate SmartPCA projections, highlighting the importance of considering uncertainty when interpreting PCA results from ancient samples. To address this, we developed a probabilistic framework to quantify the uncertainty in PCA projections due to missing data. By applying our methodology to modern and ancient West Eurasian genotype samples from the Allen Ancient DNA Resource database, we could show a high concordance between our predicted projection and empirically derived distributions. Applying this framework to real-world data, we demonstrate its utility in predicting and visualizing embedding uncertainties for ancient samples of varying SNP coverages.</p><p><strong>Conclusion: </strong>Our results emphasize the importance of accounting for projection uncertainty in ancient population studies. We therefore make our probabilistic model available through TrustPCA, a user-friendly web tool that provides researchers with uncertainty estimates alongside PCA projections, facilitating data exploration in ancient human genomic studies and enhancing transparency in data quality reporting.</p>","PeriodicalId":9030,"journal":{"name":"BMC Genomics","volume":"26 1","pages":"537"},"PeriodicalIF":3.5000,"publicationDate":"2025-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Genomics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12864-025-11728-1","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOTECHNOLOGY & APPLIED MICROBIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Principal Component Analysis (PCA) is widely used in population genetics to visualize genetic relationships and population structures. In ancient genomics, genotype information may in parts remain unresolved due to the low abundance and degraded quality of ancient DNA. While methods like SmartPCA allow the projection of ancient samples despite missing data, they do not quantify projection uncertainty. The reliability of PCA projections for often very sparse ancient genotype samples is not well understood. Ignoring this uncertainty may lead to overconfident conclusions about the observed genetic relationships and population structure.
Results: This study systematically investigates the impact of missing loci on PCA projections using both simulated and real ancient human genotype data. Through extensive simulations with high-coverage ancient samples, we demonstrate that increasing levels of missing data can lead to less accurate SmartPCA projections, highlighting the importance of considering uncertainty when interpreting PCA results from ancient samples. To address this, we developed a probabilistic framework to quantify the uncertainty in PCA projections due to missing data. By applying our methodology to modern and ancient West Eurasian genotype samples from the Allen Ancient DNA Resource database, we could show a high concordance between our predicted projection and empirically derived distributions. Applying this framework to real-world data, we demonstrate its utility in predicting and visualizing embedding uncertainties for ancient samples of varying SNP coverages.
Conclusion: Our results emphasize the importance of accounting for projection uncertainty in ancient population studies. We therefore make our probabilistic model available through TrustPCA, a user-friendly web tool that provides researchers with uncertainty estimates alongside PCA projections, facilitating data exploration in ancient human genomic studies and enhancing transparency in data quality reporting.
期刊介绍:
BMC Genomics is an open access, peer-reviewed journal that considers articles on all aspects of genome-scale analysis, functional genomics, and proteomics.
BMC Genomics is part of the BMC series which publishes subject-specific journals focused on the needs of individual research communities across all areas of biology and medicine. We offer an efficient, fair and friendly peer review service, and are committed to publishing all sound science, provided that there is some advance in knowledge presented by the work.