Cole M Williams, Jared O'Connell, Ethan Jewett, William A Freyman, Christopher R Gignoux, Sohini Ramachandran, Amy L Williams
{"title":"Phasing millions of samples achieves near perfect accuracy, enabling parent-of-origin analyses.","authors":"Cole M Williams, Jared O'Connell, Ethan Jewett, William A Freyman, Christopher R Gignoux, Sohini Ramachandran, Amy L Williams","doi":"10.1016/j.xhgg.2025.100479","DOIUrl":null,"url":null,"abstract":"<p><p>Haplotype phasing, the process of determining which genetic variants are physically located on the same chromosome, is crucial for genetic analyses. Here, we benchmark SHAPEIT and Beagle, two state-of-the-art phasing methods, on two large datasets: >8 million research-consented 23andMe, Inc. customers and the UK Biobank (UKB). Remarkably, both methods' median switch error rate (SER) (after excluding single SNP switches, which we call 'blips') is 0.00% across all tested 23andMe trio children and 0.026% in British samples from UKB. Across UKB samples, switch errors predominantly occur in regions lacking identity-by-descent (IBD) coverage. SHAPEIT and Beagle excel at intra-chromosomal phasing, but lack the ability to phase across chromosomes, motivating us to develop HAPTiC (HAPlotype Tiling and Clustering), an inter-chromosomal phasing method that assigns paternal and maternal variants genome-wide. Our approach uses IBD segments to phase blocks of variants on different chromosomes. HAPTiC represents the segments a focal individual shares with their relatives as nodes in a signed graph and performs spectral clustering. We test HAPTiC on 1022 UKB trios, yielding a median per-site phase error of 0.13% in regions covered by IBD segments (45.1% of sites). We also ran HAPTiC in the 23andMe database and found a median phase error rate of 0.49% in Europeans (100% of sites) and 0.16% in admixed Africans (99.8% of sites). HAPTiC enables analyses that require the parent-of-origin of variants, such as association studies and ancestry inference of untyped parents.</p>","PeriodicalId":34530,"journal":{"name":"HGG Advances","volume":" ","pages":"100479"},"PeriodicalIF":3.3000,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"HGG Advances","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1016/j.xhgg.2025.100479","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}
引用次数: 0
Abstract
Haplotype phasing, the process of determining which genetic variants are physically located on the same chromosome, is crucial for genetic analyses. Here, we benchmark SHAPEIT and Beagle, two state-of-the-art phasing methods, on two large datasets: >8 million research-consented 23andMe, Inc. customers and the UK Biobank (UKB). Remarkably, both methods' median switch error rate (SER) (after excluding single SNP switches, which we call 'blips') is 0.00% across all tested 23andMe trio children and 0.026% in British samples from UKB. Across UKB samples, switch errors predominantly occur in regions lacking identity-by-descent (IBD) coverage. SHAPEIT and Beagle excel at intra-chromosomal phasing, but lack the ability to phase across chromosomes, motivating us to develop HAPTiC (HAPlotype Tiling and Clustering), an inter-chromosomal phasing method that assigns paternal and maternal variants genome-wide. Our approach uses IBD segments to phase blocks of variants on different chromosomes. HAPTiC represents the segments a focal individual shares with their relatives as nodes in a signed graph and performs spectral clustering. We test HAPTiC on 1022 UKB trios, yielding a median per-site phase error of 0.13% in regions covered by IBD segments (45.1% of sites). We also ran HAPTiC in the 23andMe database and found a median phase error rate of 0.49% in Europeans (100% of sites) and 0.16% in admixed Africans (99.8% of sites). HAPTiC enables analyses that require the parent-of-origin of variants, such as association studies and ancestry inference of untyped parents.
单倍型相位,即确定哪些遗传变异物理上位于同一染色体上的过程,对遗传分析至关重要。在这里,我们对SHAPEIT和Beagle这两种最先进的分阶段方法进行了基准测试,基于两个大型数据集:8800万研究同意的23andMe公司客户和英国生物银行(UKB)。值得注意的是,在所有测试的23andMe三人组儿童中,这两种方法的中位开关错误率(SER)(排除单SNP开关后,我们称之为“小点”)为0.00%,而在来自英国的英国样本中为0.026%。在UKB样本中,开关错误主要发生在缺乏血统识别(IBD)覆盖的地区。SHAPEIT和Beagle擅长染色体内分期,但缺乏跨染色体分期的能力,这促使我们开发了HAPTiC (HAPlotype Tiling and Clustering),这是一种染色体间分期方法,可以在全基因组范围内分配父亲和母亲的变异。我们的方法使用IBD片段来相位不同染色体上的变异块。HAPTiC将焦点个体与其亲属共享的片段表示为符号图中的节点,并执行谱聚类。我们在1022个UKB三联体上测试了HAPTiC,在IBD片段覆盖的区域(45.1%的位点)中,每个位点的相位误差中位数为0.13%。我们还在23andMe数据库中运行了HAPTiC,发现欧洲人(100%的位点)的中位相位错误率为0.49%,混合非洲人(99.8%的位点)的中位相位错误率为0.16%。HAPTiC支持需要变体的父母起源的分析,例如关联研究和无型父母的祖先推断。