Genetic Similarity Clustering Using the UK Biobank as a Reference Dataset.

IF 1 4区 医学 Q4 GENETICS & HEREDITY
Ngoc-Quynh Le, Puya Gharahkhani, Stuart MacGregor
{"title":"Genetic Similarity Clustering Using the UK Biobank as a Reference Dataset.","authors":"Ngoc-Quynh Le, Puya Gharahkhani, Stuart MacGregor","doi":"10.1017/thg.2025.15","DOIUrl":null,"url":null,"abstract":"<p><p>Incorporating genetic data from diverse populations is crucial for understanding genetic contributions to diseases and ensuring health equity in healthcare practices. However, existing reference panels either capture a limited number of populations or have small sample sizes. We examine the UK Biobank's performance as a reference for clustering genetically similar individuals. Leveraging data from participants of diverse origins, we aim to improve population representation and mitigate bias caused by the limited number of populations in other reference panels. We combined countries of birth and ethnic backgrounds data fields from the UK Biobank and genetic information to infer genetically similar population labels. A random forest model was then trained on genetic principal components to identify each individual's most genetically similar population. The model's performance was validated using the 1000 Genomes and the CARTaGENE biobank data. We identified more diverse reference populations than present in datasets such as 1000 Genomes, covering 19 populations worldwide. Our model achieved medium to high precision and recall for most labeled populations, although lower rates were observed in closely related groups. For instance, we identified 519 people in CARTaGENE most genetically similar to the Middle Eastern reference sample derived in the UK Biobank (there are no Middle Eastern samples in 1000 Genomes), yielding an 81.1% precision and a 97.0% recall rate compared to demographic-based information. This practical approach of clustering genetically similar individuals utilizing existing biobank data may facilitate downstream analyses, such as genomewide association studies or polygenic risk scores in underrepresented populations in genetic studies.</p>","PeriodicalId":23446,"journal":{"name":"Twin Research and Human Genetics","volume":" ","pages":"1-8"},"PeriodicalIF":1.0000,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Twin Research and Human Genetics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1017/thg.2025.15","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}
引用次数: 0

Abstract

Incorporating genetic data from diverse populations is crucial for understanding genetic contributions to diseases and ensuring health equity in healthcare practices. However, existing reference panels either capture a limited number of populations or have small sample sizes. We examine the UK Biobank's performance as a reference for clustering genetically similar individuals. Leveraging data from participants of diverse origins, we aim to improve population representation and mitigate bias caused by the limited number of populations in other reference panels. We combined countries of birth and ethnic backgrounds data fields from the UK Biobank and genetic information to infer genetically similar population labels. A random forest model was then trained on genetic principal components to identify each individual's most genetically similar population. The model's performance was validated using the 1000 Genomes and the CARTaGENE biobank data. We identified more diverse reference populations than present in datasets such as 1000 Genomes, covering 19 populations worldwide. Our model achieved medium to high precision and recall for most labeled populations, although lower rates were observed in closely related groups. For instance, we identified 519 people in CARTaGENE most genetically similar to the Middle Eastern reference sample derived in the UK Biobank (there are no Middle Eastern samples in 1000 Genomes), yielding an 81.1% precision and a 97.0% recall rate compared to demographic-based information. This practical approach of clustering genetically similar individuals utilizing existing biobank data may facilitate downstream analyses, such as genomewide association studies or polygenic risk scores in underrepresented populations in genetic studies.

使用UK Biobank作为参考数据集的遗传相似性聚类。
整合来自不同人群的遗传数据对于了解遗传对疾病的贡献和确保卫生保健实践中的卫生公平至关重要。然而,现有的参考小组要么只捕获有限数量的人口,要么样本量很小。我们检查英国生物银行的表现,作为遗传相似个体聚类的参考。利用来自不同来源的参与者的数据,我们的目标是提高人口代表性,减轻其他参考小组中人口数量有限造成的偏见。我们结合了来自英国生物银行的出生国家和种族背景数据领域以及遗传信息来推断基因相似的人群标签。然后对随机森林模型进行遗传主成分训练,以确定每个个体遗传最相似的种群。使用1000个基因组和CARTaGENE biobank数据验证了该模型的性能。我们确定了比1000个基因组等数据集更多样化的参考种群,涵盖了全球19个种群。我们的模型在大多数标记人群中实现了中高的精度和召回率,尽管在密切相关的群体中观察到较低的率。例如,我们在CARTaGENE确定了519人,他们的基因与英国生物银行(UK Biobank)的中东参考样本最相似(1000个基因组中没有中东样本),与基于人口统计学的信息相比,准确度为81.1%,召回率为97.0%。这种利用现有生物库数据聚类遗传相似个体的实用方法可以促进下游分析,例如全基因组关联研究或遗传研究中代表性不足人群的多基因风险评分。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Twin Research and Human Genetics
Twin Research and Human Genetics 医学-妇产科学
CiteScore
1.50
自引率
11.10%
发文量
37
审稿时长
6-12 weeks
期刊介绍: Twin Research and Human Genetics is the official journal of the International Society for Twin Studies. Twin Research and Human Genetics covers all areas of human genetics with an emphasis on twin studies, genetic epidemiology, psychiatric and behavioral genetics, and research on multiple births in the fields of epidemiology, genetics, endocrinology, fetal pathology, obstetrics and pediatrics. Through Twin Research and Human Genetics the society aims to publish the latest research developments in twin studies throughout the world.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信