{"title":"Genetic Similarity Clustering Using the UK Biobank as a Reference Dataset.","authors":"Ngoc-Quynh Le, Puya Gharahkhani, Stuart MacGregor","doi":"10.1017/thg.2025.15","DOIUrl":null,"url":null,"abstract":"<p><p>Incorporating genetic data from diverse populations is crucial for understanding genetic contributions to diseases and ensuring health equity in healthcare practices. However, existing reference panels either capture a limited number of populations or have small sample sizes. We examine the UK Biobank's performance as a reference for clustering genetically similar individuals. Leveraging data from participants of diverse origins, we aim to improve population representation and mitigate bias caused by the limited number of populations in other reference panels. We combined countries of birth and ethnic backgrounds data fields from the UK Biobank and genetic information to infer genetically similar population labels. A random forest model was then trained on genetic principal components to identify each individual's most genetically similar population. The model's performance was validated using the 1000 Genomes and the CARTaGENE biobank data. We identified more diverse reference populations than present in datasets such as 1000 Genomes, covering 19 populations worldwide. Our model achieved medium to high precision and recall for most labeled populations, although lower rates were observed in closely related groups. For instance, we identified 519 people in CARTaGENE most genetically similar to the Middle Eastern reference sample derived in the UK Biobank (there are no Middle Eastern samples in 1000 Genomes), yielding an 81.1% precision and a 97.0% recall rate compared to demographic-based information. This practical approach of clustering genetically similar individuals utilizing existing biobank data may facilitate downstream analyses, such as genomewide association studies or polygenic risk scores in underrepresented populations in genetic studies.</p>","PeriodicalId":23446,"journal":{"name":"Twin Research and Human Genetics","volume":" ","pages":"1-8"},"PeriodicalIF":1.0000,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Twin Research and Human Genetics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1017/thg.2025.15","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}
引用次数: 0
Abstract
Incorporating genetic data from diverse populations is crucial for understanding genetic contributions to diseases and ensuring health equity in healthcare practices. However, existing reference panels either capture a limited number of populations or have small sample sizes. We examine the UK Biobank's performance as a reference for clustering genetically similar individuals. Leveraging data from participants of diverse origins, we aim to improve population representation and mitigate bias caused by the limited number of populations in other reference panels. We combined countries of birth and ethnic backgrounds data fields from the UK Biobank and genetic information to infer genetically similar population labels. A random forest model was then trained on genetic principal components to identify each individual's most genetically similar population. The model's performance was validated using the 1000 Genomes and the CARTaGENE biobank data. We identified more diverse reference populations than present in datasets such as 1000 Genomes, covering 19 populations worldwide. Our model achieved medium to high precision and recall for most labeled populations, although lower rates were observed in closely related groups. For instance, we identified 519 people in CARTaGENE most genetically similar to the Middle Eastern reference sample derived in the UK Biobank (there are no Middle Eastern samples in 1000 Genomes), yielding an 81.1% precision and a 97.0% recall rate compared to demographic-based information. This practical approach of clustering genetically similar individuals utilizing existing biobank data may facilitate downstream analyses, such as genomewide association studies or polygenic risk scores in underrepresented populations in genetic studies.
期刊介绍:
Twin Research and Human Genetics is the official journal of the International Society for Twin Studies. Twin Research and Human Genetics covers all areas of human genetics with an emphasis on twin studies, genetic epidemiology, psychiatric and behavioral genetics, and research on multiple births in the fields of epidemiology, genetics, endocrinology, fetal pathology, obstetrics and pediatrics.
Through Twin Research and Human Genetics the society aims to publish the latest research developments in twin studies throughout the world.