在基因组分析的深度学习模型中考虑种群结构。

IF 4.5 2区医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Biomedical Informatics Pub Date : 2025-07-05 DOI:10.1016/j.jbi.2025.104873

Gabrielle Dagasso , Matthias Wilms , Raissa Souza , Nils D. Forkert

{"title":"在基因组分析的深度学习模型中考虑种群结构。","authors":"Gabrielle Dagasso , Matthias Wilms , Raissa Souza , Nils D. Forkert","doi":"10.1016/j.jbi.2025.104873","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>Deep learning methods are becoming increasingly popular for genotype analyses in recent years. In conventional genomic analyses, it is important to account for confounders to avoid biasing the results. Genetic relatedness is one of the most common confounders in conventional genomic analyses and there is a general consensus that it should be considered in the analysis to prevent distant levels of common ancestry from affecting the identification of causal variants. In contrast, genetic relatedness is not considered or ignored in many of the recently published deep learning models.</div></div><div><h3>Objective</h3><div>This study investigates whether the omission of genetic relatedness in deep learning models, common in recent literature, introduces confounding effects similar to those observed in conventional genomic analyses, particularly due to ancestry-related variants.</div></div><div><h3>Methods</h3><div>We developed and used a deep learning model to perform classifications based on single nucleotide polymorphism data from simulated and real-world datasets to examine whether population structure is confounding the model and potentially causing shortcut learning.</div></div><div><h3>Results</h3><div>The results of this study suggest that population structure may not significantly affect the performance of the deep learning model. However, explainable AI revealed notable differences in the focus between the confounded and unconfounded models when examining SNP feature importance.</div></div><div><h3>Conclusion</h3><div>While population structure may not heavily affect model performance, it is important to reduce the models’ capabilities of shortcut learning when designing deep learning models for analyzing genomic datasets, by using ancestry-related variants over potentially relevant biomarkers of the disease or disorder in question. The code used to perform these analyses can be found at: https://github.com/notTrivial/populationStructure.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"169 ","pages":"Article 104873"},"PeriodicalIF":4.5000,"publicationDate":"2025-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Accounting for population structure in deep learning models for genomic analysis\",\"authors\":\"Gabrielle Dagasso , Matthias Wilms , Raissa Souza , Nils D. Forkert\",\"doi\":\"10.1016/j.jbi.2025.104873\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Background</h3><div>Deep learning methods are becoming increasingly popular for genotype analyses in recent years. In conventional genomic analyses, it is important to account for confounders to avoid biasing the results. Genetic relatedness is one of the most common confounders in conventional genomic analyses and there is a general consensus that it should be considered in the analysis to prevent distant levels of common ancestry from affecting the identification of causal variants. In contrast, genetic relatedness is not considered or ignored in many of the recently published deep learning models.</div></div><div><h3>Objective</h3><div>This study investigates whether the omission of genetic relatedness in deep learning models, common in recent literature, introduces confounding effects similar to those observed in conventional genomic analyses, particularly due to ancestry-related variants.</div></div><div><h3>Methods</h3><div>We developed and used a deep learning model to perform classifications based on single nucleotide polymorphism data from simulated and real-world datasets to examine whether population structure is confounding the model and potentially causing shortcut learning.</div></div><div><h3>Results</h3><div>The results of this study suggest that population structure may not significantly affect the performance of the deep learning model. However, explainable AI revealed notable differences in the focus between the confounded and unconfounded models when examining SNP feature importance.</div></div><div><h3>Conclusion</h3><div>While population structure may not heavily affect model performance, it is important to reduce the models’ capabilities of shortcut learning when designing deep learning models for analyzing genomic datasets, by using ancestry-related variants over potentially relevant biomarkers of the disease or disorder in question. The code used to perform these analyses can be found at: https://github.com/notTrivial/populationStructure.</div></div>\",\"PeriodicalId\":15263,\"journal\":{\"name\":\"Journal of Biomedical Informatics\",\"volume\":\"169 \",\"pages\":\"Article 104873\"},\"PeriodicalIF\":4.5000,\"publicationDate\":\"2025-07-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Biomedical Informatics\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1532046425001029\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Biomedical Informatics","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1532046425001029","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

摘要

背景：近年来，深度学习方法在基因型分析中越来越受欢迎。在传统的基因组分析中，考虑混杂因素以避免结果偏差是很重要的。遗传亲缘关系是传统基因组分析中最常见的混杂因素之一，人们普遍认为，在分析中应考虑遗传亲缘关系，以防止遥远水平的共同祖先影响因果变异的鉴定。相比之下，在最近发表的许多深度学习模型中，遗传相关性没有被考虑或忽略。目的：本研究调查了近期文献中常见的深度学习模型中遗传相关性的遗漏是否会引入类似于传统基因组分析中观察到的混淆效应，特别是由于与祖先相关的变异。方法：我们开发并使用了一个深度学习模型，基于模拟和现实数据集的单核苷酸多态性数据进行分类，以检查群体结构是否混淆了模型并可能导致快速学习。结果：本研究的结果表明，人口结构可能不会显著影响深度学习模型的性能。然而，在检查SNP特征重要性时，可解释的AI揭示了混合模型和非混合模型之间焦点的显着差异。结论：虽然种群结构可能不会严重影响模型的性能，但在设计用于分析基因组数据集的深度学习模型时，通过使用与疾病或疾病潜在相关的生物标志物相关的祖先相关变异，降低模型的快捷学习能力是很重要的。用于执行这些分析的代码可以在https://github.com/notTrivial/populationStructure上找到。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Accounting for population structure in deep learning models for genomic analysis

查看原文本刊更多论文

Accounting for population structure in deep learning models for genomic analysis

Background

Deep learning methods are becoming increasingly popular for genotype analyses in recent years. In conventional genomic analyses, it is important to account for confounders to avoid biasing the results. Genetic relatedness is one of the most common confounders in conventional genomic analyses and there is a general consensus that it should be considered in the analysis to prevent distant levels of common ancestry from affecting the identification of causal variants. In contrast, genetic relatedness is not considered or ignored in many of the recently published deep learning models.

Objective

This study investigates whether the omission of genetic relatedness in deep learning models, common in recent literature, introduces confounding effects similar to those observed in conventional genomic analyses, particularly due to ancestry-related variants.

Methods

We developed and used a deep learning model to perform classifications based on single nucleotide polymorphism data from simulated and real-world datasets to examine whether population structure is confounding the model and potentially causing shortcut learning.

Results

The results of this study suggest that population structure may not significantly affect the performance of the deep learning model. However, explainable AI revealed notable differences in the focus between the confounded and unconfounded models when examining SNP feature importance.

Conclusion

While population structure may not heavily affect model performance, it is important to reduce the models’ capabilities of shortcut learning when designing deep learning models for analyzing genomic datasets, by using ancestry-related variants over potentially relevant biomarkers of the disease or disorder in question. The code used to perform these analyses can be found at: https://github.com/notTrivial/populationStructure.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Biomedical Informatics 医学-计算机：跨学科应用

CiteScore

8.90

自引率

6.70%

发文量

243

审稿时长

32 days

期刊介绍： The Journal of Biomedical Informatics reflects a commitment to high-quality original research papers, reviews, and commentaries in the area of biomedical informatics methodology. Although we publish articles motivated by applications in the biomedical sciences (for example, clinical medicine, health care, population health, and translational bioinformatics), the journal emphasizes reports of new methodologies and techniques that have general applicability and that form the basis for the evolving science of biomedical informatics. Articles on medical devices; evaluations of implemented systems (including clinical trials of information technologies); or papers that provide insight into a biological process, a specific disease, or treatment options would generally be more suitable for publication in other venues. Papers on applications of signal processing and image analysis are often more suitable for biomedical engineering journals or other informatics journals, although we do publish papers that emphasize the information management and knowledge representation/modeling issues that arise in the storage and use of biological signals and images. System descriptions are welcome if they illustrate and substantiate the underlying methodology that is the principal focus of the report and an effort is made to address the generalizability and/or range of application of that methodology. Note also that, given the international nature of JBI, papers that deal with specific languages other than English, or with country-specific health systems or approaches, are acceptable for JBI only if they offer generalizable lessons that are relevant to the broad JBI readership, regardless of their country, language, culture, or health system.