使用单倍型矩阵排列解释种群基因组学中的监督机器学习推断。

IF 5.3 1区生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY

Molecular biology and evolution Pub Date : 2025-10-06 DOI:10.1093/molbev/msaf250

Linh N Tran, David Castellano, Ryan N Gutenkunst

{"title":"使用单倍型矩阵排列解释种群基因组学中的监督机器学习推断。","authors":"Linh N Tran, David Castellano, Ryan N Gutenkunst","doi":"10.1093/molbev/msaf250","DOIUrl":null,"url":null,"abstract":"Supervised machine learning methods, such as convolutional neural networks (CNNs), that use haplotype matrices as input data have become powerful tools for population genomics inference. However, these methods often lack interpretability, making it difficult to understand which population genetics features drive their predictions-a critical limitation for method development and biological interpretation. Here we introduce a systematic permutation approach that progressively disrupts population genetics features within input test haplotype matrices, including linkage disequilibrium, haplotype structure, and allele frequencies. By measuring performance degradation after each permutation, the importance of each feature can be assessed. We applied our approach to three published CNNs for positive selection and demographic history inference. We found that the positive selection inference CNN ImaGene critically depends on haplotype structure and linkage disequilibrium patterns, while the demographic inference CNN relies primarily on allele frequency information. Surprisingly, another positive selection inference CNN, disc-pg-gan, achieved high accuracy using only simple allele count information, suggesting its training regime may not adequately challenge the model to learn complex population genetic signatures. Our approach provides a straightforward, model-agnostic, and biologically-motivated framework for interpreting any haplotype matrix-based method, offering insights that can guide both method development and application in population genomics.","PeriodicalId":18730,"journal":{"name":"Molecular biology and evolution","volume":" ","pages":""},"PeriodicalIF":5.3000,"publicationDate":"2025-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Interpreting supervised machine learning inferences in population genomics using haplotype matrix permutations.\",\"authors\":\"Linh N Tran, David Castellano, Ryan N Gutenkunst\",\"doi\":\"10.1093/molbev/msaf250\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Supervised machine learning methods, such as convolutional neural networks (CNNs), that use haplotype matrices as input data have become powerful tools for population genomics inference. However, these methods often lack interpretability, making it difficult to understand which population genetics features drive their predictions-a critical limitation for method development and biological interpretation. Here we introduce a systematic permutation approach that progressively disrupts population genetics features within input test haplotype matrices, including linkage disequilibrium, haplotype structure, and allele frequencies. By measuring performance degradation after each permutation, the importance of each feature can be assessed. We applied our approach to three published CNNs for positive selection and demographic history inference. We found that the positive selection inference CNN ImaGene critically depends on haplotype structure and linkage disequilibrium patterns, while the demographic inference CNN relies primarily on allele frequency information. Surprisingly, another positive selection inference CNN, disc-pg-gan, achieved high accuracy using only simple allele count information, suggesting its training regime may not adequately challenge the model to learn complex population genetic signatures. Our approach provides a straightforward, model-agnostic, and biologically-motivated framework for interpreting any haplotype matrix-based method, offering insights that can guide both method development and application in population genomics.\",\"PeriodicalId\":18730,\"journal\":{\"name\":\"Molecular biology and evolution\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":5.3000,\"publicationDate\":\"2025-10-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Molecular biology and evolution\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1093/molbev/msaf250\",\"RegionNum\":1,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"BIOCHEMISTRY & MOLECULAR BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Molecular biology and evolution","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/molbev/msaf250","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

有监督的机器学习方法，如卷积神经网络（cnn），使用单倍型矩阵作为输入数据，已经成为群体基因组学推断的强大工具。然而，这些方法往往缺乏可解释性，使得很难理解哪些群体遗传学特征驱动了他们的预测——这是方法发展和生物学解释的一个关键限制。在这里，我们介绍了一种系统的排列方法，逐步破坏输入测试单倍型矩阵中的群体遗传特征，包括连锁不平衡、单倍型结构和等位基因频率。通过测量每次排列后的性能下降，可以评估每个特征的重要性。我们将我们的方法应用于三个已发表的cnn进行正面选择和人口历史推断。我们发现，正选择推断CNN ImaGene主要依赖于单倍型结构和连锁不平衡模式，而人口统计推断CNN主要依赖于等位基因频率信息。令人惊讶的是，另一个正选择推理CNN， disc-pg-gan，仅使用简单的等位基因计数信息就获得了很高的准确性，这表明其训练机制可能不足以挑战模型来学习复杂的群体遗传特征。我们的方法为解释任何基于单倍型矩阵的方法提供了一个直接的、模型不可知的和生物学动机的框架，提供了可以指导方法开发和种群基因组学应用的见解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Interpreting supervised machine learning inferences in population genomics using haplotype matrix permutations.

Supervised machine learning methods, such as convolutional neural networks (CNNs), that use haplotype matrices as input data have become powerful tools for population genomics inference. However, these methods often lack interpretability, making it difficult to understand which population genetics features drive their predictions-a critical limitation for method development and biological interpretation. Here we introduce a systematic permutation approach that progressively disrupts population genetics features within input test haplotype matrices, including linkage disequilibrium, haplotype structure, and allele frequencies. By measuring performance degradation after each permutation, the importance of each feature can be assessed. We applied our approach to three published CNNs for positive selection and demographic history inference. We found that the positive selection inference CNN ImaGene critically depends on haplotype structure and linkage disequilibrium patterns, while the demographic inference CNN relies primarily on allele frequency information. Surprisingly, another positive selection inference CNN, disc-pg-gan, achieved high accuracy using only simple allele count information, suggesting its training regime may not adequately challenge the model to learn complex population genetic signatures. Our approach provides a straightforward, model-agnostic, and biologically-motivated framework for interpreting any haplotype matrix-based method, offering insights that can guide both method development and application in population genomics.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Molecular biology and evolution 生物-进化生物学

CiteScore

19.70

自引率

3.70%

发文量

257

审稿时长

1 months

期刊介绍： Molecular Biology and Evolution Journal Overview: Publishes research at the interface of molecular (including genomics) and evolutionary biology Considers manuscripts containing patterns, processes, and predictions at all levels of organization: population, taxonomic, functional, and phenotypic Interested in fundamental discoveries, new and improved methods, resources, technologies, and theories advancing evolutionary research Publishes balanced reviews of recent developments in genome evolution and forward-looking perspectives suggesting future directions in molecular evolution applications.