Decoding ancestry-specific genetic risk: interpretable deep feature selection reveals prostate cancer SNP disparities in diverse populations.

IF 6.1 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining Pub Date : 2025-09-29 DOI:10.1186/s13040-025-00470-9

Zhong Chen, Zichen Lao, You Lu, Wensheng Zhang, Andrea Edwards, Kun Zhang

{"title":"Decoding ancestry-specific genetic risk: interpretable deep feature selection reveals prostate cancer SNP disparities in diverse populations.","authors":"Zhong Chen, Zichen Lao, You Lu, Wensheng Zhang, Andrea Edwards, Kun Zhang","doi":"10.1186/s13040-025-00470-9","DOIUrl":null,"url":null,"abstract":"Background: The clinical potential of single nucleotide polymorphisms (SNPs) in prostate cancer (PCa) diagnosis has been extensively explored using conventional statistical and machine learning approaches. However, the predictive power and interpretability of these methods remain inadequate for clinical translation, primarily due to limited generalization across high-dimensional SNP datasets. This study addresses the contested diagnostic utility of SNPs by integrating interpretable feature selection with deep learning to enhance both classification performance and biological relevance.Methods: We propose an interpretable deep feature selection framework designed to enhance both the classification performance and biological relevance of SNP markers in distinguishing between benign and malignant prostate cancer samples. This study specifically investigates the debated diagnostic value of SNPs in PCa classification by integrating feature selection with deep learning to uncover actionable insights. Specifically, our framework comprises four key components: (1) Heuristic feature reduction, which eliminates irrelevant SNPs during gradient computation for training deep neural networks (DNNs); (2) Iterative SNP subset optimization, aiming at maximizing classification AUC during model training; (3) Gradient variance minimization, mitigating instability caused by limited sample sizes; and (4) Nonlinear interaction modeling, which extracts high-level SNP interactions through hierarchical representations.Results: Evaluated on the PLCO, BPC3, and MEC-AA datasets, our method achieved mean AUC scores of 0.747, 0.751, and 0.559, respectively, demonstrating statistically significant improvements (p < 0.05, a paired t-test) over existing approaches. Notably, the lower AUC for MEC-AA may reflect inherent population-specific complexities, as this dataset focuses on African American men, a group historically underrepresented in genomic studies. For interpretability, our framework identified 345, 373, and 437 consensus SNP markers across the PLCO, BPC3, and MEC-AA cohorts, respectively. Key SNPs were further validated against prior research on PCa racial disparities: rs10086908 and rs2273669 (PLCO); rs12284087, rs902774, rs9364554, and rs7611694 (BPC3); and rs3123078 and rs1447295 (MEC-AA) exhibited strong concordance with established loci linked to ethnic-specific risk profiles. For instance, rs1447295 on chromosome 8q24, recurrently associated with African ancestry, underscores the method's ability to recover population-relevant variants.Conclusion: By synergizing interpretable feature selection with deep learning, this work advances the translation of SNP-based biomarkers into clinically actionable tools while clarifying their contested diagnostic role in PCa.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"66"},"PeriodicalIF":6.1000,"publicationDate":"2025-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12481780/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biodata Mining","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13040-025-00470-9","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Background: The clinical potential of single nucleotide polymorphisms (SNPs) in prostate cancer (PCa) diagnosis has been extensively explored using conventional statistical and machine learning approaches. However, the predictive power and interpretability of these methods remain inadequate for clinical translation, primarily due to limited generalization across high-dimensional SNP datasets. This study addresses the contested diagnostic utility of SNPs by integrating interpretable feature selection with deep learning to enhance both classification performance and biological relevance.

Methods: We propose an interpretable deep feature selection framework designed to enhance both the classification performance and biological relevance of SNP markers in distinguishing between benign and malignant prostate cancer samples. This study specifically investigates the debated diagnostic value of SNPs in PCa classification by integrating feature selection with deep learning to uncover actionable insights. Specifically, our framework comprises four key components: (1) Heuristic feature reduction, which eliminates irrelevant SNPs during gradient computation for training deep neural networks (DNNs); (2) Iterative SNP subset optimization, aiming at maximizing classification AUC during model training; (3) Gradient variance minimization, mitigating instability caused by limited sample sizes; and (4) Nonlinear interaction modeling, which extracts high-level SNP interactions through hierarchical representations.

Results: Evaluated on the PLCO, BPC3, and MEC-AA datasets, our method achieved mean AUC scores of 0.747, 0.751, and 0.559, respectively, demonstrating statistically significant improvements (p < 0.05, a paired t-test) over existing approaches. Notably, the lower AUC for MEC-AA may reflect inherent population-specific complexities, as this dataset focuses on African American men, a group historically underrepresented in genomic studies. For interpretability, our framework identified 345, 373, and 437 consensus SNP markers across the PLCO, BPC3, and MEC-AA cohorts, respectively. Key SNPs were further validated against prior research on PCa racial disparities: rs10086908 and rs2273669 (PLCO); rs12284087, rs902774, rs9364554, and rs7611694 (BPC3); and rs3123078 and rs1447295 (MEC-AA) exhibited strong concordance with established loci linked to ethnic-specific risk profiles. For instance, rs1447295 on chromosome 8q24, recurrently associated with African ancestry, underscores the method's ability to recover population-relevant variants.

Conclusion: By synergizing interpretable feature selection with deep learning, this work advances the translation of SNP-based biomarkers into clinically actionable tools while clarifying their contested diagnostic role in PCa.

查看原文本刊更多论文

解码祖先特异性遗传风险：可解释的深度特征选择揭示了前列腺癌SNP在不同人群中的差异。

背景：单核苷酸多态性（snp）在前列腺癌（PCa）诊断中的临床潜力已经通过传统的统计学和机器学习方法进行了广泛的探索。然而，这些方法的预测能力和可解释性仍然不足以用于临床翻译，主要是由于高维SNP数据集的泛化有限。本研究通过将可解释的特征选择与深度学习相结合来提高分类性能和生物学相关性，解决了snp有争议的诊断效用。方法：我们提出了一个可解释的深度特征选择框架，旨在提高SNP标记在区分良性和恶性前列腺癌样本中的分类性能和生物学相关性。本研究通过将特征选择与深度学习相结合来揭示可操作的见解，专门研究了snp在PCa分类中的诊断价值。具体来说，我们的框架包括四个关键部分：(1)启发式特征约简，它在训练深度神经网络（dnn）的梯度计算过程中消除不相关的snp；(2)迭代SNP子集优化，以模型训练时的分类AUC最大化为目标；(3)梯度方差最小化，减轻样本量有限造成的不稳定性；(4)非线性相互作用建模，通过分层表示提取高水平SNP相互作用。结果：在PLCO， BPC3和MEC-AA数据集上进行评估，我们的方法分别获得了0.747,0.751和0.559的平均AUC分数，显示出统计学上显著的改进(p结论：通过将可解释的特征选择与深度学习相结合，这项工作将基于snp的生物标志物转化为临床可操作的工具，同时澄清了它们在PCa中的争议性诊断作用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Biodata Mining MATHEMATICAL & COMPUTATIONAL BIOLOGY-

CiteScore

7.90

自引率

0.00%

发文量

审稿时长

23 weeks

期刊介绍： BioData Mining is an open access, open peer-reviewed journal encompassing research on all aspects of data mining applied to high-dimensional biological and biomedical data, focusing on computational aspects of knowledge discovery from large-scale genetic, transcriptomic, genomic, proteomic, and metabolomic data. Topical areas include, but are not limited to: -Development, evaluation, and application of novel data mining and machine learning algorithms. -Adaptation, evaluation, and application of traditional data mining and machine learning algorithms. -Open-source software for the application of data mining and machine learning algorithms. -Design, development and integration of databases, software and web services for the storage, management, retrieval, and analysis of data from large scale studies. -Pre-processing, post-processing, modeling, and interpretation of data mining and machine learning results for biological interpretation and knowledge discovery.