Zhong Chen, Zichen Lao, You Lu, Wensheng Zhang, Andrea Edwards, Kun Zhang
{"title":"Decoding ancestry-specific genetic risk: interpretable deep feature selection reveals prostate cancer SNP disparities in diverse populations.","authors":"Zhong Chen, Zichen Lao, You Lu, Wensheng Zhang, Andrea Edwards, Kun Zhang","doi":"10.1186/s13040-025-00470-9","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>The clinical potential of single nucleotide polymorphisms (SNPs) in prostate cancer (PCa) diagnosis has been extensively explored using conventional statistical and machine learning approaches. However, the predictive power and interpretability of these methods remain inadequate for clinical translation, primarily due to limited generalization across high-dimensional SNP datasets. This study addresses the contested diagnostic utility of SNPs by integrating interpretable feature selection with deep learning to enhance both classification performance and biological relevance.</p><p><strong>Methods: </strong>We propose an interpretable deep feature selection framework designed to enhance both the classification performance and biological relevance of SNP markers in distinguishing between benign and malignant prostate cancer samples. This study specifically investigates the debated diagnostic value of SNPs in PCa classification by integrating feature selection with deep learning to uncover actionable insights. Specifically, our framework comprises four key components: (1) Heuristic feature reduction, which eliminates irrelevant SNPs during gradient computation for training deep neural networks (DNNs); (2) Iterative SNP subset optimization, aiming at maximizing classification AUC during model training; (3) Gradient variance minimization, mitigating instability caused by limited sample sizes; and (4) Nonlinear interaction modeling, which extracts high-level SNP interactions through hierarchical representations.</p><p><strong>Results: </strong>Evaluated on the PLCO, BPC3, and MEC-AA datasets, our method achieved mean AUC scores of 0.747, 0.751, and 0.559, respectively, demonstrating statistically significant improvements (p < 0.05, a paired t-test) over existing approaches. Notably, the lower AUC for MEC-AA may reflect inherent population-specific complexities, as this dataset focuses on African American men, a group historically underrepresented in genomic studies. For interpretability, our framework identified 345, 373, and 437 consensus SNP markers across the PLCO, BPC3, and MEC-AA cohorts, respectively. Key SNPs were further validated against prior research on PCa racial disparities: rs10086908 and rs2273669 (PLCO); rs12284087, rs902774, rs9364554, and rs7611694 (BPC3); and rs3123078 and rs1447295 (MEC-AA) exhibited strong concordance with established loci linked to ethnic-specific risk profiles. For instance, rs1447295 on chromosome 8q24, recurrently associated with African ancestry, underscores the method's ability to recover population-relevant variants.</p><p><strong>Conclusion: </strong>By synergizing interpretable feature selection with deep learning, this work advances the translation of SNP-based biomarkers into clinically actionable tools while clarifying their contested diagnostic role in PCa.</p>","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"66"},"PeriodicalIF":6.1000,"publicationDate":"2025-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12481780/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biodata Mining","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13040-025-00470-9","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Background: The clinical potential of single nucleotide polymorphisms (SNPs) in prostate cancer (PCa) diagnosis has been extensively explored using conventional statistical and machine learning approaches. However, the predictive power and interpretability of these methods remain inadequate for clinical translation, primarily due to limited generalization across high-dimensional SNP datasets. This study addresses the contested diagnostic utility of SNPs by integrating interpretable feature selection with deep learning to enhance both classification performance and biological relevance.
Methods: We propose an interpretable deep feature selection framework designed to enhance both the classification performance and biological relevance of SNP markers in distinguishing between benign and malignant prostate cancer samples. This study specifically investigates the debated diagnostic value of SNPs in PCa classification by integrating feature selection with deep learning to uncover actionable insights. Specifically, our framework comprises four key components: (1) Heuristic feature reduction, which eliminates irrelevant SNPs during gradient computation for training deep neural networks (DNNs); (2) Iterative SNP subset optimization, aiming at maximizing classification AUC during model training; (3) Gradient variance minimization, mitigating instability caused by limited sample sizes; and (4) Nonlinear interaction modeling, which extracts high-level SNP interactions through hierarchical representations.
Results: Evaluated on the PLCO, BPC3, and MEC-AA datasets, our method achieved mean AUC scores of 0.747, 0.751, and 0.559, respectively, demonstrating statistically significant improvements (p < 0.05, a paired t-test) over existing approaches. Notably, the lower AUC for MEC-AA may reflect inherent population-specific complexities, as this dataset focuses on African American men, a group historically underrepresented in genomic studies. For interpretability, our framework identified 345, 373, and 437 consensus SNP markers across the PLCO, BPC3, and MEC-AA cohorts, respectively. Key SNPs were further validated against prior research on PCa racial disparities: rs10086908 and rs2273669 (PLCO); rs12284087, rs902774, rs9364554, and rs7611694 (BPC3); and rs3123078 and rs1447295 (MEC-AA) exhibited strong concordance with established loci linked to ethnic-specific risk profiles. For instance, rs1447295 on chromosome 8q24, recurrently associated with African ancestry, underscores the method's ability to recover population-relevant variants.
Conclusion: By synergizing interpretable feature selection with deep learning, this work advances the translation of SNP-based biomarkers into clinically actionable tools while clarifying their contested diagnostic role in PCa.
期刊介绍:
BioData Mining is an open access, open peer-reviewed journal encompassing research on all aspects of data mining applied to high-dimensional biological and biomedical data, focusing on computational aspects of knowledge discovery from large-scale genetic, transcriptomic, genomic, proteomic, and metabolomic data.
Topical areas include, but are not limited to:
-Development, evaluation, and application of novel data mining and machine learning algorithms.
-Adaptation, evaluation, and application of traditional data mining and machine learning algorithms.
-Open-source software for the application of data mining and machine learning algorithms.
-Design, development and integration of databases, software and web services for the storage, management, retrieval, and analysis of data from large scale studies.
-Pre-processing, post-processing, modeling, and interpretation of data mining and machine learning results for biological interpretation and knowledge discovery.