{"title":"Optimizing Model Performance and Interpretability: Application to Biological Data Classification.","authors":"Zhenyu Huang, Xuechen Mu, Yangkun Cao, Qiufen Chen, Siyu Qiao, Bocheng Shi, Gangyi Xiao, Yan Wang, Ying Xu","doi":"10.3390/genes16030297","DOIUrl":null,"url":null,"abstract":"<p><p>This study introduces a novel framework that simultaneously addresses the challenges of performance accuracy and result interpretability in transcriptomic-data-based classification. <b>Background/objectives</b>: In biological data classification, it is challenging to achieve both high performance accuracy and interpretability at the same time. This study presents a framework to address both challenges in transcriptomic-data-based classification. The goal is to select features, models, and a meta-voting classifier that optimizes both classification performance and interpretability. <b>Methods</b>: The framework consists of a four-step feature selection process: (1) the identification of metabolic pathways whose enzyme-gene expressions discriminate samples with different labels, aiding interpretability; (2) the selection of pathways whose expression variance is largely captured by the first principal component of the gene expression matrix; (3) the selection of minimal sets of genes, whose collective discerning power covers 95% of the pathway-based discerning power; and (4) the introduction of adversarial samples to identify and filter genes sensitive to such samples. Additionally, adversarial samples are used to select the optimal classification model, and a meta-voting classifier is constructed based on the optimized model results. <b>Results</b>: The framework applied to two cancer classification problems showed that in the binary classification, the prediction performance was comparable to the full-gene model, with F1-score differences of between -5% and 5%. In the ternary classification, the performance was significantly better, with F1-score differences ranging from -2% to 12%, while also maintaining excellent interpretability of the selected feature genes. <b>Conclusions</b>: This framework effectively integrates feature selection, adversarial sample handling, and model optimization, offering a valuable tool for a wide range of biological data classification problems. Its ability to balance performance accuracy and high interpretability makes it highly applicable in the field of computational biology.</p>","PeriodicalId":12688,"journal":{"name":"Genes","volume":"16 3","pages":""},"PeriodicalIF":2.8000,"publicationDate":"2025-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11942234/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genes","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.3390/genes16030297","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}
引用次数: 0
Abstract
This study introduces a novel framework that simultaneously addresses the challenges of performance accuracy and result interpretability in transcriptomic-data-based classification. Background/objectives: In biological data classification, it is challenging to achieve both high performance accuracy and interpretability at the same time. This study presents a framework to address both challenges in transcriptomic-data-based classification. The goal is to select features, models, and a meta-voting classifier that optimizes both classification performance and interpretability. Methods: The framework consists of a four-step feature selection process: (1) the identification of metabolic pathways whose enzyme-gene expressions discriminate samples with different labels, aiding interpretability; (2) the selection of pathways whose expression variance is largely captured by the first principal component of the gene expression matrix; (3) the selection of minimal sets of genes, whose collective discerning power covers 95% of the pathway-based discerning power; and (4) the introduction of adversarial samples to identify and filter genes sensitive to such samples. Additionally, adversarial samples are used to select the optimal classification model, and a meta-voting classifier is constructed based on the optimized model results. Results: The framework applied to two cancer classification problems showed that in the binary classification, the prediction performance was comparable to the full-gene model, with F1-score differences of between -5% and 5%. In the ternary classification, the performance was significantly better, with F1-score differences ranging from -2% to 12%, while also maintaining excellent interpretability of the selected feature genes. Conclusions: This framework effectively integrates feature selection, adversarial sample handling, and model optimization, offering a valuable tool for a wide range of biological data classification problems. Its ability to balance performance accuracy and high interpretability makes it highly applicable in the field of computational biology.
期刊介绍:
Genes (ISSN 2073-4425) is an international, peer-reviewed open access journal which provides an advanced forum for studies related to genes, genetics and genomics. It publishes reviews, research articles, communications and technical notes. There is no restriction on the length of the papers and we encourage scientists to publish their results in as much detail as possible.