优化模型性能和可解释性：在生物数据分类中的应用。

IF 2.8 3区生物学 Q2 GENETICS & HEREDITY

Genes Pub Date : 2025-02-28 DOI:10.3390/genes16030297

Zhenyu Huang, Xuechen Mu, Yangkun Cao, Qiufen Chen, Siyu Qiao, Bocheng Shi, Gangyi Xiao, Yan Wang, Ying Xu

{"title":"优化模型性能和可解释性：在生物数据分类中的应用。","authors":"Zhenyu Huang, Xuechen Mu, Yangkun Cao, Qiufen Chen, Siyu Qiao, Bocheng Shi, Gangyi Xiao, Yan Wang, Ying Xu","doi":"10.3390/genes16030297","DOIUrl":null,"url":null,"abstract":"This study introduces a novel framework that simultaneously addresses the challenges of performance accuracy and result interpretability in transcriptomic-data-based classification. Background/objectives: In biological data classification, it is challenging to achieve both high performance accuracy and interpretability at the same time. This study presents a framework to address both challenges in transcriptomic-data-based classification. The goal is to select features, models, and a meta-voting classifier that optimizes both classification performance and interpretability. Methods: The framework consists of a four-step feature selection process: (1) the identification of metabolic pathways whose enzyme-gene expressions discriminate samples with different labels, aiding interpretability; (2) the selection of pathways whose expression variance is largely captured by the first principal component of the gene expression matrix; (3) the selection of minimal sets of genes, whose collective discerning power covers 95% of the pathway-based discerning power; and (4) the introduction of adversarial samples to identify and filter genes sensitive to such samples. Additionally, adversarial samples are used to select the optimal classification model, and a meta-voting classifier is constructed based on the optimized model results. Results: The framework applied to two cancer classification problems showed that in the binary classification, the prediction performance was comparable to the full-gene model, with F1-score differences of between -5% and 5%. In the ternary classification, the performance was significantly better, with F1-score differences ranging from -2% to 12%, while also maintaining excellent interpretability of the selected feature genes. Conclusions: This framework effectively integrates feature selection, adversarial sample handling, and model optimization, offering a valuable tool for a wide range of biological data classification problems. Its ability to balance performance accuracy and high interpretability makes it highly applicable in the field of computational biology.","PeriodicalId":12688,"journal":{"name":"Genes","volume":"16 3","pages":""},"PeriodicalIF":2.8000,"publicationDate":"2025-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11942234/pdf/","citationCount":"0","resultStr":"{\"title\":\"Optimizing Model Performance and Interpretability: Application to Biological Data Classification.\",\"authors\":\"Zhenyu Huang, Xuechen Mu, Yangkun Cao, Qiufen Chen, Siyu Qiao, Bocheng Shi, Gangyi Xiao, Yan Wang, Ying Xu\",\"doi\":\"10.3390/genes16030297\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This study introduces a novel framework that simultaneously addresses the challenges of performance accuracy and result interpretability in transcriptomic-data-based classification. Background/objectives: In biological data classification, it is challenging to achieve both high performance accuracy and interpretability at the same time. This study presents a framework to address both challenges in transcriptomic-data-based classification. The goal is to select features, models, and a meta-voting classifier that optimizes both classification performance and interpretability. Methods: The framework consists of a four-step feature selection process: (1) the identification of metabolic pathways whose enzyme-gene expressions discriminate samples with different labels, aiding interpretability; (2) the selection of pathways whose expression variance is largely captured by the first principal component of the gene expression matrix; (3) the selection of minimal sets of genes, whose collective discerning power covers 95% of the pathway-based discerning power; and (4) the introduction of adversarial samples to identify and filter genes sensitive to such samples. Additionally, adversarial samples are used to select the optimal classification model, and a meta-voting classifier is constructed based on the optimized model results. Results: The framework applied to two cancer classification problems showed that in the binary classification, the prediction performance was comparable to the full-gene model, with F1-score differences of between -5% and 5%. In the ternary classification, the performance was significantly better, with F1-score differences ranging from -2% to 12%, while also maintaining excellent interpretability of the selected feature genes. Conclusions: This framework effectively integrates feature selection, adversarial sample handling, and model optimization, offering a valuable tool for a wide range of biological data classification problems. Its ability to balance performance accuracy and high interpretability makes it highly applicable in the field of computational biology.\",\"PeriodicalId\":12688,\"journal\":{\"name\":\"Genes\",\"volume\":\"16 3\",\"pages\":\"\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2025-02-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11942234/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Genes\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.3390/genes16030297\",\"RegionNum\":3,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"GENETICS & HEREDITY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genes","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.3390/genes16030297","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}

引用次数: 0

摘要

本研究引入了一个新的框架，同时解决了基于转录组数据的分类中性能准确性和结果可解释性的挑战。背景/目的：在生物数据分类中，同时实现高性能、准确性和可解释性是一个具有挑战性的问题。本研究提出了一个框架来解决基于转录组数据的分类中的这两个挑战。目标是选择能够优化分类性能和可解释性的特征、模型和元投票分类器。方法：该框架由四步特征选择过程组成：(1)识别代谢途径，其酶基因表达可区分不同标签的样品，有助于可解释性；(2)表达变异在很大程度上被基因表达矩阵的第一主成分捕获的途径的选择；(3)最小基因集的选择，其集体识别能力占基于路径的识别能力的95%；(4)引入对抗性样本，识别和过滤对这种样本敏感的基因。此外，利用对抗样本选择最优分类模型，并基于优化模型结果构建元投票分类器。结果：应用于两个癌症分类问题的框架显示，在二元分类中，预测性能与全基因模型相当，f1评分差异在-5%和5%之间。在三元分类中，表现明显更好，f1分差异在-2%到12%之间，同时也保持了所选特征基因的良好可解释性。结论：该框架有效地集成了特征选择、对抗性样本处理和模型优化，为广泛的生物数据分类问题提供了有价值的工具。它能够平衡性能准确性和高可解释性，使其在计算生物学领域具有很高的应用价值。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Optimizing Model Performance and Interpretability: Application to Biological Data Classification.

This study introduces a novel framework that simultaneously addresses the challenges of performance accuracy and result interpretability in transcriptomic-data-based classification. Background/objectives: In biological data classification, it is challenging to achieve both high performance accuracy and interpretability at the same time. This study presents a framework to address both challenges in transcriptomic-data-based classification. The goal is to select features, models, and a meta-voting classifier that optimizes both classification performance and interpretability. Methods: The framework consists of a four-step feature selection process: (1) the identification of metabolic pathways whose enzyme-gene expressions discriminate samples with different labels, aiding interpretability; (2) the selection of pathways whose expression variance is largely captured by the first principal component of the gene expression matrix; (3) the selection of minimal sets of genes, whose collective discerning power covers 95% of the pathway-based discerning power; and (4) the introduction of adversarial samples to identify and filter genes sensitive to such samples. Additionally, adversarial samples are used to select the optimal classification model, and a meta-voting classifier is constructed based on the optimized model results. Results: The framework applied to two cancer classification problems showed that in the binary classification, the prediction performance was comparable to the full-gene model, with F1-score differences of between -5% and 5%. In the ternary classification, the performance was significantly better, with F1-score differences ranging from -2% to 12%, while also maintaining excellent interpretability of the selected feature genes. Conclusions: This framework effectively integrates feature selection, adversarial sample handling, and model optimization, offering a valuable tool for a wide range of biological data classification problems. Its ability to balance performance accuracy and high interpretability makes it highly applicable in the field of computational biology.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Genes GENETICS & HEREDITY-

CiteScore

5.20

自引率

5.70%

发文量

1975

审稿时长

22.94 days

期刊介绍： Genes (ISSN 2073-4425) is an international, peer-reviewed open access journal which provides an advanced forum for studies related to genes, genetics and genomics. It publishes reviews, research articles, communications and technical notes. There is no restriction on the length of the papers and we encourage scientists to publish their results in as much detail as possible.