用于生物标记物选择的多目标遗传算法系统高估调整的双阶段优化器。

IF 6.8 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics Pub Date : 2024-11-22 DOI:10.1093/bib/bbae674

Luca Cattelani, Vittorio Fortino

{"title":"用于生物标记物选择的多目标遗传算法系统高估调整的双阶段优化器。","authors":"Luca Cattelani, Vittorio Fortino","doi":"10.1093/bib/bbae674","DOIUrl":null,"url":null,"abstract":"The selection of biomarker panels in omics data, challenged by numerous molecular features and limited samples, often requires the use of machine learning methods paired with wrapper feature selection techniques, like genetic algorithms. They test various feature sets-potential biomarker solutions-to fine-tune a machine learning model's performance for supervised tasks, such as classifying cancer subtypes. This optimization process is undertaken using validation sets to evaluate and identify the most effective feature combinations. Evaluations have performance estimation error, measurable as discrepancy between validation and test set performance, and when the selection involves many models the best ones are almost certainly overestimated. This issue is also relevant in a multi-objective feature selection process where various characteristics of the biomarker panels are optimized, such as predictive performances and feature set size. Methods have been proposed to reduce the overestimation after a model has already been selected in single-objective problems, but no algorithm existed capable of reducing the overestimation during the optimization, improving model selection, or applied in the more general multi-objective domain. We propose Dual-stage Optimizer for Systematic overestimation Adjustment in Multi-Objective problems (DOSA-MO), a novel multi-objective optimization wrapper algorithm that learns how the original estimation, its variance, and the feature set size of the solutions predict the overestimation. DOSA-MO adjusts the expectation of the performance during the optimization, improving the composition of the solution set. We verify that DOSA-MO improves the performance of a state-of-the-art genetic algorithm on left-out or external sample sets, when predicting cancer subtypes and/or patient overall survival, using three transcriptomics datasets for kidney and breast cancer.","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 1","pages":""},"PeriodicalIF":6.8000,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11684899/pdf/","citationCount":"0","resultStr":"{\"title\":\"Dual-stage optimizer for systematic overestimation adjustment applied to multi-objective genetic algorithms for biomarker selection.\",\"authors\":\"Luca Cattelani, Vittorio Fortino\",\"doi\":\"10.1093/bib/bbae674\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The selection of biomarker panels in omics data, challenged by numerous molecular features and limited samples, often requires the use of machine learning methods paired with wrapper feature selection techniques, like genetic algorithms. They test various feature sets-potential biomarker solutions-to fine-tune a machine learning model's performance for supervised tasks, such as classifying cancer subtypes. This optimization process is undertaken using validation sets to evaluate and identify the most effective feature combinations. Evaluations have performance estimation error, measurable as discrepancy between validation and test set performance, and when the selection involves many models the best ones are almost certainly overestimated. This issue is also relevant in a multi-objective feature selection process where various characteristics of the biomarker panels are optimized, such as predictive performances and feature set size. Methods have been proposed to reduce the overestimation after a model has already been selected in single-objective problems, but no algorithm existed capable of reducing the overestimation during the optimization, improving model selection, or applied in the more general multi-objective domain. We propose Dual-stage Optimizer for Systematic overestimation Adjustment in Multi-Objective problems (DOSA-MO), a novel multi-objective optimization wrapper algorithm that learns how the original estimation, its variance, and the feature set size of the solutions predict the overestimation. DOSA-MO adjusts the expectation of the performance during the optimization, improving the composition of the solution set. We verify that DOSA-MO improves the performance of a state-of-the-art genetic algorithm on left-out or external sample sets, when predicting cancer subtypes and/or patient overall survival, using three transcriptomics datasets for kidney and breast cancer.\",\"PeriodicalId\":9209,\"journal\":{\"name\":\"Briefings in bioinformatics\",\"volume\":\"26 1\",\"pages\":\"\"},\"PeriodicalIF\":6.8000,\"publicationDate\":\"2024-11-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11684899/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Briefings in bioinformatics\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1093/bib/bbae674\",\"RegionNum\":2,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"BIOCHEMICAL RESEARCH METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Briefings in bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/bib/bbae674","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

摘要

由于分子特征繁多且样本有限，在 omics 数据中选择生物标记物面板时，往往需要使用机器学习方法与遗传算法等包装特征选择技术。他们测试各种特征集--潜在的生物标记物解决方案，以微调机器学习模型在监督任务（如癌症亚型分类）中的性能。这一优化过程使用验证集来评估和确定最有效的特征组合。评估会产生性能估计误差，即验证集和测试集性能之间的差异，当选择涉及许多模型时，最佳模型几乎肯定会被高估。这一问题在多目标特征选择过程中也很重要，在这一过程中，生物标志物面板的各种特征（如预测性能和特征集大小）都需要优化。在单目标问题中，已经有人提出了在模型选定后减少高估的方法，但还没有一种算法能够在优化过程中减少高估、改进模型选择或应用于更广泛的多目标领域。我们提出了在多目标问题中进行系统高估调整的双阶段优化算法（DOSA-MO），这是一种新颖的多目标优化包装算法，它可以学习原始估计、其方差和解决方案的特征集大小如何预测高估。DOSA-MO 会在优化过程中调整性能预期，从而改善解集的组成。我们利用肾癌和乳腺癌的三个转录组学数据集，验证了 DOSA-MO 在预测癌症亚型和/或患者总生存期时，提高了最先进遗传算法在遗漏或外部样本集上的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Dual-stage optimizer for systematic overestimation adjustment applied to multi-objective genetic algorithms for biomarker selection.

The selection of biomarker panels in omics data, challenged by numerous molecular features and limited samples, often requires the use of machine learning methods paired with wrapper feature selection techniques, like genetic algorithms. They test various feature sets-potential biomarker solutions-to fine-tune a machine learning model's performance for supervised tasks, such as classifying cancer subtypes. This optimization process is undertaken using validation sets to evaluate and identify the most effective feature combinations. Evaluations have performance estimation error, measurable as discrepancy between validation and test set performance, and when the selection involves many models the best ones are almost certainly overestimated. This issue is also relevant in a multi-objective feature selection process where various characteristics of the biomarker panels are optimized, such as predictive performances and feature set size. Methods have been proposed to reduce the overestimation after a model has already been selected in single-objective problems, but no algorithm existed capable of reducing the overestimation during the optimization, improving model selection, or applied in the more general multi-objective domain. We propose Dual-stage Optimizer for Systematic overestimation Adjustment in Multi-Objective problems (DOSA-MO), a novel multi-objective optimization wrapper algorithm that learns how the original estimation, its variance, and the feature set size of the solutions predict the overestimation. DOSA-MO adjusts the expectation of the performance during the optimization, improving the composition of the solution set. We verify that DOSA-MO improves the performance of a state-of-the-art genetic algorithm on left-out or external sample sets, when predicting cancer subtypes and/or patient overall survival, using three transcriptomics datasets for kidney and breast cancer.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Briefings in bioinformatics 生物-生化研究方法

CiteScore

13.20

自引率

13.70%

发文量

549

审稿时长

6 months

期刊介绍： Briefings in Bioinformatics is an international journal serving as a platform for researchers and educators in the life sciences. It also appeals to mathematicians, statisticians, and computer scientists applying their expertise to biological challenges. The journal focuses on reviews tailored for users of databases and analytical tools in contemporary genetics, molecular and systems biology. It stands out by offering practical assistance and guidance to non-specialists in computerized methodologies. Covering a wide range from introductory concepts to specific protocols and analyses, the papers address bacterial, plant, fungal, animal, and human data. The journal's detailed subject areas include genetic studies of phenotypes and genotypes, mapping, DNA sequencing, expression profiling, gene expression studies, microarrays, alignment methods, protein profiles and HMMs, lipids, metabolic and signaling pathways, structure determination and function prediction, phylogenetic studies, and education and training.