Union With Recursive Feature Elimination: A Feature Selection Framework to Improve the Classification Performance of Multicategory Causes of Death in Colorectal Cancer

IF 5.1 2区 医学 Q1 MEDICINE, RESEARCH & EXPERIMENTAL
Fei Deng , Lin Zhao , Ning Yu , Yuxiang Lin , Lanjing Zhang
{"title":"Union With Recursive Feature Elimination: A Feature Selection Framework to Improve the Classification Performance of Multicategory Causes of Death in Colorectal Cancer","authors":"Fei Deng ,&nbsp;Lin Zhao ,&nbsp;Ning Yu ,&nbsp;Yuxiang Lin ,&nbsp;Lanjing Zhang","doi":"10.1016/j.labinv.2023.100320","DOIUrl":null,"url":null,"abstract":"<div><p><span><span><span>Despite the use of machine learning tools, it is challenging to properly model cause-specific deaths in colorectal cancer (CRC) patients and choose appropriate treatments<span>. Here, we propose an interesting feature selection framework, namely union with recursive feature elimination (U-RFE), to select the union feature sets that are crucial in CRC progression-specific mortality using The Cancer Genome Atlas (TCGA) dataset. Based on the union feature sets, we compared the performance of 5 </span></span>classification algorithms, including </span>logistic regression<span> (LR), support vector machines (SVM), random forest (RF), eXtreme gradient boosting (XGBoost), and Stacking, to identify the best model for classifying 4-category deaths. In the first stage of U-RFE, LR, SVM, and RF were used as base estimators to obtain subsets containing the same number of features but not exactly the same specific features. Union analysis of the subsets was then performed to determine the final union feature set, effectively combining the advantages of different algorithms. We found that the U-RFE framework could improve various models’ performance. Stacking outperformed LR, SVM, RF, and XGBoost in most scenarios. When the target feature number of the RFE was set to 50 and the union feature set contained 298 deterministic features, the Stacking model achieved </span></span><em>F1_weighted, Recall_weighted, Precision_weighted</em>, <em>Accuracy</em>, and <em>Matthews correlation coefficient</em><span> of 0.851, 0.864, 0.854, 0.864, and 0.717, respectively. The performance of the minority categories was also significantly improved. Therefore, this recursive feature elimination–based approach of feature selection improves performances of classifying CRC deaths using clinical and omics data or those using other data with high feature redundancy and imbalance.</span></p></div>","PeriodicalId":17930,"journal":{"name":"Laboratory Investigation","volume":null,"pages":null},"PeriodicalIF":5.1000,"publicationDate":"2023-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Laboratory Investigation","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0023683723002635","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MEDICINE, RESEARCH & EXPERIMENTAL","Score":null,"Total":0}
引用次数: 0

Abstract

Despite the use of machine learning tools, it is challenging to properly model cause-specific deaths in colorectal cancer (CRC) patients and choose appropriate treatments. Here, we propose an interesting feature selection framework, namely union with recursive feature elimination (U-RFE), to select the union feature sets that are crucial in CRC progression-specific mortality using The Cancer Genome Atlas (TCGA) dataset. Based on the union feature sets, we compared the performance of 5 classification algorithms, including logistic regression (LR), support vector machines (SVM), random forest (RF), eXtreme gradient boosting (XGBoost), and Stacking, to identify the best model for classifying 4-category deaths. In the first stage of U-RFE, LR, SVM, and RF were used as base estimators to obtain subsets containing the same number of features but not exactly the same specific features. Union analysis of the subsets was then performed to determine the final union feature set, effectively combining the advantages of different algorithms. We found that the U-RFE framework could improve various models’ performance. Stacking outperformed LR, SVM, RF, and XGBoost in most scenarios. When the target feature number of the RFE was set to 50 and the union feature set contained 298 deterministic features, the Stacking model achieved F1_weighted, Recall_weighted, Precision_weighted, Accuracy, and Matthews correlation coefficient of 0.851, 0.864, 0.854, 0.864, and 0.717, respectively. The performance of the minority categories was also significantly improved. Therefore, this recursive feature elimination–based approach of feature selection improves performances of classifying CRC deaths using clinical and omics data or those using other data with high feature redundancy and imbalance.

联合与递归特征消除:提高结直肠癌多类死因分类性能的特征选择框架。
尽管使用了机器学习工具,但要正确模拟结直肠癌(CRC)患者的特异性死亡原因并选择适当的治疗方法仍具有挑战性。在此,我们提出了一个有趣的特征选择框架,即具有递归特征消除(U-RFE)的联合特征,利用 TCGA 数据集选择对 CRC 进展特异性死亡至关重要的联合特征集。在联合特征集的基础上,比较了包括逻辑回归(LR)、支持向量机(SVM)、随机森林(RF)、极梯度提升(XGBoost)和堆叠(Stacking)在内的 5 种分类算法的性能,以确定对 4 类死亡进行分类的最佳模型。在 U-RFE 的第一阶段,将 LR、SVM 和 RF 用作基础估计器,以获得包含相同数量特征但不完全相同的特定特征的子集。然后对子集进行联合分析,以确定最终的联合特征集,从而有效结合不同算法的优势。我们发现,U-RFE 框架可以提高各种模型的性能。在大多数情况下,堆叠的性能都优于 LR、SVM、RF 和 XGBoost。当 RFE 的目标特征数设为 50 且联合特征集包含 298 个确定性特征时,Stacking 模型的 F1_加权、Recall_加权、Precision_加权、准确率和 Matthews 相关系数分别达到 0.851、0.864、0.854、0.864 和 0.717。少数群体类别的性能也得到了显著提高。因此,这种基于递归-特征-剔除的特征选择方法提高了使用临床和'omic'数据或使用其他具有高特征冗余和不平衡的数据对 CRC 死亡进行分类的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Laboratory Investigation
Laboratory Investigation 医学-病理学
CiteScore
8.30
自引率
0.00%
发文量
125
审稿时长
2 months
期刊介绍: Laboratory Investigation is an international journal owned by the United States and Canadian Academy of Pathology. Laboratory Investigation offers prompt publication of high-quality original research in all biomedical disciplines relating to the understanding of human disease and the application of new methods to the diagnosis of disease. Both human and experimental studies are welcome.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信