Multi forests: Variable importance for multi-class outcomes

Roman HornungInstitute for Medical Information Processing, Biometry and Epidemiology, LMU Munich, Munich, GermanyMunich Center for Machine Learning, Alexander HapfelmeierInstitute of AI and Informatics in Medicine, TUM School of Medicine and Health, Technical University of Munich, Munich, Germany
{"title":"Multi forests: Variable importance for multi-class outcomes","authors":"Roman HornungInstitute for Medical Information Processing, Biometry and Epidemiology, LMU Munich, Munich, GermanyMunich Center for Machine Learning, Alexander HapfelmeierInstitute of AI and Informatics in Medicine, TUM School of Medicine and Health, Technical University of Munich, Munich, Germany","doi":"arxiv-2409.08925","DOIUrl":null,"url":null,"abstract":"In prediction tasks with multi-class outcomes, identifying covariates\nspecifically associated with one or more outcome classes can be important.\nConventional variable importance measures (VIMs) from random forests (RFs),\nlike permutation and Gini importance, focus on overall predictive performance\nor node purity, without differentiating between the classes. Therefore, they\ncan be expected to fail to distinguish class-associated covariates from\ncovariates that only distinguish between groups of classes. We introduce a VIM\ncalled multi-class VIM, tailored for identifying exclusively class-associated\ncovariates, via a novel RF variant called multi forests (MuFs). The trees in\nMuFs use both multi-way and binary splitting. The multi-way splits generate\nchild nodes for each class, using a split criterion that evaluates how well\nthese nodes represent their respective classes. This setup forms the basis of\nthe multi-class VIM, which measures the discriminatory ability of the splits\nperformed in the respective covariates with regard to this split criterion.\nAlongside the multi-class VIM, we introduce a second VIM, the discriminatory\nVIM. This measure, based on the binary splits, assesses the strength of the\ngeneral influence of the covariates, irrespective of their\nclass-associatedness. Simulation studies demonstrate that the multi-class VIM\nspecifically ranks class-associated covariates highly, unlike conventional VIMs\nwhich also rank other types of covariates highly. Analyses of 121 datasets\nreveal that MuFs often have slightly lower predictive performance compared to\nconventional RFs. This is, however, not a limiting factor given the algorithm's\nprimary purpose of calculating the multi-class VIM.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"2 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Methodology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.08925","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

In prediction tasks with multi-class outcomes, identifying covariates specifically associated with one or more outcome classes can be important. Conventional variable importance measures (VIMs) from random forests (RFs), like permutation and Gini importance, focus on overall predictive performance or node purity, without differentiating between the classes. Therefore, they can be expected to fail to distinguish class-associated covariates from covariates that only distinguish between groups of classes. We introduce a VIM called multi-class VIM, tailored for identifying exclusively class-associated covariates, via a novel RF variant called multi forests (MuFs). The trees in MuFs use both multi-way and binary splitting. The multi-way splits generate child nodes for each class, using a split criterion that evaluates how well these nodes represent their respective classes. This setup forms the basis of the multi-class VIM, which measures the discriminatory ability of the splits performed in the respective covariates with regard to this split criterion. Alongside the multi-class VIM, we introduce a second VIM, the discriminatory VIM. This measure, based on the binary splits, assesses the strength of the general influence of the covariates, irrespective of their class-associatedness. Simulation studies demonstrate that the multi-class VIM specifically ranks class-associated covariates highly, unlike conventional VIMs which also rank other types of covariates highly. Analyses of 121 datasets reveal that MuFs often have slightly lower predictive performance compared to conventional RFs. This is, however, not a limiting factor given the algorithm's primary purpose of calculating the multi-class VIM.
多重森林:多类结果的变量重要性
在具有多类结果的预测任务中,识别与一个或多个结果类别特别相关的协变量可能非常重要。来自随机森林(RF)的传统变量重要性度量(VIMs),如置换和基尼重要性,侧重于整体预测性能或节点纯度,而不区分类别。因此,预计它们无法区分与类相关的协变量和只区分类群的协变量。我们通过一种名为多森林(MuFs)的新型 RF 变体,引入了一种称为多类 VIM 的 VIM,专门用于识别与类相关的协变量。MuFs 中的树同时使用多向分裂和二元分裂。多向拆分为每个类别生成子节点,使用拆分标准来评估这些节点对各自类别的代表程度。这种设置构成了多类 VIM 的基础,多类 VIM 衡量的是根据这种拆分标准在各自协变量中进行拆分的判别能力。该指标基于二元拆分,评估协变量的一般影响强度,而不考虑其类别相关性。模拟研究表明,多类 VIM 对类相关协变量的排序很高,而传统 VIM 对其他类型协变量的排序也很高。对 121 个数据集的分析表明,MuFs 的预测性能往往略低于传统的 RFs。不过,考虑到该算法的主要目的是计算多类 VIM,这并不是一个限制因素。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信