多重森林：多类结果的变量重要性

arXiv - STAT - Methodology Pub Date : 2024-09-13 DOI:arxiv-2409.08925

Roman HornungInstitute for Medical Information Processing, Biometry and Epidemiology, LMU Munich, Munich, GermanyMunich Center for Machine Learning, Alexander HapfelmeierInstitute of AI and Informatics in Medicine, TUM School of Medicine and Health, Technical University of Munich, Munich, Germany

{"title":"多重森林：多类结果的变量重要性","authors":"Roman HornungInstitute for Medical Information Processing, Biometry and Epidemiology, LMU Munich, Munich, GermanyMunich Center for Machine Learning, Alexander HapfelmeierInstitute of AI and Informatics in Medicine, TUM School of Medicine and Health, Technical University of Munich, Munich, Germany","doi":"arxiv-2409.08925","DOIUrl":null,"url":null,"abstract":"In prediction tasks with multi-class outcomes, identifying covariates\nspecifically associated with one or more outcome classes can be important.\nConventional variable importance measures (VIMs) from random forests (RFs),\nlike permutation and Gini importance, focus on overall predictive performance\nor node purity, without differentiating between the classes. Therefore, they\ncan be expected to fail to distinguish class-associated covariates from\ncovariates that only distinguish between groups of classes. We introduce a VIM\ncalled multi-class VIM, tailored for identifying exclusively class-associated\ncovariates, via a novel RF variant called multi forests (MuFs). The trees in\nMuFs use both multi-way and binary splitting. The multi-way splits generate\nchild nodes for each class, using a split criterion that evaluates how well\nthese nodes represent their respective classes. This setup forms the basis of\nthe multi-class VIM, which measures the discriminatory ability of the splits\nperformed in the respective covariates with regard to this split criterion.\nAlongside the multi-class VIM, we introduce a second VIM, the discriminatory\nVIM. This measure, based on the binary splits, assesses the strength of the\ngeneral influence of the covariates, irrespective of their\nclass-associatedness. Simulation studies demonstrate that the multi-class VIM\nspecifically ranks class-associated covariates highly, unlike conventional VIMs\nwhich also rank other types of covariates highly. Analyses of 121 datasets\nreveal that MuFs often have slightly lower predictive performance compared to\nconventional RFs. This is, however, not a limiting factor given the algorithm's\nprimary purpose of calculating the multi-class VIM.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"2 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multi forests: Variable importance for multi-class outcomes\",\"authors\":\"Roman HornungInstitute for Medical Information Processing, Biometry and Epidemiology, LMU Munich, Munich, GermanyMunich Center for Machine Learning, Alexander HapfelmeierInstitute of AI and Informatics in Medicine, TUM School of Medicine and Health, Technical University of Munich, Munich, Germany\",\"doi\":\"arxiv-2409.08925\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In prediction tasks with multi-class outcomes, identifying covariates\\nspecifically associated with one or more outcome classes can be important.\\nConventional variable importance measures (VIMs) from random forests (RFs),\\nlike permutation and Gini importance, focus on overall predictive performance\\nor node purity, without differentiating between the classes. Therefore, they\\ncan be expected to fail to distinguish class-associated covariates from\\ncovariates that only distinguish between groups of classes. We introduce a VIM\\ncalled multi-class VIM, tailored for identifying exclusively class-associated\\ncovariates, via a novel RF variant called multi forests (MuFs). The trees in\\nMuFs use both multi-way and binary splitting. The multi-way splits generate\\nchild nodes for each class, using a split criterion that evaluates how well\\nthese nodes represent their respective classes. This setup forms the basis of\\nthe multi-class VIM, which measures the discriminatory ability of the splits\\nperformed in the respective covariates with regard to this split criterion.\\nAlongside the multi-class VIM, we introduce a second VIM, the discriminatory\\nVIM. This measure, based on the binary splits, assesses the strength of the\\ngeneral influence of the covariates, irrespective of their\\nclass-associatedness. Simulation studies demonstrate that the multi-class VIM\\nspecifically ranks class-associated covariates highly, unlike conventional VIMs\\nwhich also rank other types of covariates highly. Analyses of 121 datasets\\nreveal that MuFs often have slightly lower predictive performance compared to\\nconventional RFs. This is, however, not a limiting factor given the algorithm's\\nprimary purpose of calculating the multi-class VIM.\",\"PeriodicalId\":501425,\"journal\":{\"name\":\"arXiv - STAT - Methodology\",\"volume\":\"2 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - STAT - Methodology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.08925\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Methodology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.08925","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

在具有多类结果的预测任务中，识别与一个或多个结果类别特别相关的协变量可能非常重要。来自随机森林（RF）的传统变量重要性度量（VIMs），如置换和基尼重要性，侧重于整体预测性能或节点纯度，而不区分类别。因此，预计它们无法区分与类相关的协变量和只区分类群的协变量。我们通过一种名为多森林（MuFs）的新型 RF 变体，引入了一种称为多类 VIM 的 VIM，专门用于识别与类相关的协变量。MuFs 中的树同时使用多向分裂和二元分裂。多向拆分为每个类别生成子节点，使用拆分标准来评估这些节点对各自类别的代表程度。这种设置构成了多类 VIM 的基础，多类 VIM 衡量的是根据这种拆分标准在各自协变量中进行拆分的判别能力。该指标基于二元拆分，评估协变量的一般影响强度，而不考虑其类别相关性。模拟研究表明，多类 VIM 对类相关协变量的排序很高，而传统 VIM 对其他类型协变量的排序也很高。对 121 个数据集的分析表明，MuFs 的预测性能往往略低于传统的 RFs。不过，考虑到该算法的主要目的是计算多类 VIM，这并不是一个限制因素。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Multi forests: Variable importance for multi-class outcomes

In prediction tasks with multi-class outcomes, identifying covariates specifically associated with one or more outcome classes can be important. Conventional variable importance measures (VIMs) from random forests (RFs), like permutation and Gini importance, focus on overall predictive performance or node purity, without differentiating between the classes. Therefore, they can be expected to fail to distinguish class-associated covariates from covariates that only distinguish between groups of classes. We introduce a VIM called multi-class VIM, tailored for identifying exclusively class-associated covariates, via a novel RF variant called multi forests (MuFs). The trees in MuFs use both multi-way and binary splitting. The multi-way splits generate child nodes for each class, using a split criterion that evaluates how well these nodes represent their respective classes. This setup forms the basis of the multi-class VIM, which measures the discriminatory ability of the splits performed in the respective covariates with regard to this split criterion. Alongside the multi-class VIM, we introduce a second VIM, the discriminatory VIM. This measure, based on the binary splits, assesses the strength of the general influence of the covariates, irrespective of their class-associatedness. Simulation studies demonstrate that the multi-class VIM specifically ranks class-associated covariates highly, unlike conventional VIMs which also rank other types of covariates highly. Analyses of 121 datasets reveal that MuFs often have slightly lower predictive performance compared to conventional RFs. This is, however, not a limiting factor given the algorithm's primary purpose of calculating the multi-class VIM.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - STAT - Methodology

自引率

0.00%

发文量