Identification of representative trees in random forests based on a new tree-based distance measure

IF 1.4 4区计算机科学 Q2 STATISTICS & PROBABILITY

Advances in Data Analysis and Classification Pub Date : 2022-08-19 DOI:10.1101/2022.05.15.492004

Björn-Hergen Laabs von Holt, A. Westenberger, I. König

{"title":"Identification of representative trees in random forests based on a new tree-based distance measure","authors":"Björn-Hergen Laabs von Holt, A. Westenberger, I. König","doi":"10.1101/2022.05.15.492004","DOIUrl":null,"url":null,"abstract":"In life sciences, random forests are often used to train predictive models. However, gaining any explanatory insight into the mechanics leading to a specific outcome is rather complex, which impedes the implementation of random forests into clinical practice. By simplifying a complex ensemble of decision trees to a single most representative tree, it is assumed to be possible to observe common tree structures, the importance of specific features and variable interactions. Thus, representative trees could also help to understand interactions between genetic variants. Intuitively, representative trees are those with the minimal distance to all other trees, which requires a proper definition of the distance between two trees. Thus, we developed a new tree-based distance measure, which incorporates more of the underlying tree structure than other metrics. We compared our new method with the existing metrics in an extensive simulation study and applied it to predict the age at onset based on a set of genetic risk factors in a clinical data set. In our simulation study we were able to show the advantages of our weighted splitting variable approach. Our real data application revealed that representative trees are not only able to replicate the results from a recent genome-wide association study, but also can give additional explanations of the genetic mechanisms. Finally, we implemented all compared distance measures in R and made them publicly available in the R package timbR ( https://github.com/imbs-hl/timbR ).","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"30 5","pages":"1-18"},"PeriodicalIF":1.4000,"publicationDate":"2022-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advances in Data Analysis and Classification","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1101/2022.05.15.492004","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}

引用次数: 0

Abstract

In life sciences, random forests are often used to train predictive models. However, gaining any explanatory insight into the mechanics leading to a specific outcome is rather complex, which impedes the implementation of random forests into clinical practice. By simplifying a complex ensemble of decision trees to a single most representative tree, it is assumed to be possible to observe common tree structures, the importance of specific features and variable interactions. Thus, representative trees could also help to understand interactions between genetic variants. Intuitively, representative trees are those with the minimal distance to all other trees, which requires a proper definition of the distance between two trees. Thus, we developed a new tree-based distance measure, which incorporates more of the underlying tree structure than other metrics. We compared our new method with the existing metrics in an extensive simulation study and applied it to predict the age at onset based on a set of genetic risk factors in a clinical data set. In our simulation study we were able to show the advantages of our weighted splitting variable approach. Our real data application revealed that representative trees are not only able to replicate the results from a recent genome-wide association study, but also can give additional explanations of the genetic mechanisms. Finally, we implemented all compared distance measures in R and made them publicly available in the R package timbR ( https://github.com/imbs-hl/timbR ).

查看原文本刊更多论文

基于树木距离测度的随机森林中代表性树木的识别

在生命科学中，随机森林常用于训练预测模型。然而，获得任何导致特定结果的机制的解释性见解是相当复杂的，这阻碍了随机森林在临床实践中的实施。通过将一个复杂的决策树集合简化为一个最具代表性的树，假设有可能观察到共同的树结构、特定特征的重要性和变量的相互作用。因此，代表性树也可以帮助理解遗传变异之间的相互作用。直观地说，代表性树是那些与所有其他树的距离最小的树，这需要对两棵树之间的距离进行适当的定义。因此，我们开发了一种新的基于树的距离度量，它比其他度量包含更多的底层树结构。我们将我们的新方法与一项广泛的模拟研究中的现有指标进行了比较，并将其应用于基于临床数据集中的一组遗传风险因素的发病年龄预测。在我们的模拟研究中，我们能够展示我们的加权分割变量方法的优点。我们的实际数据应用表明，代表性树不仅能够复制最近全基因组关联研究的结果，而且还可以提供遗传机制的额外解释。最后，我们在R中实现了所有的比较距离度量，并在R包timbR (https://github.com/imbs-hl/timbR)中公开了它们。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Advances in Data Analysis and Classification STATISTICS & PROBABILITY-

CiteScore

3.40

自引率

6.20%

发文量

审稿时长

>12 weeks

期刊介绍： The international journal Advances in Data Analysis and Classification (ADAC) is designed as a forum for high standard publications on research and applications concerning the extraction of knowable aspects from many types of data. It publishes articles on such topics as structural, quantitative, or statistical approaches for the analysis of data; advances in classification, clustering, and pattern recognition methods; strategies for modeling complex data and mining large data sets; methods for the extraction of knowledge from data, and applications of advanced methods in specific domains of practice. Articles illustrate how new domain-specific knowledge can be made available from data by skillful use of data analysis methods. The journal also publishes survey papers that outline, and illuminate the basic ideas and techniques of special approaches.