Identification of representative trees in random forests based on a new tree-based distance measure

IF 1.4 4区 计算机科学 Q2 STATISTICS & PROBABILITY
Björn-Hergen Laabs von Holt, A. Westenberger, I. König
{"title":"Identification of representative trees in random forests based on a new tree-based distance measure","authors":"Björn-Hergen Laabs von Holt, A. Westenberger, I. König","doi":"10.1101/2022.05.15.492004","DOIUrl":null,"url":null,"abstract":"In life sciences, random forests are often used to train predictive models. However, gaining any explanatory insight into the mechanics leading to a specific outcome is rather complex, which impedes the implementation of random forests into clinical practice. By simplifying a complex ensemble of decision trees to a single most representative tree, it is assumed to be possible to observe common tree structures, the importance of specific features and variable interactions. Thus, representative trees could also help to understand interactions between genetic variants. Intuitively, representative trees are those with the minimal distance to all other trees, which requires a proper definition of the distance between two trees. Thus, we developed a new tree-based distance measure, which incorporates more of the underlying tree structure than other metrics. We compared our new method with the existing metrics in an extensive simulation study and applied it to predict the age at onset based on a set of genetic risk factors in a clinical data set. In our simulation study we were able to show the advantages of our weighted splitting variable approach. Our real data application revealed that representative trees are not only able to replicate the results from a recent genome-wide association study, but also can give additional explanations of the genetic mechanisms. Finally, we implemented all compared distance measures in R and made them publicly available in the R package timbR ( https://github.com/imbs-hl/timbR ).","PeriodicalId":49270,"journal":{"name":"Advances in Data Analysis and Classification","volume":"30 5","pages":"1-18"},"PeriodicalIF":1.4000,"publicationDate":"2022-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advances in Data Analysis and Classification","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1101/2022.05.15.492004","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}
引用次数: 0

Abstract

In life sciences, random forests are often used to train predictive models. However, gaining any explanatory insight into the mechanics leading to a specific outcome is rather complex, which impedes the implementation of random forests into clinical practice. By simplifying a complex ensemble of decision trees to a single most representative tree, it is assumed to be possible to observe common tree structures, the importance of specific features and variable interactions. Thus, representative trees could also help to understand interactions between genetic variants. Intuitively, representative trees are those with the minimal distance to all other trees, which requires a proper definition of the distance between two trees. Thus, we developed a new tree-based distance measure, which incorporates more of the underlying tree structure than other metrics. We compared our new method with the existing metrics in an extensive simulation study and applied it to predict the age at onset based on a set of genetic risk factors in a clinical data set. In our simulation study we were able to show the advantages of our weighted splitting variable approach. Our real data application revealed that representative trees are not only able to replicate the results from a recent genome-wide association study, but also can give additional explanations of the genetic mechanisms. Finally, we implemented all compared distance measures in R and made them publicly available in the R package timbR ( https://github.com/imbs-hl/timbR ).
基于树木距离测度的随机森林中代表性树木的识别
在生命科学中,随机森林常用于训练预测模型。然而,获得任何导致特定结果的机制的解释性见解是相当复杂的,这阻碍了随机森林在临床实践中的实施。通过将一个复杂的决策树集合简化为一个最具代表性的树,假设有可能观察到共同的树结构、特定特征的重要性和变量的相互作用。因此,代表性树也可以帮助理解遗传变异之间的相互作用。直观地说,代表性树是那些与所有其他树的距离最小的树,这需要对两棵树之间的距离进行适当的定义。因此,我们开发了一种新的基于树的距离度量,它比其他度量包含更多的底层树结构。我们将我们的新方法与一项广泛的模拟研究中的现有指标进行了比较,并将其应用于基于临床数据集中的一组遗传风险因素的发病年龄预测。在我们的模拟研究中,我们能够展示我们的加权分割变量方法的优点。我们的实际数据应用表明,代表性树不仅能够复制最近全基因组关联研究的结果,而且还可以提供遗传机制的额外解释。最后,我们在R中实现了所有的比较距离度量,并在R包timbR (https://github.com/imbs-hl/timbR)中公开了它们。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
3.40
自引率
6.20%
发文量
45
审稿时长
>12 weeks
期刊介绍: The international journal Advances in Data Analysis and Classification (ADAC) is designed as a forum for high standard publications on research and applications concerning the extraction of knowable aspects from many types of data. It publishes articles on such topics as structural, quantitative, or statistical approaches for the analysis of data; advances in classification, clustering, and pattern recognition methods; strategies for modeling complex data and mining large data sets; methods for the extraction of knowledge from data, and applications of advanced methods in specific domains of practice. Articles illustrate how new domain-specific knowledge can be made available from data by skillful use of data analysis methods. The journal also publishes survey papers that outline, and illuminate the basic ideas and techniques of special approaches.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信