多混合数据协变量随机分裂随机森林

Q4 Medicine

Journal of Biostatistics and Epidemiology Pub Date : 2023-10-31 DOI:10.18502/jbe.v9i1.13974

Mohammad Fayaz, Alireza Abadi, Soheila Khodakarim

{"title":"多混合数据协变量随机分裂随机森林","authors":"Mohammad Fayaz, Alireza Abadi, Soheila Khodakarim","doi":"10.18502/jbe.v9i1.13974","DOIUrl":null,"url":null,"abstract":"Introduction:The bagging (BG) and random forest (RF) are famous supervised statistical learning methods based on the classification and regression trees. The BG and RF can deal with different types of responses such as categorical, continuous, etc. There are curves, time series, functional data, or observations that are related to each other based on their domain in many statistical applications. The RF methods are extended to some cases for functional data as covariates or responses in many pieces of literature. Among them, random-splitting is used to summarize the functional data to the multiple related summary statistics such as average, etc. Methods: This research article extends this method and introduces the mixed data BG (MD-BG) and RF (MD-RF) algorithm for multiple functional and non-functional, or mixed and hybrid data, covariates and it calculates the variable importance plot (VIP) for each covariate. Results: The main differences between MD-BG and MD-RF are in choosing the covariates that in the first, all covariates remain in the model but the second uses a random sample of covariates. The MD-RF helps to unmask the most important parts of functional covariates and the most important non-functional covariates. Conclusion: We apply our methods on the two datasets of DTI and Tecator and compare their performances for continuous and categorical responses with developed R package (“RSRF”) in the GitHub.","PeriodicalId":34310,"journal":{"name":"Journal of Biostatistics and Epidemiology","volume":"2015 29","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Random-Splitting Random Forest with Multiple Mixed-Data Covariates\",\"authors\":\"Mohammad Fayaz, Alireza Abadi, Soheila Khodakarim\",\"doi\":\"10.18502/jbe.v9i1.13974\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Introduction:The bagging (BG) and random forest (RF) are famous supervised statistical learning methods based on the classification and regression trees. The BG and RF can deal with different types of responses such as categorical, continuous, etc. There are curves, time series, functional data, or observations that are related to each other based on their domain in many statistical applications. The RF methods are extended to some cases for functional data as covariates or responses in many pieces of literature. Among them, random-splitting is used to summarize the functional data to the multiple related summary statistics such as average, etc. Methods: This research article extends this method and introduces the mixed data BG (MD-BG) and RF (MD-RF) algorithm for multiple functional and non-functional, or mixed and hybrid data, covariates and it calculates the variable importance plot (VIP) for each covariate. Results: The main differences between MD-BG and MD-RF are in choosing the covariates that in the first, all covariates remain in the model but the second uses a random sample of covariates. The MD-RF helps to unmask the most important parts of functional covariates and the most important non-functional covariates. Conclusion: We apply our methods on the two datasets of DTI and Tecator and compare their performances for continuous and categorical responses with developed R package (“RSRF”) in the GitHub.\",\"PeriodicalId\":34310,\"journal\":{\"name\":\"Journal of Biostatistics and Epidemiology\",\"volume\":\"2015 29\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-10-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Biostatistics and Epidemiology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.18502/jbe.v9i1.13974\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"Medicine\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Biostatistics and Epidemiology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18502/jbe.v9i1.13974","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"Medicine","Score":null,"Total":0}

引用次数: 0

摘要

bagging (BG)和random forest (RF)是著名的基于分类树和回归树的监督统计学习方法。BG和RF可以处理不同类型的响应，如分类响应、连续响应等。在许多统计应用中，有曲线、时间序列、函数数据或观测值，它们基于各自的域而相互关联。在许多文献中，RF方法被扩展到功能数据作为协变量或响应的某些情况。其中，随机分割是将功能数据汇总为多个相关的汇总统计量，如平均值等方法:本文对该方法进行了扩展，引入了混合数据BG (MD-BG)和RF (MD-RF)算法，对多个功能和非功能，或混合和混合数据，协变量，计算每个协变量的变量重要性图(VIP)。结果:MD-BG和MD-RF的主要区别在于协变量的选择，在前者中，所有协变量都保留在模型中，而后者使用随机样本的协变量。MD-RF有助于揭示功能协变量的最重要部分和最重要的非功能协变量。结论:我们将我们的方法应用于DTI和Tecator两个数据集，并与GitHub中开发的R包(“RSRF”)比较了它们在连续和分类响应方面的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Random-Splitting Random Forest with Multiple Mixed-Data Covariates

Introduction:The bagging (BG) and random forest (RF) are famous supervised statistical learning methods based on the classification and regression trees. The BG and RF can deal with different types of responses such as categorical, continuous, etc. There are curves, time series, functional data, or observations that are related to each other based on their domain in many statistical applications. The RF methods are extended to some cases for functional data as covariates or responses in many pieces of literature. Among them, random-splitting is used to summarize the functional data to the multiple related summary statistics such as average, etc. Methods: This research article extends this method and introduces the mixed data BG (MD-BG) and RF (MD-RF) algorithm for multiple functional and non-functional, or mixed and hybrid data, covariates and it calculates the variable importance plot (VIP) for each covariate. Results: The main differences between MD-BG and MD-RF are in choosing the covariates that in the first, all covariates remain in the model but the second uses a random sample of covariates. The MD-RF helps to unmask the most important parts of functional covariates and the most important non-functional covariates. Conclusion: We apply our methods on the two datasets of DTI and Tecator and compare their performances for continuous and categorical responses with developed R package (“RSRF”) in the GitHub.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Biostatistics and Epidemiology Medicine-Epidemiology

CiteScore

0.80

自引率

0.00%

发文量

审稿时长

12 weeks