通过随机森林进行迁移学习：单次联合方法

IF 1.5 3区数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Computational Statistics & Data Analysis Pub Date : 2024-05-06 DOI:10.1016/j.csda.2024.107975

Pengcheng Xiang , Ling Zhou , Lu Tang

{"title":"通过随机森林进行迁移学习：单次联合方法","authors":"Pengcheng Xiang , Ling Zhou , Lu Tang","doi":"10.1016/j.csda.2024.107975","DOIUrl":null,"url":null,"abstract":"<div>A one-shot federated transfer learning method using random forests (FTRF) is developed to improve the prediction accuracy at a target data site by leveraging information from auxiliary sites. Both theoretical and numerical results show that the proposed federated transfer learning approach is at least as accurate as the model trained on the target data alone regardless of possible data heterogeneity, which includes imbalanced and non-IID data distributions across sites and model mis-specification. FTRF has the ability to evaluate the similarity between the target and auxiliary sites, enabling the target site to autonomously select more similar site information to enhance its predictive performance. To ensure communication efficiency, FTRF adopts the model averaging idea that requires a single round of communication between the target and the auxiliary sites. Only fitted models from auxiliary sites are sent to the target site. Unlike traditional model averaging, FTRF incorporates predicted outcomes from other sites and the original variables when estimating model averaging weights, resulting in a variable-dependent weighting to better utilize models from auxiliary sites to improve prediction. Five real-world data examples show that FTRF reduces the prediction error by 2-40% compared to methods not utilizing auxiliary information.</div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"197 ","pages":"Article 107975"},"PeriodicalIF":1.5000,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Transfer learning via random forests: A one-shot federated approach\",\"authors\":\"Pengcheng Xiang , Ling Zhou , Lu Tang\",\"doi\":\"10.1016/j.csda.2024.107975\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>A one-shot federated transfer learning method using random forests (FTRF) is developed to improve the prediction accuracy at a target data site by leveraging information from auxiliary sites. Both theoretical and numerical results show that the proposed federated transfer learning approach is at least as accurate as the model trained on the target data alone regardless of possible data heterogeneity, which includes imbalanced and non-IID data distributions across sites and model mis-specification. FTRF has the ability to evaluate the similarity between the target and auxiliary sites, enabling the target site to autonomously select more similar site information to enhance its predictive performance. To ensure communication efficiency, FTRF adopts the model averaging idea that requires a single round of communication between the target and the auxiliary sites. Only fitted models from auxiliary sites are sent to the target site. Unlike traditional model averaging, FTRF incorporates predicted outcomes from other sites and the original variables when estimating model averaging weights, resulting in a variable-dependent weighting to better utilize models from auxiliary sites to improve prediction. Five real-world data examples show that FTRF reduces the prediction error by 2-40% compared to methods not utilizing auxiliary information.</div>\",\"PeriodicalId\":55225,\"journal\":{\"name\":\"Computational Statistics & Data Analysis\",\"volume\":\"197 \",\"pages\":\"Article 107975\"},\"PeriodicalIF\":1.5000,\"publicationDate\":\"2024-05-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computational Statistics & Data Analysis\",\"FirstCategoryId\":\"100\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167947324000598\",\"RegionNum\":3,\"RegionCategory\":\"数学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Statistics & Data Analysis","FirstCategoryId":"100","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167947324000598","RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

摘要

我们开发了一种使用随机森林（FTRF）的单次联合迁移学习方法，通过利用来自辅助站点的信息来提高目标数据站点的预测准确性。理论和数值结果表明，无论可能存在的数据异质性（包括各站点数据分布不平衡和非 IID 数据分布以及模型规范错误）如何，所提出的联合迁移学习方法的准确性至少与单独在目标数据上训练的模型相当。FTRF 能够评估目标站点和辅助站点之间的相似性，使目标站点能够自主选择更多相似站点信息，从而提高预测性能。为确保通信效率，FTRF 采用了模型平均化思想，目标站点和辅助站点之间只需进行一轮通信。只有来自辅助站点的拟合模型才会被发送到目标站点。与传统的模型平均不同，FTRF 在估算模型平均权重时，将其他站点的预测结果和原始变量纳入其中，从而形成了一种取决于变量的权重，以更好地利用辅助站点的模型来改进预测。五个实际数据实例表明，与不利用辅助信息的方法相比，FTRF 可将预测误差减少 2-40%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Transfer learning via random forests: A one-shot federated approach

A one-shot federated transfer learning method using random forests (FTRF) is developed to improve the prediction accuracy at a target data site by leveraging information from auxiliary sites. Both theoretical and numerical results show that the proposed federated transfer learning approach is at least as accurate as the model trained on the target data alone regardless of possible data heterogeneity, which includes imbalanced and non-IID data distributions across sites and model mis-specification. FTRF has the ability to evaluate the similarity between the target and auxiliary sites, enabling the target site to autonomously select more similar site information to enhance its predictive performance. To ensure communication efficiency, FTRF adopts the model averaging idea that requires a single round of communication between the target and the auxiliary sites. Only fitted models from auxiliary sites are sent to the target site. Unlike traditional model averaging, FTRF incorporates predicted outcomes from other sites and the original variables when estimating model averaging weights, resulting in a variable-dependent weighting to better utilize models from auxiliary sites to improve prediction. Five real-world data examples show that FTRF reduces the prediction error by 2-40% compared to methods not utilizing auxiliary information.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computational Statistics & Data Analysis 数学-计算机：跨学科应用

CiteScore

3.70

自引率

5.60%

发文量

167

审稿时长

60 days

期刊介绍： Computational Statistics and Data Analysis (CSDA), an Official Publication of the network Computational and Methodological Statistics (CMStatistics) and of the International Association for Statistical Computing (IASC), is an international journal dedicated to the dissemination of methodological research and applications in the areas of computational statistics and data analysis. The journal consists of four refereed sections which are divided into the following subject areas: I) Computational Statistics - Manuscripts dealing with: 1) the explicit impact of computers on statistical methodology (e.g., Bayesian computing, bioinformatics,computer graphics, computer intensive inferential methods, data exploration, data mining, expert systems, heuristics, knowledge based systems, machine learning, neural networks, numerical and optimization methods, parallel computing, statistical databases, statistical systems), and 2) the development, evaluation and validation of statistical software and algorithms. Software and algorithms can be submitted with manuscripts and will be stored together with the online article. II) Statistical Methodology for Data Analysis - Manuscripts dealing with novel and original data analytical strategies and methodologies applied in biostatistics (design and analytic methods for clinical trials, epidemiological studies, statistical genetics, or genetic/environmental interactions), chemometrics, classification, data exploration, density estimation, design of experiments, environmetrics, education, image analysis, marketing, model free data exploration, pattern recognition, psychometrics, statistical physics, image processing, robust procedures. [...] III) Special Applications - [...] IV) Annals of Statistical Data Science [...]