分布式异构迁移学习

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Research Pub Date : 2024-05-14 DOI:10.1016/j.bdr.2024.100456

Paolo Mignone , Gianvito Pio , Michelangelo Ceci

{"title":"分布式异构迁移学习","authors":"Paolo Mignone , Gianvito Pio , Michelangelo Ceci","doi":"10.1016/j.bdr.2024.100456","DOIUrl":null,"url":null,"abstract":"<div>Transfer learning has proved to be effective for building predictive models even in complex conditions with a low amount of available labeled data, by constructing a predictive model for a target domain also using the knowledge coming from a separate domain, called source domain. However, several existing transfer learning methods assume identical feature spaces between the source and the target domains. This assumption limits the possible real-world applications of such methods, since two separate, although related, domains could be described by totally different feature spaces. Heterogeneous transfer learning methods aim to overcome this limitation, but they usually i) make other assumptions on the features, such as requiring the same number of features, ii) are not generally able to distribute the workload over multiple computational nodes, iii) cannot work in the Positive-Unlabeled (PU) learning setting, which we also considered in this study, or iv) their applicability is limited to specific application domains, i.e., they are not general-purpose methods.In this manuscript, we present a novel distributed heterogeneous transfer learning method, implemented in Apache Spark, that overcomes all the above-mentioned limitations. Specifically, it is able to work also in the PU learning setting by resorting to a clustering-based approach, and can align totally heterogeneous feature spaces, without exploiting peculiarities of specific application domains. Moreover, our distributed approach allows us to process large source and target datasets.Our experimental evaluation was performed in three different application domains that can benefit from transfer learning approaches, namely the reconstruction of the human gene regulatory network, the prediction of cerebral stroke in hospital patients, and the prediction of customer energy consumption in power grids. The results show that the proposed approach is able to outperform 4 state-of-the-art heterogeneous transfer learning approaches and 3 baselines, and exhibits ideal performances in terms of scalability.</div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"37 ","pages":"Article 100456"},"PeriodicalIF":4.2000,"publicationDate":"2024-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2214579624000327/pdfft?md5=33cf99e10874514291bfc635b26d260f&pid=1-s2.0-S2214579624000327-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Distributed Heterogeneous Transfer Learning\",\"authors\":\"Paolo Mignone , Gianvito Pio , Michelangelo Ceci\",\"doi\":\"10.1016/j.bdr.2024.100456\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>Transfer learning has proved to be effective for building predictive models even in complex conditions with a low amount of available labeled data, by constructing a predictive model for a target domain also using the knowledge coming from a separate domain, called source domain. However, several existing transfer learning methods assume identical feature spaces between the source and the target domains. This assumption limits the possible real-world applications of such methods, since two separate, although related, domains could be described by totally different feature spaces. Heterogeneous transfer learning methods aim to overcome this limitation, but they usually i) make other assumptions on the features, such as requiring the same number of features, ii) are not generally able to distribute the workload over multiple computational nodes, iii) cannot work in the Positive-Unlabeled (PU) learning setting, which we also considered in this study, or iv) their applicability is limited to specific application domains, i.e., they are not general-purpose methods.In this manuscript, we present a novel distributed heterogeneous transfer learning method, implemented in Apache Spark, that overcomes all the above-mentioned limitations. Specifically, it is able to work also in the PU learning setting by resorting to a clustering-based approach, and can align totally heterogeneous feature spaces, without exploiting peculiarities of specific application domains. Moreover, our distributed approach allows us to process large source and target datasets.Our experimental evaluation was performed in three different application domains that can benefit from transfer learning approaches, namely the reconstruction of the human gene regulatory network, the prediction of cerebral stroke in hospital patients, and the prediction of customer energy consumption in power grids. The results show that the proposed approach is able to outperform 4 state-of-the-art heterogeneous transfer learning approaches and 3 baselines, and exhibits ideal performances in terms of scalability.</div>\",\"PeriodicalId\":56017,\"journal\":{\"name\":\"Big Data Research\",\"volume\":\"37 \",\"pages\":\"Article 100456\"},\"PeriodicalIF\":4.2000,\"publicationDate\":\"2024-05-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S2214579624000327/pdfft?md5=33cf99e10874514291bfc635b26d260f&pid=1-s2.0-S2214579624000327-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Big Data Research\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2214579624000327\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Big Data Research","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2214579624000327","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

事实证明，迁移学习可以有效地构建预测模型，即使在可用标注数据较少的复杂条件下，也能利用来自另一个领域（称为源领域）的知识构建目标领域的预测模型。然而，现有的几种迁移学习方法都假设源域和目标域的特征空间完全相同。这一假设限制了此类方法在现实世界中的应用，因为两个独立的领域虽然相关，但可能由完全不同的特征空间来描述。异构迁移学习方法旨在克服这一限制，但它们通常 i) 对特征做出其他假设，如要求特征数量相同；ii) 通常无法在多个计算节点上分配工作量；iii) 无法在正向无标记（PU）学习环境中工作，我们在本研究中也考虑了这一点；或者 iv) 它们的适用性仅限于特定的应用领域，也就是说，它们不是通用方法、在本手稿中，我们介绍了一种在 Apache Spark 中实现的新型分布式异构迁移学习方法，它克服了上述所有局限。具体来说，它通过采用基于聚类的方法，也能在 PU 学习环境中工作，并能对齐完全异构的特征空间，而无需利用特定应用领域的特殊性。此外，我们的分布式方法允许我们处理大型源数据集和目标数据集。我们在三个不同的应用领域进行了实验评估，这些应用领域可以从迁移学习方法中获益，即人类基因调控网络的重建、医院病人脑中风的预测以及电网客户能源消耗的预测。结果表明，所提出的方法能够超越 4 种最先进的异构迁移学习方法和 3 种基线方法，并且在可扩展性方面表现理想。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Distributed Heterogeneous Transfer Learning

Transfer learning has proved to be effective for building predictive models even in complex conditions with a low amount of available labeled data, by constructing a predictive model for a target domain also using the knowledge coming from a separate domain, called source domain. However, several existing transfer learning methods assume identical feature spaces between the source and the target domains. This assumption limits the possible real-world applications of such methods, since two separate, although related, domains could be described by totally different feature spaces. Heterogeneous transfer learning methods aim to overcome this limitation, but they usually i) make other assumptions on the features, such as requiring the same number of features, ii) are not generally able to distribute the workload over multiple computational nodes, iii) cannot work in the Positive-Unlabeled (PU) learning setting, which we also considered in this study, or iv) their applicability is limited to specific application domains, i.e., they are not general-purpose methods.

In this manuscript, we present a novel distributed heterogeneous transfer learning method, implemented in Apache Spark, that overcomes all the above-mentioned limitations. Specifically, it is able to work also in the PU learning setting by resorting to a clustering-based approach, and can align totally heterogeneous feature spaces, without exploiting peculiarities of specific application domains. Moreover, our distributed approach allows us to process large source and target datasets.

Our experimental evaluation was performed in three different application domains that can benefit from transfer learning approaches, namely the reconstruction of the human gene regulatory network, the prediction of cerebral stroke in hospital patients, and the prediction of customer energy consumption in power grids. The results show that the proposed approach is able to outperform 4 state-of-the-art heterogeneous transfer learning approaches and 3 baselines, and exhibits ideal performances in terms of scalability.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Big Data Research Computer Science-Computer Science Applications

CiteScore

8.40

自引率

3.00%

发文量

期刊介绍： The journal aims to promote and communicate advances in big data research by providing a fast and high quality forum for researchers, practitioners and policy makers from the very many different communities working on, and with, this topic. The journal will accept papers on foundational aspects in dealing with big data, as well as papers on specific Platforms and Technologies used to deal with big data. To promote Data Science and interdisciplinary collaboration between fields, and to showcase the benefits of data driven research, papers demonstrating applications of big data in domains as diverse as Geoscience, Social Web, Finance, e-Commerce, Health Care, Environment and Climate, Physics and Astronomy, Chemistry, life sciences and drug discovery, digital libraries and scientific publications, security and government will also be considered. Occasionally the journal may publish whitepapers on policies, standards and best practices.