论迁移学习的泛化：信息理论分析

IF 2.9 3区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Information Theory Pub Date : 2024-08-14 DOI:10.1109/TIT.2024.3441574

Xuetong Wu;Jonathan H. Manton;Uwe Aickelin;Jingge Zhu

{"title":"论迁移学习的泛化：信息理论分析","authors":"Xuetong Wu;Jonathan H. Manton;Uwe Aickelin;Jingge Zhu","doi":"10.1109/TIT.2024.3441574","DOIUrl":null,"url":null,"abstract":"Transfer learning, or domain adaptation, is concerned with machine learning problems in which training and testing data come from possibly different probability distributions. In this work, we give an information-theoretic analysis of the generalization error and excess risk of transfer learning algorithms. Our results suggest, perhaps as expected, that the Kullback-Leibler (KL) divergence \n<inline-formula> <tex-math>$D(\\mu \\|\\mu ')$ </tex-math></inline-formula>\n plays an important role in the characterizations where \n<inline-formula> <tex-math>$\\mu $ </tex-math></inline-formula>\n and \n<inline-formula> <tex-math>$\\mu '$ </tex-math></inline-formula>\n denote the distribution of the training data and the testing data, respectively. Specifically, we provide generalization error and excess risk upper bounds for learning algorithms where data from both distributions are available in the training phase. Recognizing that the bounds could be sub-optimal in general, we provide improved excess risk upper bounds for a certain class of algorithms, including the empirical risk minimization (ERM) algorithm, by making stronger assumptions through the central condition. To demonstrate the usefulness of the bounds, we further extend the analysis to the Gibbs algorithm and the noisy stochastic gradient descent method. We then generalize the mutual information bound with other divergences such as \n<inline-formula> <tex-math>$\\phi $ </tex-math></inline-formula>\n-divergence and Wasserstein distance, which may lead to tighter bounds and can handle the case when \n<inline-formula> <tex-math>$\\mu $ </tex-math></inline-formula>\n is not absolutely continuous with respect to \n<inline-formula> <tex-math>$\\mu '$ </tex-math></inline-formula>\n. Several numerical results are provided to demonstrate our theoretical findings. Lastly, to address the problem that the bounds are often not directly applicable in practice due to the absence of the distributional knowledge of the data, we develop an algorithm (called InfoBoost) that dynamically adjusts the importance weights for both source and target data based on certain information measures. The empirical results show the effectiveness of the proposed algorithm.","PeriodicalId":13494,"journal":{"name":"IEEE Transactions on Information Theory","volume":"70 10","pages":"7089-7124"},"PeriodicalIF":2.9000,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"On the Generalization for Transfer Learning: An Information-Theoretic Analysis\",\"authors\":\"Xuetong Wu;Jonathan H. Manton;Uwe Aickelin;Jingge Zhu\",\"doi\":\"10.1109/TIT.2024.3441574\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Transfer learning, or domain adaptation, is concerned with machine learning problems in which training and testing data come from possibly different probability distributions. In this work, we give an information-theoretic analysis of the generalization error and excess risk of transfer learning algorithms. Our results suggest, perhaps as expected, that the Kullback-Leibler (KL) divergence \\n<inline-formula> <tex-math>$D(\\\\mu \\\\|\\\\mu ')$ </tex-math></inline-formula>\\n plays an important role in the characterizations where \\n<inline-formula> <tex-math>$\\\\mu $ </tex-math></inline-formula>\\n and \\n<inline-formula> <tex-math>$\\\\mu '$ </tex-math></inline-formula>\\n denote the distribution of the training data and the testing data, respectively. Specifically, we provide generalization error and excess risk upper bounds for learning algorithms where data from both distributions are available in the training phase. Recognizing that the bounds could be sub-optimal in general, we provide improved excess risk upper bounds for a certain class of algorithms, including the empirical risk minimization (ERM) algorithm, by making stronger assumptions through the central condition. To demonstrate the usefulness of the bounds, we further extend the analysis to the Gibbs algorithm and the noisy stochastic gradient descent method. We then generalize the mutual information bound with other divergences such as \\n<inline-formula> <tex-math>$\\\\phi $ </tex-math></inline-formula>\\n-divergence and Wasserstein distance, which may lead to tighter bounds and can handle the case when \\n<inline-formula> <tex-math>$\\\\mu $ </tex-math></inline-formula>\\n is not absolutely continuous with respect to \\n<inline-formula> <tex-math>$\\\\mu '$ </tex-math></inline-formula>\\n. Several numerical results are provided to demonstrate our theoretical findings. Lastly, to address the problem that the bounds are often not directly applicable in practice due to the absence of the distributional knowledge of the data, we develop an algorithm (called InfoBoost) that dynamically adjusts the importance weights for both source and target data based on certain information measures. The empirical results show the effectiveness of the proposed algorithm.\",\"PeriodicalId\":13494,\"journal\":{\"name\":\"IEEE Transactions on Information Theory\",\"volume\":\"70 10\",\"pages\":\"7089-7124\"},\"PeriodicalIF\":2.9000,\"publicationDate\":\"2024-08-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Information Theory\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10636241/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Information Theory","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10636241/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

迁移学习或领域适应涉及机器学习问题，其中训练数据和测试数据可能来自不同的概率分布。在这项工作中，我们对迁移学习算法的泛化误差和超额风险进行了信息理论分析。我们的结果表明，也许正如所料，库尔巴克-莱布勒（KL）分歧 $D(\mu \|\mu ')$ 在表征中起着重要作用，其中 $\mu $ 和 $\mu '$ 分别表示训练数据和测试数据的分布。具体来说，我们为学习算法提供了泛化误差和超额风险上限，在训练阶段两种分布的数据都可用。由于认识到这些界限在一般情况下可能是次优的，我们通过中心条件做出更强的假设，为包括经验风险最小化（ERM）算法在内的某类算法提供了改进的超额风险上限。为了证明边界的实用性，我们进一步将分析扩展到吉布斯算法和噪声随机梯度下降法。然后，我们用其他发散（如$\phi $ -发散和瓦瑟斯坦距离）来概括互信息约束，这可能会导致更严格的约束，并能处理 $\mu $ 相对于 $\mu '$ 不是绝对连续的情况。我们提供了一些数值结果来证明我们的理论发现。最后，为了解决由于缺乏数据分布知识而导致边界在实际中无法直接应用的问题，我们开发了一种算法（称为 InfoBoost），它可以根据某些信息度量动态调整源数据和目标数据的重要性权重。实证结果表明了所提算法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

On the Generalization for Transfer Learning: An Information-Theoretic Analysis

Transfer learning, or domain adaptation, is concerned with machine learning problems in which training and testing data come from possibly different probability distributions. In this work, we give an information-theoretic analysis of the generalization error and excess risk of transfer learning algorithms. Our results suggest, perhaps as expected, that the Kullback-Leibler (KL) divergence

$D(\mu \|\mu ')$

plays an important role in the characterizations where

$\mu $

and

$\mu '$

denote the distribution of the training data and the testing data, respectively. Specifically, we provide generalization error and excess risk upper bounds for learning algorithms where data from both distributions are available in the training phase. Recognizing that the bounds could be sub-optimal in general, we provide improved excess risk upper bounds for a certain class of algorithms, including the empirical risk minimization (ERM) algorithm, by making stronger assumptions through the central condition. To demonstrate the usefulness of the bounds, we further extend the analysis to the Gibbs algorithm and the noisy stochastic gradient descent method. We then generalize the mutual information bound with other divergences such as

$\phi $

-divergence and Wasserstein distance, which may lead to tighter bounds and can handle the case when

$\mu $

is not absolutely continuous with respect to

$\mu '$

. Several numerical results are provided to demonstrate our theoretical findings. Lastly, to address the problem that the bounds are often not directly applicable in practice due to the absence of the distributional knowledge of the data, we develop an algorithm (called InfoBoost) that dynamically adjusts the importance weights for both source and target data based on certain information measures. The empirical results show the effectiveness of the proposed algorithm.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Information Theory 工程技术-工程：电子与电气

CiteScore

5.70

自引率

20.00%

发文量

514

审稿时长

12 months

期刊介绍： The IEEE Transactions on Information Theory is a journal that publishes theoretical and experimental papers concerned with the transmission, processing, and utilization of information. The boundaries of acceptable subject matter are intentionally not sharply delimited. Rather, it is hoped that as the focus of research activity changes, a flexible policy will permit this Transactions to follow suit. Current appropriate topics are best reflected by recent Tables of Contents; they are summarized in the titles of editorial areas that appear on the inside front cover.