Scalable Multitask Learning Using Gradient-based Estimation of Task Affinity

Dongyue Li, Aneesh Sharma, Hongyang R. Zhang
{"title":"Scalable Multitask Learning Using Gradient-based Estimation of Task Affinity","authors":"Dongyue Li, Aneesh Sharma, Hongyang R. Zhang","doi":"arxiv-2409.06091","DOIUrl":null,"url":null,"abstract":"Multitask learning is a widely used paradigm for training models on diverse\ntasks, with applications ranging from graph neural networks to language model\nfine-tuning. Since tasks may interfere with each other, a key notion for\nmodeling their relationships is task affinity. This includes pairwise task\naffinity, computed among pairs of tasks, and higher-order affinity, computed\namong subsets of tasks. Naively computing either of them requires repeatedly\ntraining on data from various task combinations, which is computationally\nintensive. We present a new algorithm Grad-TAG that can estimate task\naffinities without this repeated training. The key idea of Grad-TAG is to train a \"base\" model for all tasks and then\nuse a linearization technique to estimate the loss of the model for a specific\ntask combination. The linearization works by computing a gradient-based\napproximation of the loss, using low-dimensional projections of gradients as\nfeatures in a logistic regression to predict labels for the task combination.\nWe show that the linearized model can provably approximate the loss when the\ngradient-based approximation is accurate, and also empirically verify that on\nseveral large models. Then, given the estimated task affinity, we design a\nsemi-definite program for clustering similar tasks by maximizing the average\ndensity of clusters. We evaluate Grad-TAG's performance across seven datasets, including\nmulti-label classification on graphs, and instruction fine-tuning of language\nmodels. Our task affinity estimates are within 2.7% distance to the true\naffinities while needing only 3% of FLOPs in full training. On our largest\ngraph with 21M edges and 500 labeling tasks, our algorithm delivers estimates\nwithin 5% distance to the true affinities, using only 112 GPU hours. Our\nresults show that Grad-TAG achieves excellent performance and runtime tradeoffs\ncompared to existing approaches.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06091","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Multitask learning is a widely used paradigm for training models on diverse tasks, with applications ranging from graph neural networks to language model fine-tuning. Since tasks may interfere with each other, a key notion for modeling their relationships is task affinity. This includes pairwise task affinity, computed among pairs of tasks, and higher-order affinity, computed among subsets of tasks. Naively computing either of them requires repeatedly training on data from various task combinations, which is computationally intensive. We present a new algorithm Grad-TAG that can estimate task affinities without this repeated training. The key idea of Grad-TAG is to train a "base" model for all tasks and then use a linearization technique to estimate the loss of the model for a specific task combination. The linearization works by computing a gradient-based approximation of the loss, using low-dimensional projections of gradients as features in a logistic regression to predict labels for the task combination. We show that the linearized model can provably approximate the loss when the gradient-based approximation is accurate, and also empirically verify that on several large models. Then, given the estimated task affinity, we design a semi-definite program for clustering similar tasks by maximizing the average density of clusters. We evaluate Grad-TAG's performance across seven datasets, including multi-label classification on graphs, and instruction fine-tuning of language models. Our task affinity estimates are within 2.7% distance to the true affinities while needing only 3% of FLOPs in full training. On our largest graph with 21M edges and 500 labeling tasks, our algorithm delivers estimates within 5% distance to the true affinities, using only 112 GPU hours. Our results show that Grad-TAG achieves excellent performance and runtime tradeoffs compared to existing approaches.
利用基于梯度的任务亲和性估计进行可扩展的多任务学习
多任务学习是一种广泛使用的范式,用于在多种任务上训练模型,应用范围从图神经网络到语言模态微调。由于任务之间可能会相互干扰,因此建模它们之间关系的一个关键概念就是任务亲和性。这包括成对任务亲和力(在成对任务之间计算)和高阶亲和力(在任务子集之间计算)。计算这两种亲和力都需要对各种任务组合的数据进行反复训练,计算量非常大。我们提出了一种新算法 Grad-TAG,它可以在不重复训练的情况下估计任务亲和力。Grad-TAG 的主要思路是为所有任务训练一个 "基础 "模型,然后使用线性化技术来估计特定任务组合的模型损失。线性化的工作原理是计算损失的基于梯度的近似值,使用梯度的低维投影作为逻辑回归的特征来预测任务组合的标签。我们证明了当基于梯度的近似值准确时,线性化模型可以近似损失,并在多个大型模型上进行了经验验证。然后,根据估计的任务亲和度,我们设计了一个半定义程序,通过最大化聚类的平均密度对相似任务进行聚类。我们评估了 Grad-TAG 在七个数据集上的性能,包括图的多标签分类和语言模型的指令微调。我们的任务亲和度估计值与真实亲和度的距离在 2.7% 以内,而完全训练只需要 3% 的 FLOPs。在具有 2100 万条边和 500 个标注任务的最大图上,我们的算法得出的估计值与真实亲和度的距离在 5%以内,只用了 112 个 GPU 小时。我们的结果表明,与现有方法相比,Grad-TAG 实现了出色的性能和运行时间折衷。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信