Scalable Multitask Learning Using Gradient-based Estimation of Task Affinity

arXiv - STAT - Machine Learning Pub Date : 2024-09-09 DOI:arxiv-2409.06091

Dongyue Li, Aneesh Sharma, Hongyang R. Zhang

{"title":"Scalable Multitask Learning Using Gradient-based Estimation of Task Affinity","authors":"Dongyue Li, Aneesh Sharma, Hongyang R. Zhang","doi":"arxiv-2409.06091","DOIUrl":null,"url":null,"abstract":"Multitask learning is a widely used paradigm for training models on diverse\ntasks, with applications ranging from graph neural networks to language model\nfine-tuning. Since tasks may interfere with each other, a key notion for\nmodeling their relationships is task affinity. This includes pairwise task\naffinity, computed among pairs of tasks, and higher-order affinity, computed\namong subsets of tasks. Naively computing either of them requires repeatedly\ntraining on data from various task combinations, which is computationally\nintensive. We present a new algorithm Grad-TAG that can estimate task\naffinities without this repeated training. The key idea of Grad-TAG is to train a \"base\" model for all tasks and then\nuse a linearization technique to estimate the loss of the model for a specific\ntask combination. The linearization works by computing a gradient-based\napproximation of the loss, using low-dimensional projections of gradients as\nfeatures in a logistic regression to predict labels for the task combination.\nWe show that the linearized model can provably approximate the loss when the\ngradient-based approximation is accurate, and also empirically verify that on\nseveral large models. Then, given the estimated task affinity, we design a\nsemi-definite program for clustering similar tasks by maximizing the average\ndensity of clusters. We evaluate Grad-TAG's performance across seven datasets, including\nmulti-label classification on graphs, and instruction fine-tuning of language\nmodels. Our task affinity estimates are within 2.7% distance to the true\naffinities while needing only 3% of FLOPs in full training. On our largest\ngraph with 21M edges and 500 labeling tasks, our algorithm delivers estimates\nwithin 5% distance to the true affinities, using only 112 GPU hours. Our\nresults show that Grad-TAG achieves excellent performance and runtime tradeoffs\ncompared to existing approaches.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06091","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Multitask learning is a widely used paradigm for training models on diverse tasks, with applications ranging from graph neural networks to language model fine-tuning. Since tasks may interfere with each other, a key notion for modeling their relationships is task affinity. This includes pairwise task affinity, computed among pairs of tasks, and higher-order affinity, computed among subsets of tasks. Naively computing either of them requires repeatedly training on data from various task combinations, which is computationally intensive. We present a new algorithm Grad-TAG that can estimate task affinities without this repeated training. The key idea of Grad-TAG is to train a "base" model for all tasks and then use a linearization technique to estimate the loss of the model for a specific task combination. The linearization works by computing a gradient-based approximation of the loss, using low-dimensional projections of gradients as features in a logistic regression to predict labels for the task combination. We show that the linearized model can provably approximate the loss when the gradient-based approximation is accurate, and also empirically verify that on several large models. Then, given the estimated task affinity, we design a semi-definite program for clustering similar tasks by maximizing the average density of clusters. We evaluate Grad-TAG's performance across seven datasets, including multi-label classification on graphs, and instruction fine-tuning of language models. Our task affinity estimates are within 2.7% distance to the true affinities while needing only 3% of FLOPs in full training. On our largest graph with 21M edges and 500 labeling tasks, our algorithm delivers estimates within 5% distance to the true affinities, using only 112 GPU hours. Our results show that Grad-TAG achieves excellent performance and runtime tradeoffs compared to existing approaches.

查看原文本刊更多论文

利用基于梯度的任务亲和性估计进行可扩展的多任务学习

多任务学习是一种广泛使用的范式，用于在多种任务上训练模型，应用范围从图神经网络到语言模态微调。由于任务之间可能会相互干扰，因此建模它们之间关系的一个关键概念就是任务亲和性。这包括成对任务亲和力（在成对任务之间计算）和高阶亲和力（在任务子集之间计算）。计算这两种亲和力都需要对各种任务组合的数据进行反复训练，计算量非常大。我们提出了一种新算法 Grad-TAG，它可以在不重复训练的情况下估计任务亲和力。Grad-TAG 的主要思路是为所有任务训练一个 "基础 "模型，然后使用线性化技术来估计特定任务组合的模型损失。线性化的工作原理是计算损失的基于梯度的近似值，使用梯度的低维投影作为逻辑回归的特征来预测任务组合的标签。我们证明了当基于梯度的近似值准确时，线性化模型可以近似损失，并在多个大型模型上进行了经验验证。然后，根据估计的任务亲和度，我们设计了一个半定义程序，通过最大化聚类的平均密度对相似任务进行聚类。我们评估了 Grad-TAG 在七个数据集上的性能，包括图的多标签分类和语言模型的指令微调。我们的任务亲和度估计值与真实亲和度的距离在 2.7% 以内，而完全训练只需要 3% 的 FLOPs。在具有 2100 万条边和 500 个标注任务的最大图上，我们的算法得出的估计值与真实亲和度的距离在 5%以内，只用了 112 个 GPU 小时。我们的结果表明，与现有方法相比，Grad-TAG 实现了出色的性能和运行时间折衷。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - STAT - Machine Learning

自引率

0.00%

发文量