Hulk: A Universal Knowledge Translator for Human-Centric Tasks

IF 18.6

IEEE transactions on pattern analysis and machine intelligence Pub Date : 2025-03-18 DOI:10.1109/TPAMI.2025.3552604

Yizhou Wang;Yixuan Wu;Weizhen He;Xun Guo;Feng Zhu;Lei Bai;Rui Zhao;Jian Wu;Tong He;Wanli Ouyang;Shixiang Tang

{"title":"Hulk: A Universal Knowledge Translator for Human-Centric Tasks","authors":"Yizhou Wang;Yixuan Wu;Weizhen He;Xun Guo;Feng Zhu;Lei Bai;Rui Zhao;Jian Wu;Tong He;Wanli Ouyang;Shixiang Tang","doi":"10.1109/TPAMI.2025.3552604","DOIUrl":null,"url":null,"abstract":"Human-centric perception tasks, e.g., pedestrian detection, skeleton-based action recognition, and pose estimation, have wide industrial applications, such as metaverse and sports analysis. There is a recent surge to develop human-centric foundation models that can benefit a broad range of human-centric perception tasks. While many human-centric foundation models have achieved success, they did not explore 3D and vision-language tasks for human-centric and required task-specific finetuning. These limitations restrict their application to more downstream tasks and situations. To tackle these problems, we present Hulk, the first multimodal human-centric generalist model, capable of addressing 2D vision, 3D vision, skeleton-based, and vision-language tasks without task-specific finetuning. The key to achieving this is condensing various task-specific heads into two general heads, one for discrete representations, e.g., languages, and the other for continuous representations, e.g., location coordinates. The outputs of two heads can be further stacked into four distinct input and output modalities. This uniform representation enables Hulk to treat diverse human-centric tasks as modality translation, integrating knowledge across a wide range of tasks. Comprehensive evaluations of Hulk on 12 benchmarks covering 8 human-centric tasks demonstrate the superiority of our proposed method, achieving state-of-the-art performance in 11 benchmarks.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 7","pages":"5672-5689"},"PeriodicalIF":18.6000,"publicationDate":"2025-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10930828/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Human-centric perception tasks, e.g., pedestrian detection, skeleton-based action recognition, and pose estimation, have wide industrial applications, such as metaverse and sports analysis. There is a recent surge to develop human-centric foundation models that can benefit a broad range of human-centric perception tasks. While many human-centric foundation models have achieved success, they did not explore 3D and vision-language tasks for human-centric and required task-specific finetuning. These limitations restrict their application to more downstream tasks and situations. To tackle these problems, we present Hulk, the first multimodal human-centric generalist model, capable of addressing 2D vision, 3D vision, skeleton-based, and vision-language tasks without task-specific finetuning. The key to achieving this is condensing various task-specific heads into two general heads, one for discrete representations, e.g., languages, and the other for continuous representations, e.g., location coordinates. The outputs of two heads can be further stacked into four distinct input and output modalities. This uniform representation enables Hulk to treat diverse human-centric tasks as modality translation, integrating knowledge across a wide range of tasks. Comprehensive evaluations of Hulk on 12 benchmarks covering 8 human-centric tasks demonstrate the superiority of our proposed method, achieving state-of-the-art performance in 11 benchmarks.

查看原文本刊更多论文

浩克：以人为中心任务的通用知识翻译器。

以人类为中心的感知任务，如行人检测、基于骨骼的动作识别和姿势估计，具有广泛的工业应用，如元宇宙和运动分析。最近，开发以人为中心的基础模型的热潮兴起，这些模型可以使广泛的以人为中心的感知任务受益。虽然许多以人为中心的基础模型已经取得了成功，但它们并没有探索以人为中心的3D和视觉语言任务，并且需要针对特定任务进行微调。这些限制限制了它们在更多下游任务和情况下的应用。为了解决这些问题，我们提出了Hulk，这是第一个以人类为中心的多模态通才模型，能够解决2D视觉，3D视觉，基于骨架和视觉语言的任务，而无需针对特定任务进行微调。实现这一目标的关键是将各种特定任务的头压缩成两个一般的头，一个用于离散表示，例如语言，另一个用于连续表示，例如位置坐标。两个头的输出可以进一步堆叠成四种不同的输入和输出模式。这种统一的表示使Hulk能够将各种以人为中心的任务视为情态翻译，从而集成跨广泛任务的知识。对浩克在12个基准测试上的综合评估，涵盖了8个以人为中心的任务，证明了我们提出的方法的优越性，在11个基准测试中实现了最先进的性能。代码可在https://github.com/OpenGVLab/Hulk上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量