RefHCM: A Unified Model for Referring Perceptions in Human-Centric Scenarios

IF 9.7 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia Pub Date : 2026-01-01 Epub Date: 2026-01-12 DOI:10.1109/TMM.2026.3651042

Jie Huang;Ruibing Hou;Jiahe Zhao;Hong Chang;Shiguang Shan

{"title":"RefHCM: A Unified Model for Referring Perceptions in Human-Centric Scenarios","authors":"Jie Huang;Ruibing Hou;Jiahe Zhao;Hong Chang;Shiguang Shan","doi":"10.1109/TMM.2026.3651042","DOIUrl":null,"url":null,"abstract":"Human-centric perceptions play a crucial role in real-world applications. While recent human-centric works have achieved impressive progress, these efforts are often constrained to the visual domain and lack interaction with human instructions, limiting their applicability in broader scenarios such as chatbots and sports analysis. This paper introduces <italic>Referring Human Perceptions, where a referring prompt specifies the person of interest in an image. To tackle the new task, we propose RefHCM (<bold>Referring <bold>Human-<bold>Centric <bold>Model), a unified framework to integrate a wide range of human-centric referring tasks. Specifically, RefHCM employs sequence mergers to convert raw multimodal data—including images, text, coordinates, and parsing maps—into semantic tokens. This standardized representation enables RefHCM to reformulate diverse human-centric referring tasks into a sequence-to-sequence paradigm, solved using a plain encoder-decoder transformer architecture. Benefiting from a unified learning strategy, RefHCM effectively facilitates knowledge transfer across tasks and exhibits unforeseen capabilities in handling complex reasoning. This work represents the first attempt to address referring human perceptions with a general-purpose framework, while simultaneously establishing a corresponding benchmark that sets new standards for the field. Extensive experiments showcase RefHCM’s competitive and even superior performance across multiple human-centric referring tasks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"2445-2459"},"PeriodicalIF":9.7000,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11340752/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2026/1/12 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Human-centric perceptions play a crucial role in real-world applications. While recent human-centric works have achieved impressive progress, these efforts are often constrained to the visual domain and lack interaction with human instructions, limiting their applicability in broader scenarios such as chatbots and sports analysis. This paper introduces Referring Human Perceptions, where a referring prompt specifies the person of interest in an image. To tackle the new task, we propose RefHCM (Referring Human-Centric Model), a unified framework to integrate a wide range of human-centric referring tasks. Specifically, RefHCM employs sequence mergers to convert raw multimodal data—including images, text, coordinates, and parsing maps—into semantic tokens. This standardized representation enables RefHCM to reformulate diverse human-centric referring tasks into a sequence-to-sequence paradigm, solved using a plain encoder-decoder transformer architecture. Benefiting from a unified learning strategy, RefHCM effectively facilitates knowledge transfer across tasks and exhibits unforeseen capabilities in handling complex reasoning. This work represents the first attempt to address referring human perceptions with a general-purpose framework, while simultaneously establishing a corresponding benchmark that sets new standards for the field. Extensive experiments showcase RefHCM’s competitive and even superior performance across multiple human-centric referring tasks.

查看原文本刊更多论文

RefHCM：在以人为中心的场景中引用感知的统一模型

以人为中心的感知在现实世界的应用中起着至关重要的作用。虽然最近以人类为中心的工作取得了令人印象深刻的进展，但这些努力往往局限于视觉领域，缺乏与人类指令的交互，限制了它们在聊天机器人和体育分析等更广泛场景中的适用性。本文介绍了引用人类感知，其中引用提示指定图像中感兴趣的人。为了解决这一新的任务，我们提出了RefHCM (reference Human-Centric Model)，这是一个统一的框架，可以集成广泛的以人为中心的引用任务。具体来说，RefHCM使用序列合并将原始的多模态数据（包括图像、文本、坐标和解析映射）转换为语义标记。这种标准化的表示使RefHCM能够将各种以人为中心的引用任务重新表述为序列到序列的范式，并使用普通的编码器-解码器转换器架构来解决。得益于统一的学习策略，RefHCM有效地促进了跨任务的知识转移，并在处理复杂推理方面展示了不可预见的能力。这项工作代表了用通用框架解决参考人类感知的第一次尝试，同时建立了相应的基准，为该领域设定了新的标准。大量的实验表明，RefHCM在多个以人为中心的参考任务中具有竞争力，甚至是卓越的表现。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.