Occlusion-aware multi-person pose estimation with keypoint grouping and dual-prompt guidance in crowded scenes

IF 3.1 4区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Journal of Visual Communication and Image Representation Pub Date : 2025-07-31 DOI:10.1016/j.jvcir.2025.104545

Tiecheng Song , Yi Peng , Chun Liu , Anyong Qin , Yue Zhao , Feng Yang , Chenqiang Gao

{"title":"Occlusion-aware multi-person pose estimation with keypoint grouping and dual-prompt guidance in crowded scenes","authors":"Tiecheng Song , Yi Peng , Chun Liu , Anyong Qin , Yue Zhao , Feng Yang , Chenqiang Gao","doi":"10.1016/j.jvcir.2025.104545","DOIUrl":null,"url":null,"abstract":"<div><div>Multi-person pose estimation (MPPE) in crowded scenes is a challenging task due to severe keypoint occlusions. Although great progress has been made in learning effective joint features for MPPE, existing methods still have two problems. (1) They seldom consider the movement characteristics of human joints and fail to adopt distinct processing strategies to describe different types of joints. (2) They only use simple joint names as text prompts, failing to mine other informative text hints to represent detailed joint situations. To address these two problems, in this paper we propose an occlusion-aware MPPE method by exploring keypoint grouping and dual-prompt guidance (KDG). KDG adopts a distillation learning framework which contains a student network and a teacher network. In the student network, we perform instance decoupling and propose a keypoint grouping strategy to learn global and local context features for two types of joints by considering their movement flexibility. In the teacher network, we introduce the vision-language model to represent the detailed joint situations and explore dual prompts, i.e., rough body part prompts and fine-grained joint prompts, to align text and visual features. Finally, we design loss functions to train the whole network and effectively transfer the rich vision-language knowledge contained in the teacher network to the student network. Experimental results on two benchmark datasets demonstrate the superiority of our KDG over state-of-the-art methods for MMPE in crowded and occluded scenes. The source codes are available at <span><span>https://github.com/stc-cqupt/KDG</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"111 ","pages":"Article 104545"},"PeriodicalIF":3.1000,"publicationDate":"2025-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Visual Communication and Image Representation","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1047320325001592","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Multi-person pose estimation (MPPE) in crowded scenes is a challenging task due to severe keypoint occlusions. Although great progress has been made in learning effective joint features for MPPE, existing methods still have two problems. (1) They seldom consider the movement characteristics of human joints and fail to adopt distinct processing strategies to describe different types of joints. (2) They only use simple joint names as text prompts, failing to mine other informative text hints to represent detailed joint situations. To address these two problems, in this paper we propose an occlusion-aware MPPE method by exploring keypoint grouping and dual-prompt guidance (KDG). KDG adopts a distillation learning framework which contains a student network and a teacher network. In the student network, we perform instance decoupling and propose a keypoint grouping strategy to learn global and local context features for two types of joints by considering their movement flexibility. In the teacher network, we introduce the vision-language model to represent the detailed joint situations and explore dual prompts, i.e., rough body part prompts and fine-grained joint prompts, to align text and visual features. Finally, we design loss functions to train the whole network and effectively transfer the rich vision-language knowledge contained in the teacher network to the student network. Experimental results on two benchmark datasets demonstrate the superiority of our KDG over state-of-the-art methods for MMPE in crowded and occluded scenes. The source codes are available at https://github.com/stc-cqupt/KDG.

查看原文本刊更多论文

拥挤场景中基于关键点分组和双提示引导的闭塞感知多人姿态估计

由于严重的关键点遮挡，拥挤场景中的多人姿态估计（MPPE）是一项具有挑战性的任务。虽然在MPPE有效关节特征的学习方面取得了很大的进展，但现有的方法仍然存在两个问题。(1)它们很少考虑人体关节的运动特征，没有采用不同的加工策略来描述不同类型的关节。(2)他们只使用简单的联合名称作为文本提示，而没有挖掘其他信息丰富的文本提示来表示详细的联合情况。为了解决这两个问题，本文提出了一种基于关键点分组和双提示引导（KDG）的闭塞感知MPPE方法。KDG采用蒸馏式学习框架，该框架包含一个学生网络和一个教师网络。在学生网络中，我们进行实例解耦并提出关键点分组策略，通过考虑关节的运动灵活性来学习两种类型关节的全局和局部上下文特征。在教师网络中，我们引入了视觉语言模型来表示详细的关节情况，并探索了双提示，即粗糙的身体部位提示和细粒度的关节提示，以对齐文本和视觉特征。最后，设计损失函数对整个网络进行训练，有效地将教师网络中包含的丰富的视觉语言知识转移到学生网络中。在两个基准数据集上的实验结果表明，在拥挤和闭塞的场景中，我们的KDG优于最先进的MMPE方法。源代码可从https://github.com/stc-cqupt/KDG获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Visual Communication and Image Representation 工程技术-计算机：软件工程

CiteScore

5.40

自引率

11.50%

发文量

188

审稿时长

9.9 months

期刊介绍： The Journal of Visual Communication and Image Representation publishes papers on state-of-the-art visual communication and image representation, with emphasis on novel technologies and theoretical work in this multidisciplinary area of pure and applied research. The field of visual communication and image representation is considered in its broadest sense and covers both digital and analog aspects as well as processing and communication in biological visual systems.