Tiecheng Song , Yi Peng , Chun Liu , Anyong Qin , Yue Zhao , Feng Yang , Chenqiang Gao
{"title":"拥挤场景中基于关键点分组和双提示引导的闭塞感知多人姿态估计","authors":"Tiecheng Song , Yi Peng , Chun Liu , Anyong Qin , Yue Zhao , Feng Yang , Chenqiang Gao","doi":"10.1016/j.jvcir.2025.104545","DOIUrl":null,"url":null,"abstract":"<div><div>Multi-person pose estimation (MPPE) in crowded scenes is a challenging task due to severe keypoint occlusions. Although great progress has been made in learning effective joint features for MPPE, existing methods still have two problems. (1) They seldom consider the movement characteristics of human joints and fail to adopt distinct processing strategies to describe different types of joints. (2) They only use simple joint names as text prompts, failing to mine other informative text hints to represent detailed joint situations. To address these two problems, in this paper we propose an occlusion-aware MPPE method by exploring keypoint grouping and dual-prompt guidance (KDG). KDG adopts a distillation learning framework which contains a student network and a teacher network. In the student network, we perform instance decoupling and propose a keypoint grouping strategy to learn global and local context features for two types of joints by considering their movement flexibility. In the teacher network, we introduce the vision-language model to represent the detailed joint situations and explore dual prompts, i.e., rough body part prompts and fine-grained joint prompts, to align text and visual features. Finally, we design loss functions to train the whole network and effectively transfer the rich vision-language knowledge contained in the teacher network to the student network. Experimental results on two benchmark datasets demonstrate the superiority of our KDG over state-of-the-art methods for MMPE in crowded and occluded scenes. The source codes are available at <span><span>https://github.com/stc-cqupt/KDG</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"111 ","pages":"Article 104545"},"PeriodicalIF":3.1000,"publicationDate":"2025-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Occlusion-aware multi-person pose estimation with keypoint grouping and dual-prompt guidance in crowded scenes\",\"authors\":\"Tiecheng Song , Yi Peng , Chun Liu , Anyong Qin , Yue Zhao , Feng Yang , Chenqiang Gao\",\"doi\":\"10.1016/j.jvcir.2025.104545\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Multi-person pose estimation (MPPE) in crowded scenes is a challenging task due to severe keypoint occlusions. Although great progress has been made in learning effective joint features for MPPE, existing methods still have two problems. (1) They seldom consider the movement characteristics of human joints and fail to adopt distinct processing strategies to describe different types of joints. (2) They only use simple joint names as text prompts, failing to mine other informative text hints to represent detailed joint situations. To address these two problems, in this paper we propose an occlusion-aware MPPE method by exploring keypoint grouping and dual-prompt guidance (KDG). KDG adopts a distillation learning framework which contains a student network and a teacher network. In the student network, we perform instance decoupling and propose a keypoint grouping strategy to learn global and local context features for two types of joints by considering their movement flexibility. In the teacher network, we introduce the vision-language model to represent the detailed joint situations and explore dual prompts, i.e., rough body part prompts and fine-grained joint prompts, to align text and visual features. Finally, we design loss functions to train the whole network and effectively transfer the rich vision-language knowledge contained in the teacher network to the student network. Experimental results on two benchmark datasets demonstrate the superiority of our KDG over state-of-the-art methods for MMPE in crowded and occluded scenes. The source codes are available at <span><span>https://github.com/stc-cqupt/KDG</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":54755,\"journal\":{\"name\":\"Journal of Visual Communication and Image Representation\",\"volume\":\"111 \",\"pages\":\"Article 104545\"},\"PeriodicalIF\":3.1000,\"publicationDate\":\"2025-07-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Visual Communication and Image Representation\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1047320325001592\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Visual Communication and Image Representation","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1047320325001592","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Occlusion-aware multi-person pose estimation with keypoint grouping and dual-prompt guidance in crowded scenes
Multi-person pose estimation (MPPE) in crowded scenes is a challenging task due to severe keypoint occlusions. Although great progress has been made in learning effective joint features for MPPE, existing methods still have two problems. (1) They seldom consider the movement characteristics of human joints and fail to adopt distinct processing strategies to describe different types of joints. (2) They only use simple joint names as text prompts, failing to mine other informative text hints to represent detailed joint situations. To address these two problems, in this paper we propose an occlusion-aware MPPE method by exploring keypoint grouping and dual-prompt guidance (KDG). KDG adopts a distillation learning framework which contains a student network and a teacher network. In the student network, we perform instance decoupling and propose a keypoint grouping strategy to learn global and local context features for two types of joints by considering their movement flexibility. In the teacher network, we introduce the vision-language model to represent the detailed joint situations and explore dual prompts, i.e., rough body part prompts and fine-grained joint prompts, to align text and visual features. Finally, we design loss functions to train the whole network and effectively transfer the rich vision-language knowledge contained in the teacher network to the student network. Experimental results on two benchmark datasets demonstrate the superiority of our KDG over state-of-the-art methods for MMPE in crowded and occluded scenes. The source codes are available at https://github.com/stc-cqupt/KDG.
期刊介绍:
The Journal of Visual Communication and Image Representation publishes papers on state-of-the-art visual communication and image representation, with emphasis on novel technologies and theoretical work in this multidisciplinary area of pure and applied research. The field of visual communication and image representation is considered in its broadest sense and covers both digital and analog aspects as well as processing and communication in biological visual systems.