合成多人和稀有姿态图像用于人体姿态估计

IF 9.7 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia Pub Date : 2025-07-04 DOI:10.1109/TMM.2025.3586122

Liuqing Zhao;Zichen Tian;Peng Zou;Richang Hong;Qianru Sun

{"title":"合成多人和稀有姿态图像用于人体姿态估计","authors":"Liuqing Zhao;Zichen Tian;Peng Zou;Richang Hong;Qianru Sun","doi":"10.1109/TMM.2025.3586122","DOIUrl":null,"url":null,"abstract":"Human pose estimation (HPE) models underperform in recognizing rare poses because they suffer from data imbalance problems (i.e., there are few image samples for rare poses) in their training datasets. From a data perspective, the most intuitive solution is to synthesize data for rare poses. Specifically, the rule-based methods apply manual manipulations (such as Cutout and GridMask) to the existing data, so the limited diversity of the data constrains the model. An alternative method is to learn the underlying data distribution via deep generative models (such as ControlNet and HumanSD) and then sample “new data” from the distribution. This works well for generating frequent poses in common scenes, but suffers when applied to rare poses or complex scenes (such as multiple persons with overlapping limbs). In this paper, we aim to address the above two issues, i.e., rare poses and complex scenes, for person image generation. We propose a two-stage method. In the first stage, we design a controllable pose generator named PoseFactory to synthesize rare poses. This generator is specifically trained on augmented pose data, and each pose is labelled with its level of difficulty and rarity. In the second stage, we introduce a multi-person image generator named MultipGenerator. It is conditioned on multiple human poses and textual descriptions of complex scenes. Both stages are controllable in terms of the diversity of poses and the complexity of scenes. For evaluation, we conduct extensive experiments on three widely used datasets: MS-COCO, HumanArt, and OCHuman. We compare our method against traditional pose data augmentation and person image generation methods, and it demonstrates its superior performance both quantitatively and qualitatively.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"6568-6580"},"PeriodicalIF":9.7000,"publicationDate":"2025-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Synthesizing Multi-Person and Rare Pose Images for Human Pose Estimation\",\"authors\":\"Liuqing Zhao;Zichen Tian;Peng Zou;Richang Hong;Qianru Sun\",\"doi\":\"10.1109/TMM.2025.3586122\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Human pose estimation (HPE) models underperform in recognizing rare poses because they suffer from data imbalance problems (i.e., there are few image samples for rare poses) in their training datasets. From a data perspective, the most intuitive solution is to synthesize data for rare poses. Specifically, the rule-based methods apply manual manipulations (such as Cutout and GridMask) to the existing data, so the limited diversity of the data constrains the model. An alternative method is to learn the underlying data distribution via deep generative models (such as ControlNet and HumanSD) and then sample “new data” from the distribution. This works well for generating frequent poses in common scenes, but suffers when applied to rare poses or complex scenes (such as multiple persons with overlapping limbs). In this paper, we aim to address the above two issues, i.e., rare poses and complex scenes, for person image generation. We propose a two-stage method. In the first stage, we design a controllable pose generator named PoseFactory to synthesize rare poses. This generator is specifically trained on augmented pose data, and each pose is labelled with its level of difficulty and rarity. In the second stage, we introduce a multi-person image generator named MultipGenerator. It is conditioned on multiple human poses and textual descriptions of complex scenes. Both stages are controllable in terms of the diversity of poses and the complexity of scenes. For evaluation, we conduct extensive experiments on three widely used datasets: MS-COCO, HumanArt, and OCHuman. We compare our method against traditional pose data augmentation and person image generation methods, and it demonstrates its superior performance both quantitatively and qualitatively.\",\"PeriodicalId\":13273,\"journal\":{\"name\":\"IEEE Transactions on Multimedia\",\"volume\":\"27 \",\"pages\":\"6568-6580\"},\"PeriodicalIF\":9.7000,\"publicationDate\":\"2025-07-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Multimedia\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11071880/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11071880/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

人体姿态估计（HPE）模型在识别罕见姿态方面表现不佳，因为它们在训练数据集中存在数据不平衡问题（即罕见姿态的图像样本很少）。从数据的角度来看，最直观的解决方案是对罕见姿势进行数据合成。具体来说，基于规则的方法对现有数据应用手动操作（如Cutout和GridMask），因此数据的有限多样性限制了模型。另一种方法是通过深度生成模型（如ControlNet和HumanSD）学习底层数据分布，然后从分布中采样“新数据”。这对于在常见场景中生成频繁的姿势效果很好，但是当应用于罕见的姿势或复杂的场景时（例如多个肢体重叠的人）就会受到影响。在本文中，我们的目标是解决上述两个问题，即罕见的姿势和复杂的场景，为人物图像的生成。我们提出一个两阶段的方法。在第一阶段，我们设计了一个名为PoseFactory的可控姿态生成器来合成稀有姿态。这个生成器专门训练增强姿势数据，每个姿势都标有其难度和稀有程度。在第二阶段，我们介绍了一个名为MultipGenerator的多人图像生成器。它以多种人体姿势和复杂场景的文本描述为条件。就姿势的多样性和场景的复杂性而言，这两个阶段都是可控的。为了评估，我们在三个广泛使用的数据集上进行了广泛的实验：MS-COCO， HumanArt和ochhuman。将该方法与传统的姿态数据增强和人物图像生成方法进行了比较，结果表明该方法在定性和定量上都具有较好的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Synthesizing Multi-Person and Rare Pose Images for Human Pose Estimation

Human pose estimation (HPE) models underperform in recognizing rare poses because they suffer from data imbalance problems (i.e., there are few image samples for rare poses) in their training datasets. From a data perspective, the most intuitive solution is to synthesize data for rare poses. Specifically, the rule-based methods apply manual manipulations (such as Cutout and GridMask) to the existing data, so the limited diversity of the data constrains the model. An alternative method is to learn the underlying data distribution via deep generative models (such as ControlNet and HumanSD) and then sample “new data” from the distribution. This works well for generating frequent poses in common scenes, but suffers when applied to rare poses or complex scenes (such as multiple persons with overlapping limbs). In this paper, we aim to address the above two issues, i.e., rare poses and complex scenes, for person image generation. We propose a two-stage method. In the first stage, we design a controllable pose generator named PoseFactory to synthesize rare poses. This generator is specifically trained on augmented pose data, and each pose is labelled with its level of difficulty and rarity. In the second stage, we introduce a multi-person image generator named MultipGenerator. It is conditioned on multiple human poses and textual descriptions of complex scenes. Both stages are controllable in terms of the diversity of poses and the complexity of scenes. For evaluation, we conduct extensive experiments on three widely used datasets: MS-COCO, HumanArt, and OCHuman. We compare our method against traditional pose data augmentation and person image generation methods, and it demonstrates its superior performance both quantitatively and qualitatively.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.