Shuowen Liang, Sisi Li, Qingyun Wang, Cen Zhang, Kaiquan Zhu, Tian Yang
{"title":"GUNet:用于生成稳定和多样化姿势的图卷积网络联合扩散模型","authors":"Shuowen Liang, Sisi Li, Qingyun Wang, Cen Zhang, Kaiquan Zhu, Tian Yang","doi":"arxiv-2409.11689","DOIUrl":null,"url":null,"abstract":"Pose skeleton images are an important reference in pose-controllable image\ngeneration. In order to enrich the source of skeleton images, recent works have\ninvestigated the generation of pose skeletons based on natural language. These\nmethods are based on GANs. However, it remains challenging to perform diverse,\nstructurally correct and aesthetically pleasing human pose skeleton generation\nwith various textual inputs. To address this problem, we propose a framework\nwith GUNet as the main model, PoseDiffusion. It is the first generative\nframework based on a diffusion model and also contains a series of variants\nfine-tuned based on a stable diffusion model. PoseDiffusion demonstrates\nseveral desired properties that outperform existing methods. 1) Correct\nSkeletons. GUNet, a denoising model of PoseDiffusion, is designed to\nincorporate graphical convolutional neural networks. It is able to learn the\nspatial relationships of the human skeleton by introducing skeletal information\nduring the training process. 2) Diversity. We decouple the key points of the\nskeleton and characterise them separately, and use cross-attention to introduce\ntextual conditions. Experimental results show that PoseDiffusion outperforms\nexisting SoTA algorithms in terms of stability and diversity of text-driven\npose skeleton generation. Qualitative analyses further demonstrate its\nsuperiority for controllable generation in Stable Diffusion.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"GUNet: A Graph Convolutional Network United Diffusion Model for Stable and Diversity Pose Generation\",\"authors\":\"Shuowen Liang, Sisi Li, Qingyun Wang, Cen Zhang, Kaiquan Zhu, Tian Yang\",\"doi\":\"arxiv-2409.11689\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Pose skeleton images are an important reference in pose-controllable image\\ngeneration. In order to enrich the source of skeleton images, recent works have\\ninvestigated the generation of pose skeletons based on natural language. These\\nmethods are based on GANs. However, it remains challenging to perform diverse,\\nstructurally correct and aesthetically pleasing human pose skeleton generation\\nwith various textual inputs. To address this problem, we propose a framework\\nwith GUNet as the main model, PoseDiffusion. It is the first generative\\nframework based on a diffusion model and also contains a series of variants\\nfine-tuned based on a stable diffusion model. PoseDiffusion demonstrates\\nseveral desired properties that outperform existing methods. 1) Correct\\nSkeletons. GUNet, a denoising model of PoseDiffusion, is designed to\\nincorporate graphical convolutional neural networks. It is able to learn the\\nspatial relationships of the human skeleton by introducing skeletal information\\nduring the training process. 2) Diversity. We decouple the key points of the\\nskeleton and characterise them separately, and use cross-attention to introduce\\ntextual conditions. Experimental results show that PoseDiffusion outperforms\\nexisting SoTA algorithms in terms of stability and diversity of text-driven\\npose skeleton generation. Qualitative analyses further demonstrate its\\nsuperiority for controllable generation in Stable Diffusion.\",\"PeriodicalId\":501130,\"journal\":{\"name\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.11689\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11689","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
GUNet: A Graph Convolutional Network United Diffusion Model for Stable and Diversity Pose Generation
Pose skeleton images are an important reference in pose-controllable image
generation. In order to enrich the source of skeleton images, recent works have
investigated the generation of pose skeletons based on natural language. These
methods are based on GANs. However, it remains challenging to perform diverse,
structurally correct and aesthetically pleasing human pose skeleton generation
with various textual inputs. To address this problem, we propose a framework
with GUNet as the main model, PoseDiffusion. It is the first generative
framework based on a diffusion model and also contains a series of variants
fine-tuned based on a stable diffusion model. PoseDiffusion demonstrates
several desired properties that outperform existing methods. 1) Correct
Skeletons. GUNet, a denoising model of PoseDiffusion, is designed to
incorporate graphical convolutional neural networks. It is able to learn the
spatial relationships of the human skeleton by introducing skeletal information
during the training process. 2) Diversity. We decouple the key points of the
skeleton and characterise them separately, and use cross-attention to introduce
textual conditions. Experimental results show that PoseDiffusion outperforms
existing SoTA algorithms in terms of stability and diversity of text-driven
pose skeleton generation. Qualitative analyses further demonstrate its
superiority for controllable generation in Stable Diffusion.