Text-driven clothed human image synthesis with 3D human model estimation for assistance in shopping

IF 3 4区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Multimedia Tools and Applications Pub Date : 2024-09-19 DOI:10.1007/s11042-024-20187-x

S. Karkuzhali, A. Syed Aasim, A. StalinRaj

{"title":"Text-driven clothed human image synthesis with 3D human model estimation for assistance in shopping","authors":"S. Karkuzhali, A. Syed Aasim, A. StalinRaj","doi":"10.1007/s11042-024-20187-x","DOIUrl":null,"url":null,"abstract":"<p>Online shopping has become an integral part of modern consumer culture. Yet, it is plagued by challenges in visualizing clothing items based on textual descriptions and estimating their fit on individual body types. In this work, we present an innovative solution to address these challenges through text-driven clothed human image synthesis with 3D human model estimation, leveraging the power of Vector Quantized Variational AutoEncoder (VQ-VAE). Creating diverse and high-quality human images is a crucial yet difficult undertaking in vision and graphics. With the wide variety of clothing designs and textures, existing generative models are often not sufficient for the end user. In this proposed work, we introduce a solution that is provided by various datasets passed through several models so the optimized solution can be provided along with high-quality images with a range of postures. We use two distinct procedures to create full-body 2D human photographs starting from a predetermined human posture. 1) The provided human pose is first converted to a human parsing map with some sentences that describe the shapes of clothing. 2) The model developed is then given further information about the textures of clothing as an input to produce the final human image. The model is split into two different sections the first one being a codebook at a coarse level that deals with overall results and a fine-level codebook that deals with minute detailing. As mentioned previously at fine level concentrates on the minutiae of textures, whereas the codebook at the coarse level covers the depictions of textures in structures. The decoder trained together with hierarchical codebooks converts the anticipated indices at various levels to human images. The created image can be dependent on the fine-grained text input thanks to the utilization of a blend of experts. The quality of clothing textures is refined by the forecast for finer-level indexes. Implementing these strategies can result in more diversified and high-quality human images than state-of-the-art procedures, according to numerous quantitative and qualitative evaluations. These generated photographs will be converted into a 3D model, resulting in several postures and outcomes, or you may just make a 3D model from a dataset that produces a variety of stances. The application of the PIFu method uses the Marching cube algorithm and Stacked Hourglass method to produce 3D models and realistic images respectively. This results in the generation of high-resolution images based on textual description and reconstruction of the generated images as 3D models. The inception score and Fréchet Intercept Distance, SSIM, and PSNR that was achieved was 1.64 ± 0.20 and 24.64527782349843, 0.642919520, and 32.87157744102002 respectively. The implemented method scores well in comparison with other techniques. This technology holds immense promise for reshaping the e-commerce landscape, offering a more immersive and informative means of exploring clothing options.</p>","PeriodicalId":18770,"journal":{"name":"Multimedia Tools and Applications","volume":"54 1","pages":""},"PeriodicalIF":3.0000,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Multimedia Tools and Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11042-024-20187-x","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Online shopping has become an integral part of modern consumer culture. Yet, it is plagued by challenges in visualizing clothing items based on textual descriptions and estimating their fit on individual body types. In this work, we present an innovative solution to address these challenges through text-driven clothed human image synthesis with 3D human model estimation, leveraging the power of Vector Quantized Variational AutoEncoder (VQ-VAE). Creating diverse and high-quality human images is a crucial yet difficult undertaking in vision and graphics. With the wide variety of clothing designs and textures, existing generative models are often not sufficient for the end user. In this proposed work, we introduce a solution that is provided by various datasets passed through several models so the optimized solution can be provided along with high-quality images with a range of postures. We use two distinct procedures to create full-body 2D human photographs starting from a predetermined human posture. 1) The provided human pose is first converted to a human parsing map with some sentences that describe the shapes of clothing. 2) The model developed is then given further information about the textures of clothing as an input to produce the final human image. The model is split into two different sections the first one being a codebook at a coarse level that deals with overall results and a fine-level codebook that deals with minute detailing. As mentioned previously at fine level concentrates on the minutiae of textures, whereas the codebook at the coarse level covers the depictions of textures in structures. The decoder trained together with hierarchical codebooks converts the anticipated indices at various levels to human images. The created image can be dependent on the fine-grained text input thanks to the utilization of a blend of experts. The quality of clothing textures is refined by the forecast for finer-level indexes. Implementing these strategies can result in more diversified and high-quality human images than state-of-the-art procedures, according to numerous quantitative and qualitative evaluations. These generated photographs will be converted into a 3D model, resulting in several postures and outcomes, or you may just make a 3D model from a dataset that produces a variety of stances. The application of the PIFu method uses the Marching cube algorithm and Stacked Hourglass method to produce 3D models and realistic images respectively. This results in the generation of high-resolution images based on textual description and reconstruction of the generated images as 3D models. The inception score and Fréchet Intercept Distance, SSIM, and PSNR that was achieved was 1.64 ± 0.20 and 24.64527782349843, 0.642919520, and 32.87157744102002 respectively. The implemented method scores well in comparison with other techniques. This technology holds immense promise for reshaping the e-commerce landscape, offering a more immersive and informative means of exploring clothing options.

Abstract Image

查看原文本刊更多论文

通过三维人体模型估算进行文字驱动的着装人体图像合成，用于购物辅助

网上购物已成为现代消费文化不可或缺的一部分。然而，在根据文字描述可视化服装商品以及估计其是否适合个人体型方面，它却面临着诸多挑战。在这项工作中，我们利用矢量量化变异自动编码器（VQ-VAE）的强大功能，通过文本驱动的服装人体图像合成和三维人体模型估计，提出了一种创新的解决方案来应对这些挑战。创建多样化和高质量的人体图像是视觉和图形学领域一项重要而艰巨的任务。由于服装设计和纹理种类繁多，现有的生成模型往往无法满足最终用户的需求。在本作品中，我们提出了一种解决方案，即通过多个模型处理不同的数据集，从而提供优化的解决方案以及具有各种姿态的高质量图像。我们使用两种不同的程序，从预定的人体姿势开始创建全身二维人体照片。1) 首先将提供的人体姿态转换为人体解析图，其中包含一些描述服装形状的句子。2) 然后，将所建立的模型作为输入，进一步提供有关服装纹理的信息，以生成最终的人体图像。该模型分为两个不同的部分，第一部分是处理整体结果的粗级编码本，第二部分是处理微小细节的细级编码本。如前所述，精细级别的编码本集中于纹理的细微之处，而粗略级别的编码本则涵盖结构中纹理的描述。与分层编码本一起训练的解码器可将各级预期指数转换为人类图像。由于使用了混合专家，创建的图像可以依赖于细粒度文本输入。服装纹理的质量可通过对更精细级别的索引进行预测而得到改善。根据大量的定量和定性评估，与最先进的程序相比，实施这些策略可以生成更加多样化和高质量的人体图像。这些生成的照片将被转换成三维模型，从而产生多种姿态和结果，或者您也可以直接从产生各种姿态的数据集中制作三维模型。PIFu 方法的应用采用行进立方体算法和堆叠沙漏法，分别生成三维模型和逼真图像。这样就能根据文字描述生成高分辨率图像，并将生成的图像重建为三维模型。所获得的入门分数和弗雷谢特截距、SSIM 和 PSNR 分别为 1.64 ± 0.20 和 24.64527782349843、0.642919520 和 32.87157744102002。与其他技术相比，所采用的方法得分较高。这项技术有望重塑电子商务的格局，为探索服装选择提供一种更身临其境、信息更丰富的手段。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Multimedia Tools and Applications 工程技术-工程：电子与电气

CiteScore

7.20

自引率

16.70%

发文量

2439

审稿时长

9.2 months

期刊介绍： Multimedia Tools and Applications publishes original research articles on multimedia development and system support tools as well as case studies of multimedia applications. It also features experimental and survey articles. The journal is intended for academics, practitioners, scientists and engineers who are involved in multimedia system research, design and applications. All papers are peer reviewed. Specific areas of interest include: - Multimedia Tools: - Multimedia Applications: - Prototype multimedia systems and platforms