MMTrans:多模态变压器的现实视频虚拟试戴

Proceedings of the 18th ACM SIGGRAPH International Conference on Virtual-Reality Continuum and its Applications in Industry Pub Date : 2022-12-27 DOI:10.1145/3574131.3574431

Xinrong Hu, Ziyi Zhang, Ruiqi Luo, Junjie Huang, Jinxing Liang, Jin Huang, Tao Peng, Hao Cai

{"title":"MMTrans:多模态变压器的现实视频虚拟试戴","authors":"Xinrong Hu, Ziyi Zhang, Ruiqi Luo, Junjie Huang, Jinxing Liang, Jin Huang, Tao Peng, Hao Cai","doi":"10.1145/3574131.3574431","DOIUrl":null,"url":null,"abstract":"Video virtual try-on methods aim to generate coherent, smooth, and realistic try-on videos, it matches the target clothing with the person in the video in a spatiotemporally consistent manner. Existing methods can match the human body with the clothing and then present it by the way of video, however it will cause the problem of excessive distortion of the grid and poor display effect at last. Given the problem, we found that was due to the neglect of the relationship between inputs lead to the loss of some features, while the conventional convolution operation is difficult to establish the remote information that is crucial in generating globally consistent results, restriction on clothing texture detail can lead to excessive deformation in the process of TPS fitting, make a lot of the try-on method in the final video rendering is not real. For the above problems, we reduce the excessive distortion of the garment during deformation by using a constraint function to regularize the TPS parameters; it also proposes a multimodal two-stage combinatorial Transformer: in the first stage, an interaction module is added, in which the long-distance relationship between people and clothing can be simulated, and then a better remote relationship can be obtained as well as contribute to the performance of TPS; in the second stage, an activation module is added, which can establish a global dependency, and this dependency can make the input important regions in the data are more prominent, which can provide more natural intermediate inputs for subsequent U-net networks. This paper’s method can bring better results for video virtual fitting, and experiments on the VVT dataset prove that the method outperforms previous methods in both quantitative and qualitative aspects.","PeriodicalId":111802,"journal":{"name":"Proceedings of the 18th ACM SIGGRAPH International Conference on Virtual-Reality Continuum and its Applications in Industry","volume":"222 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"MMTrans: MultiModal Transformer for realistic video virtual try-on\",\"authors\":\"Xinrong Hu, Ziyi Zhang, Ruiqi Luo, Junjie Huang, Jinxing Liang, Jin Huang, Tao Peng, Hao Cai\",\"doi\":\"10.1145/3574131.3574431\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Video virtual try-on methods aim to generate coherent, smooth, and realistic try-on videos, it matches the target clothing with the person in the video in a spatiotemporally consistent manner. Existing methods can match the human body with the clothing and then present it by the way of video, however it will cause the problem of excessive distortion of the grid and poor display effect at last. Given the problem, we found that was due to the neglect of the relationship between inputs lead to the loss of some features, while the conventional convolution operation is difficult to establish the remote information that is crucial in generating globally consistent results, restriction on clothing texture detail can lead to excessive deformation in the process of TPS fitting, make a lot of the try-on method in the final video rendering is not real. For the above problems, we reduce the excessive distortion of the garment during deformation by using a constraint function to regularize the TPS parameters; it also proposes a multimodal two-stage combinatorial Transformer: in the first stage, an interaction module is added, in which the long-distance relationship between people and clothing can be simulated, and then a better remote relationship can be obtained as well as contribute to the performance of TPS; in the second stage, an activation module is added, which can establish a global dependency, and this dependency can make the input important regions in the data are more prominent, which can provide more natural intermediate inputs for subsequent U-net networks. This paper’s method can bring better results for video virtual fitting, and experiments on the VVT dataset prove that the method outperforms previous methods in both quantitative and qualitative aspects.\",\"PeriodicalId\":111802,\"journal\":{\"name\":\"Proceedings of the 18th ACM SIGGRAPH International Conference on Virtual-Reality Continuum and its Applications in Industry\",\"volume\":\"222 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 18th ACM SIGGRAPH International Conference on Virtual-Reality Continuum and its Applications in Industry\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3574131.3574431\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 18th ACM SIGGRAPH International Conference on Virtual-Reality Continuum and its Applications in Industry","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3574131.3574431","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

视频虚拟试戴方法旨在生成连贯、流畅、逼真的试戴视频，将目标服装与视频中的人物在时空上保持一致。现有的方法可以将人体与服装进行匹配，然后以视频的方式呈现，但最终会造成网格过度失真，显示效果不佳的问题。针对这个问题，我们发现由于忽略了输入之间的关系导致一些特征的丢失，而传统的卷积操作难以建立远程信息，而远程信息对于生成全局一致的结果至关重要，对服装纹理细节的限制会导致TPS试装过程中过度变形，使得很多试穿方法在最终的视频渲染中不真实。针对上述问题，采用约束函数对TPS参数进行正则化，减少了变形过程中服装的过度变形;提出了一种多模态两级组合变压器:在第一级增加交互模块，模拟人与服装之间的远距离关系，从而获得更好的远程关系，提高TPS性能;第二阶段，增加激活模块，可以建立全局依赖关系，这种依赖关系可以使数据中的输入重要区域更加突出，为后续的U-net网络提供更自然的中间输入。本文的方法可以为视频虚拟拟合带来更好的效果，在VVT数据集上的实验证明，该方法在定量和定性方面都优于以往的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

MMTrans: MultiModal Transformer for realistic video virtual try-on

Video virtual try-on methods aim to generate coherent, smooth, and realistic try-on videos, it matches the target clothing with the person in the video in a spatiotemporally consistent manner. Existing methods can match the human body with the clothing and then present it by the way of video, however it will cause the problem of excessive distortion of the grid and poor display effect at last. Given the problem, we found that was due to the neglect of the relationship between inputs lead to the loss of some features, while the conventional convolution operation is difficult to establish the remote information that is crucial in generating globally consistent results, restriction on clothing texture detail can lead to excessive deformation in the process of TPS fitting, make a lot of the try-on method in the final video rendering is not real. For the above problems, we reduce the excessive distortion of the garment during deformation by using a constraint function to regularize the TPS parameters; it also proposes a multimodal two-stage combinatorial Transformer: in the first stage, an interaction module is added, in which the long-distance relationship between people and clothing can be simulated, and then a better remote relationship can be obtained as well as contribute to the performance of TPS; in the second stage, an activation module is added, which can establish a global dependency, and this dependency can make the input important regions in the data are more prominent, which can provide more natural intermediate inputs for subsequent U-net networks. This paper’s method can bring better results for video virtual fitting, and experiments on the VVT dataset prove that the method outperforms previous methods in both quantitative and qualitative aspects.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 18th ACM SIGGRAPH International Conference on Virtual-Reality Continuum and its Applications in Industry

自引率

0.00%

发文量