具有长短期上下文语义的说话头视频生成

IF 3.5 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Applied Intelligence Pub Date : 2024-12-10 DOI:10.1007/s10489-024-06010-y

Zhao Jing, Hongxia Bie, Jiali Wang, Zhisong Bie, Jinxin Li, Jianwei Ren, Yichen Zhi

{"title":"具有长短期上下文语义的说话头视频生成","authors":"Zhao Jing, Hongxia Bie, Jiali Wang, Zhisong Bie, Jinxin Li, Jianwei Ren, Yichen Zhi","doi":"10.1007/s10489-024-06010-y","DOIUrl":null,"url":null,"abstract":"<div><p>One-shot talking-head video generation involves a face-appearance source image and a series of motions extracted from driving frames to produce a coherent video. Most existing methods merely use the source image to generate videos over long time intervals, which leads to detail loss and distorted images due to the semantics mismatch. Short-term semantics extracted from previous generated frames with temporal consistency can complement the mismatches of long-term semantics. In this paper, we propose a talking-head generation method utilizing long short-term contextual semantics. First, the cross-entropy of real frame and generated frame with long short-term Semantics is mathematically modeled. Then, a novel semi-autoregressive GAN is proposed to efficiently avoid semantics mismatch by utilizing complementary long-term and autoregressively extracted short-term semantics. Moreover, a short-term semantics enhancement module is proposed aiming for suppressing the noise in the autoregressive pipeline and reinforcing fusion of the long short-term semantics. Extensive experiments have been performed and the experimental results demonstrate that our method can generate detailed and refined frames and outperforms the other methods, particularly with large motion changes.</p></div>","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"55 2","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2024-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Talking-head video generation with long short-term contextual semantics\",\"authors\":\"Zhao Jing, Hongxia Bie, Jiali Wang, Zhisong Bie, Jinxin Li, Jianwei Ren, Yichen Zhi\",\"doi\":\"10.1007/s10489-024-06010-y\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>One-shot talking-head video generation involves a face-appearance source image and a series of motions extracted from driving frames to produce a coherent video. Most existing methods merely use the source image to generate videos over long time intervals, which leads to detail loss and distorted images due to the semantics mismatch. Short-term semantics extracted from previous generated frames with temporal consistency can complement the mismatches of long-term semantics. In this paper, we propose a talking-head generation method utilizing long short-term contextual semantics. First, the cross-entropy of real frame and generated frame with long short-term Semantics is mathematically modeled. Then, a novel semi-autoregressive GAN is proposed to efficiently avoid semantics mismatch by utilizing complementary long-term and autoregressively extracted short-term semantics. Moreover, a short-term semantics enhancement module is proposed aiming for suppressing the noise in the autoregressive pipeline and reinforcing fusion of the long short-term semantics. Extensive experiments have been performed and the experimental results demonstrate that our method can generate detailed and refined frames and outperforms the other methods, particularly with large motion changes.</p></div>\",\"PeriodicalId\":8041,\"journal\":{\"name\":\"Applied Intelligence\",\"volume\":\"55 2\",\"pages\":\"\"},\"PeriodicalIF\":3.5000,\"publicationDate\":\"2024-12-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applied Intelligence\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://link.springer.com/article/10.1007/s10489-024-06010-y\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Intelligence","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10489-024-06010-y","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

单镜头说话头视频生成包括人脸源图像和从驾驶帧中提取的一系列动作，以产生连贯的视频。现有的方法大多只使用源图像长时间间隔生成视频，由于语义不匹配导致细节丢失和图像失真。从之前生成的具有时间一致性的帧中提取的短期语义可以弥补长期语义的不匹配。在本文中，我们提出了一种利用长短期上下文语义的说话头生成方法。首先，对具有长短期语义的真实帧和生成帧的交叉熵进行数学建模；然后，提出了一种新的半自回归GAN，利用互补的长期和自回归提取的短期语义，有效地避免了语义不匹配。此外，提出了短期语义增强模块，以抑制自回归管道中的噪声，增强长短期语义的融合。大量的实验结果表明，该方法可以生成精细的帧，并优于其他方法，特别是在运动变化较大的情况下。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Talking-head video generation with long short-term contextual semantics

One-shot talking-head video generation involves a face-appearance source image and a series of motions extracted from driving frames to produce a coherent video. Most existing methods merely use the source image to generate videos over long time intervals, which leads to detail loss and distorted images due to the semantics mismatch. Short-term semantics extracted from previous generated frames with temporal consistency can complement the mismatches of long-term semantics. In this paper, we propose a talking-head generation method utilizing long short-term contextual semantics. First, the cross-entropy of real frame and generated frame with long short-term Semantics is mathematically modeled. Then, a novel semi-autoregressive GAN is proposed to efficiently avoid semantics mismatch by utilizing complementary long-term and autoregressively extracted short-term semantics. Moreover, a short-term semantics enhancement module is proposed aiming for suppressing the noise in the autoregressive pipeline and reinforcing fusion of the long short-term semantics. Extensive experiments have been performed and the experimental results demonstrate that our method can generate detailed and refined frames and outperforms the other methods, particularly with large motion changes.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Applied Intelligence 工程技术-计算机：人工智能

CiteScore

6.60

自引率

20.80%

发文量

1361

审稿时长

5.9 months

期刊介绍： With a focus on research in artificial intelligence and neural networks, this journal addresses issues involving solutions of real-life manufacturing, defense, management, government and industrial problems which are too complex to be solved through conventional approaches and require the simulation of intelligent thought processes, heuristics, applications of knowledge, and distributed and parallel processing. The integration of these multiple approaches in solving complex problems is of particular importance. The journal presents new and original research and technological developments, addressing real and complex issues applicable to difficult problems. It provides a medium for exchanging scientific research and technological achievements accomplished by the international community.