A Hybrid Generator Architecture for Controllable Face Synthesis

ACM SIGGRAPH 2023 Conference Proceedings Pub Date : 2023-07-23 DOI:10.1145/3588432.3591563

Dann Mensah, N. Kim, M. Aittala, S. Laine, J. Lehtinen

{"title":"A Hybrid Generator Architecture for Controllable Face Synthesis","authors":"Dann Mensah, N. Kim, M. Aittala, S. Laine, J. Lehtinen","doi":"10.1145/3588432.3591563","DOIUrl":null,"url":null,"abstract":"Modern data-driven image generation models often surpass traditional graphics techniques in quality. However, while traditional modeling and animation tools allow precise control over the image generation process in terms of interpretable quantities — e.g., shapes and reflectances — endowing learned models with such controls is generally difficult. In the context of human faces, we seek a data-driven generator architecture that simultaneously retains the photorealistic quality of modern generative adversarial networks (GAN) and allows explicit, disentangled controls over head shapes, expressions, identity, background, and illumination. While our high-level goal is shared by a large body of previous work, we approach the problem with a different philosophy: We treat the problem as an unconditional synthesis task, and engineer interpretable inductive biases into the model that make it easy for the desired behavior to emerge. Concretely, our generator is a combination of learned neural networks and fixed-function blocks, such as a 3D morphable head model and texture-mapping rasterizer, and we leave it up to the training process to figure out how they should be used together. This greatly simplifies the training problem by removing the need for labeled training data; we learn the distributions of the independent variables that drive the model instead of requiring that their values are known for each training image. Furthermore, we need no contrastive or imitation learning for correct behavior. We show that our design successfully encourages the generative model to make use of the internal, interpretable representations in a semantically meaningful manner. This allows sampling of different aspects of the image independently, as well as precise control of the results by manipulating the internal state of the interpretable blocks within the generator. This enables, for instance, facial animation using traditional animation tools.","PeriodicalId":280036,"journal":{"name":"ACM SIGGRAPH 2023 Conference Proceedings","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM SIGGRAPH 2023 Conference Proceedings","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3588432.3591563","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Modern data-driven image generation models often surpass traditional graphics techniques in quality. However, while traditional modeling and animation tools allow precise control over the image generation process in terms of interpretable quantities — e.g., shapes and reflectances — endowing learned models with such controls is generally difficult. In the context of human faces, we seek a data-driven generator architecture that simultaneously retains the photorealistic quality of modern generative adversarial networks (GAN) and allows explicit, disentangled controls over head shapes, expressions, identity, background, and illumination. While our high-level goal is shared by a large body of previous work, we approach the problem with a different philosophy: We treat the problem as an unconditional synthesis task, and engineer interpretable inductive biases into the model that make it easy for the desired behavior to emerge. Concretely, our generator is a combination of learned neural networks and fixed-function blocks, such as a 3D morphable head model and texture-mapping rasterizer, and we leave it up to the training process to figure out how they should be used together. This greatly simplifies the training problem by removing the need for labeled training data; we learn the distributions of the independent variables that drive the model instead of requiring that their values are known for each training image. Furthermore, we need no contrastive or imitation learning for correct behavior. We show that our design successfully encourages the generative model to make use of the internal, interpretable representations in a semantically meaningful manner. This allows sampling of different aspects of the image independently, as well as precise control of the results by manipulating the internal state of the interpretable blocks within the generator. This enables, for instance, facial animation using traditional animation tools.

查看原文本刊更多论文

一种用于可控人脸合成的混合发生器结构

现代数据驱动的图像生成模型通常在质量上超过传统的图形技术。然而，虽然传统的建模和动画工具允许在可解释的数量(例如形状和反射率)方面对图像生成过程进行精确控制，但赋予学习模型这种控制通常是困难的。在人脸的背景下，我们寻求一种数据驱动的生成器架构，同时保留现代生成对抗网络(GAN)的逼真质量，并允许对头部形状、表情、身份、背景和照明进行明确、不纠缠的控制。虽然我们的高级目标与之前的大量工作相同，但我们使用不同的哲学来处理问题:我们将问题视为无条件的综合任务，并在模型中设计可解释的归纳偏差，使期望的行为很容易出现。具体地说，我们的生成器是学习神经网络和固定功能块的组合，比如3D变形头部模型和纹理映射光栅器，我们把它留给训练过程来弄清楚它们应该如何一起使用。这极大地简化了训练问题，因为不需要标记训练数据;我们学习驱动模型的自变量的分布，而不是要求它们的值对于每个训练图像都是已知的。此外，我们不需要通过对比或模仿来学习正确的行为。我们表明，我们的设计成功地鼓励生成模型以语义上有意义的方式利用内部的、可解释的表示。这允许对图像的不同方面进行独立采样，以及通过操纵生成器中可解释块的内部状态来精确控制结果。例如，这使使用传统动画工具的面部动画成为可能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM SIGGRAPH 2023 Conference Proceedings

自引率

0.00%

发文量