基于先验知识的文本到图像合成半参数方法

Proceedings of the 2022 5th International Conference on Algorithms, Computing and Artificial Intelligence Pub Date : 2022-12-23 DOI:10.1145/3579654.3579717

Jiadong Liang

{"title":"基于先验知识的文本到图像合成半参数方法","authors":"Jiadong Liang","doi":"10.1145/3579654.3579717","DOIUrl":null,"url":null,"abstract":"Text-to-image synthesis adopts only text descriptions as input to generate consistent images which should have high visual quality and be semantically aligned with the input text. Compared to images, the textual semantics is ambiguous and sparse, which makes it challenging to map features directly and accurately from text space to image space. To address this issue, the intuitive method is to construct an intermediate space connecting text and image. Using layout as a bridge between text and image not only mitigates the difficulty of the task, but also constrains the spatial distribution of objects in the generated images, which is crucial to the quality of synthesized images. In this paper, we build a two-stage framework for text-to-image synthesis, i.e., Layout Searching by Text Matching, and Layout-to-Image Synthesis with Fine-Grained Textual Semantic Injection. Specifically, we build the prior layout knowledge from the training dataset and propose a semi-parametric layout searching strategy to retrieve the layout that matches the input sentence by measuring the semantic distance between different textual descriptions. In the stage of layout-to-image synthesis, we construct the Textual and Spatial Alignment Generative Adversarial Networks (TSAGANs) that are designed to guarantee the fine-grained alignment of the generated images with the input text and layout obtained in the first stage. Extensive experiments conducted on the COCO-stuff dataset manifest that our method can obtain more reasonable layouts and improve the performance of synthesized images significantly.","PeriodicalId":146783,"journal":{"name":"Proceedings of the 2022 5th International Conference on Algorithms, Computing and Artificial Intelligence","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Semi-Parametric Method for Text-to-Image Synthesis from Prior Knowledge\",\"authors\":\"Jiadong Liang\",\"doi\":\"10.1145/3579654.3579717\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Text-to-image synthesis adopts only text descriptions as input to generate consistent images which should have high visual quality and be semantically aligned with the input text. Compared to images, the textual semantics is ambiguous and sparse, which makes it challenging to map features directly and accurately from text space to image space. To address this issue, the intuitive method is to construct an intermediate space connecting text and image. Using layout as a bridge between text and image not only mitigates the difficulty of the task, but also constrains the spatial distribution of objects in the generated images, which is crucial to the quality of synthesized images. In this paper, we build a two-stage framework for text-to-image synthesis, i.e., Layout Searching by Text Matching, and Layout-to-Image Synthesis with Fine-Grained Textual Semantic Injection. Specifically, we build the prior layout knowledge from the training dataset and propose a semi-parametric layout searching strategy to retrieve the layout that matches the input sentence by measuring the semantic distance between different textual descriptions. In the stage of layout-to-image synthesis, we construct the Textual and Spatial Alignment Generative Adversarial Networks (TSAGANs) that are designed to guarantee the fine-grained alignment of the generated images with the input text and layout obtained in the first stage. Extensive experiments conducted on the COCO-stuff dataset manifest that our method can obtain more reasonable layouts and improve the performance of synthesized images significantly.\",\"PeriodicalId\":146783,\"journal\":{\"name\":\"Proceedings of the 2022 5th International Conference on Algorithms, Computing and Artificial Intelligence\",\"volume\":\"37 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2022 5th International Conference on Algorithms, Computing and Artificial Intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3579654.3579717\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 5th International Conference on Algorithms, Computing and Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3579654.3579717","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

文本到图像的合成只采用文本描述作为输入，生成的图像必须具有高的视觉质量，并且在语义上与输入文本保持一致。与图像相比，文本语义模糊且稀疏，这给直接准确地将特征从文本空间映射到图像空间带来了挑战。为了解决这个问题，直观的方法是构建一个连接文本和图像的中间空间。利用布局作为文本和图像之间的桥梁，不仅降低了任务的难度，而且还限制了生成图像中物体的空间分布，这对合成图像的质量至关重要。在本文中，我们构建了一个两阶段的文本到图像合成框架，即通过文本匹配进行布局搜索和通过细粒度文本语义注入进行布局到图像合成。具体而言，我们从训练数据集中构建先验布局知识，并提出一种半参数布局搜索策略，通过测量不同文本描述之间的语义距离来检索与输入句子匹配的布局。在布局到图像的合成阶段，我们构建了文本和空间对齐生成对抗网络(TSAGANs)，旨在保证生成的图像与第一阶段获得的输入文本和布局进行细粒度对齐。在COCO-stuff数据集上进行的大量实验表明，我们的方法可以获得更合理的布局，显著提高合成图像的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Semi-Parametric Method for Text-to-Image Synthesis from Prior Knowledge

Text-to-image synthesis adopts only text descriptions as input to generate consistent images which should have high visual quality and be semantically aligned with the input text. Compared to images, the textual semantics is ambiguous and sparse, which makes it challenging to map features directly and accurately from text space to image space. To address this issue, the intuitive method is to construct an intermediate space connecting text and image. Using layout as a bridge between text and image not only mitigates the difficulty of the task, but also constrains the spatial distribution of objects in the generated images, which is crucial to the quality of synthesized images. In this paper, we build a two-stage framework for text-to-image synthesis, i.e., Layout Searching by Text Matching, and Layout-to-Image Synthesis with Fine-Grained Textual Semantic Injection. Specifically, we build the prior layout knowledge from the training dataset and propose a semi-parametric layout searching strategy to retrieve the layout that matches the input sentence by measuring the semantic distance between different textual descriptions. In the stage of layout-to-image synthesis, we construct the Textual and Spatial Alignment Generative Adversarial Networks (TSAGANs) that are designed to guarantee the fine-grained alignment of the generated images with the input text and layout obtained in the first stage. Extensive experiments conducted on the COCO-stuff dataset manifest that our method can obtain more reasonable layouts and improve the performance of synthesized images significantly.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2022 5th International Conference on Algorithms, Computing and Artificial Intelligence

自引率

0.00%

发文量