Tianrui Niu, Fangxiang Feng, Lingxuan Li, Xiaojie Wang
{"title":"从本地相关文本合成图像","authors":"Tianrui Niu, Fangxiang Feng, Lingxuan Li, Xiaojie Wang","doi":"10.1145/3372278.3390684","DOIUrl":null,"url":null,"abstract":"Text-to-image synthesis refers to generating photo-realistic images from text descriptions. Recent works focus on generating images with complex scenes and multiple objects. However, the text inputs to these models are the only captions that always describe the most apparent object or feature of the image and detailed information (e.g. visual attributes) for regions and objects are often missing. Quantitative evaluation of generation performances is still an unsolved problem, where traditional image classification- or retrieval-based metrics fail at evaluating complex images. To address these problems, we propose to generate images conditioned on locally-related texts, i.e., descriptions of local image regions or objects instead of the whole image. Specifically, questions and answers (QAs) are chosen as locally-related texts, which makes it possible to use VQA accuracy as a new evaluation metric. The intuition is simple: higher image quality and image-text consistency (both globally and locally) can help a VQA model answer questions more correctly. We purposed VQA-GAN model with three key modules: hierarchical QA encoder, QA-conditional GAN and external VQA loss. These modules help leverage the new inputs effectively. Thorough experiments on two public VQA datasets demonstrate the effectiveness of the model and the newly proposed metric.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"Image Synthesis from Locally Related Texts\",\"authors\":\"Tianrui Niu, Fangxiang Feng, Lingxuan Li, Xiaojie Wang\",\"doi\":\"10.1145/3372278.3390684\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Text-to-image synthesis refers to generating photo-realistic images from text descriptions. Recent works focus on generating images with complex scenes and multiple objects. However, the text inputs to these models are the only captions that always describe the most apparent object or feature of the image and detailed information (e.g. visual attributes) for regions and objects are often missing. Quantitative evaluation of generation performances is still an unsolved problem, where traditional image classification- or retrieval-based metrics fail at evaluating complex images. To address these problems, we propose to generate images conditioned on locally-related texts, i.e., descriptions of local image regions or objects instead of the whole image. Specifically, questions and answers (QAs) are chosen as locally-related texts, which makes it possible to use VQA accuracy as a new evaluation metric. The intuition is simple: higher image quality and image-text consistency (both globally and locally) can help a VQA model answer questions more correctly. We purposed VQA-GAN model with three key modules: hierarchical QA encoder, QA-conditional GAN and external VQA loss. These modules help leverage the new inputs effectively. Thorough experiments on two public VQA datasets demonstrate the effectiveness of the model and the newly proposed metric.\",\"PeriodicalId\":158014,\"journal\":{\"name\":\"Proceedings of the 2020 International Conference on Multimedia Retrieval\",\"volume\":\"24 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-06-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2020 International Conference on Multimedia Retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3372278.3390684\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2020 International Conference on Multimedia Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3372278.3390684","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Text-to-image synthesis refers to generating photo-realistic images from text descriptions. Recent works focus on generating images with complex scenes and multiple objects. However, the text inputs to these models are the only captions that always describe the most apparent object or feature of the image and detailed information (e.g. visual attributes) for regions and objects are often missing. Quantitative evaluation of generation performances is still an unsolved problem, where traditional image classification- or retrieval-based metrics fail at evaluating complex images. To address these problems, we propose to generate images conditioned on locally-related texts, i.e., descriptions of local image regions or objects instead of the whole image. Specifically, questions and answers (QAs) are chosen as locally-related texts, which makes it possible to use VQA accuracy as a new evaluation metric. The intuition is simple: higher image quality and image-text consistency (both globally and locally) can help a VQA model answer questions more correctly. We purposed VQA-GAN model with three key modules: hierarchical QA encoder, QA-conditional GAN and external VQA loss. These modules help leverage the new inputs effectively. Thorough experiments on two public VQA datasets demonstrate the effectiveness of the model and the newly proposed metric.