{"title":"文本到图像扩散模型的迭代对象计数优化","authors":"Oz Zafar, Lior Wolf, Idan Schwartz","doi":"arxiv-2408.11721","DOIUrl":null,"url":null,"abstract":"We address a persistent challenge in text-to-image models: accurately\ngenerating a specified number of objects. Current models, which learn from\nimage-text pairs, inherently struggle with counting, as training data cannot\ndepict every possible number of objects for any given object. To solve this, we\npropose optimizing the generated image based on a counting loss derived from a\ncounting model that aggregates an object\\'s potential. Employing an\nout-of-the-box counting model is challenging for two reasons: first, the model\nrequires a scaling hyperparameter for the potential aggregation that varies\ndepending on the viewpoint of the objects, and second, classifier guidance\ntechniques require modified models that operate on noisy intermediate diffusion\nsteps. To address these challenges, we propose an iterated online training mode\nthat improves the accuracy of inferred images while altering the text\nconditioning embedding and dynamically adjusting hyperparameters. Our method\noffers three key advantages: (i) it can consider non-derivable counting\ntechniques based on detection models, (ii) it is a zero-shot plug-and-play\nsolution facilitating rapid changes to the counting techniques and image\ngeneration methods, and (iii) the optimized counting token can be reused to\ngenerate accurate images without additional optimization. We evaluate the\ngeneration of various objects and show significant improvements in accuracy.\nThe project page is available at https://ozzafar.github.io/count_token.","PeriodicalId":501174,"journal":{"name":"arXiv - CS - Graphics","volume":"60 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Iterative Object Count Optimization for Text-to-image Diffusion Models\",\"authors\":\"Oz Zafar, Lior Wolf, Idan Schwartz\",\"doi\":\"arxiv-2408.11721\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We address a persistent challenge in text-to-image models: accurately\\ngenerating a specified number of objects. Current models, which learn from\\nimage-text pairs, inherently struggle with counting, as training data cannot\\ndepict every possible number of objects for any given object. To solve this, we\\npropose optimizing the generated image based on a counting loss derived from a\\ncounting model that aggregates an object\\\\'s potential. Employing an\\nout-of-the-box counting model is challenging for two reasons: first, the model\\nrequires a scaling hyperparameter for the potential aggregation that varies\\ndepending on the viewpoint of the objects, and second, classifier guidance\\ntechniques require modified models that operate on noisy intermediate diffusion\\nsteps. To address these challenges, we propose an iterated online training mode\\nthat improves the accuracy of inferred images while altering the text\\nconditioning embedding and dynamically adjusting hyperparameters. Our method\\noffers three key advantages: (i) it can consider non-derivable counting\\ntechniques based on detection models, (ii) it is a zero-shot plug-and-play\\nsolution facilitating rapid changes to the counting techniques and image\\ngeneration methods, and (iii) the optimized counting token can be reused to\\ngenerate accurate images without additional optimization. We evaluate the\\ngeneration of various objects and show significant improvements in accuracy.\\nThe project page is available at https://ozzafar.github.io/count_token.\",\"PeriodicalId\":501174,\"journal\":{\"name\":\"arXiv - CS - Graphics\",\"volume\":\"60 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Graphics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.11721\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Graphics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.11721","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Iterative Object Count Optimization for Text-to-image Diffusion Models
We address a persistent challenge in text-to-image models: accurately
generating a specified number of objects. Current models, which learn from
image-text pairs, inherently struggle with counting, as training data cannot
depict every possible number of objects for any given object. To solve this, we
propose optimizing the generated image based on a counting loss derived from a
counting model that aggregates an object\'s potential. Employing an
out-of-the-box counting model is challenging for two reasons: first, the model
requires a scaling hyperparameter for the potential aggregation that varies
depending on the viewpoint of the objects, and second, classifier guidance
techniques require modified models that operate on noisy intermediate diffusion
steps. To address these challenges, we propose an iterated online training mode
that improves the accuracy of inferred images while altering the text
conditioning embedding and dynamically adjusting hyperparameters. Our method
offers three key advantages: (i) it can consider non-derivable counting
techniques based on detection models, (ii) it is a zero-shot plug-and-play
solution facilitating rapid changes to the counting techniques and image
generation methods, and (iii) the optimized counting token can be reused to
generate accurate images without additional optimization. We evaluate the
generation of various objects and show significant improvements in accuracy.
The project page is available at https://ozzafar.github.io/count_token.