Shanshan Zhao , Teng Wang , Jinrui Zhang , Xiangchen Wang , Feng Zheng
{"title":"在图像字幕中实现细粒度多模态控制","authors":"Shanshan Zhao , Teng Wang , Jinrui Zhang , Xiangchen Wang , Feng Zheng","doi":"10.1016/j.patcog.2025.112381","DOIUrl":null,"url":null,"abstract":"<div><div>Controllable image captioning (CIC) models have traditionally focused on generating controlled descriptions using specific text styles. However, these approaches are limited as they rely solely on text control signals, which often fail to align with complex human intentions, such as selecting specific areas in images. To enhance multimodal interactivity, we propose to augment current CIC systems with diverse and joint visual-text controls. To achieve this, we first create a comprehensive Multimodal Controllable Image Captioning Corpus (MCoCa) dataset by leveraging language rewriting ability of GPT-3.5, containing 0.97M image-captions pairs along with 21 visual-text control signals. By training the visual and textual adapters equipped on the multimodal large language model with newly proposed instructional prompts on MCoCa, we observe emergent combinatory multimodal controllability and significant improvement in text controllability. We present exhaustive quantitative and qualitative results, benchmarking our trained model’s state-of-the-art zero-shot captioning performance on SentiCap and FlickrStyle10K in terms of both fidelity and controllability. For regional understanding ability of visual-controlled captioning, our method achieves obvious improvement compared with the baseline models.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"172 ","pages":"Article 112381"},"PeriodicalIF":7.6000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MCoCa: Towards fine-grained multimodal control in image captioning\",\"authors\":\"Shanshan Zhao , Teng Wang , Jinrui Zhang , Xiangchen Wang , Feng Zheng\",\"doi\":\"10.1016/j.patcog.2025.112381\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Controllable image captioning (CIC) models have traditionally focused on generating controlled descriptions using specific text styles. However, these approaches are limited as they rely solely on text control signals, which often fail to align with complex human intentions, such as selecting specific areas in images. To enhance multimodal interactivity, we propose to augment current CIC systems with diverse and joint visual-text controls. To achieve this, we first create a comprehensive Multimodal Controllable Image Captioning Corpus (MCoCa) dataset by leveraging language rewriting ability of GPT-3.5, containing 0.97M image-captions pairs along with 21 visual-text control signals. By training the visual and textual adapters equipped on the multimodal large language model with newly proposed instructional prompts on MCoCa, we observe emergent combinatory multimodal controllability and significant improvement in text controllability. We present exhaustive quantitative and qualitative results, benchmarking our trained model’s state-of-the-art zero-shot captioning performance on SentiCap and FlickrStyle10K in terms of both fidelity and controllability. For regional understanding ability of visual-controlled captioning, our method achieves obvious improvement compared with the baseline models.</div></div>\",\"PeriodicalId\":49713,\"journal\":{\"name\":\"Pattern Recognition\",\"volume\":\"172 \",\"pages\":\"Article 112381\"},\"PeriodicalIF\":7.6000,\"publicationDate\":\"2025-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Pattern Recognition\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0031320325010428\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320325010428","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
MCoCa: Towards fine-grained multimodal control in image captioning
Controllable image captioning (CIC) models have traditionally focused on generating controlled descriptions using specific text styles. However, these approaches are limited as they rely solely on text control signals, which often fail to align with complex human intentions, such as selecting specific areas in images. To enhance multimodal interactivity, we propose to augment current CIC systems with diverse and joint visual-text controls. To achieve this, we first create a comprehensive Multimodal Controllable Image Captioning Corpus (MCoCa) dataset by leveraging language rewriting ability of GPT-3.5, containing 0.97M image-captions pairs along with 21 visual-text control signals. By training the visual and textual adapters equipped on the multimodal large language model with newly proposed instructional prompts on MCoCa, we observe emergent combinatory multimodal controllability and significant improvement in text controllability. We present exhaustive quantitative and qualitative results, benchmarking our trained model’s state-of-the-art zero-shot captioning performance on SentiCap and FlickrStyle10K in terms of both fidelity and controllability. For regional understanding ability of visual-controlled captioning, our method achieves obvious improvement compared with the baseline models.
期刊介绍:
The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.