通过基于 CLIP 的直接优化重新审视图像字幕培训范式

arXiv - CS - Multimedia Pub Date : 2024-08-26 DOI:arxiv-2408.14547

Nicholas Moratelli, Davide Caffagni, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

{"title":"通过基于 CLIP 的直接优化重新审视图像字幕培训范式","authors":"Nicholas Moratelli, Davide Caffagni, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara","doi":"arxiv-2408.14547","DOIUrl":null,"url":null,"abstract":"The conventional training approach for image captioning involves pre-training\na network using teacher forcing and subsequent fine-tuning with Self-Critical\nSequence Training to maximize hand-crafted captioning metrics. However, when\nattempting to optimize modern and higher-quality metrics like CLIP-Score and\nPAC-Score, this training method often encounters instability and fails to\nacquire the genuine descriptive capabilities needed to produce fluent and\ninformative captions. In this paper, we propose a new training paradigm termed\nDirect CLIP-Based Optimization (DiCO). Our approach jointly learns and\noptimizes a reward model that is distilled from a learnable captioning\nevaluator with high human correlation. This is done by solving a weighted\nclassification problem directly inside the captioner. At the same time, DiCO\nprevents divergence from the original model, ensuring that fluency is\nmaintained. DiCO not only exhibits improved stability and enhanced quality in\nthe generated captions but also aligns more closely with human preferences\ncompared to existing methods, especially in modern metrics. Additionally, it\nmaintains competitive performance in traditional metrics. Our source code and\ntrained models are publicly available at https://github.com/aimagelab/DiCO.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"67 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization\",\"authors\":\"Nicholas Moratelli, Davide Caffagni, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara\",\"doi\":\"arxiv-2408.14547\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The conventional training approach for image captioning involves pre-training\\na network using teacher forcing and subsequent fine-tuning with Self-Critical\\nSequence Training to maximize hand-crafted captioning metrics. However, when\\nattempting to optimize modern and higher-quality metrics like CLIP-Score and\\nPAC-Score, this training method often encounters instability and fails to\\nacquire the genuine descriptive capabilities needed to produce fluent and\\ninformative captions. In this paper, we propose a new training paradigm termed\\nDirect CLIP-Based Optimization (DiCO). Our approach jointly learns and\\noptimizes a reward model that is distilled from a learnable captioning\\nevaluator with high human correlation. This is done by solving a weighted\\nclassification problem directly inside the captioner. At the same time, DiCO\\nprevents divergence from the original model, ensuring that fluency is\\nmaintained. DiCO not only exhibits improved stability and enhanced quality in\\nthe generated captions but also aligns more closely with human preferences\\ncompared to existing methods, especially in modern metrics. Additionally, it\\nmaintains competitive performance in traditional metrics. Our source code and\\ntrained models are publicly available at https://github.com/aimagelab/DiCO.\",\"PeriodicalId\":501480,\"journal\":{\"name\":\"arXiv - CS - Multimedia\",\"volume\":\"67 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Multimedia\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.14547\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.14547","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

图像字幕的传统训练方法包括使用教师强迫对网络进行预训练，然后使用自批判序列训练（Self-CriticalSequence Training）进行微调，以最大限度地提高手工制作的字幕指标。然而，当试图优化 CLIP-Score 和 PAC-Score 等现代和更高质量的指标时，这种训练方法往往会遇到不稳定性，无法获得制作流畅和有信息量的字幕所需的真正描述能力。在本文中，我们提出了一种新的训练范式，称为基于直接 CLIP 的优化（DiCO）。我们的方法可以共同学习和优化一个奖励模型，该模型是从具有高度人类相关性的可学习字幕评估器中提炼出来的。这是通过直接在字幕机中解决加权分类问题来实现的。与此同时，DiCO 还能避免与原始模型产生分歧，确保流畅性得以保持。DiCO 不仅提高了生成字幕的稳定性和质量，而且与现有方法相比，特别是在现代指标方面，更符合人类的偏好。此外，它在传统指标方面也保持了有竞争力的性能。我们的源代码和训练模型可在 https://github.com/aimagelab/DiCO 网站上公开获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization

The conventional training approach for image captioning involves pre-training a network using teacher forcing and subsequent fine-tuning with Self-Critical Sequence Training to maximize hand-crafted captioning metrics. However, when attempting to optimize modern and higher-quality metrics like CLIP-Score and PAC-Score, this training method often encounters instability and fails to acquire the genuine descriptive capabilities needed to produce fluent and informative captions. In this paper, we propose a new training paradigm termed Direct CLIP-Based Optimization (DiCO). Our approach jointly learns and optimizes a reward model that is distilled from a learnable captioning evaluator with high human correlation. This is done by solving a weighted classification problem directly inside the captioner. At the same time, DiCO prevents divergence from the original model, ensuring that fluency is maintained. DiCO not only exhibits improved stability and enhanced quality in the generated captions but also aligns more closely with human preferences compared to existing methods, especially in modern metrics. Additionally, it maintains competitive performance in traditional metrics. Our source code and trained models are publicly available at https://github.com/aimagelab/DiCO.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Multimedia

自引率

0.00%

发文量