Pseudo-labeling with keyword refining for few-supervised video captioning

IF 7.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Pattern Recognition Pub Date : 2024-11-14 DOI:10.1016/j.patcog.2024.111176

Ping Li , Tao Wang , Xinkui Zhao , Xianghua Xu , Mingli Song

{"title":"Pseudo-labeling with keyword refining for few-supervised video captioning","authors":"Ping Li , Tao Wang , Xinkui Zhao , Xianghua Xu , Mingli Song","doi":"10.1016/j.patcog.2024.111176","DOIUrl":null,"url":null,"abstract":"<div><div>Video captioning generate a sentence that describes the video content. Existing methods always require a number of captions (e.g., 10 or 20) per video to train the model, which is quite costly. In this work, we explore the possibility of using only one or very few ground-truth sentences, and introduce a new task named few-supervised video captioning. Specifically, we propose a few-supervised video captioning framework that consists of lexically constrained pseudo-labeling module and keyword-refined captioning module. Unlike the random sampling in natural language processing that may cause invalid modifications (i.e., edit words), the former module guides the model to edit words using some actions (e.g., copy, replace, insert, and delete) by a pretrained token-level classifier, and then fine-tunes candidate sentences by a pretrained language model. Meanwhile, the former employs the repetition penalized sampling to encourage the model to yield concise pseudo-labeled sentences with less repetition, and selects the most relevant sentences upon a pretrained video-text model. Moreover, to keep semantic consistency between pseudo-labeled sentences and video content, we develop the transformer-based keyword refiner with the video-keyword gated fusion strategy to emphasize more on relevant words. Extensive experiments on several benchmarks demonstrate the advantages of the proposed approach in both few-supervised and fully-supervised scenarios.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"159 ","pages":"Article 111176"},"PeriodicalIF":7.5000,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320324009270","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Video captioning generate a sentence that describes the video content. Existing methods always require a number of captions (e.g., 10 or 20) per video to train the model, which is quite costly. In this work, we explore the possibility of using only one or very few ground-truth sentences, and introduce a new task named few-supervised video captioning. Specifically, we propose a few-supervised video captioning framework that consists of lexically constrained pseudo-labeling module and keyword-refined captioning module. Unlike the random sampling in natural language processing that may cause invalid modifications (i.e., edit words), the former module guides the model to edit words using some actions (e.g., copy, replace, insert, and delete) by a pretrained token-level classifier, and then fine-tunes candidate sentences by a pretrained language model. Meanwhile, the former employs the repetition penalized sampling to encourage the model to yield concise pseudo-labeled sentences with less repetition, and selects the most relevant sentences upon a pretrained video-text model. Moreover, to keep semantic consistency between pseudo-labeled sentences and video content, we develop the transformer-based keyword refiner with the video-keyword gated fusion strategy to emphasize more on relevant words. Extensive experiments on several benchmarks demonstrate the advantages of the proposed approach in both few-supervised and fully-supervised scenarios.

查看原文本刊更多论文

针对少数人监督的视频字幕，利用关键词提炼进行伪标记

视频字幕生成描述视频内容的句子。现有方法总是要求每段视频有一定数量的字幕（如 10 或 20 个）来训练模型，成本相当高。在这项工作中，我们探索了只使用一个或极少数地面实况句子的可能性，并引入了一项名为 "少数监督视频字幕 "的新任务。具体来说，我们提出了一种少数监督视频字幕制作框架，该框架由词汇约束伪标签模块和关键字提炼字幕制作模块组成。与自然语言处理中可能导致无效修改（即编辑词语）的随机抽样不同，前者通过预训练的标记级分类器引导模型使用一些操作（如复制、替换、插入和删除）来编辑词语，然后通过预训练的语言模型对候选句子进行微调。同时，前者采用了重复惩罚采样法，鼓励模型生成重复较少的简洁伪标签句子，并通过预训练的视频文本模型选择最相关的句子。此外，为了保持伪标签句子与视频内容之间的语义一致性，我们开发了基于转换器的关键词提炼器，并采用了视频-关键词门控融合策略，以更加强调相关词语。在多个基准上进行的广泛实验证明了所提出的方法在少数监督和完全监督场景下的优势。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Pattern Recognition 工程技术-工程：电子与电气

CiteScore

14.40

自引率

16.20%

发文量

683

审稿时长

5.6 months

期刊介绍： The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.