ETC: Temporal Boundary Expand Then Clarify for Weakly Supervised Video Grounding With Multimodal Large Language Model

IF 9.7 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia Pub Date : 2025-02-05 DOI:10.1109/TMM.2024.3521758

Guozhang Li;Xinpeng Ding;De Cheng;Jie Li;Nannan Wang;Xinbo Gao

{"title":"ETC: Temporal Boundary Expand Then Clarify for Weakly Supervised Video Grounding With Multimodal Large Language Model","authors":"Guozhang Li;Xinpeng Ding;De Cheng;Jie Li;Nannan Wang;Xinbo Gao","doi":"10.1109/TMM.2024.3521758","DOIUrl":null,"url":null,"abstract":"Early weakly supervised video grounding (WSVG) methods often struggle with incomplete boundary detection due to the absence of temporal boundary annotations. To bridge the gap between video-level and boundary-level annotations, explicit supervision methods (i.e., generating pseudo-temporal boundaries for training) have achieved great success. However, data augmentation in these methods might disrupt critical temporal information, yielding poor pseudo-temporal boundaries. In this paper, we propose a new perspective that maintains the integrity of the original temporal content while introducing more valuable information for expanding the incomplete boundaries. To this end, we propose <bold>ETC</b> (<bold>E</b>xpand <bold>t</b>hen <bold>C</b>larify), first using the additional information to expand the initial incomplete pseudo-temporal boundaries, and subsequently refining these expanded ones to achieve precise boundaries. Motivated by video continuity, i.e., visual similarity across adjacent frames, we use powerful multi-modal large language models (MLLMs) to annotate each frame within the initial pseudo-temporal boundaries, yielding more comprehensive descriptions for expanded boundaries. To further clarify the noise in expanded boundaries, we combine mutual learning with a tailored proposal-level contrastive objective to use a learnable approach to harmonize a balance between incomplete yet clean (initial) and comprehensive yet noisy (expanded) boundaries for more precise ones. Experiments demonstrate the superiority of our method on two challenging WSVG datasets.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1772-1782"},"PeriodicalIF":9.7000,"publicationDate":"2025-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10874219/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Early weakly supervised video grounding (WSVG) methods often struggle with incomplete boundary detection due to the absence of temporal boundary annotations. To bridge the gap between video-level and boundary-level annotations, explicit supervision methods (i.e., generating pseudo-temporal boundaries for training) have achieved great success. However, data augmentation in these methods might disrupt critical temporal information, yielding poor pseudo-temporal boundaries. In this paper, we propose a new perspective that maintains the integrity of the original temporal content while introducing more valuable information for expanding the incomplete boundaries. To this end, we propose ETC (Expand then Clarify), first using the additional information to expand the initial incomplete pseudo-temporal boundaries, and subsequently refining these expanded ones to achieve precise boundaries. Motivated by video continuity, i.e., visual similarity across adjacent frames, we use powerful multi-modal large language models (MLLMs) to annotate each frame within the initial pseudo-temporal boundaries, yielding more comprehensive descriptions for expanded boundaries. To further clarify the noise in expanded boundaries, we combine mutual learning with a tailored proposal-level contrastive objective to use a learnable approach to harmonize a balance between incomplete yet clean (initial) and comprehensive yet noisy (expanded) boundaries for more precise ones. Experiments demonstrate the superiority of our method on two challenging WSVG datasets.

查看原文本刊更多论文

基于多模态大语言模型的弱监督视频时间边界扩展与澄清

早期的弱监督视频接地（WSVG）方法由于缺乏时间边界标注，常常存在边界检测不完整的问题。为了弥补视频级和边界级标注之间的差距，显式监督方法（即为训练生成伪时间边界）取得了巨大的成功。然而，这些方法中的数据扩充可能会破坏关键的时间信息，产生较差的伪时间边界。在本文中，我们提出了一个新的视角，即在保持原始时间内容完整性的同时，引入更多有价值的信息来扩展不完全边界。为此，我们提出了ETC (Expand then Clarify)，首先使用额外的信息来扩展初始的不完整伪时间边界，然后对这些扩展的伪时间边界进行细化以获得精确的边界。受视频连续性（即相邻帧之间的视觉相似性）的驱动，我们使用强大的多模态大语言模型（mllm）在初始伪时间边界内对每个帧进行注释，从而对扩展的边界产生更全面的描述。为了进一步澄清扩展边界中的噪声，我们将相互学习与定制的提案级对比目标相结合，使用可学习的方法来协调不完整但干净（初始）和全面但有噪声（扩展）边界之间的平衡，以获得更精确的边界。实验证明了该方法在两个具有挑战性的WSVG数据集上的优越性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.