Query as Supervision: Toward Low-Cost and Robust Video Moment and Highlight Retrieval

IF 8.3 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2024-12-04 DOI:10.1109/TCSVT.2024.3510950

Xun Jiang;Liqing Zhu;Xing Xu;Fumin Shen;Yang Yang;Heng Tao Shen

{"title":"Query as Supervision: Toward Low-Cost and Robust Video Moment and Highlight Retrieval","authors":"Xun Jiang;Liqing Zhu;Xing Xu;Fumin Shen;Yang Yang;Heng Tao Shen","doi":"10.1109/TCSVT.2024.3510950","DOIUrl":null,"url":null,"abstract":"Video Moment and Highlight Retrieval (VMHR) aims at retrieving video events with a text query in a long untrimmed video and selecting the most related video highlights by assigning the worthiness scores. However, we observed existing methods mostly have two unavoidable defects: 1) The temporal annotations of highlight scores are extremely labor-cost and subjective, thus it is very hard and expensive to gather qualified annotated training data. 2) The previous VMHR methods would fit the temporal distributions instead of learning vision-language relevance, which reveals the limitations of the conventional paradigm on model robustness towards biased training data from open-world scenarios. In this paper, we propose a novel method termed Query as Supervision (QaS), which jointly tackles the annotation cost and model robustness in the VMHR task. Specifically, instead of learning from the distributions of temporal annotations, our QaS method completely learns multimodal alignments within semantic space via our proposed Hybrid Ranking Learning scheme for retrieving moments and highlights. In this way, it only requires low-cost annotations and also provides much better robustness towards Out-Of-Distribution test samples. We evaluate our proposed QaS method on three benchmark datasets, i.e., QVHighlights, BLiSS, and Charades-STA and their biased training version. Extensive experiments demonstrate that the QaS outperforms existing state-of-the-art methods under the same low-cost annotation settings and reveals better robustness against biased training data. Our code is available at <uri>https://github.com/CFM-MSG/Code_QaS</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"3955-3968"},"PeriodicalIF":8.3000,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10778247/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Video Moment and Highlight Retrieval (VMHR) aims at retrieving video events with a text query in a long untrimmed video and selecting the most related video highlights by assigning the worthiness scores. However, we observed existing methods mostly have two unavoidable defects: 1) The temporal annotations of highlight scores are extremely labor-cost and subjective, thus it is very hard and expensive to gather qualified annotated training data. 2) The previous VMHR methods would fit the temporal distributions instead of learning vision-language relevance, which reveals the limitations of the conventional paradigm on model robustness towards biased training data from open-world scenarios. In this paper, we propose a novel method termed Query as Supervision (QaS), which jointly tackles the annotation cost and model robustness in the VMHR task. Specifically, instead of learning from the distributions of temporal annotations, our QaS method completely learns multimodal alignments within semantic space via our proposed Hybrid Ranking Learning scheme for retrieving moments and highlights. In this way, it only requires low-cost annotations and also provides much better robustness towards Out-Of-Distribution test samples. We evaluate our proposed QaS method on three benchmark datasets, i.e., QVHighlights, BLiSS, and Charades-STA and their biased training version. Extensive experiments demonstrate that the QaS outperforms existing state-of-the-art methods under the same low-cost annotation settings and reveals better robustness against biased training data. Our code is available at https://github.com/CFM-MSG/Code_QaS.

查看原文本刊更多论文

作为监督的查询：面向低成本、鲁棒的视频时刻和高光检索

视频时刻和亮点检索（VMHR）的目的是在长视频中通过文本查询检索视频事件，并通过分配价值分数来选择最相关的视频亮点。然而，我们观察到现有的方法大多存在两个不可避免的缺陷：1)突出分数的时间标注非常耗费人力和主观，因此收集合格的标注训练数据非常困难和昂贵。2)传统的VMHR方法会对时间分布进行拟合，而不是学习视觉语言相关性，这暴露了传统范式在模型鲁棒性上对开放世界场景中有偏见的训练数据的局限性。在本文中，我们提出了一种新的方法，即查询即监督（QaS），它共同解决了VMHR任务中的标注成本和模型鲁棒性问题。具体来说，我们的QaS方法不是从时间注释的分布中学习，而是通过我们提出的混合排名学习方案来检索时刻和高光，完全学习语义空间中的多模态对齐。通过这种方式，它只需要低成本的注释，并且对out - distribution测试样本提供了更好的鲁棒性。我们在三个基准数据集（即QVHighlights、BLiSS和Charades-STA）及其偏差训练版本上评估了我们提出的QaS方法。大量的实验表明，在相同的低成本标注设置下，QaS优于现有的最先进的方法，并且对有偏见的训练数据显示出更好的鲁棒性。我们的代码可在https://github.com/CFM-MSG/Code_QaS上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Circuits and Systems for Video Technology 工程技术-工程：电子与电气

CiteScore

13.80

自引率

27.40%

发文量

660

审稿时长

5 months

期刊介绍： The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.