Xun Jiang;Liqing Zhu;Xing Xu;Fumin Shen;Yang Yang;Heng Tao Shen
{"title":"作为监督的查询:面向低成本、鲁棒的视频时刻和高光检索","authors":"Xun Jiang;Liqing Zhu;Xing Xu;Fumin Shen;Yang Yang;Heng Tao Shen","doi":"10.1109/TCSVT.2024.3510950","DOIUrl":null,"url":null,"abstract":"Video Moment and Highlight Retrieval (VMHR) aims at retrieving video events with a text query in a long untrimmed video and selecting the most related video highlights by assigning the worthiness scores. However, we observed existing methods mostly have two unavoidable defects: 1) The temporal annotations of highlight scores are extremely labor-cost and subjective, thus it is very hard and expensive to gather qualified annotated training data. 2) The previous VMHR methods would fit the temporal distributions instead of learning vision-language relevance, which reveals the limitations of the conventional paradigm on model robustness towards biased training data from open-world scenarios. In this paper, we propose a novel method termed Query as Supervision (QaS), which jointly tackles the annotation cost and model robustness in the VMHR task. Specifically, instead of learning from the distributions of temporal annotations, our QaS method completely learns multimodal alignments within semantic space via our proposed Hybrid Ranking Learning scheme for retrieving moments and highlights. In this way, it only requires low-cost annotations and also provides much better robustness towards Out-Of-Distribution test samples. We evaluate our proposed QaS method on three benchmark datasets, i.e., QVHighlights, BLiSS, and Charades-STA and their biased training version. Extensive experiments demonstrate that the QaS outperforms existing state-of-the-art methods under the same low-cost annotation settings and reveals better robustness against biased training data. Our code is available at <uri>https://github.com/CFM-MSG/Code_QaS</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"3955-3968"},"PeriodicalIF":8.3000,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Query as Supervision: Toward Low-Cost and Robust Video Moment and Highlight Retrieval\",\"authors\":\"Xun Jiang;Liqing Zhu;Xing Xu;Fumin Shen;Yang Yang;Heng Tao Shen\",\"doi\":\"10.1109/TCSVT.2024.3510950\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Video Moment and Highlight Retrieval (VMHR) aims at retrieving video events with a text query in a long untrimmed video and selecting the most related video highlights by assigning the worthiness scores. However, we observed existing methods mostly have two unavoidable defects: 1) The temporal annotations of highlight scores are extremely labor-cost and subjective, thus it is very hard and expensive to gather qualified annotated training data. 2) The previous VMHR methods would fit the temporal distributions instead of learning vision-language relevance, which reveals the limitations of the conventional paradigm on model robustness towards biased training data from open-world scenarios. In this paper, we propose a novel method termed Query as Supervision (QaS), which jointly tackles the annotation cost and model robustness in the VMHR task. Specifically, instead of learning from the distributions of temporal annotations, our QaS method completely learns multimodal alignments within semantic space via our proposed Hybrid Ranking Learning scheme for retrieving moments and highlights. In this way, it only requires low-cost annotations and also provides much better robustness towards Out-Of-Distribution test samples. We evaluate our proposed QaS method on three benchmark datasets, i.e., QVHighlights, BLiSS, and Charades-STA and their biased training version. Extensive experiments demonstrate that the QaS outperforms existing state-of-the-art methods under the same low-cost annotation settings and reveals better robustness against biased training data. Our code is available at <uri>https://github.com/CFM-MSG/Code_QaS</uri>.\",\"PeriodicalId\":13082,\"journal\":{\"name\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"volume\":\"35 5\",\"pages\":\"3955-3968\"},\"PeriodicalIF\":8.3000,\"publicationDate\":\"2024-12-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10778247/\",\"RegionNum\":1,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10778247/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
Query as Supervision: Toward Low-Cost and Robust Video Moment and Highlight Retrieval
Video Moment and Highlight Retrieval (VMHR) aims at retrieving video events with a text query in a long untrimmed video and selecting the most related video highlights by assigning the worthiness scores. However, we observed existing methods mostly have two unavoidable defects: 1) The temporal annotations of highlight scores are extremely labor-cost and subjective, thus it is very hard and expensive to gather qualified annotated training data. 2) The previous VMHR methods would fit the temporal distributions instead of learning vision-language relevance, which reveals the limitations of the conventional paradigm on model robustness towards biased training data from open-world scenarios. In this paper, we propose a novel method termed Query as Supervision (QaS), which jointly tackles the annotation cost and model robustness in the VMHR task. Specifically, instead of learning from the distributions of temporal annotations, our QaS method completely learns multimodal alignments within semantic space via our proposed Hybrid Ranking Learning scheme for retrieving moments and highlights. In this way, it only requires low-cost annotations and also provides much better robustness towards Out-Of-Distribution test samples. We evaluate our proposed QaS method on three benchmark datasets, i.e., QVHighlights, BLiSS, and Charades-STA and their biased training version. Extensive experiments demonstrate that the QaS outperforms existing state-of-the-art methods under the same low-cost annotation settings and reveals better robustness against biased training data. Our code is available at https://github.com/CFM-MSG/Code_QaS.
期刊介绍:
The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.