Yifang Xu, Yunzhuo Sun, Zien Xie, Benxiang Zhai, Youyao Jia, Sidan Du
{"title":"Query-Guided Refinement and Dynamic Spans Network for Video Highlight Detection and Temporal Grounding in Online Information Systems","authors":"Yifang Xu, Yunzhuo Sun, Zien Xie, Benxiang Zhai, Youyao Jia, Sidan Du","doi":"10.4018/ijswis.332768","DOIUrl":null,"url":null,"abstract":"With the surge in online video content, finding highlights and key video segments have garnered widespread attention. Given a textual query, video highlight detection (HD) and temporal grounding (TG) aim to predict frame-wise saliency scores from a video while concurrently locating all relevant spans. Despite recent progress in DETR-based works, these methods crudely fuse different inputs in the encoder, which limits effective cross-modal interaction. To solve this challenge, the authors design QD-Net (query-guided refinement and dynamic spans network) tailored for HD&TG. Specifically, they propose a query-guided refinement module to decouple the feature encoding from the interaction process. Furthermore, they present a dynamic span decoder that leverages learnable 2D spans as decoder queries, which accelerates training convergence for TG. On QVHighlights dataset, the proposed QD-Net achieves 61.87 HD-HIT@1 and 61.88 TG-mAP@0.5, yielding a significant improvement of +1.88 and +8.05, respectively, compared to the state-of-the-art method.","PeriodicalId":54934,"journal":{"name":"International Journal on Semantic Web and Information Systems","volume":"1 1","pages":"0"},"PeriodicalIF":4.1000,"publicationDate":"2023-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal on Semantic Web and Information Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4018/ijswis.332768","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
With the surge in online video content, finding highlights and key video segments have garnered widespread attention. Given a textual query, video highlight detection (HD) and temporal grounding (TG) aim to predict frame-wise saliency scores from a video while concurrently locating all relevant spans. Despite recent progress in DETR-based works, these methods crudely fuse different inputs in the encoder, which limits effective cross-modal interaction. To solve this challenge, the authors design QD-Net (query-guided refinement and dynamic spans network) tailored for HD&TG. Specifically, they propose a query-guided refinement module to decouple the feature encoding from the interaction process. Furthermore, they present a dynamic span decoder that leverages learnable 2D spans as decoder queries, which accelerates training convergence for TG. On QVHighlights dataset, the proposed QD-Net achieves 61.87 HD-HIT@1 and 61.88 TG-mAP@0.5, yielding a significant improvement of +1.88 and +8.05, respectively, compared to the state-of-the-art method.
期刊介绍:
The International Journal on Semantic Web and Information Systems (IJSWIS) promotes a knowledge transfer channel where academics, practitioners, and researchers can discuss, analyze, criticize, synthesize, communicate, elaborate, and simplify the more-than-promising technology of the semantic Web in the context of information systems. The journal aims to establish value-adding knowledge transfer and personal development channels in three distinctive areas: academia, industry, and government.