{"title":"Multi-Level interaction network for temporal sentence grounding in videos","authors":"Guangli Wu, Zhijun Yang, Jing Zhang","doi":"10.3233/jifs-234800","DOIUrl":null,"url":null,"abstract":"Temporal sentence grounding in videos (TSGV), which aims to retrieve video segments from an untrimmed videos that semantically match a given query. Most previous methods focused on learning either local or global query features and then performed cross-modal interaction, but ignore the complementarity between local and global features. In this paper, we propose a novel Multi-Level Interaction Network for Temporal Sentence Grounding in Videos. This network explores the semantics of queries at both phrase and sentence levels, interacting phrase-level features with video features to highlight video segments relevant to the query phrase and sentence-level features with video features to learn more about global localization information. A stacked fusion gate module is designed, which effectively captures the temporal relationships and semantic information among video segments. This module also introduces a gating mechanism to enable the model to adaptively regulate the fusion degree of video features and query features, further improving the accuracy of predicting the target segments. Extensive experiments on the ActivityNet Captions and Charades-STA benchmark datasets demonstrate that the proposed method outperforms the state-of-the-art methods.","PeriodicalId":509313,"journal":{"name":"Journal of Intelligent & Fuzzy Systems","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Intelligent & Fuzzy Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3233/jifs-234800","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Temporal sentence grounding in videos (TSGV), which aims to retrieve video segments from an untrimmed videos that semantically match a given query. Most previous methods focused on learning either local or global query features and then performed cross-modal interaction, but ignore the complementarity between local and global features. In this paper, we propose a novel Multi-Level Interaction Network for Temporal Sentence Grounding in Videos. This network explores the semantics of queries at both phrase and sentence levels, interacting phrase-level features with video features to highlight video segments relevant to the query phrase and sentence-level features with video features to learn more about global localization information. A stacked fusion gate module is designed, which effectively captures the temporal relationships and semantic information among video segments. This module also introduces a gating mechanism to enable the model to adaptively regulate the fusion degree of video features and query features, further improving the accuracy of predicting the target segments. Extensive experiments on the ActivityNet Captions and Charades-STA benchmark datasets demonstrate that the proposed method outperforms the state-of-the-art methods.