Xun Jiang;Xing Xu;Zailei Zhou;Yang Yang;Fumin Shen;Heng Tao Shen
{"title":"利用角度重构文本嵌入检索零镜头视频瞬间","authors":"Xun Jiang;Xing Xu;Zailei Zhou;Yang Yang;Fumin Shen;Heng Tao Shen","doi":"10.1109/TMM.2024.3396272","DOIUrl":null,"url":null,"abstract":"Given an untrimmed video and a text query, Video Moment Retrieval (VMR) aims at retrieving a specific moment where the video content is semantically related to the text query. Conventional VMR methods rely on video-text paired data or specific temporal annotations for each target event. However, the subjectivity and time-consuming nature of the labeling process limit their practicality in multimedia applications. To address this issue, recently researchers proposed a Zero-Shot Learning setting for VMR (ZS-VMR) that trains VMR models without manual supervision signals, thereby reducing the data cost. In this paper, we tackle the challenging ZS-VMR problem with \n<italic>Angular Reconstructive Text embeddings (ART)</i>\n, generalizing the image-text matching pre-trained model CLIP to the VMR task. Specifically, assuming that visual embeddings are close to their semantically related text embeddings in angular space, our ART method generates pseudo-text embeddings of video event proposals through the hypersphere of CLIP. Moreover, to address the temporal nature of videos, we also design local multimodal fusion learning to narrow the gaps between image-text matching and video-text matching. Our experimental results on two widely used VMR benchmarks, Charades-STA and ActivityNet-Captions, show that our method outperforms current state-of-the-art ZS-VMR methods. It also achieves competitive performance compared to recent weakly-supervised VMR methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"9657-9670"},"PeriodicalIF":8.4000,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Zero-Shot Video Moment Retrieval With Angular Reconstructive Text Embeddings\",\"authors\":\"Xun Jiang;Xing Xu;Zailei Zhou;Yang Yang;Fumin Shen;Heng Tao Shen\",\"doi\":\"10.1109/TMM.2024.3396272\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Given an untrimmed video and a text query, Video Moment Retrieval (VMR) aims at retrieving a specific moment where the video content is semantically related to the text query. Conventional VMR methods rely on video-text paired data or specific temporal annotations for each target event. However, the subjectivity and time-consuming nature of the labeling process limit their practicality in multimedia applications. To address this issue, recently researchers proposed a Zero-Shot Learning setting for VMR (ZS-VMR) that trains VMR models without manual supervision signals, thereby reducing the data cost. In this paper, we tackle the challenging ZS-VMR problem with \\n<italic>Angular Reconstructive Text embeddings (ART)</i>\\n, generalizing the image-text matching pre-trained model CLIP to the VMR task. Specifically, assuming that visual embeddings are close to their semantically related text embeddings in angular space, our ART method generates pseudo-text embeddings of video event proposals through the hypersphere of CLIP. Moreover, to address the temporal nature of videos, we also design local multimodal fusion learning to narrow the gaps between image-text matching and video-text matching. Our experimental results on two widely used VMR benchmarks, Charades-STA and ActivityNet-Captions, show that our method outperforms current state-of-the-art ZS-VMR methods. It also achieves competitive performance compared to recent weakly-supervised VMR methods.\",\"PeriodicalId\":13273,\"journal\":{\"name\":\"IEEE Transactions on Multimedia\",\"volume\":\"26 \",\"pages\":\"9657-9670\"},\"PeriodicalIF\":8.4000,\"publicationDate\":\"2024-07-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Multimedia\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10605104/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10605104/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Zero-Shot Video Moment Retrieval With Angular Reconstructive Text Embeddings
Given an untrimmed video and a text query, Video Moment Retrieval (VMR) aims at retrieving a specific moment where the video content is semantically related to the text query. Conventional VMR methods rely on video-text paired data or specific temporal annotations for each target event. However, the subjectivity and time-consuming nature of the labeling process limit their practicality in multimedia applications. To address this issue, recently researchers proposed a Zero-Shot Learning setting for VMR (ZS-VMR) that trains VMR models without manual supervision signals, thereby reducing the data cost. In this paper, we tackle the challenging ZS-VMR problem with
Angular Reconstructive Text embeddings (ART)
, generalizing the image-text matching pre-trained model CLIP to the VMR task. Specifically, assuming that visual embeddings are close to their semantically related text embeddings in angular space, our ART method generates pseudo-text embeddings of video event proposals through the hypersphere of CLIP. Moreover, to address the temporal nature of videos, we also design local multimodal fusion learning to narrow the gaps between image-text matching and video-text matching. Our experimental results on two widely used VMR benchmarks, Charades-STA and ActivityNet-Captions, show that our method outperforms current state-of-the-art ZS-VMR methods. It also achieves competitive performance compared to recent weakly-supervised VMR methods.
期刊介绍:
The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.