基于互匹配网络和TubeDETR的以人为中心的时空视频接地

Proceedings of the 4th on Person in Context Workshop Pub Date : 2022-07-09 DOI:10.1145/3552455.3555815

Fan Yu, Zhixiang Zhao, Yuchen Wang, Yi Xu, Tongwei Ren, Gangshan Wu

{"title":"基于互匹配网络和TubeDETR的以人为中心的时空视频接地","authors":"Fan Yu, Zhixiang Zhao, Yuchen Wang, Yi Xu, Tongwei Ren, Gangshan Wu","doi":"10.1145/3552455.3555815","DOIUrl":null,"url":null,"abstract":"In this technical report, we represent our solution for the Human-centric Spatio-Temporal Video Grounding (HC-STVG) track of the 4th Person in Context (PIC) workshop and challenge. Our solution is built on the basis of TubeDETR and Mutual Matching Network (MMN). Specifically, TubeDETR exploits a video-text encoder and a space-time decoder to predict the starting time, the ending time and the tube of the target person. MMN detects persons in images, links them as tubes, extracts features of person tubes and the text description, and predicts the similarities between them to choose the most likely person tube as the grounding result. Our solution finally finetunes the results by combining the spatio localization of MMN and the temporal localization of TubeDETR. In the HC-STVG track of the 4th PIC challenge, our solution achieves the third place.","PeriodicalId":309164,"journal":{"name":"Proceedings of the 4th on Person in Context Workshop","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Human-centric Spatio-Temporal Video Grounding via the Combination of Mutual Matching Network and TubeDETR\",\"authors\":\"Fan Yu, Zhixiang Zhao, Yuchen Wang, Yi Xu, Tongwei Ren, Gangshan Wu\",\"doi\":\"10.1145/3552455.3555815\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this technical report, we represent our solution for the Human-centric Spatio-Temporal Video Grounding (HC-STVG) track of the 4th Person in Context (PIC) workshop and challenge. Our solution is built on the basis of TubeDETR and Mutual Matching Network (MMN). Specifically, TubeDETR exploits a video-text encoder and a space-time decoder to predict the starting time, the ending time and the tube of the target person. MMN detects persons in images, links them as tubes, extracts features of person tubes and the text description, and predicts the similarities between them to choose the most likely person tube as the grounding result. Our solution finally finetunes the results by combining the spatio localization of MMN and the temporal localization of TubeDETR. In the HC-STVG track of the 4th PIC challenge, our solution achieves the third place.\",\"PeriodicalId\":309164,\"journal\":{\"name\":\"Proceedings of the 4th on Person in Context Workshop\",\"volume\":\"48 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-07-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 4th on Person in Context Workshop\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3552455.3555815\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 4th on Person in Context Workshop","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3552455.3555815","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

在本技术报告中，我们为第四届情境中人(PIC)研讨会和挑战的以人为中心的时空视频接地(HC-STVG)轨道提供了我们的解决方案。我们的解决方案建立在TubeDETR和互匹配网络(MMN)的基础上。具体来说，TubeDETR利用视频文本编码器和时空解码器来预测目标人的开始时间、结束时间和管道。MMN检测图像中的人，将其链接为管，提取人管和文字描述的特征，并预测两者之间的相似度，选择最可能的人管作为接地结果。我们的解决方案结合MMN的空间定位和TubeDETR的时间定位对结果进行微调。在第四届PIC挑战赛的HC-STVG赛道中，我们的方案获得了第三名。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Human-centric Spatio-Temporal Video Grounding via the Combination of Mutual Matching Network and TubeDETR

In this technical report, we represent our solution for the Human-centric Spatio-Temporal Video Grounding (HC-STVG) track of the 4th Person in Context (PIC) workshop and challenge. Our solution is built on the basis of TubeDETR and Mutual Matching Network (MMN). Specifically, TubeDETR exploits a video-text encoder and a space-time decoder to predict the starting time, the ending time and the tube of the target person. MMN detects persons in images, links them as tubes, extracts features of person tubes and the text description, and predicts the similarities between them to choose the most likely person tube as the grounding result. Our solution finally finetunes the results by combining the spatio localization of MMN and the temporal localization of TubeDETR. In the HC-STVG track of the 4th PIC challenge, our solution achieves the third place.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 4th on Person in Context Workshop

自引率

0.00%

发文量