基于互匹配网络和TubeDETR的以人为中心的时空视频接地

Fan Yu, Zhixiang Zhao, Yuchen Wang, Yi Xu, Tongwei Ren, Gangshan Wu
{"title":"基于互匹配网络和TubeDETR的以人为中心的时空视频接地","authors":"Fan Yu, Zhixiang Zhao, Yuchen Wang, Yi Xu, Tongwei Ren, Gangshan Wu","doi":"10.1145/3552455.3555815","DOIUrl":null,"url":null,"abstract":"In this technical report, we represent our solution for the Human-centric Spatio-Temporal Video Grounding (HC-STVG) track of the 4th Person in Context (PIC) workshop and challenge. Our solution is built on the basis of TubeDETR and Mutual Matching Network (MMN). Specifically, TubeDETR exploits a video-text encoder and a space-time decoder to predict the starting time, the ending time and the tube of the target person. MMN detects persons in images, links them as tubes, extracts features of person tubes and the text description, and predicts the similarities between them to choose the most likely person tube as the grounding result. Our solution finally finetunes the results by combining the spatio localization of MMN and the temporal localization of TubeDETR. In the HC-STVG track of the 4th PIC challenge, our solution achieves the third place.","PeriodicalId":309164,"journal":{"name":"Proceedings of the 4th on Person in Context Workshop","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Human-centric Spatio-Temporal Video Grounding via the Combination of Mutual Matching Network and TubeDETR\",\"authors\":\"Fan Yu, Zhixiang Zhao, Yuchen Wang, Yi Xu, Tongwei Ren, Gangshan Wu\",\"doi\":\"10.1145/3552455.3555815\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this technical report, we represent our solution for the Human-centric Spatio-Temporal Video Grounding (HC-STVG) track of the 4th Person in Context (PIC) workshop and challenge. Our solution is built on the basis of TubeDETR and Mutual Matching Network (MMN). Specifically, TubeDETR exploits a video-text encoder and a space-time decoder to predict the starting time, the ending time and the tube of the target person. MMN detects persons in images, links them as tubes, extracts features of person tubes and the text description, and predicts the similarities between them to choose the most likely person tube as the grounding result. Our solution finally finetunes the results by combining the spatio localization of MMN and the temporal localization of TubeDETR. In the HC-STVG track of the 4th PIC challenge, our solution achieves the third place.\",\"PeriodicalId\":309164,\"journal\":{\"name\":\"Proceedings of the 4th on Person in Context Workshop\",\"volume\":\"48 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-07-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 4th on Person in Context Workshop\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3552455.3555815\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 4th on Person in Context Workshop","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3552455.3555815","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

在本技术报告中,我们为第四届情境中人(PIC)研讨会和挑战的以人为中心的时空视频接地(HC-STVG)轨道提供了我们的解决方案。我们的解决方案建立在TubeDETR和互匹配网络(MMN)的基础上。具体来说,TubeDETR利用视频文本编码器和时空解码器来预测目标人的开始时间、结束时间和管道。MMN检测图像中的人,将其链接为管,提取人管和文字描述的特征,并预测两者之间的相似度,选择最可能的人管作为接地结果。我们的解决方案结合MMN的空间定位和TubeDETR的时间定位对结果进行微调。在第四届PIC挑战赛的HC-STVG赛道中,我们的方案获得了第三名。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Human-centric Spatio-Temporal Video Grounding via the Combination of Mutual Matching Network and TubeDETR
In this technical report, we represent our solution for the Human-centric Spatio-Temporal Video Grounding (HC-STVG) track of the 4th Person in Context (PIC) workshop and challenge. Our solution is built on the basis of TubeDETR and Mutual Matching Network (MMN). Specifically, TubeDETR exploits a video-text encoder and a space-time decoder to predict the starting time, the ending time and the tube of the target person. MMN detects persons in images, links them as tubes, extracts features of person tubes and the text description, and predicts the similarities between them to choose the most likely person tube as the grounding result. Our solution finally finetunes the results by combining the spatio localization of MMN and the temporal localization of TubeDETR. In the HC-STVG track of the 4th PIC challenge, our solution achieves the third place.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信