Video Instance Segmentation Without Using Mask and Identity Supervision

IF 8.4 1区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS
Ge Li;Jiale Cao;Hanqing Sun;Rao Muhammad Anwer;Jin Xie;Fahad Khan;Yanwei Pang
{"title":"Video Instance Segmentation Without Using Mask and Identity Supervision","authors":"Ge Li;Jiale Cao;Hanqing Sun;Rao Muhammad Anwer;Jin Xie;Fahad Khan;Yanwei Pang","doi":"10.1109/TMM.2024.3521668","DOIUrl":null,"url":null,"abstract":"Video instance segmentation (VIS) is a challenging vision problem in which the task is to simultaneously detect, segment, and track all the object instances in a video. Most existing VIS approaches rely on pixel-level mask supervision within a frame as well as instance-level identity annotation across frames. However, obtaining these ‘mask and identity’ annotations is time-consuming and expensive. We propose the first mask-identity-free VIS framework that neither utilizes mask annotations nor requires identity supervision. Accordingly, we introduce a query contrast and exchange network (QCEN) comprising instance query contrast and query-exchanged mask learning. The instance query contrast first performs cross-frame instance matching and then conducts query feature contrastive learning. The query-exchanged mask learning exploits both intra-video and inter-video query exchange properties: exchanging queries of an identical instance from different frames within a video results in consistent instance masks, whereas exchanging queries across videos results in all-zero background masks. Extensive experiments on three benchmarks (YouTube-VIS 2019, YouTube-VIS 2021, and OVIS) reveal the merits of the proposed approach, which significantly reduces the performance gap between the identify-free baseline and our mask-identify-free VIS method. On the YouTube-VIS 2019 validation set, our mask-identity-free approach achieves 91.4% of the stronger-supervision-based baseline performance when utilizing the same ImageNet pre-trained model.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"224-235"},"PeriodicalIF":8.4000,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10814054/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Video instance segmentation (VIS) is a challenging vision problem in which the task is to simultaneously detect, segment, and track all the object instances in a video. Most existing VIS approaches rely on pixel-level mask supervision within a frame as well as instance-level identity annotation across frames. However, obtaining these ‘mask and identity’ annotations is time-consuming and expensive. We propose the first mask-identity-free VIS framework that neither utilizes mask annotations nor requires identity supervision. Accordingly, we introduce a query contrast and exchange network (QCEN) comprising instance query contrast and query-exchanged mask learning. The instance query contrast first performs cross-frame instance matching and then conducts query feature contrastive learning. The query-exchanged mask learning exploits both intra-video and inter-video query exchange properties: exchanging queries of an identical instance from different frames within a video results in consistent instance masks, whereas exchanging queries across videos results in all-zero background masks. Extensive experiments on three benchmarks (YouTube-VIS 2019, YouTube-VIS 2021, and OVIS) reveal the merits of the proposed approach, which significantly reduces the performance gap between the identify-free baseline and our mask-identify-free VIS method. On the YouTube-VIS 2019 validation set, our mask-identity-free approach achieves 91.4% of the stronger-supervision-based baseline performance when utilizing the same ImageNet pre-trained model.
不使用掩码和身份监督的视频实例分割
视频实例分割(VIS)是一个具有挑战性的视觉问题,其任务是同时检测、分割和跟踪视频中的所有对象实例。大多数现有的VIS方法依赖于帧内的像素级掩码监督以及跨帧的实例级标识注释。然而,获得这些“掩码和标识”注释既耗时又昂贵。我们提出了第一个无掩码身份的VIS框架,既不使用掩码注释也不需要身份监督。因此,我们引入了一个包含实例查询对比和查询交换掩码学习的查询对比与交换网络(QCEN)。实例查询对比首先进行跨帧实例匹配,然后进行查询特征对比学习。查询交换掩码学习利用了视频内和视频间的查询交换属性:交换来自视频内不同帧的相同实例的查询导致一致的实例掩码,而跨视频交换查询导致全零背景掩码。在三个基准(YouTube-VIS 2019、YouTube-VIS 2021和OVIS)上进行的大量实验揭示了该方法的优点,该方法显着缩小了无识别基线和无掩膜识别VIS方法之间的性能差距。在YouTube-VIS 2019验证集上,当使用相同的ImageNet预训练模型时,我们的无掩码身份方法达到了基于强监督的基线性能的91.4%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
IEEE Transactions on Multimedia
IEEE Transactions on Multimedia 工程技术-电信学
CiteScore
11.70
自引率
11.00%
发文量
576
审稿时长
5.5 months
期刊介绍: The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信