Ge Li;Jiale Cao;Hanqing Sun;Rao Muhammad Anwer;Jin Xie;Fahad Khan;Yanwei Pang
{"title":"Video Instance Segmentation Without Using Mask and Identity Supervision","authors":"Ge Li;Jiale Cao;Hanqing Sun;Rao Muhammad Anwer;Jin Xie;Fahad Khan;Yanwei Pang","doi":"10.1109/TMM.2024.3521668","DOIUrl":null,"url":null,"abstract":"Video instance segmentation (VIS) is a challenging vision problem in which the task is to simultaneously detect, segment, and track all the object instances in a video. Most existing VIS approaches rely on pixel-level mask supervision within a frame as well as instance-level identity annotation across frames. However, obtaining these ‘mask and identity’ annotations is time-consuming and expensive. We propose the first mask-identity-free VIS framework that neither utilizes mask annotations nor requires identity supervision. Accordingly, we introduce a query contrast and exchange network (QCEN) comprising instance query contrast and query-exchanged mask learning. The instance query contrast first performs cross-frame instance matching and then conducts query feature contrastive learning. The query-exchanged mask learning exploits both intra-video and inter-video query exchange properties: exchanging queries of an identical instance from different frames within a video results in consistent instance masks, whereas exchanging queries across videos results in all-zero background masks. Extensive experiments on three benchmarks (YouTube-VIS 2019, YouTube-VIS 2021, and OVIS) reveal the merits of the proposed approach, which significantly reduces the performance gap between the identify-free baseline and our mask-identify-free VIS method. On the YouTube-VIS 2019 validation set, our mask-identity-free approach achieves 91.4% of the stronger-supervision-based baseline performance when utilizing the same ImageNet pre-trained model.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"224-235"},"PeriodicalIF":8.4000,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10814054/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Video instance segmentation (VIS) is a challenging vision problem in which the task is to simultaneously detect, segment, and track all the object instances in a video. Most existing VIS approaches rely on pixel-level mask supervision within a frame as well as instance-level identity annotation across frames. However, obtaining these ‘mask and identity’ annotations is time-consuming and expensive. We propose the first mask-identity-free VIS framework that neither utilizes mask annotations nor requires identity supervision. Accordingly, we introduce a query contrast and exchange network (QCEN) comprising instance query contrast and query-exchanged mask learning. The instance query contrast first performs cross-frame instance matching and then conducts query feature contrastive learning. The query-exchanged mask learning exploits both intra-video and inter-video query exchange properties: exchanging queries of an identical instance from different frames within a video results in consistent instance masks, whereas exchanging queries across videos results in all-zero background masks. Extensive experiments on three benchmarks (YouTube-VIS 2019, YouTube-VIS 2021, and OVIS) reveal the merits of the proposed approach, which significantly reduces the performance gap between the identify-free baseline and our mask-identify-free VIS method. On the YouTube-VIS 2019 validation set, our mask-identity-free approach achieves 91.4% of the stronger-supervision-based baseline performance when utilizing the same ImageNet pre-trained model.
期刊介绍:
The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.