{"title":"MaskTrack:用于视频对象分割的自动标记和稳定跟踪技术","authors":"Zhenyu Chen;Lu Zhang;Ping Hu;Huchuan Lu;You He","doi":"10.1109/TNNLS.2024.3469959","DOIUrl":null,"url":null,"abstract":"Video object segmentation (VOS) has witnessed notable progress due to the establishment of video training datasets and the introduction of diverse, innovative network architectures. However, video mask annotation is a highly intricate and labor-intensive task, as meticulous frame-by-frame comparisons are needed to ascertain the positions and identities of targets in the subsequent frames. Current VOS benchmarks often annotate only a few instances in each video to save costs, which, however, hinders the model’s understanding of the complete context of the video scenes. To simplify video annotation and achieve efficient dense labeling, we introduce a zero-shot auto-labeling strategy based on the segment anything model (SAM), enabling it to densely annotate video instances without access to any manual annotations. Moreover, although existing VOS methods demonstrate improving performance, segmenting long-term and complex video scenes remains challenging due to the difficulties in stably discriminating and tracking instance identities. To this end, we further introduce a new framework, MaskTrack, which excels in long-term VOS and also exhibits significant performance advantages in distinguishing instances in complex videos with densely packed similar objects. We conduct extensive experiments to demonstrate the effectiveness of the proposed method and show that without introducing image datasets for pretraining, it achieves excellent performance on both short-term (86.2% in YouTube-VOS val) and long-term (68.2% in LVOS val) VOS benchmarks. Our method also surprisingly demonstrates strong generalization ability and performs well in visual object tracking (VOT) (65.6% in VOTS2023) and referring VOS (RVOS) (65.2% in Ref YouTube VOS) challenges.","PeriodicalId":13303,"journal":{"name":"IEEE transactions on neural networks and learning systems","volume":"36 7","pages":"12052-12065"},"PeriodicalIF":8.9000,"publicationDate":"2024-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MaskTrack: Auto-Labeling and Stable Tracking for Video Object Segmentation\",\"authors\":\"Zhenyu Chen;Lu Zhang;Ping Hu;Huchuan Lu;You He\",\"doi\":\"10.1109/TNNLS.2024.3469959\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Video object segmentation (VOS) has witnessed notable progress due to the establishment of video training datasets and the introduction of diverse, innovative network architectures. However, video mask annotation is a highly intricate and labor-intensive task, as meticulous frame-by-frame comparisons are needed to ascertain the positions and identities of targets in the subsequent frames. Current VOS benchmarks often annotate only a few instances in each video to save costs, which, however, hinders the model’s understanding of the complete context of the video scenes. To simplify video annotation and achieve efficient dense labeling, we introduce a zero-shot auto-labeling strategy based on the segment anything model (SAM), enabling it to densely annotate video instances without access to any manual annotations. Moreover, although existing VOS methods demonstrate improving performance, segmenting long-term and complex video scenes remains challenging due to the difficulties in stably discriminating and tracking instance identities. To this end, we further introduce a new framework, MaskTrack, which excels in long-term VOS and also exhibits significant performance advantages in distinguishing instances in complex videos with densely packed similar objects. We conduct extensive experiments to demonstrate the effectiveness of the proposed method and show that without introducing image datasets for pretraining, it achieves excellent performance on both short-term (86.2% in YouTube-VOS val) and long-term (68.2% in LVOS val) VOS benchmarks. Our method also surprisingly demonstrates strong generalization ability and performs well in visual object tracking (VOT) (65.6% in VOTS2023) and referring VOS (RVOS) (65.2% in Ref YouTube VOS) challenges.\",\"PeriodicalId\":13303,\"journal\":{\"name\":\"IEEE transactions on neural networks and learning systems\",\"volume\":\"36 7\",\"pages\":\"12052-12065\"},\"PeriodicalIF\":8.9000,\"publicationDate\":\"2024-10-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on neural networks and learning systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10726574/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on neural networks and learning systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10726574/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
MaskTrack: Auto-Labeling and Stable Tracking for Video Object Segmentation
Video object segmentation (VOS) has witnessed notable progress due to the establishment of video training datasets and the introduction of diverse, innovative network architectures. However, video mask annotation is a highly intricate and labor-intensive task, as meticulous frame-by-frame comparisons are needed to ascertain the positions and identities of targets in the subsequent frames. Current VOS benchmarks often annotate only a few instances in each video to save costs, which, however, hinders the model’s understanding of the complete context of the video scenes. To simplify video annotation and achieve efficient dense labeling, we introduce a zero-shot auto-labeling strategy based on the segment anything model (SAM), enabling it to densely annotate video instances without access to any manual annotations. Moreover, although existing VOS methods demonstrate improving performance, segmenting long-term and complex video scenes remains challenging due to the difficulties in stably discriminating and tracking instance identities. To this end, we further introduce a new framework, MaskTrack, which excels in long-term VOS and also exhibits significant performance advantages in distinguishing instances in complex videos with densely packed similar objects. We conduct extensive experiments to demonstrate the effectiveness of the proposed method and show that without introducing image datasets for pretraining, it achieves excellent performance on both short-term (86.2% in YouTube-VOS val) and long-term (68.2% in LVOS val) VOS benchmarks. Our method also surprisingly demonstrates strong generalization ability and performs well in visual object tracking (VOT) (65.6% in VOTS2023) and referring VOS (RVOS) (65.2% in Ref YouTube VOS) challenges.
期刊介绍:
The focus of IEEE Transactions on Neural Networks and Learning Systems is to present scholarly articles discussing the theory, design, and applications of neural networks as well as other learning systems. The journal primarily highlights technical and scientific research in this domain.