MaskTrack：用于视频对象分割的自动标记和稳定跟踪技术

IF 8.9 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE transactions on neural networks and learning systems Pub Date : 2024-10-21 DOI:10.1109/TNNLS.2024.3469959

Zhenyu Chen;Lu Zhang;Ping Hu;Huchuan Lu;You He

{"title":"MaskTrack：用于视频对象分割的自动标记和稳定跟踪技术","authors":"Zhenyu Chen;Lu Zhang;Ping Hu;Huchuan Lu;You He","doi":"10.1109/TNNLS.2024.3469959","DOIUrl":null,"url":null,"abstract":"Video object segmentation (VOS) has witnessed notable progress due to the establishment of video training datasets and the introduction of diverse, innovative network architectures. However, video mask annotation is a highly intricate and labor-intensive task, as meticulous frame-by-frame comparisons are needed to ascertain the positions and identities of targets in the subsequent frames. Current VOS benchmarks often annotate only a few instances in each video to save costs, which, however, hinders the model’s understanding of the complete context of the video scenes. To simplify video annotation and achieve efficient dense labeling, we introduce a zero-shot auto-labeling strategy based on the segment anything model (SAM), enabling it to densely annotate video instances without access to any manual annotations. Moreover, although existing VOS methods demonstrate improving performance, segmenting long-term and complex video scenes remains challenging due to the difficulties in stably discriminating and tracking instance identities. To this end, we further introduce a new framework, MaskTrack, which excels in long-term VOS and also exhibits significant performance advantages in distinguishing instances in complex videos with densely packed similar objects. We conduct extensive experiments to demonstrate the effectiveness of the proposed method and show that without introducing image datasets for pretraining, it achieves excellent performance on both short-term (86.2% in YouTube-VOS val) and long-term (68.2% in LVOS val) VOS benchmarks. Our method also surprisingly demonstrates strong generalization ability and performs well in visual object tracking (VOT) (65.6% in VOTS2023) and referring VOS (RVOS) (65.2% in Ref YouTube VOS) challenges.","PeriodicalId":13303,"journal":{"name":"IEEE transactions on neural networks and learning systems","volume":"36 7","pages":"12052-12065"},"PeriodicalIF":8.9000,"publicationDate":"2024-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MaskTrack: Auto-Labeling and Stable Tracking for Video Object Segmentation\",\"authors\":\"Zhenyu Chen;Lu Zhang;Ping Hu;Huchuan Lu;You He\",\"doi\":\"10.1109/TNNLS.2024.3469959\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Video object segmentation (VOS) has witnessed notable progress due to the establishment of video training datasets and the introduction of diverse, innovative network architectures. However, video mask annotation is a highly intricate and labor-intensive task, as meticulous frame-by-frame comparisons are needed to ascertain the positions and identities of targets in the subsequent frames. Current VOS benchmarks often annotate only a few instances in each video to save costs, which, however, hinders the model’s understanding of the complete context of the video scenes. To simplify video annotation and achieve efficient dense labeling, we introduce a zero-shot auto-labeling strategy based on the segment anything model (SAM), enabling it to densely annotate video instances without access to any manual annotations. Moreover, although existing VOS methods demonstrate improving performance, segmenting long-term and complex video scenes remains challenging due to the difficulties in stably discriminating and tracking instance identities. To this end, we further introduce a new framework, MaskTrack, which excels in long-term VOS and also exhibits significant performance advantages in distinguishing instances in complex videos with densely packed similar objects. We conduct extensive experiments to demonstrate the effectiveness of the proposed method and show that without introducing image datasets for pretraining, it achieves excellent performance on both short-term (86.2% in YouTube-VOS val) and long-term (68.2% in LVOS val) VOS benchmarks. Our method also surprisingly demonstrates strong generalization ability and performs well in visual object tracking (VOT) (65.6% in VOTS2023) and referring VOS (RVOS) (65.2% in Ref YouTube VOS) challenges.\",\"PeriodicalId\":13303,\"journal\":{\"name\":\"IEEE transactions on neural networks and learning systems\",\"volume\":\"36 7\",\"pages\":\"12052-12065\"},\"PeriodicalIF\":8.9000,\"publicationDate\":\"2024-10-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on neural networks and learning systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10726574/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on neural networks and learning systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10726574/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

由于视频训练数据集的建立和各种创新网络架构的引入，视频对象分割（VOS）取得了显著进展。然而，视频掩码标注是一项非常复杂和费力的工作，需要逐帧进行细致的比较，以确定后续帧中目标的位置和身份。当前的VOS基准测试通常只在每个视频中注释几个实例以节省成本，然而，这阻碍了模型对视频场景完整上下文的理解。为了简化视频标注并实现高效的密集标注，我们引入了一种基于分段任意模型（SAM）的零间隔自动标注策略，使其能够在不需要任何手动标注的情况下对视频实例进行密集标注。此外，尽管现有的VOS方法性能有所提高，但由于难以稳定地识别和跟踪实例身份，分割长期和复杂的视频场景仍然具有挑战性。为此，我们进一步引入了一个新的框架，MaskTrack，它在长期VOS方面表现出色，并且在具有密集相似对象的复杂视频中区分实例方面也表现出显着的性能优势。我们进行了大量的实验来证明所提出方法的有效性，并表明在不引入图像数据集进行预训练的情况下，它在短期（YouTube-VOS val中为86.2%）和长期（LVOS val中为68.2%）VOS基准上都取得了出色的性能。我们的方法也表现出很强的泛化能力，在视觉目标跟踪（VOT）（在VOTS2023中为65.6%）和参考VOS (RVOS)（在Ref YouTube VOS中为65.2%）挑战中表现良好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

MaskTrack: Auto-Labeling and Stable Tracking for Video Object Segmentation

Video object segmentation (VOS) has witnessed notable progress due to the establishment of video training datasets and the introduction of diverse, innovative network architectures. However, video mask annotation is a highly intricate and labor-intensive task, as meticulous frame-by-frame comparisons are needed to ascertain the positions and identities of targets in the subsequent frames. Current VOS benchmarks often annotate only a few instances in each video to save costs, which, however, hinders the model’s understanding of the complete context of the video scenes. To simplify video annotation and achieve efficient dense labeling, we introduce a zero-shot auto-labeling strategy based on the segment anything model (SAM), enabling it to densely annotate video instances without access to any manual annotations. Moreover, although existing VOS methods demonstrate improving performance, segmenting long-term and complex video scenes remains challenging due to the difficulties in stably discriminating and tracking instance identities. To this end, we further introduce a new framework, MaskTrack, which excels in long-term VOS and also exhibits significant performance advantages in distinguishing instances in complex videos with densely packed similar objects. We conduct extensive experiments to demonstrate the effectiveness of the proposed method and show that without introducing image datasets for pretraining, it achieves excellent performance on both short-term (86.2% in YouTube-VOS val) and long-term (68.2% in LVOS val) VOS benchmarks. Our method also surprisingly demonstrates strong generalization ability and performs well in visual object tracking (VOT) (65.6% in VOTS2023) and referring VOS (RVOS) (65.2% in Ref YouTube VOS) challenges.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE transactions on neural networks and learning systems COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

CiteScore

23.80

自引率

9.60%

发文量

2102

审稿时长

3-8 weeks

期刊介绍： The focus of IEEE Transactions on Neural Networks and Learning Systems is to present scholarly articles discussing the theory, design, and applications of neural networks as well as other learning systems. The journal primarily highlights technical and scientific research in this domain.