MaskTrack:用于视频对象分割的自动标记和稳定跟踪技术

IF 8.9 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Zhenyu Chen;Lu Zhang;Ping Hu;Huchuan Lu;You He
{"title":"MaskTrack:用于视频对象分割的自动标记和稳定跟踪技术","authors":"Zhenyu Chen;Lu Zhang;Ping Hu;Huchuan Lu;You He","doi":"10.1109/TNNLS.2024.3469959","DOIUrl":null,"url":null,"abstract":"Video object segmentation (VOS) has witnessed notable progress due to the establishment of video training datasets and the introduction of diverse, innovative network architectures. However, video mask annotation is a highly intricate and labor-intensive task, as meticulous frame-by-frame comparisons are needed to ascertain the positions and identities of targets in the subsequent frames. Current VOS benchmarks often annotate only a few instances in each video to save costs, which, however, hinders the model’s understanding of the complete context of the video scenes. To simplify video annotation and achieve efficient dense labeling, we introduce a zero-shot auto-labeling strategy based on the segment anything model (SAM), enabling it to densely annotate video instances without access to any manual annotations. Moreover, although existing VOS methods demonstrate improving performance, segmenting long-term and complex video scenes remains challenging due to the difficulties in stably discriminating and tracking instance identities. To this end, we further introduce a new framework, MaskTrack, which excels in long-term VOS and also exhibits significant performance advantages in distinguishing instances in complex videos with densely packed similar objects. We conduct extensive experiments to demonstrate the effectiveness of the proposed method and show that without introducing image datasets for pretraining, it achieves excellent performance on both short-term (86.2% in YouTube-VOS val) and long-term (68.2% in LVOS val) VOS benchmarks. Our method also surprisingly demonstrates strong generalization ability and performs well in visual object tracking (VOT) (65.6% in VOTS2023) and referring VOS (RVOS) (65.2% in Ref YouTube VOS) challenges.","PeriodicalId":13303,"journal":{"name":"IEEE transactions on neural networks and learning systems","volume":"36 7","pages":"12052-12065"},"PeriodicalIF":8.9000,"publicationDate":"2024-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MaskTrack: Auto-Labeling and Stable Tracking for Video Object Segmentation\",\"authors\":\"Zhenyu Chen;Lu Zhang;Ping Hu;Huchuan Lu;You He\",\"doi\":\"10.1109/TNNLS.2024.3469959\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Video object segmentation (VOS) has witnessed notable progress due to the establishment of video training datasets and the introduction of diverse, innovative network architectures. However, video mask annotation is a highly intricate and labor-intensive task, as meticulous frame-by-frame comparisons are needed to ascertain the positions and identities of targets in the subsequent frames. Current VOS benchmarks often annotate only a few instances in each video to save costs, which, however, hinders the model’s understanding of the complete context of the video scenes. To simplify video annotation and achieve efficient dense labeling, we introduce a zero-shot auto-labeling strategy based on the segment anything model (SAM), enabling it to densely annotate video instances without access to any manual annotations. Moreover, although existing VOS methods demonstrate improving performance, segmenting long-term and complex video scenes remains challenging due to the difficulties in stably discriminating and tracking instance identities. To this end, we further introduce a new framework, MaskTrack, which excels in long-term VOS and also exhibits significant performance advantages in distinguishing instances in complex videos with densely packed similar objects. We conduct extensive experiments to demonstrate the effectiveness of the proposed method and show that without introducing image datasets for pretraining, it achieves excellent performance on both short-term (86.2% in YouTube-VOS val) and long-term (68.2% in LVOS val) VOS benchmarks. Our method also surprisingly demonstrates strong generalization ability and performs well in visual object tracking (VOT) (65.6% in VOTS2023) and referring VOS (RVOS) (65.2% in Ref YouTube VOS) challenges.\",\"PeriodicalId\":13303,\"journal\":{\"name\":\"IEEE transactions on neural networks and learning systems\",\"volume\":\"36 7\",\"pages\":\"12052-12065\"},\"PeriodicalIF\":8.9000,\"publicationDate\":\"2024-10-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on neural networks and learning systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10726574/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on neural networks and learning systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10726574/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

由于视频训练数据集的建立和各种创新网络架构的引入,视频对象分割(VOS)取得了显著进展。然而,视频掩码标注是一项非常复杂和费力的工作,需要逐帧进行细致的比较,以确定后续帧中目标的位置和身份。当前的VOS基准测试通常只在每个视频中注释几个实例以节省成本,然而,这阻碍了模型对视频场景完整上下文的理解。为了简化视频标注并实现高效的密集标注,我们引入了一种基于分段任意模型(SAM)的零间隔自动标注策略,使其能够在不需要任何手动标注的情况下对视频实例进行密集标注。此外,尽管现有的VOS方法性能有所提高,但由于难以稳定地识别和跟踪实例身份,分割长期和复杂的视频场景仍然具有挑战性。为此,我们进一步引入了一个新的框架,MaskTrack,它在长期VOS方面表现出色,并且在具有密集相似对象的复杂视频中区分实例方面也表现出显着的性能优势。我们进行了大量的实验来证明所提出方法的有效性,并表明在不引入图像数据集进行预训练的情况下,它在短期(YouTube-VOS val中为86.2%)和长期(LVOS val中为68.2%)VOS基准上都取得了出色的性能。我们的方法也表现出很强的泛化能力,在视觉目标跟踪(VOT)(在VOTS2023中为65.6%)和参考VOS (RVOS)(在Ref YouTube VOS中为65.2%)挑战中表现良好。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
MaskTrack: Auto-Labeling and Stable Tracking for Video Object Segmentation
Video object segmentation (VOS) has witnessed notable progress due to the establishment of video training datasets and the introduction of diverse, innovative network architectures. However, video mask annotation is a highly intricate and labor-intensive task, as meticulous frame-by-frame comparisons are needed to ascertain the positions and identities of targets in the subsequent frames. Current VOS benchmarks often annotate only a few instances in each video to save costs, which, however, hinders the model’s understanding of the complete context of the video scenes. To simplify video annotation and achieve efficient dense labeling, we introduce a zero-shot auto-labeling strategy based on the segment anything model (SAM), enabling it to densely annotate video instances without access to any manual annotations. Moreover, although existing VOS methods demonstrate improving performance, segmenting long-term and complex video scenes remains challenging due to the difficulties in stably discriminating and tracking instance identities. To this end, we further introduce a new framework, MaskTrack, which excels in long-term VOS and also exhibits significant performance advantages in distinguishing instances in complex videos with densely packed similar objects. We conduct extensive experiments to demonstrate the effectiveness of the proposed method and show that without introducing image datasets for pretraining, it achieves excellent performance on both short-term (86.2% in YouTube-VOS val) and long-term (68.2% in LVOS val) VOS benchmarks. Our method also surprisingly demonstrates strong generalization ability and performs well in visual object tracking (VOT) (65.6% in VOTS2023) and referring VOS (RVOS) (65.2% in Ref YouTube VOS) challenges.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
IEEE transactions on neural networks and learning systems
IEEE transactions on neural networks and learning systems COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-COMPUTER SCIENCE, HARDWARE & ARCHITECTURE
CiteScore
23.80
自引率
9.60%
发文量
2102
审稿时长
3-8 weeks
期刊介绍: The focus of IEEE Transactions on Neural Networks and Learning Systems is to present scholarly articles discussing the theory, design, and applications of neural networks as well as other learning systems. The journal primarily highlights technical and scientific research in this domain.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信