Xiaomin Li;Qinghe Wang;Dezhuang Li;Mengmeng Ge;Xu Jia;You He;Huchuan Lu
{"title":"MoBox: Enhancing Video Object Segmentation With Motion-Augmented Box Supervision","authors":"Xiaomin Li;Qinghe Wang;Dezhuang Li;Mengmeng Ge;Xu Jia;You He;Huchuan Lu","doi":"10.1109/TCSVT.2024.3451981","DOIUrl":null,"url":null,"abstract":"We propose MoBox, a low-cost solution for semi-supervised video object segmentation that requires only bounding boxes as manual annotations for training. Built upon a mature semi-supervised video object segmentation network, we redesign the training losses and employ a more stringent training strategy. Specifically, we introduce a well-designed constraint term that enhances traditional spatial projection by simultaneously leveraging the projections of both the ground-truth box and the predicted mask across two axes, rather than evaluating discrepancies along the x-axis and y-axis independently. To harness the intrinsic properties of videos, considering the underlying correspondence between motion represented by optical flow and the original image, we incorporate motion coherence information into the color consistency loss as supplementary information and propose a motion discrepancy loss to obtain accurate boundaries. Additionally, to mitigate the ambiguity of weak supervision, we further introduce the pseudo strict constraint during training, which significantly improves model performance. Our approach yields competitive scores on popular benchmarks, achieving a <inline-formula> <tex-math>$\\mathcal {J}\\& \\mathcal {F}$ </tex-math></inline-formula> score of 78.6 on the DAVIS 2017 validation set and an Overall score of 78.0 on the YouTube-VOS 2018 validation set. These results highlight the efficacy of MoBox, demonstrating that the semi-supervised video object segmentation model can be effectively trained using only motion-augmented box supervision and intrinsic information of videos.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 1","pages":"405-417"},"PeriodicalIF":8.3000,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10659037/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
We propose MoBox, a low-cost solution for semi-supervised video object segmentation that requires only bounding boxes as manual annotations for training. Built upon a mature semi-supervised video object segmentation network, we redesign the training losses and employ a more stringent training strategy. Specifically, we introduce a well-designed constraint term that enhances traditional spatial projection by simultaneously leveraging the projections of both the ground-truth box and the predicted mask across two axes, rather than evaluating discrepancies along the x-axis and y-axis independently. To harness the intrinsic properties of videos, considering the underlying correspondence between motion represented by optical flow and the original image, we incorporate motion coherence information into the color consistency loss as supplementary information and propose a motion discrepancy loss to obtain accurate boundaries. Additionally, to mitigate the ambiguity of weak supervision, we further introduce the pseudo strict constraint during training, which significantly improves model performance. Our approach yields competitive scores on popular benchmarks, achieving a $\mathcal {J}\& \mathcal {F}$ score of 78.6 on the DAVIS 2017 validation set and an Overall score of 78.0 on the YouTube-VOS 2018 validation set. These results highlight the efficacy of MoBox, demonstrating that the semi-supervised video object segmentation model can be effectively trained using only motion-augmented box supervision and intrinsic information of videos.
期刊介绍:
The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.