{"title":"Dynamic Anchor Box-based Instance Decoding and Position-aware Instance Association for Online Video Instance Segmentation","authors":"Hyun-Jin Chun, Incheol Kim","doi":"10.5302/j.icros.2023.23.0086","DOIUrl":null,"url":null,"abstract":"Video instance segmentation (VIS) is a vision task that involves simultaneously detecting, classifying, segmenting, and tracking object instances in videos. In this study, we introduce dynamic anchor box and deformable attention for VIS (DAB-D-VIS), a novel transformer-based model for online VIS. To enhance the multilayer transformer-based instance decoding for each video frame, our proposed model uses deformable attention mechanisms that focus on a small set of key sampling points. Additionally, dynamic anchor boxes are employed to explicitly represent the region of candidate instances. These two methods have already been proven to be effective for transformer-based object detection from images. Furthermore, to address the constraints of online VIS, our model incorporates a robust inter-frame instance association method. This method leverages both similarity in the contrastive embedding space and positional difference in the images between two instances. Extensive experiments conducted on the YouTube-VIS benchmark dataset validate the effectiveness of our proposed DAB-D-VIS model.","PeriodicalId":38644,"journal":{"name":"Journal of Institute of Control, Robotics and Systems","volume":"2013 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Institute of Control, Robotics and Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5302/j.icros.2023.23.0086","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Mathematics","Score":null,"Total":0}
引用次数: 0
Abstract
Video instance segmentation (VIS) is a vision task that involves simultaneously detecting, classifying, segmenting, and tracking object instances in videos. In this study, we introduce dynamic anchor box and deformable attention for VIS (DAB-D-VIS), a novel transformer-based model for online VIS. To enhance the multilayer transformer-based instance decoding for each video frame, our proposed model uses deformable attention mechanisms that focus on a small set of key sampling points. Additionally, dynamic anchor boxes are employed to explicitly represent the region of candidate instances. These two methods have already been proven to be effective for transformer-based object detection from images. Furthermore, to address the constraints of online VIS, our model incorporates a robust inter-frame instance association method. This method leverages both similarity in the contrastive embedding space and positional difference in the images between two instances. Extensive experiments conducted on the YouTube-VIS benchmark dataset validate the effectiveness of our proposed DAB-D-VIS model.